I have had jobs where sites stop being able to connect to the mother-ship, usually these are sites using an xDSL modem to log into the mother-ship, and login is of course by the trusty Radius server.
The problem isn’t that the cheapo xDSL modem is dead, though that is always the second thing investigated, or the cheapo xDSL line is dead, though that is always the first thing investigated, the problem is the Radius server just stopped working, and you can “fix” it by making a change that simply should not make any difference, changing the Radius password on the Radius server and xDSL modem / router.
I’ve had this on Cisco kit too, you need to TFTP a patch across so configure terminal and then give it an IP address, give your laptop and IP address and as a final sanity check before starting the TFTP you attempt to ping each box from the other, and it doesn’t work, and you can repeat the process ten times, and it won’t work, but if you reboot the Cisco box it will work first time.
Neither of these problems should exist, within the framework of “things as they should be” or rather “things as they are taught”.. for example it is heresy to suggest rebooting the Radius server, so it is discounted as a source of problems when a client site cannot log into a mother-ship, and for example it is heresy to suggest that any console / command line output from Cisco IOS is less than 100% truthful, and yet, if either of these statements were true, the fixes I used would not work.
When asked what the problem was, I say something “Was stuck in the RAM“, which is of course meaningless *and* inaccurate, but it is an explanation of sorts, and it is *far* closer to the truth than the official answers.
I’m not a coder, but I suspect the truth could be found somewhere in the realms of buffer overflows and bounds checking.
However, nobody calls a senior coder in when a remote office fails to connect to the mother-ship, (which one way or another is what 99% of my day job is about, making two sites connect to each other) so as a result you get anything *but* the truth.
As an aside, before I continue, if you are thinking that these are only problems encountered because I am working with cheap ass kit on cheap ass contracts for cheap ass clients, you would be as mistaken as you can possibly be… I absolutely guarantee that even if you have never set foot in the UK you will know 50% of the end users by brand name and reputation alone, even if they do not have a presence local to you.
Most of the kit is relatively speaking not very much money, anything from 500 to 5,000 bucks a box, and that is not a lot of money for a site that is turning over a million a week or an engineer that costs the end user 250 bucks before I even leave MY home, much less turn up on site… the kits itself is very mediocre quality, hardware wise, and that is me speaking as an engineer. Trust me on this.
Cisco kit sells because it all runs IOS, and finding people with Cisco qualifications who can write / edit / troubleshoot the config files, which are the files that tell the IOS what to do, is about as hard as finding a web designer, worst case scenario is there are several tens of thousands available for not very much about 90 milliseconds away in Mumbai.
This, by the way, is the SOLE reason everyone loves the cloud and virtual machines, virtual machines don’t have ANY hardware, so you NEVER need a field engineer to turn up and move a patch cable, power cycle to unstick the RAM, do an actual install or upgrade, or anything else…
So, back to the plot…
It’s down to ETHOS, car brakes were basically designed so the default state was that they were off, truck brakes were designed so the default state was they were on (and it took air pressure to keep them off).. so you pressurise a car system to make it stop, and you leak pressure out of a truck system to make it stop.
Ask yourself two questions;
- Which is safest.
- Which is cheapest to make.
Suddenly everything becomes clear.
Unless you are the bit of NASA writing the actual code that directly controls the spacecraft flight hardware, or the bit of GE writing the actual code that directly controls the control rods in the nuke pile, or… and I cannot think of a third fucking example….. then option 2 always gets a look in.
Most of the time the bottom line is the bottom line.
“Good enough” (mostly)
By definition you are excluding the “one in a million” event from your calculations.
Which is great, *until* that event comes along… luckily for humanity in the sphere of my job until I fix it that means someone didn’t get their wages, someone didn’t get their stock in trade to sell, someone didn’t get a product or service that they were going to re-sell to someone else.
It can all be very serious and even life changing to the individuals concerned, but, the small print can cover that shit, nobody got killed…. fuck em…
We have had quite a few “cascade failures” in teh intertubez, they aren’t yet as serious as the power grid blackouts we have had, but then again the power grid is everywhere and literally in everything, and the net is still a relative newbie, chromebooks running exclusively on data living on a virtual machine in the cloud somewhere and 100% of fast net connectivity even to boot up into anything useful are still rare.
But the times, as Dylan said, they are a changin’
I am seeing, as a result of these changes, where the 1st, 2nd and 3rd level responses to problems simply do not work, because the RAM that is stuck is not in the local machine, it is in a central machine that MUST NOT be rebooted, or worse still, in a cloud virtual machine.
At that point the on the spot field engineer (me) can no longer just ring the remote server engineer, compare notes, agree on a likely cause and course of action, and resolve the problem.
I saw this happen, in the flesh, before my own eyes, for the first time, personally, yesterday, NetApp, unfortunately there were so many levels of virtuality that the server guy couldn’t diagnose which layer or virtual RAM was stuck, or where, and there was no possibility of simply rebooting as that would take the entire enterprise down and trash that whole day’s production, which was already sold and due to be in the shops tomorrow, or changing chap/tacacs/radius logins and resetting the problem that way… no worries, a whole new virtual machine was created, problem ignored.
Fuck it, I still get paid either way.
Asking people like me about my opinion on such things, well, that would be like asking a doctor about disease, fuck that, ask the pharma marketing machine, they have their eye on the bottom line.