I have had jobs where sites stop being able to connect to the mother-ship, usually these are sites using an xDSL modem to log into the mother-ship, and login is of course by the trusty Radius server.
The problem isn’t that the cheapo xDSL modem is dead, though that is always the second thing investigated, or the cheapo xDSL line is dead, though that is always the first thing investigated, the problem is the Radius server just stopped working, and you can “fix” it by making a change that simply should not make any difference, changing the Radius password on the Radius server and xDSL modem / router.
I’ve had this on Cisco kit too, you need to TFTP a patch across so configure terminal and then give it an IP address, give your laptop and IP address and as a final sanity check before starting the TFTP you attempt to ping each box from the other, and it doesn’t work, and you can repeat the process ten times, and it won’t work, but if you reboot the Cisco box it will work first time.
Neither of these problems should exist, within the framework of “things as they should be” or rather “things as they are taught”.. for example it is heresy to suggest rebooting the Radius server, so it is discounted as a source of problems when a client site cannot log into a mother-ship, and for example it is heresy to suggest that any console / command line output from Cisco IOS is less than 100% truthful, and yet, if either of these statements were true, the fixes I used would not work.
When asked what the problem was, I say something “Was stuck in the RAM“, which is of course meaningless *and* inaccurate, but it is an explanation of sorts, and it is *far* closer to the truth than the official answers.
I’m not a coder, but I suspect the truth could be found somewhere in the realms of buffer overflows and bounds checking.
However, nobody calls a senior coder in when a remote office fails to connect to the mother-ship, (which one way or another is what 99% of my day job is about, making two sites connect to each other) so as a result you get anything *but* the truth.
As an aside, before I continue, if you are thinking that these are only problems encountered because I am working with cheap ass kit on cheap ass contracts for cheap ass clients, you would be as mistaken as you can possibly be… I absolutely guarantee that even if you have never set foot in the UK you will know 50% of the end users by brand name and reputation alone, even if they do not have a presence local to you.
Most of the kit is relatively speaking not very much money, anything from 500 to 5,000 bucks a box, and that is not a lot of money for a site that is turning over a million a week or an engineer that costs the end user 250 bucks before I even leave MY home, much less turn up on site… the kits itself is very mediocre quality, hardware wise, and that is me speaking as an engineer. Trust me on this.
Cisco kit sells because it all runs IOS, and finding people with Cisco qualifications who can write / edit / troubleshoot the config files, which are the files that tell the IOS what to do, is about as hard as finding a web designer, worst case scenario is there are several tens of thousands available for not very much about 90 milliseconds away in Mumbai.
This, by the way, is the SOLE reason everyone loves the cloud and virtual machines, virtual machines don’t have ANY hardware, so you NEVER need a field engineer to turn up and move a patch cable, power cycle to unstick the RAM, do an actual install or upgrade, or anything else…
So, back to the plot…
It’s down to ETHOS, car brakes were basically designed so the default state was that they were off, truck brakes were designed so the default state was they were on (and it took air pressure to keep them off).. so you pressurise a car system to make it stop, and you leak pressure out of a truck system to make it stop.
Ask yourself two questions;
- Which is safest.
- Which is cheapest to make.
Suddenly everything becomes clear.
Unless you are the bit of NASA writing the actual code that directly controls the spacecraft flight hardware, or the bit of GE writing the actual code that directly controls the control rods in the nuke pile, or… and I cannot think of a third fucking example….. then option 2 always gets a look in.
Most of the time the bottom line is the bottom line.
“Good enough” (mostly)
By definition you are excluding the “one in a million” event from your calculations.
Which is great, *until* that event comes along… luckily for humanity in the sphere of my job until I fix it that means someone didn’t get their wages, someone didn’t get their stock in trade to sell, someone didn’t get a product or service that they were going to re-sell to someone else.
It can all be very serious and even life changing to the individuals concerned, but, the small print can cover that shit, nobody got killed…. fuck em…
We have had quite a few “cascade failures” in teh intertubez, they aren’t yet as serious as the power grid blackouts we have had, but then again the power grid is everywhere and literally in everything, and the net is still a relative newbie, chromebooks running exclusively on data living on a virtual machine in the cloud somewhere and 100% of fast net connectivity even to boot up into anything useful are still rare.
But the times, as Dylan said, they are a changin’
I am seeing, as a result of these changes, where the 1st, 2nd and 3rd level responses to problems simply do not work, because the RAM that is stuck is not in the local machine, it is in a central machine that MUST NOT be rebooted, or worse still, in a cloud virtual machine.
At that point the on the spot field engineer (me) can no longer just ring the remote server engineer, compare notes, agree on a likely cause and course of action, and resolve the problem.
I saw this happen, in the flesh, before my own eyes, for the first time, personally, yesterday, NetApp, unfortunately there were so many levels of virtuality that the server guy couldn’t diagnose which layer or virtual RAM was stuck, or where, and there was no possibility of simply rebooting as that would take the entire enterprise down and trash that whole day’s production, which was already sold and due to be in the shops tomorrow, or changing chap/tacacs/radius logins and resetting the problem that way… no worries, a whole new virtual machine was created, problem ignored.
Fuck it, I still get paid either way.
Asking people like me about my opinion on such things, well, that would be like asking a doctor about disease, fuck that, ask the pharma marketing machine, they have their eye on the bottom line.
There isn’t a back-yard mechanic in the world who hasn’t muttered “Intermittent problem” under his breath.
In automobiles, these only start cropping up after 15 years or so; a relay develops a high-resistance short, or part of the pneumatic fuel injection control system develops a very slight resistance with gunk in the tubes that only affects you between 3000 & 3250 RPMs. Quite frankly, these problems *should* not exist… but they saved $0.05 by getting the cheaper relay, and besides, who owns a fifteen year old car? We have a 2% increase in engine efficiency by adding the computer, for a 50% reduction in life expectancy (at least Holly got decent IQ bonus…).
The infrastructure and ecosystem we’re developing for software is deeply worrying to me, unfortunately I’m not the sort of computer genius that anybody would listen to. All I know is that my gut instincts, thus far, have been frequently affirmed by actual software geniuses.
But hey, Windows 8 has cool transition screens, and that’s all that matters, right?
Comment by Aurini — October 8, 2013 @ 3:29 pm
Much as we all like to malign winders, it is actually much better than most of the software that gets installed on top of it.
The problems *really* start when you have say each API assigned 100 millimetres of virtual machine space, with a 10 mm gap all around it to separate it from the other API’s, and developer A writes some software that pushed 5mm into that boundary to make x function work easier, and developer B of a different product uses the neighbouring API but pushes THAT boundary 5mm to make function y work easier, now you have products A and B installed, and possibly both running simultaneously, with fuck all in between them.
A *classic* case of this is EVERY_CUNT_THAT_EVER_WROTE_A_FUCKING_GAME.
My current win7pro install came with MS Visual C++ 2012, and everything worked fucking perfectly.
Then along came ubisoft not even asking but installing an older version for Crysis 3, then along came Bethseda not even asking but installing an older and DIFFERENT fucking version for Skyrim, then along came the same story for FEAR3, and next thing you know when I open a folder full of movies fucking explorer throws up an error trying to generate preview thumbnails.
This is NOT microsoft’s fault, even though they wrote all the versions of C++
Comment by wimminz — October 8, 2013 @ 8:29 pm
Crap, you had to mention the power grid. 😦
I´ve managed to forget this bogus “Grid Down” Drill the US gubinmint wankers are preparing for the 13th next month.
Fuck, I´m getting that pissed-off feeling again.
Comment by hans — October 8, 2013 @ 4:38 pm