Wimminz – celebrating skank ho's everywhere

October 8, 2013

Stuck in the RAM


I have had jobs where sites stop being able to connect to the mother-ship, usually these are sites using an xDSL modem to log into the mother-ship, and login is of course by the trusty Radius server.

The problem isn’t that the cheapo xDSL modem is dead, though that is always the second thing investigated, or the cheapo xDSL line is dead, though that is always the first thing investigated, the problem is the Radius server just stopped working, and you can “fix” it by making a change that simply should not make any difference, changing the Radius password on the Radius server and xDSL modem / router.

I’ve had this on Cisco kit too, you need to TFTP a patch across so configure terminal and then give it an IP address, give your laptop and IP address and as a final sanity check before starting the TFTP you attempt to ping each box from the other, and it doesn’t work, and you can repeat the process ten times, and it won’t work, but if you reboot the Cisco box it will work first time.

Neither of these problems should exist, within the framework of “things as they should be” or rather “things as they are taught”.. for example it is heresy to suggest rebooting the Radius server, so it is discounted as a source of problems when a client site cannot log into a mother-ship, and for example it is heresy to suggest that any console / command line output from Cisco IOS is less than 100% truthful, and yet, if either of these statements were true, the fixes I used would not work.

When asked what the problem was, I say something “Was stuck in the RAM“, which is of course meaningless *and* inaccurate, but it is an explanation of sorts, and it is *far* closer to the truth than the official answers.

I’m not a coder, but I suspect the truth could be found somewhere in the realms of buffer overflows and bounds checking.

However, nobody calls a senior coder in when a remote office fails to connect to the mother-ship, (which one way or another is what 99% of my day job is about, making two sites connect to each other) so as a result you get anything *but* the truth.

As an aside, before I continue, if you are thinking that these are only problems encountered because I am working with cheap ass kit on cheap ass contracts for cheap ass clients, you would be as mistaken as you can possibly be… I absolutely guarantee that even if you have never set foot in the UK you will know 50% of the end users by brand name and reputation alone, even if they do not have a presence local to you.

Most of the kit is relatively speaking not very much money, anything from 500 to 5,000 bucks a box, and that is not a lot of money for a site that is turning over a million a week or an engineer that costs the end user 250 bucks before I even leave MY home, much less turn up on site… the kits itself is very mediocre quality, hardware wise, and that is me speaking as an engineer. Trust me on this.

Cisco kit sells because it all runs IOS, and finding people with Cisco qualifications who can write / edit / troubleshoot the config files, which are the files that tell the IOS what to do, is about as hard as finding a web designer, worst case scenario is there are several tens of thousands available for not very much about 90 milliseconds away in Mumbai.

This, by the way, is the SOLE reason everyone loves the cloud and virtual machines, virtual machines don’t have ANY hardware, so you NEVER need a field engineer to turn up and move a patch cable, power cycle to unstick the RAM, do an actual install or upgrade, or anything else…

So, back to the plot…

It’s down to ETHOS, car brakes were basically designed so the default state was that they were off, truck brakes were designed so the default state was they were on (and it took air pressure to keep them off).. so you pressurise a car system to make it stop, and you leak pressure out of a truck system to make it stop.

Ask yourself two questions;

  1. Which is safest.
  2. Which is cheapest to make.

Suddenly everything becomes clear.

Unless you are the bit of NASA writing the actual code that directly controls the spacecraft flight hardware, or the bit of GE writing the actual code that directly controls the control rods in the nuke pile, or… and I cannot think of a third fucking example…..  then option 2 always gets a look in.

Most of the time the bottom line is the bottom line.

“Good enough” (mostly)

By definition you are excluding the “one in a million” event from your calculations.

Which is great, *until* that event comes along… luckily for humanity in the sphere of my job until I fix it that means someone didn’t get their wages, someone didn’t get their stock in trade to sell, someone didn’t get a product or service that they were going to re-sell to someone else.

It can all be very serious and even life changing to the individuals concerned, but, the small print can cover that shit, nobody got killed…. fuck em…

We have had quite a few “cascade failures” in teh intertubez, they aren’t yet as serious as the power grid blackouts we have had, but then again the power grid is everywhere and literally in everything, and the net is still a relative newbie, chromebooks running exclusively on data living on a virtual machine in the cloud somewhere and 100% of fast net connectivity even to boot up into anything useful are still rare.

But the times, as Dylan said, they are a changin’

I am seeing, as a result of these changes, where the 1st, 2nd and 3rd level responses to problems simply do not work, because the RAM that is stuck is not in the local machine, it is in a central machine that MUST NOT be rebooted, or worse still, in a cloud virtual machine.

At that point the on the spot field engineer (me) can no longer just ring the remote server engineer, compare notes, agree on a likely cause and course of action, and resolve the problem.

I saw this happen, in the flesh, before my own eyes, for the first time, personally, yesterday, NetApp, unfortunately there were so many levels of virtuality that the server guy couldn’t diagnose which layer or virtual RAM was stuck, or where, and there was no possibility of simply rebooting as that would take the entire enterprise down and trash that whole day’s production, which was already sold and due to be in the shops tomorrow, or changing chap/tacacs/radius logins and resetting the problem that way… no worries, a whole new virtual machine was created, problem ignored.

Fuck it, I still get paid either way.

Asking people like me about my opinion on such things, well, that would be like asking a doctor about disease, fuck that, ask the pharma marketing machine, they have their eye on the bottom line.

March 8, 2013

Living in a virtualised world


I’ve been busy of late, hence the dearth of new posts..

My current gig is basically summed up thus, world + dog are chasing economies wherever they can find them (a good example is regional offices that years ago would have been on leased lines now being connected by xDSL) and so ACME corp’s 447 regional offices get new Cisco 887 adsl routers and all that, and the IT management can then be outsourced and offshored…. 447 expensive leased lines dropped, the in house 500 strong IT department sacked en masse, loadsa money saved, trebles all around at the bean-counters offices.

But some cunt has to turn up with the box and physically plug in the patch cables and so on, and when, not if, when that shit breaks, some cunt has to turn up and physically reset or repair the thing that cannot be fixed remotely…. even if that someone is just a remote pair of hands for a resellers resellers resellers reseller….

Don’t get me wrong, this shit is slick, but it is a basic engineering principle that the more layers of complexity you build up, the more there is to go wrong… which is why twice in the last week alone the NatWest Bank customers have seen all the ATM‘s simply stop working and no on-line banking either, and this is being repeated across the nation in all things IT.

Like the song says, Do your fucking job till the end

Till your job ends that is…. meanwhile back at the gig the crowd I work for are all gung ho, gangbusters and corporate image, which is fucking great while it lasts, which is by definition going to be a finite amount of time, we are hyenas feeding on corpses, for the moment it is a banquet…

I smile sweetly at them all, and friday rolls around and I think to myself that is another week’s money grabbed, wonder what next week will bring, because you see I am old enough and cynical enough to know that in this solyent green world, the crowd I work for can disappear with as little warning as the jobs of those we are replacing with little Cisco boxes (themselves now made in the Czech republic, oh the irony) went down the swannee…

They tell me about all the valuable skills and qualifications I can earn while working for them, and there is an element of truth in that, but I had valuable skills and qualifications in my previous trade of marine engineering, and they don’t put food on the table today, but my survivalist attitudes to life do, so what is more useful to me?

Never take a job you aren’t prepared to walk away from on a moment’s notice is a good motto, because already in this young and dynamic company I can see signs of the rot setting in, and the infection is spreading a lot faster than it did 20 years ago… I see this all over now, they will give some guy a £25k car to turn up at the customers premises in, because it looks good, but no 5 dollar uniform sweatshirt, just wear something of your own, and if you are given any tools they came from walmart, it is utter fucking madness…. exactly the same brand of utter fucking madness that created the jobs in do in the first place… by sacking all the IT staff and sticking in a remotely managed router and some switches.

Of course VOIP is all the rage, so when the cute little cabinet goes down the ACME corp regional office does not just disappear from the HQ WAN, all the fucking phone lines go down too… how many of these sites have all this shit running on a UPS, even a cheap and nasty will only keep it running for ten minutes SOHO job from APC or similar?

You got it, haven’t seen a single fucking one yet….

It is fucking dreadfully incompetent and amateurish, I don’t give a flying fuck how swish and fancy and cute all this remotely managed Cisco kit is, WHEN IT IS WORKING, I don’t care how impressive the tricks are that you can do, WHEN IT IS WORKING, I don’t care with what ease you can do quite complex tasks, WHEN IT IS WORKING, all I see is a system that studiously ignores the 9,000lb gorilla in the room, what the fuck do you do when it stops working, and their answer to that is to point at dudes like me…. whoooosh…

So anyway I’m chilling after a job yesterday with another of the field engineers, who is of a similar age to me, and we are discussing this, and the one thing my extensive experience has taught me…. and this is from the year dot of web servers on…

  1. The least likely person to bork such a box is the field engineer sat there physically in front of it, in true CYA mode he covers his ass at every step, when I am asked to type in console commands for a box that has lost connectivity to a remote IT management centre I read and spell everything back phoenetically, and then ask them, do you want me to press return now? No matter how simple the command.
  2. The MOST likely person to bork such a box is the remotely connected tech telnetting in or whatever, they don’t give a fuck, and this is before they get confused between the three other field techs they are talking to simultaneously to me.
  3. The MORE of a wizard the remote tech is, the WORSE they will bork the box…

All of which means that instead of us field guys being remote waldoes for the megamind remote admin guys, which is how all this shit is marketed by the bean-counters, we are just another point of failure, for exactly the same reasons that someone playing Call of Duty will have a different approach to a crunchie on the ground in Afdiggastan with actual bullets flying around…Networkfailure

Now these people, if you push them, will admit that there are things like the graphic above, a “cascade failure”, but these same cunts have never had to RECOVER from one, because the fact is they have never been in one, of if they have, they were but one node…

I can distinctly remember being in a large hydroelectric turbine hall when a (local) cascade failure hit, because one of the turbines was tripped out by a vibration sensor, which they think was caused by a log getting down into the vent, so one goes down, and it takes aaaaaages to spin down, but the SOUND is indescribably different when it is not under load, and then the next one went because it was overloaded thanks to the first one going down, and then then remaining three went almost together…. and everyone is stood there looking at each other and the hall lighting goes out, and emergency DC lighting flicks on and the turbines continue to spool down… it is the most eerie motherfucking experience… and it took on the onsite diesel gen set and four hours or work before they could start spooling up again, another two hours to get the first two turbines synced to the grid, and another four hours for the remaining three.

But they had ENGINEERS on site, not fucking remote wizards and the only thing on-site some field techs told on a phone press this button now, now press this one, now type this in, now move that cable from here to there, OK I’m in, you can go to the next job ta….

SLA’s, well SLA’s are fulfilled if the resellers reseller can get a warm body on site within 4 hours, that warm body doesn’t have to actually DO anything, or FIX anything, he is just there so the SLA penalties can’t be invoked.

What the people I am currently working for do not know, that I do, is this.

THE DIFFERENCE BETWEEN THEORY AND PRACTICE, IS GREATER IN PRACTICE THAN IN THEORY.

So what happens in extreme cases, well someone ships a new box down, and it gets swapped out and we see if that fixes the problem, the only thing rarer than a UPS is the proverbial “smoking gun” when responding to an error call, nobody know what went wrong or what the causes were, and nobody gives a fuck, this job has had a 2 hour slot allocated to it, and that’s all there is.

Various three letter government agencies are waffling on about the threat of cyber terrorism, and hackers are getting sent to gitmo for 999 years of waterboarding pre trial, but the fact is that the real terrorists are all the fucking beancounters putting these bastard systems in place in the first fucking place, it isn’t IF it falls over, it is WHEN it falls over.

Currently these failures ain’t that bad, wossname bank goes down for 6 hours, wossname ISP goes down for 8 hours, wossname supermarket goes down for 4 hours, but no measures are being put in place to improve on this, on the contrary…. the opposite is what is happening.

Currently, cascade failures in IT have been confined to so called fucked up countries where fucked up stuff like the so called arab spring uprisings were going on, and again shit was blamed of guvvmint shutting shit down, hasn’t YET happened to a western country on the scale of the seventies east coast USA power grid cascade failure, which was ultimately caused by ONE part dying…  hasn’t happened YET.

But it’s gonna, why else is everyone getting the pre – emptive bullshit excuses in place about digital pearl harbours.

And it is not just ACME corp and your local supermarket and your local mobile phone shop doing this shit, it is also your local Court of law, your local Police station, your local lawyers, your local bank, your local hospital, and the technology is spreading in all these places.

Sure, they may well have a diesel genny out back that can be fired up to keep the lights on, but what fucking use is that when packets carrying everything from data to voice suddenly find no routes outside the LAN?

Which reminds me, next week’s money I need to buy myself a new NAS box and a couple of WD Red 3tb disks… lol

 

%d bloggers like this: