My Devops horror stories, one sentence each: - Somebody deployed new features on...

danudey · on Oct 31, 2013

I built a system where our developers can do instant deployments of any of our software packages (and instant point-in-time rollbacks), and then do zero-downtime restarts of services.

Now we deploy dozens of times a day and I never get called on a Friday night because someone did something stupid.

Edit: I do get called when I did something stupid and it broke the deployment system. But that's gotten much rarer lately.

rdoherty · on Oct 31, 2013

Would love to hear more about this system, sounds really cool!

mikko-apo · on Nov 1, 2013

I've built a similar system, around 5 year ago. Users were able to deploy any version to any cluster from a nice UI. Basically you could select any version, click "Install" button and follow the logs in real time.

Behind the scenes it was a decentralized continuous delivery system. Very cool stuff, highly automated. Reduced a lot of work and sped up development cycles from months to minutes. Served quite a large software development organization (1000+). I think we had 1000 servers in 5 datacenters around the globe.

Nowdays I'm working on an open source version of that system, it's still missing few critical features but hopefully I'll get the first release out next spring.

btw, I'm looking for projects/contacts that would be interested in trying out how the system would fit their needs.

existencebox · on Nov 1, 2013

Hey Mikko; I'm a sysadmin at a research university, and I'd be very curious in at least "picking your brain" about your tool. I can't make any promises about actual usage, but I always love to see a novel approaches to relevant problems.

Do you have any sort of github/project page?

mikko-apo · on Nov 1, 2013

Hi

Cool, sysadmin at a research university sounds like a nice position to be at.

Yes, it's already on github. Unfortunately, since it's missing those critical features it's not easy to see how the whole system is going to work. If you to talk just drop me an email at gmail. mikko.apo is the account.

ecopoesis · on Oct 31, 2013

I really hate the idea that deploying on a Friday afternoon is a bad idea. It's only bad when you have shit developers or shit processes that don't catch broken code.

Personally, I think it's better to release at 5pm on a Friday. Once people stay late a few times to fix their broken shit they'll be smarter about not checking in crap.

0xbadcafebee · on Oct 31, 2013

> It's only bad when you have shit developers or shit processes that don't catch broken code.

Or when the bug is only triggered in specific user profiles.

Or when all the devs went on a retreat in the mountains with no cell service.

Or when a dev makes a mistake (which we know never happens to even the best devs)

Or when the only developer that knows which one of the 1000 changes that were pushed could be the one breaking, turned his phone off.

Or when a flaw is discovered in the process for the first time (which we know never happens because everyone's process is perfect, until it isn't)

Or how change management's requirement that the fix be tested and verified by all affected teams might have people staying a few hours after 5pm on a Friday when they just want to get their weekend started.

Or how 10 different people from 10 different teams might need to be called and kept to work until 2am because the change can't be pulled because the database was already modified and the old client data is already expired from cache and a refresh would destroy the frontend servers.

Or another reason.

ryan-allen · on Nov 1, 2013

Yes, this! "Good code" and a CI box and deployment automation and some chef recipies don't spell ultimate success.

It drives me nuts when people tell me off for saying 'yeah yeah, no, automating our entire infrastructure of 5 servers isn't really worth it right now', like I'm some unprofessional bozo.

I pretty much have experience with all but one or two of your suggested scenarios, and by now I have no patience for annoying software developers who think that using chef or puppet somehow sufficiently embiggens them to run ops on their own (of course dev ops is almost a political assault on existsing ops guys, not merely a nice new solution to existing problems).

Sigh. This is why I don't work on teams these days (if I can help it).

EDIT: Though I also agree with the sub-parent, that deploying on 5pm is fine in certain teams and certain projects, the most important thing is are the guys pushing to do the deploy going to own the deployment? Are they going to hang around for another 60 minutes to check everything is OK? Are they going to be available at 10pm or on Saturday if something goes wrong and are they going to own it? If the answer is no, then nope, don't do it.

jmccree · on Oct 31, 2013

I'd agree that in a perfect scenario, you should be able to push code at any time confidently. But for many companies and projects this is not really available. As well, in many organizations the person who has to fix broken stuff is not the same as who develops and pushes code. I'm not saying that's a good thing, but it is reality for many people.

Even if it only happens once in your career, once you've had a dev push out code at 5pm friday night, jet out the door and hit the bar, meanwhile you (the sysadmin/ops on call) get woken up at 1am by site down alert, and have to debug/rollback the changes while the dev who pushed them is unreachable, you learn to really avoid friday evening pushes. Fool me once...

walshemj · on Oct 31, 2013

Amen Brother

Its sad that most places dont have a proper technical copy (with a full copy of live data) to do full tests on TDD is all very well but you need to test the entire system.

brazzy · on Oct 31, 2013

Yeah, because all problems are foreseeable and only ever caused by crap code... right.

No matter how great your processes and your code are, no test can catch everything that can go wrong in a live environment, and doubly not if your system interfaces with anything third party.

justinsteele · on Oct 31, 2013

It's not always the person who pushed it on a friday that ends up fixing it, though. They can be unreachable, without a computer, etc etc. It's just easier to change less during hours you have less people on hand, is all.

liquidcool · on Nov 1, 2013

Maybe it's me, but I have no problem staying late on a Friday to fix my screw-up. However, I'm terrified of having to fix something Monday morning while everyone else is watching.

But the real reason we deploy weekday mornings is so everyone is on deck and we can get outside help if required. When I was doing system integration, the problem was never in my code, it was the vendor's. Testing can only get you so close to the real world.

randall · on Oct 31, 2013

Yeah, with "real" CI which some people seem to hate, checking in bad code becomes the problem, which seems way better than just waiting to deploy.

pygy_ · on Oct 31, 2013

- It turns out your entire infrastructure is dependent on a single 8U Sun Solaris machine from 15 years ago, and nobody knows where it is.

How did you locate it? Measuring ping latency from other machines?

lil_cain · on Nov 1, 2013

I'd imagine tracing routing, and then MAC address tables would be rather a lot faster, and more accurate.

pygy_ · on Nov 1, 2013

Indeed :-)

mcfunley · on Oct 31, 2013

Automated tests trigger automated security, all admins are banned, user groups are automatically notified that certain admins are no longer admins.

ascendantlogic · on Oct 31, 2013

I read the last two lines, threw up under my desk and blacked out.

linc01n · on Nov 2, 2013

We won't deploy any code even after Wed afternoon. Don't get ourselves in any troubles.

stevenklein · on Oct 31, 2013

fun stuff

mkramlich · on Oct 31, 2013

> - Troubleshooting a bug in a site, view source.... and see SQL in the JS.

this is why I refuse to do "View Source" on the HealthCare.gov website. I'm afraid of what I might see.

aaronem · on Oct 31, 2013

I'm just gonna leave this right here:

        if ('en' === 'en') {
            $('#desktop-nav .middle').append('<a class="liteac-login topnav myprofile" href="/marketplace/auth/global/en_US/myProfile#landingPage">My Profile</a>');
            $('.mobile-nav-right').append('<a class="mobile-right-bottom liteac-login" href="/marketplace/auth/global/en_US/myProfile#landingPage">My Profile</a>');
        } else {
            $('#desktop-nav .middle').append('<a class="liteac-login topnav myprofile" href="/marketplace/auth/global/es_MX/myProfile#landingPage">My Profile</a>');
            $('.mobile-nav-right').append('<a class="mobile-right-bottom liteac-login" href="/marketplace/auth/global/es_MX/myProfile#landingPage">My Profile</a>');            
        }

illicium · on Nov 1, 2013

Ah, now I see why it has 500 million lines of source code.

mkramlich · on Nov 1, 2013

and costs 200M$+

because, you know, static HTML and some CRUD/lookup logic behind-the-scenes is just that hard

DCoder · on Nov 2, 2013

Seen this pattern many times before. One of those 'en' strings is the current user's language being written into the source, the other is hardcoded. If your server-side templating engine is impotent and only supports variable interpolation without conditionals, this approach is easier than pulling the right JS snippet from somewhere else.