Reading my Twitter stream over coffee recently, I saw Nick Galbreath post the Nanex Research theory/analysis of the Knight Capital trading platform disaster which you can read about in more detail at the links I've provided in the appendix below.
Point is, and this is conjecture and hypothesis, but it would appear as though a piece of code that's meant to stay in the lab, made it out into production. We can see hints of why as Nanex analyzed the issue -
"When the time comes to deploy the new market making software, which is likely handled by a different group, the Tester is accidentally included in the release package and started on NYSE's live system."
That makes sense. How many organizations have this same set-up for deploying software? The developers in one group and the packaging/deployment (otherwise known as Operations) people in another? Thinking back to my days at GE Power Systems this was rigidly enforced.
Developers, who were off-shore resources more often than not, would work on code as it was spec'd out to them, and then FTP (or otherwise transfer) the completed software bits to "our side" where someone would package it up and put it on a server to test. Testing included, eventually, security testing.
I can't tell you the fun things we found in this pre-production environment when we started digging around during security testing. No, really, I can't tell you, but rest assured it didn't end with misconfigurations, or accidental code bits being included. Once we found a few files from another piece of software that was not designated for our environment... maybe one day I'll be able to tell that story.
Anyway... the moral of the story is that there were these separate groups that designed, built, packaged, tested, deployed, and then monitored these applications - and we often found ourselves in similar situations as Knight except without the multi-billion-dollar holding problem.
As you can guess, security issues were not scarce, and neither were configuration bugs, developer-included 'whoops' comments and pieces of test harnesses that should never make it to production. But they did.
As I think about how a DevOps tribe could function differently it becomes increasingly clear that disasters like this one could be diminished if only we had continuity in the SDLC. There's the key, continuity.
Any process that lacks continuity is doomed to stumble at some point. The costs that we accrue go into technical debt discussions, but ultimately the piper comes calling. This Knight Capital incident may be an interesting case study in where DevOps could decrease the likelihood of failures like this one - or maybe not, we would need a similar organization doing DevOps to compare against.
I simply believe, and more so with every incident like this, that rigid processes where we have a knowledge gap at the handoff between groups is the problem, not a solution. This is one of the primary reasons I'm so fervently behind DevOps - the knowledge gap has less chance to rear its head.
Knight Capital Links:
Cross-posted from Following the White Rabbit