The many-faced god of operational excellence, DevOps and now 'site reliability engineering' • DEVCLASS

DevOps

The many-faced god of operational excellence, DevOps and now ‘site reliability engineering’

By Michael Cote

February 6, 2018

The many-faced god of operational excellence, DevOps and now ‘site reliability engineering’

Someone’s been kicking up the “NoOps” ant pile again. There it was, sitting there finally rebuilt after the annual upturning, and The Lord of Cartography, Simon Wardley says: “I think you’ll find that the new legacy is going to be DevOps.” That said, it is winter, so the ants are moving a bit slower than usual.

But we’ve only just started…!

This “LessOps” vibe matches my own anecdotes as flit about the IT departments at large organizations. At the same time, so many IT departments are hungry for DevOps, they want to understand it and put it into practice, to be sure. Surveys are showing growing interest: the annual DevOps report says the number of respondents working on DevOps teams rose from 16 per cent in 2014 to 27 per cent in 2017.

A 2016 Forrester survey reports that 69 per cent of respondents reported adopting “processes that embrace or resemble DevOps”. And while I can’t help but think that those 69 per cent are those of you, dear readers, who leave comments for me professing that you’ve been doing DevOps since Churchill’s second term – and “just didn’t call it that” – let’s take Forrester’s survey here as useful.

What was that middle part again?

First, what exactly is DevOps? As John Willis, one of the co-authors of the DevOps Handbook told me: “Unfortunately, DevOps means whatever the definer wants you to believe and no definition is wrong.” He went on to give his definition by way of describing the end state: “DevOps is about service as a supply chain and all the things that enable fast, resident and consumable delivery of the service.”

Another dean of DevOps, Gene Kim, described it much the same way, as quoted in Gary Gruver’s excellent book on scaling DevOps: “DevOps should be defined by the outcomes. It is those sets of cultural norms and technology practices that enable the fast flow of planned work from, among other things, development through tests into operations, while preserving world-class reliability, operation, and security.

“DevOps is not about what you do, but what your outcomes are. So many things that we associate with DevOps, such as communication and culture, fit underneath this very broad umbrella of beliefs and practices.”

As ever, successful technology-driven definitions quickly become a description of the outcomes rather than how you get there. But wait! Our DevOps report friends took a bold swing at defining exactly what DevOps is, first by practices, then by effects, and then by outcomes.

This year, they even made a single chart of it, below.

Of course, unless you’ve had deep MBA training, one chart like that isn’t going to define DevOps for you, but it does highlight a set of practices that lead to goals (like continuous delivery) that start to ensure predictably, reliable, and useful delivery of IT… that helps improve the business.

To me, the key to figuring out where DevOps begins and ends – what it is as a practice – is asking what’s done after a functional agile development team does a build. How does the organization deploy the build to production, then ensure it runs in production, and then ensure that it can be upgraded on-demand?

Early on, the answer was to automate, automate, automate. Instead of manually deploying builds to production, you’d use Puppet or Chef, for example. Then you’d use containers, and then came the idea of “cloud platforms” that dictated exactly how you’d package, deploy, and manage your software and gave you close to zero options about the stack below your application. Each of these was built around the idea of “the wall” between developers and operators, and removing the negative effects of that wall.

Developers would make their build, then throw it over to operations staff who’d have to figure out how to deploy the build to production and then manage it ongoing. This wall introduced so much variability in configuration management that, inevitably, someone would forget to configure the DNS servers and the whole system would go down each time a build went to production. I’m greatly oversimplifying here, but solving that problem of frequent deployment drove a tremendous amount of DevOps thought and innovation.

If you re-read Willis’s delightfully concise definition, this notion fits in pretty well. Back-solving from the goal of more frequently deploying software (that, as a bonus, also stayed up), DevOps discovered a host of “culture” practices and issues that it’s become much more famous for.

And at some point, the venerable practices “agile” were added whole-hog into the mix. And why wouldn’t it be mixed into the batter? The end goal is creating better customer and user experiences, which means not only ensuring that the software runs in production, but that it’s well designed.

Todd Underwood, a site reliability engineer at Google, summarized the process and cultural consequences well a few years back: “DevOps seeks to integrate operational concerns into the software and business practices and software/skills capabilities into operations.” This can include operations staff actually embedding with the developers, especially those developer teams who have very little operations skills.

Who automates the automaters?

Early on, the notion of DevOps was that a unified team of developers and operators (I mean, it’s right there in the name, right?) would figure all of this out, working hand in hand and all carrying pagers to create, deploy, and then manage their applications. A huge amount of work in DevOps centered around automating the end-to-end process of getting software into production, so it’s little wonder that “configuration management” is often seen as “DevOps”. But, as the tools and practices started to coagulate, these teams did more than just automating configuration management, they’d build “platforms” out of the standard stacks processes they’d been following.

Getting your teams to build platforms, being “full stack developers” as we used to call them, seems excellent, at first, until you’re lucky enough to operate at enterprise scale. For example, say the 19,000+ developers at JP Morgan Chase. At that scale, you get a real 1 + 1 = -3 effect because you’re duplicating all those stacks. Using production monitoring and management as a tracer for this anti-pattern of too many stack developers, 451 Research’s Nancy Gohring told me: “[This] leads to the situation where some enterprises have 50 monitoring tools, sometimes including multiple deployments of the same tool. That seems inefficient.”

Ideally, to “scale”, you want to not only automate the toil of IT management, but standardize and centralize it. You don’t want developer teams building their own stacks and managing their production applications in unique ways. You want to automate the automation. As Google’s Kelsey Hightower put it recently talking about serverless and DevOps: “Once we get the practice right, it should turn into technology.”

How dare you say my bash scripts aren’t proper programming!

The idea of Site Reliability Engineering, or “SRE”, fits better in this view of what DevOps is. SRE-think is not focused on pulling developers into a unified team with sysadmins. Instead, the goal is to get sysadmins to start thinking like programmers, actually writing code and developing systems for production use. Sysadmins are no longer responsible for just running what developers give them.

Sure, they spend time troubleshooting production problems like a classic sysadmin would, but once the hair-on-fires are extinguished, the immediate question is: “OK, how can we change our platform to automate all this ops toil?” The idea, as Underwood put it, is to: “Write infrastructure that doesn’t require that kind of procedural automation.”

Technologically, the re-emergence of platforms as a service (PaaS) and the growing dominance of Kubernetes for management are automating much of the manual, one-off processes of DevOps. (You should know, dear reader, that I pay my mortgage by working at one of the vendors that peddles such kit.)

The end result is enabling developers to focus on their applications, not actually carrying a pager or worrying about whatever a “DNS” is. As Allstate’s Matt Curry described it to me: “The goal is to eliminate cognitive overhead for the developers and keep their pipelines as simple as possible. They should get a ton of operational value for free just by pushing to an environment, be it monitoring, release process, security scanning, architecture patterns, or anything else that is repetitive and fairly consistent between deployments.” Similarly, good SRE staff take a code-first approach to solving problems and make operating production as simple as possible.

Who’s SRE’ing my résumé updates?

Does this mean we can start using all those DevOps books as kindling? Well, hardly. Doing software well has always passed through many names, but the general end goals have remained the same. We can all agree with a sentiment put well by Microsoft principal cloud developer advocate Bridget Kromhout: if DevOps means “good, tool-enhanced cross-team collaboration. I sure hope that never goes away.”

DevOps is “a cross-team practice, not a task,” she adds. “SRE is the new Ops Engineer, but DevOps shouldn’t have been considered to be a job in the first place. We don’t hire a collaboration expert to do all the collaborating.”

The organization treating development and operations as all part of the same concern – creating and running good software – shouldn’t be lost. The DevOps practices of being more humane in working with people seem, well, humane and pragmatic. Most importantly, the emphasis on continuous improvement and the injection of lean thinking have helped lead to huge improvements in organization I talk with. As someone who’s always complaining about how boring the “cultural” aspects of DevOps are, they’ll probably be the most long-lasting, important precepts.

There may be a few stick-in-the-muds who get all roiled at the idea of DevOps shifting around, even “dying” if you’re the kind of person who likes that hyperbole (why are you looking at me?). Clearly, “DevOps” is evolving as it spreads into mainstream organizations and as newer technologies automate what were once manual practices and disparate tools. That doesn’t mean the core goals and “culture” zoom away. Just as with “serverless“, you’re supposed to take the idea seriously, not literally.

So, if you buy all that, consider this some 2018 career advice: it’s time to go update your résumé and say you’ve been doing “SRE” this whole time, but just not calling it that.