The Next Step in Operations – O’Reilly


Platform engineering is the latest buzzword in IT operations. And like all other buzzwords, it’s in danger of becoming meaningless—in danger of meaning whatever some company with a “platform engineering” product wants to sell. We’ve seen that happen to too many useful concepts: Edge computing meant everything from caches at a cloud provider’s data center to cell phones to unattended data collection nodes on remote islands. DevOps meant, well, whatever anyone wanted. Culture? Job title? A specialized group within IT?

We don’t want that to happen to platform engineering. IT operations at scale is too important to leave to chance. In her forthcoming book Platform Engineering, Camille Fournier notes that platform engineering has been used to mean anything from an ops team wiki to dashboards to APIs to container orchestration with Kubernetes. All of these have some bearing on platform engineering. But none of them are platform engineering. Taken together, they sound like the story of blind men describing an elephant: one grabs hold of a tusk, another the tail, another a leg, but none of them have a picture of the whole. Camille offers a holistic definition of platform engineering: “a product approach to developing internal platforms that create leverage by abstracting away complexity, being operated to provide reliable and scalable foundations, and by enabling application engineers to focus on delivering great products and user experiences.” (Emphasis Camille’s.)


Learn faster. Dig deeper. See farther.

That sounds abstract, but it’s both precise and helpful. “A product approach” is a theme that comes up repeatedly in discussions of platform engineering: treating the platform as a product and software developers—the users of the platform—as customers, and building with the customer’s needs in mind. There’s been a lot of talk about the death of DevOps; there was even a brief NoOps movement. But as Charity Majors pointed out at PlatformCon 2023, the reality of operations engineering is that it has become fantastically complex. The time when “operations” meant racking a few servers and installing Apache and MySQL is long gone. While cloud providers have taken over the racking, stacking, and software installation, they now offer scores of services, each of which has to be configured correctly. Applications have grown more complex too: we now have fleets of microservices operating asynchronously across hundreds or thousands of cloud instances. And as applications have become more complex, so has operations. It’s been years since operations meant mumbling magical incantations into server consoles. That’s not repeatable; that’s not scalable; that’s not reliable. Unfortunately, we’ve ended up with a different problem: modern software systems can only be operated by the developers who created them.

The problem is that software engineers want to do what software engineers do best, and that’s write cool new applications. They don’t want to become experts in the details of hosted Kubernetes, complex rules for identity, authentication, and access management (IAM), monitoring and observability, or any of the other tasks that have become part of their workspace. What’s needed is a new set of abstractions that allows both developers and operations staff to move to a higher level.

That gets to the heart of platform engineering: abstracting away complexity (in Camille’s words) or making developers more effective (in Charity’s). How do we develop software in the 21st century? Can improved tooling make developers more effective by working around productivity roadblocks? Can we let operations staff worry about issues like service-level agreements (SLAs) and uptime? Can operations staff take care of complex issues like load balancing, business continuity, and failover, which the applications developers use through a set of well-designed abstractions? That’s the challenge of platform engineering. Developers have enough complexity to worry about without taking on operations.

The fantasy of platform engineering is “one-click deployment”: write your application and click on a “deployment” item in your control panel, and the application moves smoothly and painlessly through testing, integration, and deployment. Life is almost never that simple. Deployment itself isn’t a simple concept, what with canary deployments, A/B testing, rollbacks, and so on.

But there is a reality, and behind that reality are some real successes. Facebook used to talk about requiring new hires to deploy something to its site on their first day at work. This predates “platform engineering,” “developer platforms,” and all of that, but it clearly shows that abstractions that simplify software deployment in a complex environment aren’t new.

Writing about his experience at LinkedIn in 2011, Kevin Scott (now CTO of Microsoft) describes how the company found itself in a huge developmental mess just as it went public. It was almost impossible to deploy new features: several years as a startup that was moving fast and breaking things had resulted in a tangled web of conflicting processes and technical debt. “Automate all the things” was a powerful slogan—but as attractive as that sounds, it has a very real downside. LinkedIn took the bold step of halting new development for as long as it took to build a consistent platform for deploying software. It ended up taking several months (and put several careers on the line, including Scott’s), but it was ultimately a success. LinkedIn went from releasing new features once a month, if that, to being able to release several times a day.

What’s particularly interesting about this story is that, writing several years after the fact, Scott uses none of the language that we now associate with “platform engineering.” He doesn’t talk about developer experience, internal developer platform, or any of that. But what his team clearly accomplished was platform engineering of the highest order—and that probably saved LinkedIn because, despite its highly successful IPO, a web startup that can’t deploy is dead in the water.

Walmart has a similar story about improving its DevOps and CI/CD practices. Daily deployment exposed problems in tools, procedures, and processes. These problems were addressed by a DevOps team and were forwarded to a platform team. Like the events recounted above, the work took place in the 2010s. Also like Scott’s LinkedIn story, Walmart’s narrative doesn’t use the language that we now associate with platform engineering.

The Heroku platform as a service is another example of platform engineering’s prehistory. Heroku, which made its debut in 2007, made single-click deployment a reality, at least for simple applications. When programming with Heroku, you didn’t need to know anything about the cloud and very little about how to wire the database to your application. Almost everything was taken care of for you. While Heroku never went quite far enough, it gave web developers a taste of what might be possible.

All of these examples make it clear that platform engineering isn’t anything new. What we now call “platform engineering” consolidates practices that have been around for some time; it’s the natural evolution of movements like DevOps, infrastructure as code, and even the scripting of common maintenance tasks. Whether they’re “software developers” as such or operations staff, people in the software industry have always built tools to make their jobs easier. Platform engineering puts this tool-building on a more rigorous and formal basis: it recognizes that building tools and creating abstractions for complex processes is engineering, not hacking. LinkedIn’s problem wasn’t a lack of tooling. It was several years of wildcat tool development and ad hoc solutions that eventually turned into a mass of seething bits and choked out progress. The solution was doing a better job of engineering the company’s tooling to build a consistent and coordinated platform.

In “DevOps Isn’t Dead, But It’s Not in Great Health Either,” Steven Vaughan-Nichols argues that DevOps may not be delivering: only 14% of companies can get software into production in a day and only 9% can deploy multiple times per day. To some extent, this is no doubt because many organizations that claim to have adopted DevOps, CI/CD, and similar ideas never really change their practices or their culture; they rename existing practices without changing anything substantial. But it’s also true that software deployment has become more complex and that, as LinkedIn learned, undisciplined tool development can result in a mountain of technical debt. Architectural styles like microservices decompose large monoliths into smaller services—but then the correct configuration and deployment of those services becomes a new bottleneck, a new nucleus around which technical debt can accumulate.

The list of problems that platform engineering should solve for software developers gets long quickly. It contains everything from smoothing the path from the developer’s laptop to a source control repository to deploying software to the cloud in production. The more you look, the more tasks to simplify you’ll find. Many security problems result from incorrectly configured identity, authorization, and access management (IAM). Can IAM be simplified in a way that prevents errors? When AWS first appeared, we were all amazed at how simple it was to spin up virtual instances and store data. But provisioning a service that uses dozens of available services and runs across thousands of instances, some in the cloud and some on-premises, is far from simple. Getting it wrong can lead to a nightmare for performance and scaling. Can the burden of correctly provisioning infrastructure be minimized? Deployment isn’t just pushing something to a server or even a fleet of servers; it may include canary deployments, A/B testing, and rollback capabilities. Can these complex deployment scenarios be simplified? Any deployment needs to take scaling into account; if software can’t take into account the company’s current and near-term needs, it’s in trouble. Can a platform incorporate practices that simplify scalability? Failover and business continuity in the event of outages, minimizing cost by optimizing the size of the server fleet, regulatory compliance—these are all issues that are important in the 2020s and that, if we’re being honest, we really didn’t think much about 20 years ago. Do developers need to worry about failover, or can it be part of the platform?

The key word in platform engineering isn’t “platform”; it’s “engineering.” Solid engineering is needed to move up the abstraction ladder, as Yevgeniy Brikman has said. But what does that mean?

Definitions of platform engineering frequently talk about treating the developer as a customer. That can feel very weird when you think (or read) about it. Your company already has “customers.” Are your engineers “customers” too? But that shift in mindset from treating software developers as a labor asset to customers is crucial. Camille Fournier means the same thing when she writes about “a product approach to developing internal platforms”: a platform engineering team has to take its customers seriously, has to understand what the customers’ problems are, and has to come up with effective solutions to those problems.

Platform engineering has the same pitfalls as other kinds of product development. It’s important to build for the customer, not for the engineer designing the product. Techno-solutionism—thinking that all problems can be solved by applying state-of-the-art technology—usually degenerates into implementing ideas because they’re cool, not because they’re appropriate. It almost always imposes solutions from outside the problem space, forcing one group’s ideas on customers without thinking adequately about the customers’ needs. It’s poor engineering. Good engineering may require sitting in the customer’s chair and performing their tasks often enough to get a good feel for their real requirements. Domain-driven design (DDD) is a good tool for flushing out customers’ needs; DDD stresses doing in-depth research to understand product requirements and doesn’t assume that every group within an organization has the same requirements. An organization may be represented by a number of bounded contexts, each of which has its own requirements and each of which needs to be considered in engineering a developer platform. One-size-fits-all solutions usually fail. It’s also a mistake to assume that a developer platform should solve all of the developers’ problems. Getting to 80% may be all you can do; the old 80/20 rule is still a good rule of thumb.

Platform engineering is necessarily opinionated: platform engineers need to develop ideas about how software development workflows should be handled. But it’s also important to understand the limits of “opinionated software.” David Heinemeier Hansson (DHH) popularized the idea of “opinionated software” with Ruby on Rails, which implemented his ideas about what kinds of support a web platform should provide. Were DHH’s opinions correct? That’s the wrong question. DHH’s opinions allowed Rails to thrive, but that’s only platform engineering within the context of DHH’s company, 37 Signals. Rails’ success among web developers would have meant little if it wasn’t accepted by 37 Signals–regardless of how successful it was outside. Likewise, if the software developers at your company choose not to use the platform you develop, it has failed–no matter how good your opinions may be. If the platform imposes rules and procedures that aren’t natural to the platform’s users, it will fail. Opinionated software has to recognize that there are many ways to solve a problem and that users are always free to reject the software that you build. The users’ opinions are more important than the platform engineers’. Writing about site reliability engineering, Laura Nolan discusses the importance of the Greek concept metis: local, specific, practical, and experiential knowledge. Platform engineering must take that local knowledge into account–without getting stuck by “we’ve always done it that way.” Listening to the platform’s eventual users is key; that’s how you develop a coherent product focus.

Platform engineering is necessarily an attempt to impose some kind of order on a chaotic situation—that’s the lesson LinkedIn learned. But it’s also important to recognize, as Camille Fournier said in conversation, that there’s always chaos. We may not like to admit it, but software development is inherently a chaotic process. What happens when one company acquires another company that has its own developer platform? How do you reconcile the two, or should you even try? What happens when different groups in a company develop different processes for managing their problems? Domain-driven design’s concept of “bounded context” can help here. Some unification is probably necessary, but complete unification would almost certainly require a huge expense of time and effort, in addition to alienating a lot of developers. Imposing structure under the guise of “being opinionated” is a path to failure for a software platform. Platform engineers need to develop a product that their users want, not one that their users will fight. Again, good engineering requires listening to the customers. They may not know what they need, but their experience is the ground truth that a platform engineer has to work from.

Platform engineers also need to think carefully about “paved paths.” The term “paved paths” (often called “golden paths”) shows up frequently in the platform engineering literature. A paved path is a process that has been smoothed out, regularized, made easy by the platform. It’s common wisdom to pave the simplest and most frequently used paths first; after all, this makes it look like you’re accomplishing a lot and have good coverage. But is this the best way to look at the problem? Software developers probably already have tools and processes for managing the simplest and most commonly used paths (which aren’t necessarily the same). The right question to ask is where platform engineering can make the biggest difference. Given that the goal is to reduce the burden of complexity, what processes are the biggest problem? What solution would most reduce the developers’ burden of complexity? The best approach probably isn’t to reinvent solutions to problems that have already been solved—that can come later, if it’s necessary at all. Instead, it may be worthwhile to fit older solutions into a new framework. What problems get in developers’ way? That’s where to start.

By now, it should be obvious that, while platform engineering is about product development, it isn’t about a product like Excel or GitHub. It’s not about building a one-size-fits-all platform that can be packaged and marketed to different organizations. Each company has its own context, as does each group within a company. Each has its own requirements, its own culture, its own rules, and those must be observed—or if they must be changed, they must be changed very carefully. Engineering is always about making compromises, and frequently the most appropriate solution is the least worst, as Neal Ford has said. This is where domain-driven design, with its understanding of bounded context, can be very helpful. A platform engineer must discover the rules and requirements that aren’t stated, as well as the ones that are.

And now with AI? Sure. There’s no reason not to incorporate AI into engineering platforms. But there’s little here that requires AI. It’s likely that AI could be used effectively to analyze a project and estimate infrastructure requirements. It’s possible that AI could be used to help with code review—though the final word on code review needs to be human. There are many other possible applications. AI’s biggest value might not be making suggestions about ways to smooth various pathways but in the design process behind the platform. It’s possible that AI could analyze and summarize current practices and suggest better abstractions. It’s less likely than humans to be stuck in the trap of “the way we’ve always done it.” But humans have to remain in the loop at all times. As with software architecture, the hard work of platform engineering is understanding human processes. Gathering information about processes, understanding the reasoning behind them, and coming to grips with the history, the economics, and the politics still requires human judgment. It’s not something that AI is good at yet. Will we see increased use of AI in platform engineering? Almost certainly. But whatever you do or don’t do with AI, please don’t do it merely for buzzword compliance. AI will have a place. Find it.

That’s one side of the coin. The other side is that companies are investing in building applications that incorporate AI. It’s easy to assume that software incorporating AI isn’t much different from traditional applications, but that’s a mistake. Platform engineering is all about managing complexity, and incorporating AI in an application will inevitably increase complexity. Accommodating AI will certainly stress our ideas about continuous delivery: What does automated testing mean when a model’s output is stochastic, not deterministic? What does CD mean when evaluating an application’s fitness may take much longer than developing it? Platform engineering will need a role in testing and evaluation of AI models. There will need to be tools to detect when an application is being abused or delivering inappropriate results. Models need to be monitored so they can be retrained when they grow stale. And there will be new options for managing the cost of deploying AI applications. How do you help manage that complexity? Platform engineers will need to take all of this, and more, into account. A platform that only solves yesterday’s problems is an obstruction.

So what does a platform engineer engineer? Is it a surprise to say that what a platform engineer builds depends on the situation? A developer dashboard for deploying and other tasks might be part of a solution. It’s hard to imagine a platform engineering project in which an API isn’t part of the solution. A DevOps wiki might even be part of a solution, though standing up a wiki hardly requires engineering. Collecting a company’s collective wisdom and lore about building projects might help platform engineers to work toward a better solution. But it’s important not to point to any of these things and say “This is it—building that is platform engineering.” Focusing on any single thing tends to attract platform engineering teams to the latest fad. Does this repeat the history of DevOps, which was hampered by its refusal to define itself? No. Platform engineering is ultimately engineering. And that engineering must take into account the entire process, starting with gathering requirements, understanding how software developers work, learning where complexity becomes burdensome, and finding what paths are most in need of paving. It proceeds to building a solution—a solution that is, by definition, never finished. There will always be new paths to pave, new kinds of complexity to abstract. Platform engineering is an ongoing process.

Why are you doing platform engineering? How do you justify it to senior management? And how do you justify it to the software developers that you’re serving?

We hope that justifying platform engineering to software developers is easy—but that isn’t guaranteed. You’re most likely to succeed with software developers if they feel like they’ve been listened to and that you’re not imposing a set of opinions on them. Developers have insight into the problems they face; take advantage of it. Engineering solutions that reduce the burden of complexity are the key to success. If you’re succeeding, you should be seeing deployments increase; you should be seeing less frustration; and you should see metrics for developer productivity headed in the right direction. On the other hand, if a platform engineering solution just becomes one more thing for software developers to work around, it has failed. It doesn’t need to solve all problems initially, but a quick minimum viable product will go a long way to convincing developers that a platform has value.

Justifying platform engineering to management is a different proposition. It’s easy to look at a platform engineering team and ask, “Why does this exist? What’s the ROI? Why am I paying expensive engineers to create something that doesn’t contribute directly to the product we sell?”

The first part of the answer is simple. Platform engineering isn’t anything new. It’s the next stage in the evolution of operations, and operations has been a cost center since the start of computing. In the long arc of computing history, we’ve been evolving from a large number of operators watching over a single computer (a 1960s mainframe required a significant staff and had less computational ability and storage than a Raspberry Pi) to a small number of operators responsible for thousands of virtual machines or instances running in the cloud. Platform engineering done well is the next stage in that evolution, allowing the staff to operate even larger and more complex systems. It’s not additive, something new that has to be implemented and resourced. It’s doing what you’re already doing but better.

If senior management thinks that platform engineering doesn’t contribute directly to the product, they need to be educated in what it means to ship a software product. They need to understand that there is no product without deployment, without testing, without provisioning infrastructure. Doing this infrastructure work more efficiently and effectively contributes directly to the product. A product that can’t be deployed—or where deployments take months rather than hours—is dead in the water.

But that argument isn’t really convincing without metrics. Go back to the business problem you’re trying to solve. Do you want to increase the rate at which you release software? Document that. Are you trying to make it easier to add features or fixes without a full redeployment? Document that. Are you trying to decrease the time between a bug report and a bug fix? Document that. Programmers often think that software is self-justifying. It isn’t. It’s important to keep your eyes on the business goals and how the platform is affecting them.

The DORA metrics are a good way to show the need for better processes, along with measuring whether platform engineering is making processes more efficient. Can you demonstrate that platform engineering efforts are enabling you to get features and bug fixes into your company’s product and out to customers more quickly? Can a platform engineering effort help the company use cloud services more efficiently by avoiding duplication and oversubscription? Can you measure the amount of time developers spend on new features or fixes, as opposed to infrastructure tasks? In his PlatformCon 24 talk, Manuel Pais suggests measuring the percentage of the company’s income that’s supported by the platform. That exercise shows how important the platform is to the company. Platforms do generate value, but platform engineers frequently don’t make the effort to quantify that value when they talk to management. Once you know the value of the platform, it’s possible to forecast how the platform’s value increases over time. A platform is a strategic asset, not just a sunk cost.

Most companies already have a developer platform, whether it’s a bunch of old shell scripts, an unmaintained wiki, or a highly engineered set of tools for continuous integration and deployment. These platforms don’t all deliver the same kind of value—they may not deliver any value at all. The reality is that no company can exist for long without deploying software, and no company can develop software if its developer team is spending all their time chasing down infrastructure problems.

The platform is already there. Whether it’s working for or against you is a different question. Treating your engineering teams as customers and building a product that satisfies their needs is hard, important work. It means understanding their problems as they see them. It means coming up with new abstractions that hide complexity. And in the end, it means making it easier to deploy software successfully at scale. That’s platform engineering.





Source link