6 lessons from the Amazon Prime Video serverless vs. monolith flap
A software-development team caused quite a stir recently with a blog post describing how it abandoned a serverless architecture project in favor of a monolith—and slashed cloud infrastructure costs by 90% in the process.
But this wasn’t just any team; the post was written by Marcin Kolny, a senior software-development engineer at Amazon Prime Video.
Since Amazon is one of the leading advocates for serverless computing, not to mention the market leader in cloud services, the post was viewed as either a commendable act of openness or the very definition of throwing your company under the bus. Either way, it triggered a passionate back and forth on social media platforms that focused on larger questions:
- Has the whole serverless/microservices/service-oriented architecture (SOA) movement been overhyped?
- Has the traditional monolithic approach to software development been underestimated?
- Is it time for a market correction similar to what we’re seeing with cloud in general, where some companies are moving apps from the cloud back to the data center and rethinking their cloud-first strategies?
Now that the dust has settled a bit, a closer examination of the Prime Video team’s experience reveals some key lessons that enterprises can apply going forward. But also importantly, the issues they faced highlight the need for early input from networking pros when the application-planning process is just getting underway.
What went wrong?
The first question that needs to be asked is: Was this an edge case, an outlier, or does it have broader implications in general? The Amazon team was dealing with real-time video streams, so not exactly your average enterprise app, but the takeaways are universal to any development process involving data-intensive, low-latency applications.
Prime Video was building a tool to analyze video streams for quality issues, such as video freezes or lack of synchronization between audio and video. In a complex, multi-step process, a media converter broke the streams into video frames and audio buffers that were then sent to defect detectors. Each defect detector, software that uses algorithms to identify defects and send real-time notifications, was running as its own microservice.
Two problems became apparent as the team began to scale the application: there were too many expensive calls to Amazon S3 storage, and the process was difficult to orchestrate.
The Amazon Prime Video team explained, “We designed our initial solution as a distributed system using serverless components (for example, AWS Step Functions or AWS Lambda), which was a good choice for building the service quickly. In theory, this would allow us to scale each service component independently.”
“However, the way we used some components caused us to hit a hard scaling limit at around 5% of the expected load. Also, the overall cost of all the building blocks was too high to accept the solution at a large scale. To address this, we moved all components into a single process to keep the data transfer within the process memory, which also simplified the orchestration logic.”
The high-level architecture remained the same, and the original code was able to be reused and was quickly migrated to the new architecture, which consolidated the workflow into a single Amazon Elastic Container Service (ECS) task.
“Moving our service to a monolith reduced our infrastructure cost by over 90%. It also increased our scaling capabilities. Today, we’re able to handle thousands of streams and we still have capacity to scale the service even further,” the team wrote.
Reactions run the gamut
The post triggered lengthy discussions on social media. David Heinemeier Hansson, co-owner and CTO at SaaS vendor 37signals, was quick to jump into the fray. Hansson caused something of a stir himself recently when he decided to pull his company’s applications and data out of the Amazon public cloud.
Hansson fired off a blog post that took this basic position: “I won’t deny there may well be cases where a microservices-first architecture makes sense, but I think they’re few and far in between. The vast majority of systems are much better served by starting and staying with a majestic monolith.”
Hansson argues that the microservices/SOA approach works for large enterprises and hyperscalers, but not necessarily for smaller organizations. “If you’re Amazon or Google or any other software organization with thousands of developers, it’s a wonderful way to parallelize opportunities for improvement. Each service can be its own team with its own timeline, staff, and objectives. It can evolve independently, at least somewhat, of whatever else the rest of the constellation is doing. When you reach a certain scale, there simply is no other reasonable way to make coordination of effort happen. Otherwise, everyone will step on each other’s feet, and you’ll have to deal with endless merge conflicts.”
But the problem with breaking an application into multiple pieces is that it increases complexity. “Every time you extract a collaboration between objects to a collaboration between systems, you’re accepting a world of hurt with a myriad of liabilities and failure states,” Hansson says.
He adds that in today’s tech culture, the traditional monolithic application has become “a point of derision.” But he wants the culture to “embrace the monolith with pride.”
His definition of a monolith is “an integrated system that collapses as many unnecessary conceptual models as possible, eliminates as much needless abstraction as you can swing a hammer at. It’s a big fat ‘no’ to distributing your system lest it truly prevents you from doing what really needs to be done.”
Adrian Cockcroft, an industry veteran whose resume includes stints at Sun Microsystems, eBay, Netflix, Battery Ventures and AWS, weighed in with a different take.
He argues that the Prime Video team essentially used inaccurate terminology; they didn’t really go back to a monolith; they were simply refactoring their initial implementation, which Cockcroft describes as a best practice.
Cockcroft says, “The Prime Video team followed a path I call Serverless First, where the first try at building something is put together with Step Functions and Lambda calls. When you are exploring how to construct something, building a prototype in a few days or weeks is a good approach. Then they tried to scale it to cope with high traffic and discovered that some of the state transitions in their step functions were too frequent, and they had some overly chatty calls between AWS Lambda functions and S3. They were able to re-use most of their working code by combining it into a single long-running microservice that is horizontally scaled using ECS, and which is invoked via a Lambda function. The problem is that they called this refactoring a microservice-to-monolith transition, when it’s clearly a microservice-refactoring step and is exactly what I recommend people do.”
Cockroft does agree that microservices have been somewhat oversold, and there has been some backlash as organizations realize that “the complexity of Kubernetes has a cost, which you don’t need unless you are running at scale with a large team.”
He adds, “I don’t advocate ‘serverless only’, and I recommended that if you need sustained high traffic, low latency, and higher efficiency, then you should re-implement your rapid prototype as a continuously running autoscaled container, as part of a larger serverless-event driven architecture, which is what they did.”
6 takeaways IT pros should remember
There important lessons that enterprise IT leaders can learn from the Amazon Prime Video example.
1. It’s not about the technology
“Don’t start with technology; start with goals,” recommends Pavel Despot, senior product marketing manager at Akamai. “Start with what you want to accomplish and build for the requirements presented.”
Vijay Nayar, founder and CEO at Funnel-Labs.io, agrees. “If you approach a problem and state that microservices or a monolith system is or isn’t the answer before you’ve even heard the problem, you’re shooting first and then asking questions. It’s reckless and leads to bad decision making.”
2. Analyze the trade-offs
Microservices bring flexibility, enabling independent development, deployment, and scalability of individual services. They also introduce complexity, including the need for service discovery, inter-service communication, and managing distributed systems.
Going the serverless route has the advantage of fast deployment because the underlying infrastructure upon which you’re building the application is spun up by the service provider on demand.
3. The original design has to be right
The underlying architecture that the application will run on has to be correct in the first place, or else any attempt to move from prototype to production will run into scaling problems.
David Gatti, an AWS specialist and CTO, says, “If you design the architecture incorrectly, it won’t work, will be expensive, and complicated. The idea of passing data from one Lambda to another to do some work is not a great idea; do all the processing in one Lambda.” He says the Amazon Prime Video team “made a bad design based on the workload needed, and now they are doing the right thing. This does not mean that all serverless is bad.”
4. Simplify languages and dependencies
Hansson says that “one of the terrible side effects of microservices madness is the tendency to embrace a million different programming languages, frameworks, and ecosystems.” He recommends no more than two languages; one tuned for productivity that can be used the vast majority of the time, and a second high-performance language used for addressing hot spots.
Nayar adds, “If you split your service into 100 different tiny services, and you can’t figure out where problems emerge from, and the spiderweb of dependencies among them makes deployment a nightmare, then that’s because you split your services without thinking about keeping their purpose clear and their logic orthogonal.
5. Target specific use cases
Cockcroft says enterprise workloads that are intermittent and small scale are good candidates for the serverless approach using Amazon Step Functions and Lambda.
“When microservices are done right, they often target a narrow, isolated, and usually performance-critical segment of the system,” adds Hansson.
And Despot points out that while the serverless approach provides flexibility, the requirement that microservices talk to each other and to backend databases can impact latency.
6. Consider cost
Because the providers of serverless computing charge for the amount of time code is running, it might not be cost effective to run an application with long-running processes in a serverless environment.
And then there’s the lesson that the Amazon Prime Video team learned: storage costs can bite you if you’re not careful. AWS storage pricing is based on tiers, with Tier 1 fast-access more expensive than slower tiers. On top of that, customers are charged for every data request and data transfer, so overly chatty applications can rack up charges pretty quickly.
Copyright © 2023 IDG Communications, Inc.