Beyond Google SRE: What is Site Reliability Engineering like at Medium?
We had the opportunity to sit down with Nathaniel Felsen, DevOps Engineer at Medium and the author of “Effective DevOps with AWS”. We are happy to share some practical insights from Nathaniel’s extensive experience as a seasoned DevOps and SRE practitioner.
While we hear a lot about these experiences from Google, Netflix, etc., we wanted to gather perspectives on DevOps and SRE life with other easily relatable companies. From tech-stack challenges to organization structure, Nathaniel provides a wide range of practical insights that we hope will be valuable in improving DevOps practices at your organization.
How is Site Reliability Engineering (SRE) practiced at Medium?
Medium takes a slightly different approach to SRE. We have split the responsibility into two groups:
There is a DevOps team responsible for automation, deployment management, understanding impact of deployments and improving collaboration across teams.
There is separate group called “the Watch”, and they are the on-call rotation team. The Watch is made up of 2 product developers for 2 weeks who mostly handle day-to-day operations but more specifically respond to pages, triage bugs, handle deployment issues / rollback, etc. As a developer, you go on the Watch every 4 months.
My time is divided across both these teams and I get to see both worlds quite closely. While I lead the Watch team, I also invest significant time doing DevOps engineering including creating build pipelines, automating aspects such as monitoring, and providing feedback on production readiness when a service is ready to launch.
Give us a flavor of your tech-stack?
All AWS and containers using ECS. We started with a monolithic application and still have that. But in order to build newer services faster, we are using containers and adding them as services. So, we have monolith as well as quite a few containerized services.
We heavily use Node.js but it becomes harder to manage Node.js as the application becomes big. Many new services are being written in Go and we are experimenting with React as the new language for our front-ends. Other than that, we leverage all the usual suspects from AWS — Auto-scaling, ELB, SQS, Kinesis, Lambda, DynamoDB, etc.
What would you say are the top challenges for the Watch team?
In distributed architectures, including microservices, it is hard to address the bottlenecks. For example, we use a lot of queues in the backend for asynchronous processing. We do a lot of asynchronous processing for our recommendation engine, which tends to increase the number of messages added to SQS queues. To address this, we add more “queue consumers” but then we hit limits on how fast we can write into DynamoDB. Our primary challenges are around understanding the dependencies, identifying bottlenecks and addressing them for the short and long term.
Share with us a “house on fire” war story along with the key learning?
This incident is a great example of how features that you build down the lane might not play well with your database structure from the past. The Whitehouse took the initiative to publish the script of the State of the Union. Naturally, this was going to be a very popular Medium post.
Usually, our auto-scaling capabilities are very well established to handle such traffic spikes. However, we had recently launched a new feature of “highlighting and sharing” text. So, naturally, along with huge readership came heavy usage of the highlight feature. Through a sequence of dependencies, the highlight feature eventually ends up invoking a service that does a write call to DynamoDB. We were sharding the content based on
post-id, which had worked fine for that table until now. The highlight feature, though, swamped our table because all the writes were on the same
post-id and hence on the same shard!
As I had said earlier, the key learning here is to be able to visualize and understand the intricate dependencies in modern applications.
What attracted you to Netsil?
Since we have a bunch of services and microservices, understanding dependencies is a common critical task. We were doing tcpdump and putting that into wireshark . But when I heard about Netsil, I found that Netsil could do dependency analysis for us. Netsil would auto-discover API communications and automatically do a nice graphical version (maps) of what we were doing with tcpdump + wireshark.
With Netsil’s auto-discovered maps we are able to identify dependency chains such as “The Monolith → Queue → other services → HAProxy & ELB → Social Service → Graph database”. Netsil also gives us insights into latencies and throughput for these API calls, which helps us identify the hotspots in our dependency chains. If you have a modern microservices application, then Netsil is great for monitoring and tracing the dependencies.
Another use case for us was to do build comparisons and catch code deployment issues. Our entire deployment pipeline is automated, allowing us to deploy new services dozens of times a day. In order to identify regressions caused by bugs, we use tools such as Netsil to analyze http status code (e.g. 500/400 errors) of new builds by exposing the
build id in the http header. Thanks to that system, we are able to prevent bugs from making their way to production and have our Watch team analyze and file bugs for these issues.
We’ve been implementing a request tracing service for over a year and it’s not complete yet. The challenge with these type of tools is that, we need to add code around each span to truly understand what’s happening during the lifetime of our requests. The frustrating part is that if the code is not instrumented or header is not carrying the id, that code becomes a risky blind spot for operations.
What would be your tips for fellow DevOps Engineers & SREs?
To check out my book! 🙂
Measuring everything doesn’t mean alerting on everything. Whenever you are investigating an issue, having all the data you need is critical to get to the bottom of an issue. You don’t want to spend the first 20 mins of an outage trying to gather information about what’s going on.
You want service alerts to be important, timely and actionable. You don’t want the on-call engineer to suffer from alert fatigue and constantly see warning (or even worse, get paged) for issues they can’t fix or don’t matter (for e.g., issues with an internal reporting system may not require waking up the on-call engineer at 3am). For web applications, for example, you can usually focus on top level metrics such as latency and error rate and rely on your dashboards inside Netsil and other monitoring tools to tell you why those metrics are higher than expected.
With respect to alerting, some of the common questions that need to be answered are: Was the page justified/avoidable? Was there proper documentation? Can something be done to prevent that issue from happening again? After each important incident, review what happened and whenever possible create a post-mortem. Include information like the timeline, root cause, top level metrics such as mean time to detect and mean time to recover, mention about what went well and what could be improved.
On behalf of the Netsil team and all our readers, our sincere thanks to Nathaniel @ Medium for sharing practical insights on DevOps and SRE life. We look forward to learn more about Nathaniel’s experiences in the upcoming book: “Effective DevOps with AWS”.