I was doing a Velocity Conference flashback the other day and this talk about managing complexity in distributed systems by Astrid Atkinson (from Google) caught my attention. I’m not very much of a “5 great tips” guy, but the points she was making were so spot on, she left me no choice.
Over the past few years I’ve been involved in a number of projects where I had to troubleshoot performance issues in microservices based systems. If these practices were in place, my job would be over before I even started.
So here’s the 5 nice things you can do (and do not) that will help you build better distributed systems:
- Always have a tiny http server running that exports a status page
- Each service should be able to look after itself, and
- Each service should be able to look after its connections to others
- Don’t have more than 3-4 services doing the same thing but just a little bit differently
- Managing distributed systems is an infinate game. You’re never really done. So don’t even try to make it perfect. Okay is often good enough.
While most of these things seem common sense, in practice they are rarely implemented properly or at all. Here’s some specifics.
1. Status page
For any given service, you should be able to go to
http://serv.er/status URL to check the status of the service(s) it provides. It doesn’t have to be much, but here’s a few things you could display fairly easily:
- When was the service built
- How long it’s been up
- Is it actually up
- Number of active connections
- How much traffic it’s getting
- Who built it and how to reach the team or a person responsible for it
- Utilization metrics and how much capacity the service has (CPU, disk space etc.)
And you can expand it from there. It’s probably best to export it in JSON format so it’s both user and machine readable.
2. Self support
Once the service is running in production, you may start noticing some routine tasks you need to do to support the service. Be very mindful about them and think carefully if you actually need to be doing this. Maybe this can be automated?
Here’s some obvious examples:
- log rotation if the logs are stored locally
- self-monitoring and auto-restart when services goes down (if it’s safe to do so)
- backups if they are not centralised
- controlled auto-updates (yes, “yum -y update” is not what I mean)
You will discover others as time goes by, so everytime you do something, ask yourself: do I really need to do what I’m doing manually? Or can the system do it itself next time?
3. Looking after connections to others
Most of the time this is ignored completely. Yet it’s one of the main reasons a single service failure causes collateral damage across the entire architecture. Here are the points Astrid made:
- If you talk to another service, have a timeout (and do something sensible when that timeout expires)
- If the server doesn’t get back to you, do an exponential backoff with jitter (retry in 2s, then in 4s, 8s etc.)
- Jitter is important here to avoid lockstep bugs (i.e. retry request arrival should be somewhat random)
- Servers should be able to perform reasonably well in degraded modes – understand which responses are critical for your product and which are not. Drop the responses that are not critical.
With these practices in place, even without proper monitoring, many issues could be resolved much faster as it would be clear early on which service is failing. In fact, due to the expontential backoff, services would be more likely to recover themselves if they failed due to overload.
4. Controlling the overhead
As your infrastructure and the number of teams grows, you will eventually end up running multiple services that do the same thing but just a little bit differently. At first it may seem fine. It may even seem like you’re going to migrate from that old to this new system soon.
3 months later the conversation may go like:
– “Hey, we’re building a new image processor and we need some infrastructure for that.”
– “Don’t we have 3 of those image processors already?”
– “Yes, but ours is different. It has special requirements”.
This is the point you should start thinking about how you can actually optimize it – how you can get the teams to work together to unify these services.
I’ve managed a few such service merge projects myself and I know it can be a painful and time-consuming process. But it does pay off in the end. Every single time.
5. Don’t try to make it perfect
Every system is a living organism. Both the system itself, and the environment it is in evolve faster than ever, especially in the devops era, where deployments happen tens to hundreds of times per week.
So don’t let yourself think for a moment that you’re building something perfect, something that will last forever and that at some point you are actually going to be done. Don’t spend too much time burried in the details, making decisions that are likely to have a short life-time, like where to put the config file, how to structure the configuration file etc.
And rather than getting frustrated every time you realize problems are never going to end, learn to celebrate small victories. Every milsetone reached is an opportunity to celebrate. So celebrate!
Share this Post
I believe that providing the best user experience is the key to business success. Big part of that is making online systems fast, stable and reliable. On Speedemy.com, I am sharing best practices and tested methods that I have used for the last 12 years working as an architect of scalable systems and database performance consultant.
If you too are looking to improve the speed and reliability of your system, let's stay in touch: