Don’t have time to read the whole article? Here’s the summary:
Full ownership: you build it, you own it
My users (or metrics) will tell me when things are off
Single owner of the project
Canary releases
Simplicity over everything
No time for bullshit meetings
For the past almost three years I've been working for a company that I could call a seasoned startup. The company was 10-ish years old, but a lot of things were organized in a way similar to how startups are organized. Bus factor of one - if one person disappears, we have a problem. Not exactly a paralyzing problem since there was usually more than one person which more or less knew the domain, but usually a single person was committing to the particular service.
It also had some other consequences, like having not much support from teammates in daily work. I've been in so-called backend team, which during a couple of years of existing, slowly and steadily transformed into a core team. We did backend. We did frontend. We did data processing. We did everything, we owned every most crucial piece of software. Okay, maybe not everything (yet!), but we owned a big piece of the company's software, and it was just working. I've owned somewhere around 20 different services from various domains, which means that I was the main person responsible for fixing them when something went wrong, sometimes also during weekends or even vacations.
It took years of slow and consistent work, to prove that we're able to deliver. I want to describe my experience of working in this team, what I've learned, and how we worked.
Full ownership: you build it, you own it
This is the most important prerequisite to this all. I had to take full ownership of the services I maintain. I was the only active contributor to the services I owned. Why? Because I will not fix things on Saturday evening that someone else broke by their carelessness. Of course, we had a code review process so I was not left alone, but it was my responsibility to test it before deployment, and if something went wrong, it was also my responsibility to fix it. We had no strict SLAs, but usually I fixed any problems within minutes or hours. It was possible thanks to good monitoring.
My users (or metrics) will tell me when things are off
How do you know that your service is not working the way it should work? Usually the answer could be straightforward: you open the application and check if it works or not. In my case it was not that simple. One of the areas I was solely responsible for was crawlers. Crawling of the internet is hard and unpredictable. I feel like I've seen it all when it comes to the ways of handling errors and failures, and some of them were creative. Response with HTTP code 200 and some carefully, hand-crafted error page. Multiple chains of redirects. HTTP 403 instead of 404 on a page which has been removed. Geo blockers of all sorts. Different sorts of anti-bot protections. Even honeypots for crawlers!
In such a volatile environment it's really hard to know quickly if something is broken or not, especially in distributed services. Crawlers were processing hundreds of messages per second, so the failure rate was also high at all times. Exceptions were thrown all the time, in every second. Some of them were expected and recoverable, and some of them were unknown unknowns - the things we haven't thought of. Known exceptions are fine - we know what is happening, and can easily and safely determine without a huge rush what's up and fix it relatively slowly. With the unexpected ones, it's a different story. I had a lot of alerts on multiple metrics, such as:
too many unexpected exceptions during the last hour
no successful crawl globally for the past hour
no successful crawl in one country for the past 4/6/12h (depending on how often we were running crawlers)
avg. amount of successful crawls over the past 3 days vs the past 7 days
queue count exceeded the threshold
messages on the queue are older than 4h
queue message count exceeding some threshold
and many more
Some of the alerts were more important than others. After some time of working on the project, I had a feeling when it's something that needs immediate attention and when it's a problem that can wait for a day or three, or maybe even I expect that it will resolve by itself. Because sometimes it was totally out of my control - what if the website we're trying to crawl is down? Even worse - what if it's down because we're crawling it a bit too hard? Then the only solution to fix this problem is to wait. And since they might not like the fact that we're crawling them, we cannot write an email saying
Hey, we're crawling you and it seems like our crawlers are failing, could you check it and let us know what's up?
But also there were times when I made some changes, deployed them to prod and I was worried that I might have introduced a bug. And it did happen from time to time. But since I had a lot of alerts, usually I knew right away and was able to fix & deploy the change quickly.
Single owner of the project
I feel like this might be a very underrated thing. It forces people (at least the right people) to shift their thinking. If there is only one contributor to the project, after some time of working you know the whole codebase by heart. You know all the why's and where's (okay, maybe not all). I'll mention crawlers again. I've spent multiple months working almost exclusively on maintaining and improving crawlers. It taught me a lot about the domain and the project. I could understand the project deeply. I got to experience the consequences of my code and design decisions. I had no one else to blame for bugs other than me. When I implemented a feature and I did not fully understand the why's behind it, I usually had to re-do the whole thing a month later, since "You know what? In my head it was working differently, could you change it please?". And such reality taught me a lot. I shifted my thinking about features from "what is needed today" to "today they ask for X, they need Y, and it probably will evolve into Z in the future". Let's take a look at the simple example. Our crawlers had pretty advanced configuration options, allowing for example to set retries on some exceptions, and also to specify which error codes we don't want to retry. When the requirement came in that 'website X is returning 404 (not found) instead of 429 (too many requests), could you please make a config so that we can retry 404s?' I transformed it in my head into 'we need a mechanism of specifying HTTP code into action'. Whenever it was possible, I was trying to make features as generic as possible. Instead of implementing "a flag to interpret 404 as 429" I've implemented "When , then ". And pretty often I was right - usually weeks or months later another requirement came in: "Hey, do you remember that site that was doing 404 on too many requests? We have a new similar website, but they're returning 403s. Help pls". And if I anticipated the future requirements correctly, I was able to save usually a couple of days of development and testing.
Canary releases
Crawlers were running in multiple countries. Some countries were more important to work continuously than others. In other words, we could afford some countries to be broken, and others we could not. So for bigger releases, I've used canary releases. By bigger releases I mean changes in code that I knew that were touching multiple areas, so they can break in multiple places. What I did was I released one country, and then waited for a day or three, depending on importance, size, deadlines and my current capabilities. After the period, I contacted all stakeholders, I've checked the logs, and if everything was looking good, I deployed another batch of countries, this time more significant ones. And for the very last batch, I was leaving the most important and the biggest countries. This way I had multiple checkpoints when I was making sure that everything was fine. Also, it made me extend the deployment period over days or even weeks, so the first country that got deployed had already a significant amount of data proving that the code works fine, even before the last batch of canary has been deployed to prod.
Simplicity over everything
Do you know that feeling when you sit down and try to understand the code, but the number of virtual classes, abstractions, and inheritance makes you jump from file to file and you need to spend hours debugging? I know it very well. Usually the story behind such code in places I've worked in was that there was this 10x developer who was brilliant. He coded the core of the project over the weekend, then over the next 3 months he built more features on top of that, and then he decided it was time to leave the company. And guess who needs to maintain this little monster? Not the person who wrote it!
When I'm alone in the project, I have a lot of room to do things my way. And my way means simple. I like simple code. If there are too many things going on at once, I cannot be productive, since I'm blocked by the complexity. I need to dive deep between smart functions, doing smart things in non-explicit way so that the code is more compressed. But it does not mean that I don't enjoy it when things are one-liners. Let's take regex as an example. Whenever possible, I try to avoid regex. I'd rather write 10 lines of code, doing step-by-step filtering over 1 line of regex doing things all at once (of course there's more to it, like performance, but it's just a simple example). And the only reason is that I know that it will be hard for me to wrap my head around the regex in 6 months. And if it's step by step, then it's easier to not only debug but also read the code.
No time for bullshit meetings
This requires some kind of bravery. To be able to hop into the meeting, and ask 'Hey folks, do you need me at this meeting? I have some deadlines to meet and I feel like I will not contribute to this discussion'. Of course, it does not mean that I was allowed to not participate in any meeting, it's just that I had a sense that my time was worth something - after all, I was the only person responsible for delivering my project on time. And I had to explain myself why I missed the deadline. And if people were pushing for me to do things that I felt were wasting my time and it was unjustified by the business requirements, I was escalating it above so that the 'business owner' of the project (someone who wanted my feature to be completed) could escalate and negotiate on my time allocation. I was just being clear that doing thing X will delay thing Y, that's it.