Narrative 001: No one has a reliability budget

These days, caring for users more than you already do needs a budget. Otherwise, the question raised about reliability in your last meeting, the outage you just recovered from and could be prevented, etc, they all end up backlog residents.

So the matter isn't simply about whether you care, a lot of people do, they just can't for various reasons. Some aren't even aware. Capacity and awareness are to be considered with too.

This gets translated into care for consumers, users, business and continuity.

Then long story short is yes, you need a reliability strategy.

It depends. It always does.

Insight into reliability is valuable, especially when you can implement before things break.

Where I find this especially relevant:

If you work with:

Containers and Kubernetes
Internal developer platforms
A team large enough to fight over merge conflicts
CI/CD pipelines that break more than they build
[if you're not able to ship your ideas from cradle to grave with ease, add more reasons]

I've generalised that list but in case you're not on it and want to know how reliability applies to your work or business, let's have a coffee!

And so we are gathered here because in the early days (what do I even know) functionality was a huge priority and while that remains relevant contextually, modern software environments are more complex and expectations are at an all-time high.

So, do any of these sound familiar?

You're running software critical to your business
You have customers, or even just one user who matters
You're on the modern cloud (AWS, GCP, Azure, etc.)
You're building or scaling a team
You're dealing with sensitive data
You're not sure how your system behaves under load or during failure

If yes to any of the above, then it's worth exploring a reliability strategy.

Making this happen depends on a lot of factors and it doesn't have to be complex.

At any scale, reliability is a everyone's responsibility and I believe an investment in discovering what it takes to start or keep it going is absolutely worth it. I also think reliability (SRE) is greatly misunderstood because we have been reading from the books by Google trying to be Google and when we weren't trying, we didn't experiment enough. btw they are great books and what they started is still a game changer. Shout out to them ALWAYS.

Making this happen depends on a lot of factors and it doesn't have to be complex.

So what does a strategy look like?

There’s no one-size-fits-all answer.
Strategies come in different shapes and sizes but what matters is that they’re intentional and grounded in reality.

Here are just a few of the questions a good reliability strategy should help you answer:

What kind of value do you provide? Who’s depending on it and what breaks or who is affected if they don't get what they need?
Where does your software run? How important is location to fulfilling the value you provide? does it even matter where it runs?
How many people work on and depend on the system internally and externally? And how do they talk? (or collaborate)
What are your biggest sources of failure today in your software / platform / team / domain or even organization? (Be honest. And if you know and can share, how much does it cost you?) This matters because studying the symptoms carries lessons. Sometimes learning those lessons is what it takes. It's an ocean of possibility.
How do you currently respond to problems?

The answers won’t give you a complete strategy, but they’re a start.
They give shape to the system you’re working with and show you where to focus your efforts. Btw let’s not forget that a good strategy isn’t just a set of policies or tools as it can involve multiple people and roles. So there's an agenda to drive a shared understanding.

There’s also no shortage of modern techniques and tools that can support your approach, but remember tools complement strategy they don’t replace or complete one.

TL;DR

If you care about uptime, trust, and delivering value, you need one, at some point, it won't be a choice.
But it can (and should) reflect your scale, maturity, and architectural constraints.

Leadership coaching: A shout out

Some things are easier to do for others than they are to do for yourself. For me, showing up; as in really being present, grounded, and intentional has always been one of those things. I’m one of those people in the great battle against people pleasing and if I could “disable that endpoint” my friend, I would. I’m not here pretending I’ve cracked the code as life is not that simple. But I do want to give a shoutout to Sarah Gruneisen 🐉 and her inspiring leadership coaching. I first heard Sarah speak at last year’s SREday Amsterdam (presentation photo below), and since then, I literally couldn’t stop listening! I’m truly honored to be a part of the Avagasso Leadership Landing program and honestly, it’s been a breath of fresh air because Sarah has a way of holding space that makes you feel seen, heard, and challenged; Halfway into the program, I’m learning how to lead in a way that feels more like me; with clarity, initiative, structure and tonnes of inspiration. M...

Reliability Narratives

Search This Blog