These days, caring for users more than you already do needs a budget. Otherwise, the question raised about reliability in your last meeting, the outage you just recovered from and could be prevented, etc, they all end up backlog residents.
So the matter isn't simply about whether you care, a lot of people do, they just can't for various reasons. Some aren't even aware. Capacity and awareness are to be considered with too.
This gets translated into care for consumers, users, business and continuity.
Then long story short is yes, you need a reliability strategy.
It depends. It always does.
Insight into reliability is valuable, especially when you can implement before things break.
Where I find this especially relevant:
If you work with:
-
Containers and Kubernetes
-
Internal developer platforms
-
A team large enough to fight over merge conflicts
-
CI/CD pipelines that break more than they build
[if you're not able to ship your ideas from cradle to grave with ease, add more reasons]
So, do any of these sound familiar?
-
You're running software critical to your business
-
You have customers, or even just one user who matters
-
You're on the modern cloud (AWS, GCP, Azure, etc.)
-
You're building or scaling a team
-
You're dealing with sensitive data
-
You're not sure how your system behaves under load or during failure
If yes to any of the above, then it's worth exploring a reliability strategy.
At any scale, reliability is a everyone's responsibility and I believe an investment in discovering what it takes to start or keep it going is absolutely worth it. I also think reliability (SRE) is greatly misunderstood because we have been reading from the books by Google trying to be Google and when we weren't trying, we didn't experiment enough. btw they are great books and what they started is still a game changer. Shout out to them ALWAYS.
Making this happen depends on a lot of factors and it doesn't have to be complex.
So what does a strategy look like?
There’s no one-size-fits-all answer.
Strategies come in different shapes and sizes but what matters is that they’re intentional and grounded in reality.
Here are just a few of the questions a good reliability strategy should help you answer:
-
What kind of value do you provide? Who’s depending on it and what breaks or who is affected if they don't get what they need?
-
Where does your software run? How important is location to fulfilling the value you provide? does it even matter where it runs?
-
How many people work on and depend on the system internally and externally? And how do they talk? (or collaborate)
-
What are your biggest sources of failure today in your software / platform / team / domain or even organization? (Be honest. And if you know and can share, how much does it cost you?) This matters because studying the symptoms carries lessons. Sometimes learning those lessons is what it takes. It's an ocean of possibility.
-
How do you currently respond to problems?
The answers won’t give you a complete strategy, but they’re a start.
They give shape to the system you’re working with and show you where to focus your efforts. Btw let’s not forget that a good strategy isn’t just a set of policies or tools as it can involve multiple people and roles. So there's an agenda to drive a shared understanding.
There’s also no shortage of modern techniques and tools that can support your approach, but remember tools complement strategy they don’t replace or complete one.
TL;DR
If you care about uptime, trust, and delivering value, you need one, at some point, it won't be a choice.
But it can (and should) reflect your scale, maturity, and architectural constraints.
Comments
Post a Comment