Narrative #002: Beginning SRE

Speed and Quality

These are what a lot of us want and need but then there's scale in any relevant dimension...

So which one comes first? can we ever achieve them? Once you start to think about measuring reliability of your code, this will most likely be the play field. How to have both or in reality, balance. We have always put speed first when experimenting but a lot of us never really refactor or move on from mvps and who can blame them? Things move quick and we never truly understand what we're getting ourselves into during that phase. I hope you get to experience moving on from them, especially before AI takes over (completely) it's such a joyous learning experience.

It may be overkill and maybe not a good financial move to figure out everything you need to succeed in building software but what can really help you stand out is working within a culture that keeps you at balance and I believe that's where SRE comes in.

What you end up with is an equipped feedback loop that is flexible enough to change with your goals or business objectives, a workflow that is well informed with today's evolving technology and methodologies, a confidence and productivity enabler and a primary way to deal with operational challenges. SRE implements DevOps.

If this already sounds familiar, it's because it already is. SRE is similar to DevOps in a lot of ways, only that it has a unique way of prioritising business objectives. DevOps ushered us into efficient software delivery and operations. DevOps has helped us achieve our goals; speed and quality. SRE comes in with the rise of complexity. When we decided we want everyone to use our software, we committed to ensuring everyone enjoys it.

DevOps and SRE are both cultural practices that:

1. Reduce organisational silos

2. Require and aid us in measuring everything

3. Leverage tooling and automation

4. Re-define and embrace failure.

So then, how do you go about implementing SRE?

There are a number of proven ways to implement SRE but before hopping on any of the trains, you have to understand there is no static definition or recommended approach, it is very contextual and I believe, the easiest way is to borrow from how you already practise DevOps. Like it or not, your SRE and DevOps will interface and it's better to ensure there is a clear way to integrate or maybe even merge. It's however necessary to keep the lights on in both teams or cultures if you are unsure about relevance, if at all you're interested in reaping value from both.

Organisations come in various shapes and sizes, some need both, some don't; it depends (as always) on your context and by all means, start small.

Some things to expect along this journey include:

1. Understanding existing release processes and ensuring they are convenient to your business objectives and social-technical dynamics.
There is a lot that can be defined here but it's not worth going too much into detail without a working example, we shall cover this later. The simplest way of picturing a good example is implementing small and gradual changes that reduce your mean time to recover (MTTR)

2. Measuring reliability: SLOs, error budgets & blameless postmoterms

Reliability is the core idea behind SRE and what's more relevant is meaningful measures. Meaningful measures and reliability are mutually inclusive. Reliable systems require as possibly accurate as possible metrics to manage and improve their reliability. Without meaningful, accurate and actionable metrics, it would be impossible to ensure or improve the reliability of a system.

To foster reliability, SLOs, SLAs, SLIs, error budgets and blameless post-moterms are valuable tools that need to be leveraged. I believe a lot of teams that struggle with these tools fail to accommodate their requirements and can't go beyond the struggles of defining them. Honestly, for very genuine reasons.

It's worth noting that practising resilience as a part of your reliability strategy makes your reliability more robust but really, it's expensive and honestly more relevant for mature systems and platforms that already practise reliability because the measurements only increase and failure is tolerable. There's a flip side though, high-stakes environments in sectors such as finance and healthcare could benefit from early investment in resilience because the cost of failure is bigger even if the initial cost is high.

3. Figure out who the necessary stakeholders are and establish alignment: A number of aspects around your development process(es), product processes and operational procedures will need to change and change is hard no doubt. A notable mention is stakeholder alignment where the power to adjust velocity and focus through toil-based and maintenance work, feature work lies. One of the greatest adversaries still remains to be speed against quality and part of maintaining balance is knowing when to slow down, regard or disregard failure. This is better done in alignment and with stakeholders that can enforce this across the concerning architectural planes.

In the cases of siloed operations and infrastructure teams, a typical approach would involve defining reliability requirements that can be acted on by those teams with oversight from the SREs and better yet, in a collaborative manner. This helps with the various learning curves to be expected in the journey. If you're especially aiming for quick wins or perhaps even shortcuts to value, which in logical sense can be achieved and recommended without disregarding the marathon status, this is a worthy consideration. If your SREs don't have the relevant experience, it's even more reason to consider. The job then becomes communication. This is dissolving the silos in practise and maintaining the relevance of architectural boundaries and ownership.

4. Understanding existing tooling and automation to measure and manage toil.

Operational toil is a greater adversary today with how bloated the ecosystem has become. It's worth mentioning that modern approaches such as platform engineering can potentially help. In that particular case, SRE can be implemented as part of a platform team or SRE ends up implementing platform engineering.

Let's break that down a little: The very basic definition of a platform team involves n number of engineers with relevant operational and infrastructure experience, implementing on a moderate to large scale internal developer platform for benefits such as standardisation, confidence, scale and with requirements such as shorter time to market / value. If that sounds ideal, it's worth exploring the domain. It was never implied that platform teams should strictly consist of operational or infrastructure skills. In fact, those two are broad enough to qualify specialisation, which qualifies quality engineers, SREs, architects, etc. Once more, it depends.

Back to the point... we can now see how SRE can be implemented as part of a platform team and on the other end, SRE implementing platform engineering means platform engineering is has more relevance as a cultural aspect or doesn't make sense separate from SRE teams for example. This is typical in smaller organisations where the SRE team or person is also responsible for the platform(s) as the most typical case.

I honestly believe it's rare and also pretty hard to benefit from any of these two domains from this type of implementation, both are so important, I guess it's dependent on what's important to a business.

5. Identify, understand and leverage ownership boundaries:

SRE ideally realises shared production responsibilities. This requires nurtured feedback loops, depending on how you decide to implement SRE and of course, the size of your organisation.

- More access and privilege on the platform level from an application ownership and not platform ownership perspective. This helps enforces the value of having a platform team. In a lot of enterprise-level organisations, this is common and there's a lot of tools that facilitate this i.e. ServiceNow and honestly, failure is more catastrophic the lower you go in the stack so the strict boundaries serve a genuine purpose. Should these teams be more flexible? absolutely, we simply haven't evolved enough to make it easier on lower levels, our greatest limitation as a species.

I believe it's necessary to establish and nurture good relationships with infrastructure providers, internally and externally. This relationship is also reflective in SLOs and SLAs defined on a platform and application level. It wouldn't matter how green your application or platform dashboards look, if your infrastructure is unreliable, so are you.

Knowing all the tools SRE presents is one thing and applying them is another. This is where contextual relevance comes into play and is the hardest part of your journey.

At the very least, I hope this gives you an idea of how you can think about SRE and take a shot at implementing it. The best advise you'll get from me is "start small, take your time, track your progress and involve the right people. Figure out what the smallest thing is, find the right people and keep it accountable."

Reliability Narratives

Search This Blog