Skip to main content

Narrative #002: Beginning SRE

 Speed and Quality

These are what a lot of us want and need but then there's scale in any relevant dimension...

So which one comes first? can we ever achieve them? Once you start to think about measuring reliability of your code, this will most likely be the play field. How to have both or in reality, balance. We have always put speed first when experimenting but a lot of us never really refactor or move on from mvps and who can blame them? Things move quick and we never truly understand what we're getting ourselves into during that phase. I hope you get to experience moving on from them, especially before AI takes over (completely) it's such a joyous learning experience.

It may be overkill and maybe not a good financial move to figure out everything you need to succeed in building software but what can really help you stand out is working within a culture that keeps you at balance and I believe that's where SRE comes in. 

What you end up with is an equipped feedback loop that is flexible enough to change with your goals or business objectives, a workflow that is well informed with today's evolving technology and methodologies, a confidence and productivity enabler and a primary way to deal with operational challenges. SRE implements DevOps. 

If this already sounds familiar, it's because it already is. SRE is similar to DevOps in a lot of ways, only that it has a unique way of prioritising business objectives. DevOps ushered us into efficient software delivery and operations. DevOps has helped us achieve our goals; speed and quality. SRE comes in with the rise of complexity. When we decided we want everyone to use our software, we committed to ensuring everyone enjoys it. 

DevOps and SRE are both cultural practices that:

1. Reduce organisational silos

2. Require and aid us in measuring everything

3. Leverage tooling and automation

4. Re-define and embrace failure.


So then, how do you go about implementing SRE? 

There are a number of proven ways to implement SRE but before hopping on any of the trains, you have to understand there is no static definition or recommended approach, it is very contextual and I believe, the easiest way is to borrow from how you already practise DevOps. Like it or not, your SRE and DevOps will interface and it's better to ensure there is a clear way to integrate or maybe even merge. It's however necessary to keep the lights on in both teams or cultures if you are unsure about relevance, if at all you're interested in reaping value from both. 

Organisations come in various shapes and sizes, some need both, some don't; it depends (as always) on your context and by all means, start small.

Some things to expect along this journey include:

1. Understanding existing release processes and ensuring they are convenient to your business objectives and social-technical dynamics.
There is a lot that can be defined here but it's not worth going too much into detail without a working example, we shall cover this later. The simplest way of picturing a good example is implementing small and gradual changes that reduce your mean time to recover (MTTR)

2. Measuring reliability: SLOs, error budgets & blameless postmoterms

Reliability is the core idea behind SRE and what's more relevant is meaningful measures. Meaningful measures and reliability are mutually inclusive. Reliable systems require as possibly accurate as possible metrics to manage and improve their reliability. Without meaningful, accurate and actionable metrics, it would be impossible to ensure or improve the reliability of a system.

To foster reliability, SLOs, SLAs, SLIs, error budgets and blameless post-moterms are valuable tools that need to be leveraged. I believe a lot of teams that struggle with these tools fail to accommodate their requirements and can't go beyond the struggles of defining them. Honestly, for very genuine reasons.

It's worth noting that practising resilience as a part of your reliability strategy makes your reliability more robust but really, it's expensive and honestly more relevant for mature systems and platforms that already practise reliability because the measurements only increase and failure is tolerable. There's a flip side though, high-stakes environments in sectors such as finance and healthcare could benefit from early investment in resilience because the cost of failure is bigger even if the initial cost is high.

3. Figure out who the necessary stakeholders are and establish alignment: A number of aspects around your development process(es), product processes and operational procedures will need to change and change is hard no doubt. A notable mention is stakeholder alignment where the power to adjust velocity and focus through toil-based and maintenance work, feature work lies. One of the greatest adversaries still remains to be speed against quality and part of maintaining balance is knowing when to slow down, regard or disregard failure. This is better done in alignment and with stakeholders that can enforce this across the concerning architectural planes.

In the cases of siloed operations and infrastructure teams, a typical approach would involve defining reliability requirements that can be acted on by those teams with oversight from the SREs and better yet, in a collaborative manner. This helps with the various learning curves to be expected in the journey. If you're especially aiming for quick wins or perhaps even shortcuts to value, which in logical sense can be achieved and recommended without disregarding the marathon status, this is a worthy consideration. If your SREs don't have the relevant experience, it's even more reason to consider. The job then becomes communication. This is dissolving the silos in practise and maintaining the relevance of architectural boundaries and ownership.

4. Understanding existing tooling and automation to measure and manage toil.

Operational toil is a greater adversary today with how bloated the ecosystem has become. It's worth mentioning that modern approaches such as platform engineering can potentially help. In that particular case, SRE can be implemented as part of a platform team or SRE ends up implementing platform engineering.

Let's break that down a little: The very basic definition of a platform team involves n number of engineers with relevant operational and infrastructure experience, implementing on a moderate to large scale internal developer platform for benefits such as standardisation, confidence, scale and with requirements such as shorter time to market / value. If that sounds ideal, it's worth exploring the domain. It was never implied that platform teams should strictly consist of operational or infrastructure skills. In fact, those two are broad enough to qualify specialisation, which qualifies quality engineers, SREs, architects, etc. Once more, it depends. 

Back to the point... we can now see how SRE can be implemented as part of a platform team and on the other end, SRE implementing platform engineering means platform engineering is has more relevance as a cultural aspect or doesn't make sense separate from SRE teams for example. This is typical in smaller organisations where the SRE team or person is also responsible for the platform(s) as the most typical case. 

I honestly believe it's rare and also pretty hard to benefit from any of these two domains from this type of implementation, both are so important, I guess it's dependent on what's important to a business.

5. Identify, understand and leverage ownership boundaries:

SRE ideally realises shared production responsibilities. This requires nurtured feedback loops, depending on how you decide to implement SRE and of course, the size of your organisation.

- More access and privilege on the platform level from an application ownership and not platform ownership perspective. This helps enforces the value of having a platform team. In a lot of enterprise-level organisations, this is common and there's a lot of tools that facilitate this i.e. ServiceNow and honestly, failure is more catastrophic the lower you go in the stack so the strict boundaries serve a genuine purpose. Should these teams be more flexible? absolutely, we simply haven't evolved enough to make it easier on lower levels, our greatest limitation as a species.

I believe it's necessary to establish and nurture good relationships with infrastructure providers, internally and externally. This relationship is also reflective in SLOs and SLAs defined on a platform and application level. It wouldn't matter how green your application or platform dashboards look, if your infrastructure is unreliable, so are you. 


Knowing all the tools SRE presents is one thing and applying them is another. This is where contextual relevance comes into play and is the hardest part of your journey.

At the very least, I hope this gives you an idea of how you can think about SRE and take a shot at implementing it. The best advise you'll get from me is "start small, take your time, track your progress and involve the right people. Figure out what the smallest thing is, find the right people and keep it accountable."

Comments

Popular posts from this blog

More than one or plus one

I've spent an unhealthy amount of time thinking about how to share this so I'm literally sharing the drafts of my thought process to respect the garden-blog concept 😁 I got the opportunity to share this at Xebia 's annual TED-style knowledge exchange and the event was incredible as always. I'm not sure the recording will go online but this page holds the original idea and the final edit of the poem.  This talk explored belief systems, focusing on one of the most challenging obstacles we face: the struggle to accept grief, help and continue living fully. The closer the loss, the bigger the smack. I don't know if that ever changes but I think it's worth being grateful for overcoming those moments sometimes. It takes a village: (A love Letter to communities) - Lessons: Noobing through everything in life is how a lot of us are doing and for anyone who had the privilege to learn how to live, what a beauty! What does a toy car, a tea flask and a girlfriend have in co...

Narrative 001: No one has a reliability budget

These days, caring for users more than you already do needs a budget. Otherwise, the question raised about reliability in your last meeting, the outage you just recovered from and could be prevented, etc, they all end up backlog residents. So the matter isn't simply about whether you care, a lot of people do, they just can't for various reasons. Some aren't even aware. Capacity and awareness are to be considered with too. This gets translated into care for consumers, users, business and continuity. Then long story short is yes, you need a reliability strategy. It depends. It always does. Insight into reliability is valuable, especially when you can implement before things break. Where I find this especially relevant: If you work with: Containers and Kubernetes Internal developer platforms A team large enough to fight over merge conflicts CI/CD pipelines that break more than they build [if you're not able to ship your ideas from cradle to grave with ease,...

Leadership coaching: A shout out

Some things are easier to do for others than they are to do for yourself.  For me, showing up; as in really being present, grounded, and intentional has always been one of those things. I’m one of those people in the great battle against people pleasing and if I could “disable that endpoint” my friend, I would. I’m not here pretending I’ve cracked the code as life is not that simple. But I do want to give a shoutout to Sarah Gruneisen 🐉 and her inspiring leadership coaching. I first heard Sarah speak at last year’s SREday Amsterdam (presentation photo below), and since then, I literally couldn’t stop listening! I’m truly honored to be a part of the Avagasso Leadership Landing program and honestly, it’s been a breath of fresh air because Sarah has a way of holding space that makes you feel seen, heard, and challenged;   Halfway into the program, I’m learning how to lead in a way that feels more like me; with clarity, initiative, structure and tonnes of inspiration.   M...