SREs/ SROs - Distributed and centralized teams

Site Reliability Operations - SRO

SROs facilitate the operations for Site Reliability engineers to provide outcomes to businesses by continuously observing the digital services run on applications that need rapid development, testing, and production.

SROs provide SRE teams with the ability to:

Manage their on-call schedules and escalations
Correlate telemetry signals from monitoring tools
Manage alert rules for telemetry signals

SROs provide Central IT Operations with:

Central visibility of services registered by SRE Teams
Visibility into the work of distributed teams without slowing them down

Site Reliability Engineering - SRE

SRE is defined as “when we treat operations as if it’s a software problem” which means a set of practices that incorporates aspects of software engineering into operations thereby increasing the efficiency and reliability of software systems and improving workflow.

Site Reliability Engineers would be following these sets of practices to incorporate aspects of software engineering into operations.

In simple terms, A site reliability engineer (SRE) creates a bridge between development and IT /SRO operations by taking on the tasks typically done by operations. Instead, such tasks are given to these types of engineers who use automation tools to solve problems by creating scalable and reliable software systems.

Standardization and automation are at the heart of what an SRE does, especially as systems migrate to the cloud. Thus, they often have a background in software or system engineering or system administration with IT operations experience.

SRE should focus on CI/CD, improvising application stability and its underlying infrastructure, issue resolutions, and cross-team collaborations.

Automation SRE engineers build tools for automation to manage IT operations. Thus, instead of manually performing these functions, their aim is to automate them. Such functions include:

Continuous integration and continuous delivery
Monitoring
Incident response/ remediations

Effective Monitoring

SRE engineers are responsible for ensuring that the underlying infrastructure is running smoothly and that systems and tools are working as expected.
They also monitor critical applications and services to minimize downtime and ensure their availability.

Issue resolution

These engineers work closely with developers, especially when issues arise so they will collaborate with developers to help with troubleshooting and provide consultation when alerts are issued.
This engineer will investigate and then resolve the issue in the event that a developer runs into a problem.
Following the incident resolution, the engineer will revisit the issue and determine the cause to ensure it doesn’t happen again.

Cross-team collaboration Based on the above, SREs work across different teams, mainly operations and development. Building reliable systems and providing support to these teams, will give these teams more time to divert their attention to building new features and hence get these out faster to customers.

As we understood the SRE/SRO and their focus areas, how they can define the benchmarks for these issues, CI/CD implementations. In these scenarios, the SLI, and SLO play a vital role for SREs to define the uptime and availability of the system.

Defining SLO/ SLI/ SLAs

Service Level Objectives: SLOs are key threshold values for each Service Level Indicator that quantify the availability and quality of service. They specify a target level for the reliability of our service and performance goals. Because SLOs are key to making data-driven decisions about reliability, they’re at the core of SRE practices.

Service Level Indicators:

SLIs are measurements of the characteristics of a service/ product. SLIs directly gauge those behaviors that have the greatest impact on the customer experience. The most common SLIs or Four Golden Signals are,

Latency
Traffic
Error rate
Saturation

Other variations are USE (Utilization, Saturation, and Errors) and RED (Rate, Error and Durability).

The formula used to calculate SLI is - SLI = Good Events * 100 / Valid Events

If the value of SLI is 100, the performance of the system is ideal and if it drops to 0, the system is broken.

It is Product (Service) - Centric, which means it always revolves around measuring the capabilities or characteristics of a product or service.

Service Level Agreements:

An SLA normally involves a promise to someone using our service that its availability should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. SREs don't have direct relation with SLA. However, the impact on SLO would impact the SLAs.

Drafting system performance and Reliability for SREs

As a Product or service organization, we should start by coming up with key performance indicators that measure our product's performance, which forms our SLIs. Remember these are a direct measure of our system’s behavior in every stage of our business.

Secondly, we have to set targets of availability for achieving these indicators, which form our SLOs. This is a completely data-driven phase where we have to accumulate the data from customer queries, and stakeholders' expectations, find the insights, and finalize the target/threshold values to achieve better reliability.

As the final step, we should create our SLA. Here we have to list out the reliability values and help them understand our product's capabilities.

Thus, SLIs are the foundational blocks that help in building SLOs which in turn helps with the overall reliability mentioned in our SLA.

Choosing the Right SLOs & Target Values

Choosing the target values:

Customer experience plays an important part in deciding key SRE metrics. SLOs are the focus points in deciding the assured system reliability the company would offer to the end-users.

The target value of a service level is always measured only by an SLI.

There is a complicated dependency between SLIs and SLOs. This forms a controlling characteristic while measuring and monitoring the entire system architecture. This means a natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.

Lower bound SLOs ≤ SLI ≤ Upper bound SLOs

Choosing the right SLOs:

Never choose targets/SLOs based on the current performance of our systems, choose from our historic performances
Keep it simple - Don't specify absolute target values as SLOs
Don't aim for over-achievement or perfection, reliability cannot be 100%
Always keep a safety margin in SLOs, say like setting a historical average of our availability SLOs
Only choose SLOs that are sufficient to cover attributes of the system, which means having only a few SLOs
While drafting SLA, always remember -> Reliability values in SLA < Historical Average of our availability SLOs

Best Practices for Defining SLOs

SLOs should specify how they’re measured and the conditions under which they’re valid. SLOs can never be 100%. But we can specify the limit of up to which constraint of time we can achieve assured reliability.

For example, we can specify the SLO targets in the performance curve as,

99.9% of SLO would complete a task in less than 100ms.
99.5% of SLO would complete a task in less than 50 ms.
99% of SLO would complete a task in less than 10ms
90% of SLO would complete a task in less than 1 ms

As a best practice, always start with lower targets and improvise based on the stability & reliability of the systems.

Now, we have defined our SLOs, SLI to make customers happy. But how we will focus on CI/CD? This is where Error budgets in SRE come in handy, a rate at which SLOs can be missed. This provides a clear, objective metric that helps determine how unreliable service is allowed for a specific time. It also helps to establish a balance between reliability and innovation.

Error Budgeting

The tool SRE is used to balance service reliability with the pace of innovation. The amount of error that our service can accumulate over a certain period before our users start being unhappy. SLI is expressed as a percentage, and the objectives derived from SLIs are the SLOs. Now, the Error budget is the remainder value of the SLOs mentioned.

The formula for Error budget is, Error budget = [100 - Internal Availability SLOs] (in %)

So, in the above example, if the internal availability SLO is 99.95%, then the corresponding error budget would be (100-99.5) 0.05%. That is, we can serve up to or below the error of 0.05%.

Setting SLOs based on customer satisfaction

Set an SLO buffer, which would help in accommodating maintenance windows, and improve the performance of the system without disappointing the users.
Restrict over-dependence between the services that drag down other services and take longer time to load.
While drafting an SLA, business and legal teams are required to pick appropriate consequences and penalties, in the event the agreement is breached. An SRE in the team helps them understand the likelihood and difficulty of meeting the SLOs contained in the SLA.
Be smart and conservative while we advertise our services’ SLOs because we cannot delete any of the SLAs that are not achievable.

Conclusion

Observability is instrumenting our systems with tools (like Open Telemetry, Prometheus) to collect actionable data to know when the errors occur—but more importantly, why they happen. It’s an approach to understanding multilayer architectures to find what is broken and what requires improvement for better performance.

SREs can use observability to:

Provide high-quality applications and software at scale.
View real-time performance for their digital assets.
Construct a sustainable innovation environment.
Maximize organizational investments in the cloud and other advanced tools.
Strictly define SLOs, SLIs and meet the SLAs for the customers.

SREs/ SROs - Distributed and centralized teams - Having fun with IT World