sre at systematic

March 31, 2022

Site reliability engineering at Systematic

Site Reliability Engineer or DevOps? What's the difference?

Alexandru Dejanu, Site Reliability Engineer at Systematic: "One of the most significant advantages of being an SRE at Systematic is that the team is technology agnostic, which means that I'm interacting with new frameworks frequently".

by Alexandru Dejanu, Site Reliability Engineer

 

The Site Reliability Engineer role is crucial at Systematic since it enables the development teams to achieve better product reliability. Alexandru Dejanu, Site Reliability Engineer at Systematic, tells us about what is like being part of the Customer Operations team. Learn more from Alex and his experience as an SRE at Systematic. 

apiVersion: apps/v1

kind: SRE

metadata:

  name: SYSTEMATIC

  labels:

    app: ALEX DEJANU

At Systematic, I fully embraced the Site Reliability Engineering role, a pretty new paradigm in the IT field (especially in Romania) whose goal is to improve the reliability of systems in production.

Before onboarding the SRE journey, I worked as a DevOps. My main focus was to bridge the gap between development and operation teams by enabling CI/CD and automating different processes. Still, here at Systematic, I discovered that a new challenge lies in front of me. 

Taking a step forward as an SRE, I've understood some of the main responsibilities of this position by helping both development and operation teams to have full visibility to the complete application lifecycle. Here, I am focused on reducing toil and ensuring the applications' availability while also establishing and monitoring service-level metrics. 

Three main categories of activities that a Site Reliability Engineer does at Systematic

Now I am part of the Customer Operations department. I'm working in a multi-project squad, which means that we serve multiple teams, encompassing various industry sectors such as library and learning, healthcare, defence, renewables.

The tech stack is quite diverse, meaning we're working with Kubernetes, Openshift, Azure, Ansible, Grafana, Prometheus, and so forth. Given the vast industries and the technology stack, I can say that no two days are the same. 

From a high-level perspective, the main activities are focused on observability (not to be confused with monitoring in which you are handling "predictable" failures, whereas observability provides a way to infer the state of a system), incident response (e.g., postmortems). Last but not least, another big part of the tasks is implementing POCs, capacity management, and incident management. 

A day in a life of an SRE and recurrent tasks

Recurrent it's quite a strong word. There aren't intrinsically recurrent tasks. We're using the Feature Driven Development process, which is oriented towards speed and efficiency. 

One day you could implement a new Prometheus exporter, and the next day you could measure the cost allocation for a K8s cluster. Grafana dashboards are for sure one of our "golden hammers," and at some point, some investigation tasks will require juggling between Lucene's query syntax and PromQL. 

But at the end of the day, an essential detail is taking the DevOps mindset a step forward. I wholeheartedly can say that all the daily tasks aim to achieve better product reliability. And when the team's main values are collaboration and progress, we are confident that this goal will be reached.

The challenging part is finding new ways to measure service reliability while proactively monitoring and optimizing workflows.

 Also, one key detail of this role is understanding the importance of Service-Level Objectives, Agreements and Indicators. I would say that they're a direct measurement of a service's behaviour.

Keeping up to date with the SRE key trends

One of the most significant advantages of being an SRE at Systematic is that the team is technology agnostic, which means that I'm interacting with new frameworks quite frequently. In one project, you could work in a setup consisting of Terraform with Azure and the other Ansible with Openshift.

I tend to read different articles and blog posts like RedHat. I'm also part of various communities such as StackOverflow, GitKraken which certainly helps with being up to date on multiple topics. 

Want to find out more about what are we doing in the Customer Operations team?