iZONE
Roadmap · 2026
Updated May 29, 2026

Site Reliability Engineer Roadmap for Beginners

A 14-month path from zero to junior SRE. Linux, observability, SLOs, incident response, Kubernetes reliability, and chaos engineering — no experience needed.

What a Site Reliability Engineer does

Build dashboards and useful alerts
Define SLIs, SLOs, and error budgets
Lead and document incident response
Manage reliability on Kubernetes
Run chaos experiments safely
Automate toil away from humans
Introduction

What is this roadmap and who is it for?

A site reliability engineer is the person who makes sure production systems stay available, fast, and recoverable when things go wrong — and works to ensure things go wrong less often. SRE turns operations from a reactive, heroic discipline into an engineering one: measurable, automatable, and improvable.This roadmap follows the order that actually builds SRE thinking: Linux and networking first (you can't debug what you don't understand), then observability and SLOs (you can't improve what you don't measure), then incident response and automation (you can't scale what depends on people), then advanced reliability work. Every layer depends on the one before.One thing we want to be upfront about — SRE is one of the most context-dependent roles in engineering. The same alert means something different in a fintech system and a media platform. Building judgment about what good reliability looks like in practice is what the project work in this roadmap is for.

Before you start — 3 Things to Keep in Mind

  • 1Linux is where almost all SRE work actually happens. You need to navigate a server, read its logs, and understand what's running before anything else makes sense.
  • 2SLOs are the language of SRE. Start thinking about reliability in terms of user experience and error budgets early — the rest of the role clicks into place faster than you'd expect.
  • 3Build real things at every step. A dashboard you built for a service you actually deployed teaches more than one you copied from a tutorial.

Estimated duration

This roadmap takes 14 months at a pace of 15 to 20 hours per week.

If you can only commit 10 hours per week, plan for 18 to 22 months.

Consistency matters far more than speed.

Before you begin — what you need

  • 1A computer — Windows, Mac, or Linux. WSL2 on Windows gives you a Linux terminal for free.
  • 2A free AWS, GCP, or Azure account — free tiers are sufficient for most of this roadmap.
  • 3A GitHub account — free, and where your SRE portfolio will live.
  • 4Docker Desktop or Podman — free, for running containers locally.
  • 5A basic comfort with English, since the Google SRE books, most documentation, and the best community resources are written in it.
  • 6No prior SRE, DevOps, or operations experience needed — this roadmap starts from zero.
History & Evolution

How site reliability engineering evolved over time.

SRE began as a solution to a specific problem at Google in the mid-2000s. Understanding where it came from — and what it was designed to replace — is what makes the philosophy make sense, not just the tools.
Pre-2003

Traditional Ops: The Firefighting Era

Before SRE, operations was largely reactive. Systems went down, people stayed up late to fix them, and the same incidents recurred because the fixes were manual and the knowledge was tribal. The 'war room' model — everyone in one place, hands on keyboards, heroic debugging — was the norm. It didn't scale.

2003

Google Creates the First SRE Team

Ben Treynor Sloss formed Google's first SRE team in 2003. His premise was that reliability work should be done by software engineers who apply engineering discipline to operations — not operators who improvise. The team built tools, automated repetitive work, and set measurable targets for system health. SRE was born as a reaction to heroism, not an extension of it.

2016

The SRE Book Makes the Practice Public

Google published the Site Reliability Engineering book in 2016 and made it freely available online. For the first time, the practices around SLOs, error budgets, toil reduction, postmortems, and production readiness reviews were documented in one place. The SRE book became the closest thing the industry has to a canonical foundation for the discipline.

2018–2020

SLOs Become the Industry Language

The concept of Service Level Objectives spread beyond Google into the broader industry. Teams everywhere began defining SLIs (what to measure), SLOs (the target), and error budgets (how much failure is acceptable before action is required). Cloud providers published reliability frameworks — AWS Well-Architected, Azure Well-Architected, Google Cloud Reliability — all centred on these same concepts. SRE stopped being a Google-only idea.

2019–2022

Observability Matures Beyond Monitoring

OpenTelemetry was formed in 2019 as a merger of OpenCensus and OpenTracing — providing a vendor-neutral standard for traces, metrics, and logs. The shift from 'monitoring' (did something break?) to 'observability' (why did it break?) became mainstream. Distributed systems running in Kubernetes made this transition necessary — you can't debug a microservices failure with a single dashboard.

2020–2023

Chaos Engineering Goes Mainstream

Chaos engineering — deliberately injecting failures to find weaknesses before they become incidents — moved from Netflix's internal practice to an industry standard. Chaos Mesh became a CNCF project. AWS introduced Fault Injection Simulator. The mindset shifted: a system that has never been tested under failure isn't actually reliable, it's just untested.

2024–2026

AI Enters Operations — With Guardrails

AI tools began assisting SRE workflows — summarising incidents, drafting runbooks, triaging alerts, and suggesting capacity adjustments. Google's 2026 SRE research describes this shift explicitly: AI adds an autonomy layer to operations, but requires guardrails, dry-run modes, least-privilege access, and progressive authorisation before acting on production systems. The SRE role didn't disappear — it gained a new tool that requires new judgment.

In 2026, SRE is one of the most in-demand technical disciplines in the industry. The job market shows 15% annual growth in SRE roles, and entry-level positions start between $85K and $130K in the US. The skills — observability, SLOs, incident response, Kubernetes reliability, and automation — are in demand across SaaS, fintech, e-commerce, cloud infrastructure, and data platforms. Remote work is common. And the Google SRE books remain the canonical foundation: free to read, still the most authoritative description of what the role is and why it exists.

Market Reality

The honest state of SRE jobs in 2026.

SRE is one of the highest-compensated technical roles in the industry — and one of the most specific about what it expects from candidates. Understanding what the market actually needs gives you a much clearer picture of what to build before your job search begins.

What's happening in the market

High Demand and Strong Salaries

The SRE job market shows 15% annual growth in open roles. Entry-level positions in the US start at $85K to $130K, with mid-level roles at $106K to $178K and senior positions reaching $165K to $215K according to Glassdoor (May 2026). Indeed reports the average SRE salary at $156K across all experience levels. The skills are scarce enough that experienced SREs consistently command compensation above comparable software engineering roles.

Remote Work Is Standard

SRE work is cloud-native and asynchronous by design — incidents get paged, acknowledged, and investigated from wherever the engineer is. Remote and hybrid SRE roles are common at every company size. Global organisations seek SREs who understand cloud infrastructure, compliance, and distributed systems — English fluency and cloud platform experience are the common denominators across borders.

The Best Roles Are in Serious Production Environments

SRE work is most meaningful — and most educational — at organisations that run production systems with real traffic and real consequences. SaaS companies, fintech, e-commerce, cloud platforms, and data infrastructure companies are the natural homes for SRE teams. At these organisations, reliability work is a daily discipline, not something done only during outages.

Hands-On Projects Are the Differentiator

A beginner who has deployed a service, instrumented it with OpenTelemetry, defined SLOs, built dashboards, written alerts, and run a chaos drill will stand out far more than someone who watched tutorials about those things. The practical, workshop-driven emphasis in Google's SRE materials reflects what hiring managers actually test — they want to see what you built, not what you've read.

What you can do instead — or as well

Platform Engineering

Platform engineering builds the internal developer platforms that SRE practices run on — CI/CD pipelines, Kubernetes clusters, developer portals, golden path templates. SRE and platform engineering overlap heavily in tooling; the difference is focus. SRE focuses on production reliability and incident response. Platform engineering focuses on developer productivity and self-service infrastructure. Many SREs transition into platform engineering as their careers progress.

Security Engineering

Google's framing of security and reliability as inseparable makes SRE a natural entry point into security engineering. SREs who develop strong IAM, threat modeling, and secure deployment skills can move into dedicated security roles that command even higher compensation and are in even shorter supply.

MLOps

Machine learning systems have their own reliability challenges — model drift, pipeline failures, data quality degradation — that require SRE thinking applied to ML workflows. SREs who add ML infrastructure knowledge can move into MLOps roles, which combine the observability and incident response skills of SRE with the feature stores, training pipelines, and model serving infrastructure of ML engineering.

Freelance SRE Consulting

Small and mid-size companies need help designing their observability stack, defining SLOs, building incident processes, and improving their production infrastructure — but don't have the budget for a full-time senior SRE. Freelance SRE consulting is a real path for engineers with demonstrated skills and a portfolio of real production work.

SRE Advocacy and Education

The SRE discipline is still maturing and the educational resources are limited outside the Google books. SREs who can write clearly about observability, SLOs, and incident management — through blogs, conference talks, or courses — build a public reputation that accelerates both their career and the field.

SRE is one of the most intellectually engaging roles in production engineering — you're building the systems that keep other systems working, learning from every failure, and constantly improving. The 14-month timeline reflects the real depth the role requires. Engineers who rush through it know the vocabulary but lack the judgment that only comes from operating real systems under real conditions.

The Learning Path

Your step-by-step guide.

Foundation

The ground everything else stands on

4 steps

Core Skills

The must-have tools of the job

4 steps

Advanced

What separates beginners from job-ready developers

4 steps

Professional

The layer that makes you hireable

2 steps

14-Month Plan

A simple 14-month learning path.

One focused area per month. Go deep — don't rush ahead before the current step feels comfortable. This timeline assumes about 15–20 hours of practice per week.
Month 1Step 1 of 14

Linux Fundamentals

Terminal navigation, file permissions, process management, SSH, log files, text processing with grep and awk

15–20 hrs/week
Month 2Step 2 of 14

Networking Fundamentals

TCP/IP, DNS, HTTP/S, TLS, ports, load balancers, latency percentiles, diagnostic tools (curl, dig, tcpdump)

15–20 hrs/week
Month 3Step 3 of 14

Scripting and Git

Bash scripting, Python for log parsing and API calls, error handling, git daily workflow, GitHub, README habits

15–20 hrs/week
Month 4Step 4 of 14

Cloud Fundamentals

Cloud compute, regions and AZs, IAM, managed databases, load balancers, autoscaling, billing alerts

15–20 hrs/week
Month 5Step 5 of 14

Prometheus and Grafana

Metrics types, PromQL, instrumentation, alerting rules, dashboard design with RED and USE methods, alert fatigue

15–20 hrs/week
Month 6Step 6 of 14

OpenTelemetry and Tracing

Spans and traces, context propagation, auto-instrumentation, OTel Collector, Jaeger, trace-metric correlation

15–20 hrs/week
Month 7Step 7 of 14

SLIs, SLOs, and Error Budgets

SLI design, SLO targets, error budget policy, multi-window SLOs, Prometheus recording rules for SLO tracking

15–20 hrs/week
Month 8Step 8 of 14

Incident Response and Postmortems

Incident lifecycle, severity levels, runbooks, blameless postmortems, on-call sustainability, communication during incidents

15–20 hrs/week
Month 9Step 9 of 14

Kubernetes Reliability

Probes, PodDisruptionBudgets, resource limits, HPA, rolling updates, kube-state-metrics, Kubernetes events

15–20 hrs/week
Month 10Step 10 of 14

Infrastructure as Code

Terraform resources and state, monitoring stack as code, production readiness reviews, GitOps for infrastructure

15–20 hrs/week
Month 11Step 11 of 14

Chaos Engineering

Chaos Mesh experiments, AWS FIS, steady-state hypothesis, game days, gameday reports, blast radius management

15–20 hrs/week
Month 12Step 12 of 14

Capacity and Performance

Four golden signals, load testing with k6, bottleneck identification, graceful degradation, circuit breakers, saturation

15–20 hrs/week
Month 13Step 13 of 14

Security for Reliability

IAM least privilege, secret management with Vault, image scanning with Trivy, secure container configuration, TLS everywhere

15–20 hrs/week
Month 14Step 14 of 14

Portfolio and Interview Preparation

Complete SRE portfolio project, on-call simulation with PagerDuty, postmortem portfolio, CKA exam prep, interview practice

15–20 hrs/week
Priority Order

What to focus on first.

Starting from zero? Follow this order. It is the fastest path to being job-ready. Each item builds on the one before it — don't skip ahead.
1

Linux

SRE work starts with understanding what is actually running on a server. Without command-line fluency, every other tool is a black box — and production incidents are investigated at the command line, not through a UI.

2

Networking

Most production incidents involving services that 'can't reach' each other are network problems. Engineers who can work through a network diagnostic checklist — DNS, firewall, TLS, port, routing — resolve connectivity incidents in minutes instead of hours.

3

Scripting and Git

Toil — repetitive manual work that computers could do — is one of the things SRE is designed to eliminate. Scripting is the tool that makes elimination possible. Git is how operational scripts stay reviewed, versioned, and trustworthy.

4

Cloud Fundamentals

Modern production runs in the cloud. IAM, availability zones, managed services, and load balancers are the vocabulary of every reliability conversation at a cloud-native company.

5

Prometheus and Grafana

You can't improve reliability without measuring it. Prometheus and Grafana are the dominant metrics and dashboarding stack for SRE teams — they're the daily tools for both passive monitoring and active incident investigation.

6

OpenTelemetry

Distributed systems fail in distributed ways. Traces are how you find out which service and which operation is responsible for a latency or error problem — and OpenTelemetry is the standard for collecting them without vendor lock-in.

7

SLOs and Error Budgets

SLOs are the language of SRE. They resolve the tension between velocity and reliability by making trade-offs objective. An SRE who can define, measure, and act on SLOs is productive on any team from day one.

8

Incident Response

Incidents are the highest-stakes moments in an SRE's day. A clear process — triage, mitigation, recovery, postmortem — is what prevents a 30-minute incident from becoming a 4-hour one, and what prevents the same incident from happening again.

9

Kubernetes Reliability

Kubernetes introduces failure modes that don't exist in static deployments — probe misconfiguration, resource exhaustion, disrupted rollouts. SREs who understand these patterns don't just fix Kubernetes problems; they prevent them.

10

Infrastructure as Code

Infrastructure that can't be recreated from code in 30 minutes is a fragile single point of failure. IaC makes recovery reproducible, makes changes reviewable, and makes the infrastructure observable to the whole team.

11

Chaos Engineering

Chaos experiments verify that your resilience mechanisms — redundancy, health checks, PodDisruptionBudgets, circuit breakers — actually work before a real incident tests them. A portfolio that includes chaos experiment results is a genuine differentiator.

12

Portfolio and Certifications

A complete SRE portfolio project — deployed service, observability stack, SLO definitions, runbooks, postmortems — is the most credible signal that a junior candidate can give. The CKA and Google Cloud Professional DevOps Engineer certifications add structured external validation on top.

Challenges & Solutions

Problems every beginner faces — and how to get through them.

You will hit these walls. Every developer does. Knowing they are coming makes them much easier to push through.

Building Alerts Before You Understand What to Measure

What it looks like

You set up Prometheus and immediately create 20 alert rules for CPU, memory, disk, and every metric you can find. Two weeks later, you've had 50 alert notifications and investigated 2 real problems. The other 48 were noise.

How to get through it

Start with user-facing SLIs and the RED method — request rate, error rate, duration. Build 3 good alerts on those signals before adding anything infrastructure-level. A pager that fires on a real user experience problem is worth ten dashboards of CPU graphs. Alert fatigue is a reliability problem, not just an annoyance.

Treating Uptime as the Only Goal

What it looks like

You focus entirely on keeping the service 'up' and measure yourself by whether it went down. You miss three incidents where the service was technically running but returning errors for 20% of requests for 45 minutes each.

How to get through it

Define SLOs before you think about uptime. Google's SRE materials are explicit: reliability means user experience, not server heartbeats. A service returning 20% errors is not reliable even if every process is running. Measure what users experience, not what servers report.

The Same Incident Keeps Happening

What it looks like

You've fixed the same database connection exhaustion problem three times in six weeks. Each time you resolve the immediate incident, but the service is back to its original configuration within days.

How to get through it

Write postmortems with concrete action items that have owners and deadlines — not vague observations. 'We should improve connection pooling' is not an action item. 'PR #247 adds a connection pool limit of 50 by Month 1, owned by Alex' is. If the same incident recurs, it means the postmortem action items weren't completed or weren't effective — go back and ask why.

Kubernetes Is Confusing Before It Clicks

What it looks like

You read the Kubernetes documentation and still can't form a clear mental model of why your service isn't routing traffic, why a pod keeps restarting, or what the difference between a liveness and readiness probe actually means in practice.

How to get through it

Run kubectl get events --sort-by=.lastTimestamp and kubectl describe pod on every problem before changing anything. The Kubernetes event log and pod description contain plain-text explanations of almost every common failure. The concepts click after you've used them on real workloads — not from reading the documentation alone.

Fear of Chaos Engineering

What it looks like

The idea of deliberately breaking a service feels reckless. You understand the theory but avoid running actual chaos experiments because you're worried about causing real damage.

How to get through it

Start in a local environment with Minikube or Kind — no cloud cost, no real traffic, no stakeholder impact. Chaos Mesh works perfectly on a local cluster. Run experiments on services you built yourself. The goal is to find out whether your resilience mechanisms work before a real incident does. The first experiment that reveals a missing PodDisruptionBudget is one of the most valuable learning moments in this roadmap.

Imposter Syndrome in a Field That Reads Complex

What it looks like

You see SRE job descriptions listing Kubernetes, Prometheus, OpenTelemetry, Terraform, Chaos Mesh, SLOs, and Go, and assume you need to master all of them before applying to anything.

How to get through it

Junior SRE roles don't expect expertise across the full stack. They expect demonstrable competence in observability, incident response, and SLO thinking — plus genuine curiosity about systems and reliability. One well-built portfolio project with a real SLO, real dashboards, and two or three real postmortems is more persuasive than a list of tools you've heard of.

Can't Get the First SRE Role

What it looks like

SRE job postings require two to four years of 'production experience with SLOs.' You haven't had a production system to manage, so you feel like the entry point doesn't exist.

How to get through it

Create your own production environment — deploy a real service on a cloud free tier, instrument it, define SLOs, configure alerting, run chaos experiments, and write postmortems. DevOps, infrastructure, and on-call support roles are legitimate entry points that provide the production exposure SRE roles later require. Internships at companies with serious production infrastructure are worth taking even at lower compensation — the learning compounds quickly.

Job-ready checklist

You're ready for a junior SRE role when you can….

Navigate a Linux server entirely from the terminal, investigate a failing service through its logs and process state, and fix a simple configuration problem without instructions.

Build a Prometheus and Grafana observability stack from scratch, instrument a service with RED method metrics, and write a useful alert rule that fires on a real condition and resolves automatically.

Add OpenTelemetry instrumentation to a service, view distributed traces in Jaeger, and use trace data to identify which operation is responsible for high latency.

Write a formal SLO document for a service — SLI definition, target, measurement method, and error budget policy — and implement it as a Prometheus recording rule with a Grafana dashboard.

Work through a simulated incident with a clear process — detection, triage, mitigation, recovery — and write a blameless postmortem with a timeline and three concrete action items.

Deploy a service to Kubernetes with correctly configured probes, resource limits, a HorizontalPodAutoscaler, and a PodDisruptionBudget — and explain what failure each one prevents.

Design and run a chaos experiment, observe its effect on SLO burn rate in Grafana, and document the finding in a gameday report with a conclusion about system resilience.

A good junior SRE isn't someone who's memorised the Google SRE book. They understand how to measure reliability from the user's perspective, can investigate a production incident methodically, know how to write a postmortem that leads to real improvement, and have the engineering instinct to automate what shouldn't need a human. Fourteen months is a real investment — and the portfolio you finish with proves you can actually do the work.

Conclusion

You now have a clear path forward.

SRE compounds the same way other engineering disciplines do — every incident you investigate teaches you something the next one benefits from, every SLO you define sharpens your ability to measure what actually matters, and every chaos experiment reveals a weakness that would have found you eventually anyway. The roadmap gives you the order. The depth comes from operating real systems under real conditions.

The goal was never to learn every tool in the observability ecosystem. It was to reach a point where you can look at a production system, understand how it behaves under normal and abnormal conditions, define what 'healthy' means in measurable terms, and improve it systematically when it falls short.

Start with Linux, read the system logs on a real server, and keep going from there.

Was this helpful?

No login required to share feedback

FAQ

Frequently Asked Questions.

Questions that beginners ask most often — with honest, plain-English answers.
External Resources

Trusted places to keep learning.

Google SRE Books — sre.google/books

The Site Reliability Engineering book, The Site Reliability Workbook, and Building Secure and Reliable Systems — all free to read online. These are the canonical foundation of the discipline. Start with the SRE book, use the Workbook for practical implementation guidance, and read Building Secure and Reliable Systems for the security-reliability connection.

Google SRE — Art of SLOs Workshop

Google's official SLO workshop — a practical, structured guide to defining SLIs, setting SLO targets, calculating error budgets, and writing error budget policies. The most authoritative free resource for SLO design on this roadmap. Includes templates and worked examples.

OpenTelemetry Documentation

The official OpenTelemetry docs — covering traces, metrics, logs, auto-instrumentation, manual instrumentation, the OTel Collector, and exporters for every major backend. The most important resource for building vendor-neutral observability. Start with the Getting Started guide for your language of choice.

Prometheus Documentation

The official Prometheus docs — covering the data model, PromQL query language, alerting rules, recording rules, and instrumentation client libraries. The most authoritative reference for the metrics layer of the SRE observability stack. The PromQL documentation is worth reading cover to cover.

Chaos Mesh Documentation

The official Chaos Mesh docs — a CNCF cloud-native chaos engineering platform. Covers installation, experiment types (network chaos, pod failures, stress tests, application-level chaos), and workflow design. The starting point for any structured chaos engineering practice in a Kubernetes environment.

Keep going

Ready to go further?

Explore the Resource Hub for practical guides, honest reviews, and quick-reference cheatsheets designed to help you build faster.