Site Reliability Engineer Roadmap for Beginners
A 14-month path from zero to junior SRE. Linux, observability, SLOs, incident response, Kubernetes reliability, and chaos engineering — no experience needed.
What a Site Reliability Engineer does
What is this roadmap and who is it for?
A site reliability engineer is the person who makes sure production systems stay available, fast, and recoverable when things go wrong — and works to ensure things go wrong less often. SRE turns operations from a reactive, heroic discipline into an engineering one: measurable, automatable, and improvable.This roadmap follows the order that actually builds SRE thinking: Linux and networking first (you can't debug what you don't understand), then observability and SLOs (you can't improve what you don't measure), then incident response and automation (you can't scale what depends on people), then advanced reliability work. Every layer depends on the one before.One thing we want to be upfront about — SRE is one of the most context-dependent roles in engineering. The same alert means something different in a fintech system and a media platform. Building judgment about what good reliability looks like in practice is what the project work in this roadmap is for.
Before you start — 3 Things to Keep in Mind
- 1Linux is where almost all SRE work actually happens. You need to navigate a server, read its logs, and understand what's running before anything else makes sense.
- 2SLOs are the language of SRE. Start thinking about reliability in terms of user experience and error budgets early — the rest of the role clicks into place faster than you'd expect.
- 3Build real things at every step. A dashboard you built for a service you actually deployed teaches more than one you copied from a tutorial.
Estimated duration
This roadmap takes 14 months at a pace of 15 to 20 hours per week.
If you can only commit 10 hours per week, plan for 18 to 22 months.
Consistency matters far more than speed.
Before you begin — what you need
- 1A computer — Windows, Mac, or Linux. WSL2 on Windows gives you a Linux terminal for free.
- 2A free AWS, GCP, or Azure account — free tiers are sufficient for most of this roadmap.
- 3A GitHub account — free, and where your SRE portfolio will live.
- 4Docker Desktop or Podman — free, for running containers locally.
- 5A basic comfort with English, since the Google SRE books, most documentation, and the best community resources are written in it.
- 6No prior SRE, DevOps, or operations experience needed — this roadmap starts from zero.
How site reliability engineering evolved over time.
Traditional Ops: The Firefighting Era
Before SRE, operations was largely reactive. Systems went down, people stayed up late to fix them, and the same incidents recurred because the fixes were manual and the knowledge was tribal. The 'war room' model — everyone in one place, hands on keyboards, heroic debugging — was the norm. It didn't scale.
Google Creates the First SRE Team
Ben Treynor Sloss formed Google's first SRE team in 2003. His premise was that reliability work should be done by software engineers who apply engineering discipline to operations — not operators who improvise. The team built tools, automated repetitive work, and set measurable targets for system health. SRE was born as a reaction to heroism, not an extension of it.
The SRE Book Makes the Practice Public
Google published the Site Reliability Engineering book in 2016 and made it freely available online. For the first time, the practices around SLOs, error budgets, toil reduction, postmortems, and production readiness reviews were documented in one place. The SRE book became the closest thing the industry has to a canonical foundation for the discipline.
SLOs Become the Industry Language
The concept of Service Level Objectives spread beyond Google into the broader industry. Teams everywhere began defining SLIs (what to measure), SLOs (the target), and error budgets (how much failure is acceptable before action is required). Cloud providers published reliability frameworks — AWS Well-Architected, Azure Well-Architected, Google Cloud Reliability — all centred on these same concepts. SRE stopped being a Google-only idea.
Observability Matures Beyond Monitoring
OpenTelemetry was formed in 2019 as a merger of OpenCensus and OpenTracing — providing a vendor-neutral standard for traces, metrics, and logs. The shift from 'monitoring' (did something break?) to 'observability' (why did it break?) became mainstream. Distributed systems running in Kubernetes made this transition necessary — you can't debug a microservices failure with a single dashboard.
Chaos Engineering Goes Mainstream
Chaos engineering — deliberately injecting failures to find weaknesses before they become incidents — moved from Netflix's internal practice to an industry standard. Chaos Mesh became a CNCF project. AWS introduced Fault Injection Simulator. The mindset shifted: a system that has never been tested under failure isn't actually reliable, it's just untested.
AI Enters Operations — With Guardrails
AI tools began assisting SRE workflows — summarising incidents, drafting runbooks, triaging alerts, and suggesting capacity adjustments. Google's 2026 SRE research describes this shift explicitly: AI adds an autonomy layer to operations, but requires guardrails, dry-run modes, least-privilege access, and progressive authorisation before acting on production systems. The SRE role didn't disappear — it gained a new tool that requires new judgment.
In 2026, SRE is one of the most in-demand technical disciplines in the industry. The job market shows 15% annual growth in SRE roles, and entry-level positions start between $85K and $130K in the US. The skills — observability, SLOs, incident response, Kubernetes reliability, and automation — are in demand across SaaS, fintech, e-commerce, cloud infrastructure, and data platforms. Remote work is common. And the Google SRE books remain the canonical foundation: free to read, still the most authoritative description of what the role is and why it exists.
What's shaping SRE in 2026.
SLOs Are the Center of Every Reliability Conversation
Service Level Objectives have become the shared language between engineering, product, and business leadership. Error budgets create an objective conversation about reliability trade-offs — when the budget is healthy, teams ship faster; when it's exhausted, reliability work takes priority. Organisations that skip SLOs have less productive conversations about reliability and make worse decisions.
Observability Is Now Expected Infrastructure
OpenTelemetry is the de facto standard for vendor-neutral instrumentation. Prometheus and Grafana remain the dominant metrics and dashboarding stack. Distributed tracing has moved from advanced topic to operational baseline for any system running more than a handful of services. SREs who can build and maintain a full observability stack are immediately productive on almost any modern team.
Chaos Engineering Is a Standard Reliability Practice
Chaos Mesh, AWS Fault Injection Simulator, and similar tools have made controlled failure injection accessible. Mature reliability teams run chaos experiments as a routine part of production readiness reviews. For SRE beginners, building a chaos lab and documenting what you found is one of the most distinctive portfolio items available.
AI Augments Operations — Carefully
AI tools assist with incident triage, runbook generation, log summarisation, and alert routing. Google's 2026 SRE research describes the required architecture: autonomy levels, dry-run modes, least-privilege access, and human approval gates before AI agents can act on production. SREs who understand how to design safe autonomy boundaries are the most valuable in AI-integrated operations.
Security and Reliability Are the Same Thing
Google's Building Secure and Reliable Systems book frames security and reliability as inseparable — a system that can be compromised is not actually reliable, and a system that fails unpredictably cannot be secured. In 2026, SREs are expected to understand IAM, secret management, least-privilege deployment patterns, and how security incidents affect reliability SLOs.
The honest state of SRE jobs in 2026.
What's happening in the market
High Demand and Strong Salaries
The SRE job market shows 15% annual growth in open roles. Entry-level positions in the US start at $85K to $130K, with mid-level roles at $106K to $178K and senior positions reaching $165K to $215K according to Glassdoor (May 2026). Indeed reports the average SRE salary at $156K across all experience levels. The skills are scarce enough that experienced SREs consistently command compensation above comparable software engineering roles.
Remote Work Is Standard
SRE work is cloud-native and asynchronous by design — incidents get paged, acknowledged, and investigated from wherever the engineer is. Remote and hybrid SRE roles are common at every company size. Global organisations seek SREs who understand cloud infrastructure, compliance, and distributed systems — English fluency and cloud platform experience are the common denominators across borders.
The Best Roles Are in Serious Production Environments
SRE work is most meaningful — and most educational — at organisations that run production systems with real traffic and real consequences. SaaS companies, fintech, e-commerce, cloud platforms, and data infrastructure companies are the natural homes for SRE teams. At these organisations, reliability work is a daily discipline, not something done only during outages.
Hands-On Projects Are the Differentiator
A beginner who has deployed a service, instrumented it with OpenTelemetry, defined SLOs, built dashboards, written alerts, and run a chaos drill will stand out far more than someone who watched tutorials about those things. The practical, workshop-driven emphasis in Google's SRE materials reflects what hiring managers actually test — they want to see what you built, not what you've read.
What you can do instead — or as well
Platform Engineering
Platform engineering builds the internal developer platforms that SRE practices run on — CI/CD pipelines, Kubernetes clusters, developer portals, golden path templates. SRE and platform engineering overlap heavily in tooling; the difference is focus. SRE focuses on production reliability and incident response. Platform engineering focuses on developer productivity and self-service infrastructure. Many SREs transition into platform engineering as their careers progress.
Security Engineering
Google's framing of security and reliability as inseparable makes SRE a natural entry point into security engineering. SREs who develop strong IAM, threat modeling, and secure deployment skills can move into dedicated security roles that command even higher compensation and are in even shorter supply.
MLOps
Machine learning systems have their own reliability challenges — model drift, pipeline failures, data quality degradation — that require SRE thinking applied to ML workflows. SREs who add ML infrastructure knowledge can move into MLOps roles, which combine the observability and incident response skills of SRE with the feature stores, training pipelines, and model serving infrastructure of ML engineering.
Freelance SRE Consulting
Small and mid-size companies need help designing their observability stack, defining SLOs, building incident processes, and improving their production infrastructure — but don't have the budget for a full-time senior SRE. Freelance SRE consulting is a real path for engineers with demonstrated skills and a portfolio of real production work.
SRE Advocacy and Education
The SRE discipline is still maturing and the educational resources are limited outside the Google books. SREs who can write clearly about observability, SLOs, and incident management — through blogs, conference talks, or courses — build a public reputation that accelerates both their career and the field.
SRE is one of the most intellectually engaging roles in production engineering — you're building the systems that keep other systems working, learning from every failure, and constantly improving. The 14-month timeline reflects the real depth the role requires. Engineers who rush through it know the vocabulary but lack the judgment that only comes from operating real systems under real conditions.
Your step-by-step guide.
Foundation
The ground everything else stands on
4 steps
Core Skills
The must-have tools of the job
4 steps
Advanced
What separates beginners from job-ready developers
4 steps
Professional
The layer that makes you hireable
2 steps
A simple 14-month learning path.
Linux Fundamentals
Terminal navigation, file permissions, process management, SSH, log files, text processing with grep and awk
Networking Fundamentals
TCP/IP, DNS, HTTP/S, TLS, ports, load balancers, latency percentiles, diagnostic tools (curl, dig, tcpdump)
Scripting and Git
Bash scripting, Python for log parsing and API calls, error handling, git daily workflow, GitHub, README habits
Cloud Fundamentals
Cloud compute, regions and AZs, IAM, managed databases, load balancers, autoscaling, billing alerts
Prometheus and Grafana
Metrics types, PromQL, instrumentation, alerting rules, dashboard design with RED and USE methods, alert fatigue
OpenTelemetry and Tracing
Spans and traces, context propagation, auto-instrumentation, OTel Collector, Jaeger, trace-metric correlation
SLIs, SLOs, and Error Budgets
SLI design, SLO targets, error budget policy, multi-window SLOs, Prometheus recording rules for SLO tracking
Incident Response and Postmortems
Incident lifecycle, severity levels, runbooks, blameless postmortems, on-call sustainability, communication during incidents
Kubernetes Reliability
Probes, PodDisruptionBudgets, resource limits, HPA, rolling updates, kube-state-metrics, Kubernetes events
Infrastructure as Code
Terraform resources and state, monitoring stack as code, production readiness reviews, GitOps for infrastructure
Chaos Engineering
Chaos Mesh experiments, AWS FIS, steady-state hypothesis, game days, gameday reports, blast radius management
Capacity and Performance
Four golden signals, load testing with k6, bottleneck identification, graceful degradation, circuit breakers, saturation
Security for Reliability
IAM least privilege, secret management with Vault, image scanning with Trivy, secure container configuration, TLS everywhere
Portfolio and Interview Preparation
Complete SRE portfolio project, on-call simulation with PagerDuty, postmortem portfolio, CKA exam prep, interview practice
What to focus on first.
Linux
SRE work starts with understanding what is actually running on a server. Without command-line fluency, every other tool is a black box — and production incidents are investigated at the command line, not through a UI.
Networking
Most production incidents involving services that 'can't reach' each other are network problems. Engineers who can work through a network diagnostic checklist — DNS, firewall, TLS, port, routing — resolve connectivity incidents in minutes instead of hours.
Scripting and Git
Toil — repetitive manual work that computers could do — is one of the things SRE is designed to eliminate. Scripting is the tool that makes elimination possible. Git is how operational scripts stay reviewed, versioned, and trustworthy.
Cloud Fundamentals
Modern production runs in the cloud. IAM, availability zones, managed services, and load balancers are the vocabulary of every reliability conversation at a cloud-native company.
Prometheus and Grafana
You can't improve reliability without measuring it. Prometheus and Grafana are the dominant metrics and dashboarding stack for SRE teams — they're the daily tools for both passive monitoring and active incident investigation.
OpenTelemetry
Distributed systems fail in distributed ways. Traces are how you find out which service and which operation is responsible for a latency or error problem — and OpenTelemetry is the standard for collecting them without vendor lock-in.
SLOs and Error Budgets
SLOs are the language of SRE. They resolve the tension between velocity and reliability by making trade-offs objective. An SRE who can define, measure, and act on SLOs is productive on any team from day one.
Incident Response
Incidents are the highest-stakes moments in an SRE's day. A clear process — triage, mitigation, recovery, postmortem — is what prevents a 30-minute incident from becoming a 4-hour one, and what prevents the same incident from happening again.
Kubernetes Reliability
Kubernetes introduces failure modes that don't exist in static deployments — probe misconfiguration, resource exhaustion, disrupted rollouts. SREs who understand these patterns don't just fix Kubernetes problems; they prevent them.
Infrastructure as Code
Infrastructure that can't be recreated from code in 30 minutes is a fragile single point of failure. IaC makes recovery reproducible, makes changes reviewable, and makes the infrastructure observable to the whole team.
Chaos Engineering
Chaos experiments verify that your resilience mechanisms — redundancy, health checks, PodDisruptionBudgets, circuit breakers — actually work before a real incident tests them. A portfolio that includes chaos experiment results is a genuine differentiator.
Portfolio and Certifications
A complete SRE portfolio project — deployed service, observability stack, SLO definitions, runbooks, postmortems — is the most credible signal that a junior candidate can give. The CKA and Google Cloud Professional DevOps Engineer certifications add structured external validation on top.
Problems every beginner faces — and how to get through them.
Building Alerts Before You Understand What to Measure
What it looks like
You set up Prometheus and immediately create 20 alert rules for CPU, memory, disk, and every metric you can find. Two weeks later, you've had 50 alert notifications and investigated 2 real problems. The other 48 were noise.
How to get through it
Start with user-facing SLIs and the RED method — request rate, error rate, duration. Build 3 good alerts on those signals before adding anything infrastructure-level. A pager that fires on a real user experience problem is worth ten dashboards of CPU graphs. Alert fatigue is a reliability problem, not just an annoyance.
Treating Uptime as the Only Goal
What it looks like
You focus entirely on keeping the service 'up' and measure yourself by whether it went down. You miss three incidents where the service was technically running but returning errors for 20% of requests for 45 minutes each.
How to get through it
Define SLOs before you think about uptime. Google's SRE materials are explicit: reliability means user experience, not server heartbeats. A service returning 20% errors is not reliable even if every process is running. Measure what users experience, not what servers report.
The Same Incident Keeps Happening
What it looks like
You've fixed the same database connection exhaustion problem three times in six weeks. Each time you resolve the immediate incident, but the service is back to its original configuration within days.
How to get through it
Write postmortems with concrete action items that have owners and deadlines — not vague observations. 'We should improve connection pooling' is not an action item. 'PR #247 adds a connection pool limit of 50 by Month 1, owned by Alex' is. If the same incident recurs, it means the postmortem action items weren't completed or weren't effective — go back and ask why.
Kubernetes Is Confusing Before It Clicks
What it looks like
You read the Kubernetes documentation and still can't form a clear mental model of why your service isn't routing traffic, why a pod keeps restarting, or what the difference between a liveness and readiness probe actually means in practice.
How to get through it
Run kubectl get events --sort-by=.lastTimestamp and kubectl describe pod on every problem before changing anything. The Kubernetes event log and pod description contain plain-text explanations of almost every common failure. The concepts click after you've used them on real workloads — not from reading the documentation alone.
Fear of Chaos Engineering
What it looks like
The idea of deliberately breaking a service feels reckless. You understand the theory but avoid running actual chaos experiments because you're worried about causing real damage.
How to get through it
Start in a local environment with Minikube or Kind — no cloud cost, no real traffic, no stakeholder impact. Chaos Mesh works perfectly on a local cluster. Run experiments on services you built yourself. The goal is to find out whether your resilience mechanisms work before a real incident does. The first experiment that reveals a missing PodDisruptionBudget is one of the most valuable learning moments in this roadmap.
Imposter Syndrome in a Field That Reads Complex
What it looks like
You see SRE job descriptions listing Kubernetes, Prometheus, OpenTelemetry, Terraform, Chaos Mesh, SLOs, and Go, and assume you need to master all of them before applying to anything.
How to get through it
Junior SRE roles don't expect expertise across the full stack. They expect demonstrable competence in observability, incident response, and SLO thinking — plus genuine curiosity about systems and reliability. One well-built portfolio project with a real SLO, real dashboards, and two or three real postmortems is more persuasive than a list of tools you've heard of.
Can't Get the First SRE Role
What it looks like
SRE job postings require two to four years of 'production experience with SLOs.' You haven't had a production system to manage, so you feel like the entry point doesn't exist.
How to get through it
Create your own production environment — deploy a real service on a cloud free tier, instrument it, define SLOs, configure alerting, run chaos experiments, and write postmortems. DevOps, infrastructure, and on-call support roles are legitimate entry points that provide the production exposure SRE roles later require. Internships at companies with serious production infrastructure are worth taking even at lower compensation — the learning compounds quickly.
You're ready for a junior SRE role when you can….
Navigate a Linux server entirely from the terminal, investigate a failing service through its logs and process state, and fix a simple configuration problem without instructions.
Build a Prometheus and Grafana observability stack from scratch, instrument a service with RED method metrics, and write a useful alert rule that fires on a real condition and resolves automatically.
Add OpenTelemetry instrumentation to a service, view distributed traces in Jaeger, and use trace data to identify which operation is responsible for high latency.
Write a formal SLO document for a service — SLI definition, target, measurement method, and error budget policy — and implement it as a Prometheus recording rule with a Grafana dashboard.
Work through a simulated incident with a clear process — detection, triage, mitigation, recovery — and write a blameless postmortem with a timeline and three concrete action items.
Deploy a service to Kubernetes with correctly configured probes, resource limits, a HorizontalPodAutoscaler, and a PodDisruptionBudget — and explain what failure each one prevents.
Design and run a chaos experiment, observe its effect on SLO burn rate in Grafana, and document the finding in a gameday report with a conclusion about system resilience.
A good junior SRE isn't someone who's memorised the Google SRE book. They understand how to measure reliability from the user's perspective, can investigate a production incident methodically, know how to write a postmortem that leads to real improvement, and have the engineering instinct to automate what shouldn't need a human. Fourteen months is a real investment — and the portfolio you finish with proves you can actually do the work.
You now have a clear path forward.
SRE compounds the same way other engineering disciplines do — every incident you investigate teaches you something the next one benefits from, every SLO you define sharpens your ability to measure what actually matters, and every chaos experiment reveals a weakness that would have found you eventually anyway. The roadmap gives you the order. The depth comes from operating real systems under real conditions.
The goal was never to learn every tool in the observability ecosystem. It was to reach a point where you can look at a production system, understand how it behaves under normal and abnormal conditions, define what 'healthy' means in measurable terms, and improve it systematically when it falls short.
Start with Linux, read the system logs on a real server, and keep going from there.
No login required to share feedback
Frequently Asked Questions.
Trusted places to keep learning.
Google SRE Books — sre.google/books
The Site Reliability Engineering book, The Site Reliability Workbook, and Building Secure and Reliable Systems — all free to read online. These are the canonical foundation of the discipline. Start with the SRE book, use the Workbook for practical implementation guidance, and read Building Secure and Reliable Systems for the security-reliability connection.
Google SRE — Art of SLOs Workshop
Google's official SLO workshop — a practical, structured guide to defining SLIs, setting SLO targets, calculating error budgets, and writing error budget policies. The most authoritative free resource for SLO design on this roadmap. Includes templates and worked examples.
OpenTelemetry Documentation
The official OpenTelemetry docs — covering traces, metrics, logs, auto-instrumentation, manual instrumentation, the OTel Collector, and exporters for every major backend. The most important resource for building vendor-neutral observability. Start with the Getting Started guide for your language of choice.
Prometheus Documentation
The official Prometheus docs — covering the data model, PromQL query language, alerting rules, recording rules, and instrumentation client libraries. The most authoritative reference for the metrics layer of the SRE observability stack. The PromQL documentation is worth reading cover to cover.
Chaos Mesh Documentation
The official Chaos Mesh docs — a CNCF cloud-native chaos engineering platform. Covers installation, experiment types (network chaos, pod failures, stress tests, application-level chaos), and workflow design. The starting point for any structured chaos engineering practice in a Kubernetes environment.
Keep going
Ready to go further?
Explore the Resource Hub for practical guides, honest reviews, and quick-reference cheatsheets designed to help you build faster.