Roadmap to a job

DevOps / Platform / Site Reliability Engineer

DevOps, Platform Engineering, and SRE are three views of one job: shipping and operating software reliably, safely, and fast

7 stages · 29 skills · 73 free resources

Core stack

DockerKubernetesTerraformPrometheusLinux

Track your progress

0 / 34 done

Stage 01
Stage 1, Systems & Scripting Foundations
Become fluent on a Linux command line, automate routine work with scripts, and be able to model how data moves across a network. Everything above this stage assumes these reflexes.
Linux fundamentals (filesystem, permissions, processes, systemd/journald, package managers)Essential3 links
Linux is the operating system underpinning virtually all servers, containers, and CI infrastructure. Its filesystem hierarchy, permission model, and process management form the foundation of system administration. systemd manages services and boot targets, journald handles structured logging, and package managers such as apt, dnf, and apk install and maintain software.
Why it matters · Every container, server, and CI runner is Linux; the shell is where you will spend your days.
articleThe Linux Commands Handbook (freeCodeCamp)docsThe Linux Documentation Project, guides courseMIT 'Missing Semester', the shell and shell tools
Shell scripting (Bash)Essential3 links
Bash is the default Unix shell and scripting language found on nearly every Linux system. It enables automation of repetitive tasks through scripts that chain commands, handle control flow, and manipulate files and processes. Bash scripts are commonly used for CI steps, server provisioning, log processing, and operational tooling.
Why it matters · The glue for automation, CI steps, and fast ops fixes; the lowest-friction tool in your kit.
docsBash Reference Manual (GNU, official)articleBash Scripting Tutorial (freeCodeCamp)docsGoogle Shell Style Guide
Networking (TCP/IP, DNS, HTTP/HTTPS, TLS, load balancing, subnets/CIDR)Essential3 links
TCP/IP is the foundational protocol suite for internet and intranet communication, with DNS translating hostnames to addresses and TLS providing encrypted transport for HTTP. Load balancing distributes traffic across multiple servers to improve availability and throughput. CIDR notation is used to define subnets, controlling how IP address ranges are allocated and routed within networks.
Why it matters · A large share of production incidents are network, DNS, or TLS; you cannot debug what you cannot model.
courseComputer Networking: A Top-Down Approach, free companion site (slides, labs, videos)articleHow DNS works (Cloudflare Learning Center)docsMDN, HTTP overview
Python for automationEssential2 links
Python is a general-purpose, interpreted programming language widely used for scripting, automation, and tooling in infrastructure contexts. Its standard library and rich ecosystem (including boto3 for AWS, the google-cloud libraries, and the Kubernetes client) make it the preferred language for writing more complex operational scripts, internal CLIs, and lightweight services beyond what Bash handles cleanly.
Why it matters · The default language for tooling, glue, and small APIs once Bash becomes unwieldy.
docsPython official tutorial courseAutomate the Boring Stuff with Python (3rd ed., full text free under CC BY-NC-SA)
Stage 02
Stage 2, Version Control & Cloud Fundamentals
Track everything in Git, collaborate through pull requests, and stand up real infrastructure on one cloud provider you understand deeply.
Git and a forge (GitHub or GitLab): branching, PRs, merge conflicts, tagsEssential3 links
Git is a distributed version control system that tracks changes to source code over time, supporting branching, merging, and tagging. GitHub and GitLab are web-based forges that host Git repositories and add collaboration features such as pull requests, code review, issue tracking, and integrated CI/CD. Together they form the workflow through which code changes are proposed, reviewed, and merged.
Why it matters · Git is the source of truth for application code and, later, for infrastructure itself (GitOps).
docsPro Git book (free, official)courseGitHub Skills (hands-on interactive courses)projectLearn Git Branching (interactive visual)
One cloud provider deeply (AWS, GCP, or Azure): compute, networking/VPC, IAM, object storage, managed databasesEssential3 links
AWS, GCP, and Azure are the three dominant public cloud platforms, each offering compute (virtual machines and serverless), virtual private networking, identity and access management, object storage, and managed databases as foundational services. Deep knowledge of one provider means understanding how its services interact, how resources are billed, and how to architect resilient, secure workloads within its specific abstractions and console or CLI tooling.
Why it matters · Core concepts transfer across clouds, and employers want demonstrable depth in at least one.
courseAWS Cloud Practitioner Essentials (free, AWS Skill Builder)docsGoogle Cloud free tier and documentation courseMicrosoft Learn, Azure fundamentals
Cloud IAM and the least-privilege modelEssential2 links
Cloud Identity and Access Management (IAM) systems control which principals (users, service accounts, roles) can perform which actions on which resources within a cloud environment. The least-privilege model is a security principle dictating that each identity is granted only the minimum permissions required to perform its function. Properly scoped IAM policies reduce the blast radius of compromised credentials or misconfigured services.
Why it matters · Misconfigured permissions are a leading cause of cloud breaches; security awareness starts at IAM, not at the end of the pipeline.
docsAWS IAM documentation docsGoogle Cloud IAM overview
Cloud cost awareness (FinOps basics)Optional2 links
FinOps (Financial Operations) is a practice that brings financial accountability to cloud spending by connecting engineering, finance, and product teams around shared cost visibility. Core concepts include understanding cloud billing models (on-demand, reserved, spot), reading cost explorer dashboards, tagging resources for allocation, and right-sizing compute. Awareness of cost drivers helps teams make architectural decisions that balance performance with expenditure.
Why it matters · Budget ownership is increasingly part of platform roles; cost-aware infra design is a real differentiator in 2026.
docsFinOps Foundation, FinOps Framework docsAWS Well-Architected, Cost Optimization pillar
Build itJot: A Terminal Notes CLIbeginner · 8-12 hours
Stage 03
Stage 3, Containers & CI/CD (Shipping Code)
Package any application into a container and build an automated pipeline that tests and ships it on every commit. This is the core DevOps loop.
Docker and containerization (images, layers, Dockerfiles, multi-stage builds, registries, compose; the cgroups/namespaces underneath)Essential3 links
Docker is a platform for building and running containers, which are lightweight, isolated processes packaged with their dependencies using Linux cgroups and namespaces. A Dockerfile defines the build instructions for an image, with multi-stage builds reducing final image size by separating build and runtime environments. Container registries store and distribute images, while Docker Compose orchestrates multi-container applications on a single host.
Why it matters · Containers are the universal unit of deployment, and knowing the Linux primitives behind them makes Kubernetes far less magical.
docsDocker official Get Started guide courseDocker Curriculum (hands-on, free)projectPlay with Docker (free browser lab)
CI/CD pipelines (GitHub Actions or GitLab CI)Essential3 links
Continuous integration and continuous delivery (CI/CD) pipelines automate the process of building, testing, and deploying application code on every change. GitHub Actions and GitLab CI define these workflows as YAML files stored alongside source code, triggering jobs on events such as pull requests or tag pushes. A well-designed pipeline provides fast feedback on failures and allows code to move safely from commit to production.
Why it matters · Automated build/test/deploy on every change is the heart of the role; manual deploys are a red flag.
docsGitHub Actions, official docs docsGitLab CI/CD, official docs articlefreeCodeCamp, CI/CD explained
Artifact and image registries plus image scanningRecommended2 links
Artifact registries (such as AWS ECR, GCP Artifact Registry, GitHub Packages, or JFrog Artifactory) store versioned build outputs including container images, packages, and binaries. Image scanning tools analyze container images for known CVEs in OS packages and application dependencies before deployment. Combining a registry with scanning creates a controlled gate that ensures only vetted artifacts reach production environments.
Why it matters · You need a place to store, version, and vet build outputs before they reach production.
docsOCI image-spec (the standard behind registries)docsTrivy, vulnerability and config scanner (official site)
Build itminigrep: A Search Tool from Scratchbeginner · 8-12 hours
Stage 04
Stage 4, Orchestration & Infrastructure as Code
Run containers at scale on Kubernetes and define all infrastructure declaratively in code. This is roughly the line between junior and mid-level.
Kubernetes (pods, deployments, services, ingress, configmaps/secrets, namespaces, RBAC)Essential3 links
Kubernetes is an open-source container orchestration platform that automates deployment, scaling, and management of containerized workloads across clusters of nodes. Core API objects include pods (the smallest deployable unit), deployments (declarative replica management), services (stable network endpoints), and ingress controllers (HTTP routing). Namespaces provide logical isolation, while RBAC controls which principals can interact with which resources.
Why it matters · The de facto orchestration platform; CKA-level fluency appears on most mid and senior postings.
docsKubernetes official tutorials (including Basics)projectKillercoda, free interactive Kubernetes (Killer Shell CKA) scenarios projectKubernetes the Hard Way (Kelsey Hightower, free)
Helm (packaging and templating Kubernetes manifests)Recommended2 links
Helm is the package manager for Kubernetes, allowing teams to define, install, and upgrade applications as versioned chart packages. A Helm chart bundles Kubernetes manifests with a templating engine (Go templates) and a values file, enabling a single chart to be deployed with different configurations across environments. Helm also tracks release history, enabling rollbacks to previous chart versions.
Why it matters · The standard way to package, version, and parameterize Kubernetes deployments.
docsHelm official docs docsHelm Quickstart Guide
Infrastructure as Code: Terraform or OpenTofuEssential3 links
Terraform is a declarative Infrastructure as Code tool that provisions and manages cloud and on-premises resources through provider plugins by comparing desired state (HCL configuration files) against current state stored in a state file. OpenTofu is the open-source, community-governed fork of Terraform created after HashiCorp relicensed Terraform under the Business Source License in 2023. Both tools support plan-and-apply workflows, enabling reproducible and version-controlled infrastructure changes.
Why it matters · Declarative, version-controlled, repeatable infrastructure is non-negotiable in 2026 hiring; OpenTofu is the open-source fork after Terraform moved to the BSL.
docsHashiCorp Terraform tutorials (official, free)docsOpenTofu docs (open-source Terraform fork)videofreeCodeCamp, Terraform Associate certification course (7h video)
Configuration management (Ansible)Recommended2 links
Ansible is an agentless configuration management and automation tool that uses SSH to apply declarative YAML playbooks to remote hosts. It is commonly used to provision virtual machines, enforce configuration consistency across fleets, and coordinate multi-step deployment procedures. Ansible inventories describe target hosts, roles organize reusable task collections, and modules abstract individual operations such as package installation or file templating.
Why it matters · Still the go-to for provisioning VMs and taming config drift; complements rather than replaces Terraform.
docsAnsible Getting Started (official)docsAnsible for DevOps, free source manuscript (CC BY-SA, by Jeff Geerling)
Service mesh (Istio or Linkerd)Optional2 links
A service mesh is an infrastructure layer for managing service-to-service communication in microservice architectures, typically implemented as sidecar proxies (Envoy in Istio, linkerd-proxy in Linkerd) injected alongside each workload. It provides mutual TLS for encryption and authentication between services, traffic management features (retries, timeouts, canary splits), and L7 telemetry without requiring application code changes. Istio offers broad feature coverage, while Linkerd prioritizes operational simplicity and lower resource overhead.
Why it matters · Common in large microservice estates for mTLS, traffic shaping, and L7 telemetry, but no longer a junior/mid prerequisite; learn it when a real system needs it.
docsLinkerd docs (getting started)docsIstio docs (concepts)
Build itLocal LLM Workstationintermediate · 4-8 hours
Checkpoint
Don't wait, start applying
You don't have to finish the path to begin. Early applications and interviews show you exactly what to learn next.
Start applying to DevOps / SRE roles nowReal applications and interviews tell you what to learn next. Begin before you finish.Browse jobs
Stage 05
Stage 5, Observability & GitOps Delivery
Make systems answer 'is it healthy?' and 'what broke?', and adopt Git as the single source of truth for deployment (GitOps), the 2026 default delivery model.
Metrics, logs, and traces with Prometheus, Grafana, and LokiEssential3 links
Prometheus is an open-source metrics collection and alerting system that scrapes time-series data from instrumented services using a pull model and a query language called PromQL. Grafana is a visualization platform that builds dashboards from Prometheus metrics and other data sources. Loki is Grafana Labs' log aggregation system, designed to store and query logs with minimal indexing by using the same label model as Prometheus.
Why it matters · You cannot operate or be on-call for what you cannot see; instrumentation and dashboards are core SRE work.
docsPrometheus official docs docsGrafana, fundamentals / get started (official)docsGrafana Loki docs (log aggregation)
OpenTelemetry (vendor-neutral instrumentation standard)Essential2 links
OpenTelemetry (OTel) is a CNCF project that provides a unified set of APIs, SDKs, and a collector for generating, collecting, and exporting telemetry data (traces, metrics, and logs) from applications and infrastructure. It defines vendor-neutral wire protocols (OTLP) and semantic conventions, allowing instrumentation to be written once and exported to any compatible backend such as Grafana, Jaeger, Honeycomb, or Datadog. By 2026, OTel has become the de facto standard for cloud-native observability instrumentation.
Why it matters · The 2026 convergence point: every major observability vendor supports OTel natively, so instrumenting once avoids lock-in.
docsOpenTelemetry official docs docsOpenTelemetry Collector docs
GitOps with Argo CD or Flux (plus progressive delivery / canaries)Essential3 links
GitOps is an operational model in which a Git repository serves as the single source of truth for the desired state of Kubernetes workloads, with a controller continuously reconciling the cluster to match that state. Argo CD and Flux are the two leading GitOps controllers, each watching repositories and applying changes automatically or on approval. Progressive delivery extends GitOps with traffic-splitting strategies such as canary and blue-green deployments, allowing gradual rollout with automated promotion or rollback based on metrics.
Why it matters · Now the primary Kubernetes delivery mechanism across much of the industry; Git becomes the deploy source of truth and audit log.
docsArgo CD official docs docsFlux official docs docsOpenGitOps, principles (CNCF GitOps WG)
eBPF-based networking and observability (Cilium)Optional2 links
eBPF (extended Berkeley Packet Filter) is a Linux kernel technology that allows sandboxed programs to run in kernel space without modifying kernel source code, enabling high-performance networking, security enforcement, and observability with low overhead. Cilium is a CNCF networking and security project that uses eBPF to implement Kubernetes networking (CNI), network policies, service mesh capabilities, and rich L7 telemetry. Cilium's Hubble component provides real-time visibility into network flows without requiring application-level instrumentation.
Why it matters · A fast-growing 2026 approach to kernel-level networking, security, and telemetry; specialized today but increasingly visible on platform teams.
docsCilium docs (eBPF networking, observability, security)articleWhat is eBPF? (official eBPF.io introduction)
Build itGitOps Deploy Pipelineintermediate · 8-14 hours
Stage 06
Stage 6, Security / DevSecOps (Shift-Left & Supply Chain)
Bake security into the pipeline and harden the software supply chain, increasingly driven by regulation (SBOMs, signing, policy-as-code), not just good hygiene.
Pipeline and dependency scanning plus secrets management (Trivy, Vault)Essential3 links
Trivy is an open-source vulnerability and misconfiguration scanner that checks container images, filesystems, Git repositories, and Kubernetes manifests for known CVEs, exposed secrets, and insecure configurations as part of CI pipelines. HashiCorp Vault is a secrets management platform that centrally stores, leases, and rotates sensitive values (API keys, database credentials, TLS certificates) and provides short-lived dynamic secrets to applications and pipelines. Together, scanning and secrets management prevent vulnerable code and exposed credentials from reaching production.
Why it matters · Vulnerabilities and leaked secrets must be caught in CI, before they ever reach production.
docsTrivy official site (vuln / IaC / secret scanning)docsOWASP DevSecOps Guideline docsHashiCorp Vault tutorials (secrets management)
Software supply-chain security: SBOMs and artifact signing (Syft, Sigstore/cosign, SLSA)Recommended3 links
A Software Bill of Materials (SBOM) is a structured inventory of all components, libraries, and dependencies in a software artifact, enabling consumers to identify known vulnerabilities in what they are running. Syft is an open-source SBOM generator that produces CycloneDX or SPDX documents from container images and filesystems. Sigstore (specifically its cosign tool) enables keyless cryptographic signing and verification of container images, and the SLSA (Supply-chain Levels for Software Artifacts) framework defines attestation levels of build provenance to prevent tampering.
Why it matters · SBOMs and build provenance have moved from optional to regulatory expectation (e.g., US EO 14028, EU CRA) heading into 2026.
docsSigstore docs (cosign artifact signing)docsSLSA supply-chain framework docsSyft, generate SBOMs (Anchore, open source)
Policy as code (Open Policy Agent / Kyverno)Recommended2 links
Policy as code is the practice of expressing compliance, security, and operational rules as machine-readable policies stored and versioned alongside infrastructure code. Open Policy Agent (OPA) is a general-purpose policy engine using the Rego language, often integrated as a Kubernetes admission controller or in CI pipelines to evaluate requests against defined rules. Kyverno is a Kubernetes-native policy engine that uses YAML policies to validate, mutate, and generate resources directly within the cluster without requiring a separate policy language.
Why it matters · Codified, automated guardrails enforce compliance and least-privilege at deploy time instead of relying on review by hand.
docsOpen Policy Agent docs docsKyverno docs (Kubernetes-native policy)
Build itDistributed Rate Limiter & API Gatewayintermediate · 10-16 hours
Stage 07
Stage 7, SRE Practice & Platform Engineering (Job-Ready / Senior Track)
Adopt the operating discipline that defines senior roles: reliability as a measurable engineering target (SLOs/error budgets), real incident response, and building self-service internal platforms so other teams ship safely.
SRE fundamentals: SLIs, SLOs, error budgets, toil reductionEssential3 links
Site Reliability Engineering (SRE) is a discipline that applies software engineering practices to operations, developed at Google and now widely adopted. Service Level Indicators (SLIs) are specific metrics measuring service behavior (such as request latency or error rate), Service Level Objectives (SLOs) are the target thresholds for those metrics, and error budgets define the allowable failure margin within a period. Toil reduction is the ongoing effort to identify and automate repetitive, manual operational work that scales with system size.
Why it matters · Turns 'reliability' into measurable, negotiable engineering, the language of SRE interviews and roadmaps.
docsGoogle SRE Book (free, full text)docsGoogle SRE Workbook, Implementing SLOs (free)articleGoogle SRE, Embracing Risk (error budgets)
Incident response, on-call, blameless postmortems, alertingEssential3 links
Incident response is the structured process of detecting, triaging, mitigating, and resolving production outages or degradations, typically coordinated by an on-call engineer following a runbook. Alerting systems (Alertmanager, PagerDuty, or similar) route threshold-based or anomaly-based notifications to the appropriate responders. Blameless postmortems are written analyses produced after an incident that focus on systemic causes and preventive actions rather than individual fault, creating a shared learning artifact.
Why it matters · Production ownership is the core SRE/DevOps responsibility; runbooks and postmortems are everyday artifacts.
docsGoogle SRE, Managing Incidents (free)docsPagerDuty Incident Response (open, free guide)docsPrometheus Alertmanager docs
Platform Engineering and Internal Developer Platforms (Backstage)Recommended2 links
Platform Engineering is the discipline of designing and maintaining an Internal Developer Platform (IDP), a self-service layer that abstracts infrastructure complexity and provides standardized paths for developers to build, deploy, and operate services. Backstage is an open-source CNCF project originally developed by Spotify that serves as a framework for building IDPs, offering a software catalog, templated scaffolding, plugin-based integrations with CI/CD and cloud providers, and a unified developer portal. IDPs reduce cognitive load for application teams by encoding organizational best practices into reusable golden paths.
Why it matters · The dominant senior trajectory in 2026 (Gartner projects 80% of large orgs will run platform teams by 2026, up from 45% in 2022); self-service IDPs are becoming the norm.
docsBackstage official docs (CNCF IDP framework)docsCNCF Platforms white paper (TAG App Delivery)
Go for cloud-native tooling and operatorsOptional2 links
Go is a statically typed, compiled programming language designed at Google for systems and server-side software, emphasizing simplicity, fast compilation, and built-in concurrency via goroutines and channels. It is the primary language of the cloud-native ecosystem, including Kubernetes, Docker, Prometheus, Terraform, and the majority of CNCF projects. Writing Kubernetes controllers and operators in Go uses the controller-runtime and client-go libraries to watch cluster resources and reconcile desired state.
Why it matters · The language Kubernetes and most CNCF tools are written in; needed to write controllers/operators and high-performance internal tooling.
courseA Tour of Go (official interactive)articleGo by Example
AI-assisted operations and toil reductionOptional2 links
AI-assisted operations refers to the integration of large language model tools and AI coding assistants into infrastructure workflows to accelerate repetitive tasks such as drafting Terraform modules, writing runbooks, generating Kubernetes manifests, or summarizing incident timelines. These tools function as accelerators for experienced practitioners rather than autonomous agents, requiring the engineer to review, test, and own all generated output. Effective use involves knowing which tasks are well-suited to AI generation and maintaining critical evaluation of results before applying them to production systems.
Why it matters · By 2026 teams routinely use AI to draft IaC, summarize incidents, and write runbooks; the skill is using it to reduce toil while still owning what ships, not outsourcing judgment.
articleGoogle SRE, Eliminating Toil (free, the discipline AI should serve)docsOWASP Top 10 for LLM Applications (risks when wiring AI into infra)
Capstone: build an end-to-end production-style platformRecommended2 links
A capstone platform project integrates the full DevOps and SRE skill chain into a single cohesive system: infrastructure provisioned with Terraform or OpenTofu, applications containerized with Docker, deployed to Kubernetes via a GitOps controller (Argo CD or Flux), with CI/CD pipelines, centralized observability (Prometheus, Grafana, Loki), and defined SLOs. Building and operating such a project end-to-end demonstrates practical understanding of how each layer interacts and serves as a concrete portfolio artifact showing applied competency across the stack.
Why it matters · A running portfolio project (IaC to containers to Kubernetes to GitOps to observability to SLOs) demonstrates the whole chain and beats certifications in interviews.
projectDevOps Roadmap (open-source skill map, reference only)docsCNCF Cloud Native Landscape (tool reference)
Build itSupply-Chain Security Scanneradvanced · 20-35 hours
Land the job
Turn these skills into offers
ResuMax takes you from skilled to hired: a resume that proves it, applications tailored per role, and interview reps.
Build a resume that proves these skillsIn ResuMaxOpen builder
Tailor it to each DevOps / SRE postingIn ResuMaxTailor
Apply to DevOps / SRE jobs matched to youIn ResuMaxBrowse jobs
Practice DevOps / SRE interviewsIn ResuMaxStart prep

Browse all coding projects

Train on this path

Atlas reads your resume, shows what you already have on this path, and coaches the gaps in order.

Map my resume

Stage 1, Systems & Scripting Foundations

Stage 2, Version Control & Cloud Fundamentals

Stage 3, Containers & CI/CD (Shipping Code)

Stage 4, Orchestration & Infrastructure as Code

Don't wait, start applying

Stage 5, Observability & GitOps Delivery

Stage 6, Security / DevSecOps (Shift-Left & Supply Chain)

Stage 7, SRE Practice & Platform Engineering (Job-Ready / Senior Track)

Turn these skills into offers