All roadmaps

Roadmap to a job

DevOps / Platform / Site Reliability Engineer

DevOps, Platform Engineering, and SRE are three views of one job: shipping and operating software reliably, safely, and fast

7 stages · 29 skills · 73 free resources

Core stack

DockerKubernetesTerraformPrometheusLinux

Track your progress

0 / 34 done

  1. Stage 01

    Stage 1, Systems & Scripting Foundations

    Become fluent on a Linux command line, automate routine work with scripts, and be able to model how data moves across a network. Everything above this stage assumes these reflexes.

    Linux fundamentals (filesystem, permissions, processes, systemd/journald, package managers)Essential

    Linux is the operating system underpinning virtually all servers, containers, and CI infrastructure. Its filesystem hierarchy, permission model, and process management form the foundation of system administration. systemd manages services and boot targets, journald handles structured logging, and package managers such as apt, dnf, and apk install and maintain software.

    Why it matters · Every container, server, and CI runner is Linux; the shell is where you will spend your days.

    Shell scripting (Bash)Essential

    Bash is the default Unix shell and scripting language found on nearly every Linux system. It enables automation of repetitive tasks through scripts that chain commands, handle control flow, and manipulate files and processes. Bash scripts are commonly used for CI steps, server provisioning, log processing, and operational tooling.

    Why it matters · The glue for automation, CI steps, and fast ops fixes; the lowest-friction tool in your kit.

    Networking (TCP/IP, DNS, HTTP/HTTPS, TLS, load balancing, subnets/CIDR)Essential

    TCP/IP is the foundational protocol suite for internet and intranet communication, with DNS translating hostnames to addresses and TLS providing encrypted transport for HTTP. Load balancing distributes traffic across multiple servers to improve availability and throughput. CIDR notation is used to define subnets, controlling how IP address ranges are allocated and routed within networks.

    Why it matters · A large share of production incidents are network, DNS, or TLS; you cannot debug what you cannot model.

    Python for automationEssential

    Python is a general-purpose, interpreted programming language widely used for scripting, automation, and tooling in infrastructure contexts. Its standard library and rich ecosystem (including boto3 for AWS, the google-cloud libraries, and the Kubernetes client) make it the preferred language for writing more complex operational scripts, internal CLIs, and lightweight services beyond what Bash handles cleanly.

    Why it matters · The default language for tooling, glue, and small APIs once Bash becomes unwieldy.

  2. Stage 02

    Stage 2, Version Control & Cloud Fundamentals

    Track everything in Git, collaborate through pull requests, and stand up real infrastructure on one cloud provider you understand deeply.

    Git and a forge (GitHub or GitLab): branching, PRs, merge conflicts, tagsEssential

    Git is a distributed version control system that tracks changes to source code over time, supporting branching, merging, and tagging. GitHub and GitLab are web-based forges that host Git repositories and add collaboration features such as pull requests, code review, issue tracking, and integrated CI/CD. Together they form the workflow through which code changes are proposed, reviewed, and merged.

    Why it matters · Git is the source of truth for application code and, later, for infrastructure itself (GitOps).

    One cloud provider deeply (AWS, GCP, or Azure): compute, networking/VPC, IAM, object storage, managed databasesEssential

    AWS, GCP, and Azure are the three dominant public cloud platforms, each offering compute (virtual machines and serverless), virtual private networking, identity and access management, object storage, and managed databases as foundational services. Deep knowledge of one provider means understanding how its services interact, how resources are billed, and how to architect resilient, secure workloads within its specific abstractions and console or CLI tooling.

    Why it matters · Core concepts transfer across clouds, and employers want demonstrable depth in at least one.

    Cloud IAM and the least-privilege modelEssential

    Cloud Identity and Access Management (IAM) systems control which principals (users, service accounts, roles) can perform which actions on which resources within a cloud environment. The least-privilege model is a security principle dictating that each identity is granted only the minimum permissions required to perform its function. Properly scoped IAM policies reduce the blast radius of compromised credentials or misconfigured services.

    Why it matters · Misconfigured permissions are a leading cause of cloud breaches; security awareness starts at IAM, not at the end of the pipeline.

    Cloud cost awareness (FinOps basics)Optional

    FinOps (Financial Operations) is a practice that brings financial accountability to cloud spending by connecting engineering, finance, and product teams around shared cost visibility. Core concepts include understanding cloud billing models (on-demand, reserved, spot), reading cost explorer dashboards, tagging resources for allocation, and right-sizing compute. Awareness of cost drivers helps teams make architectural decisions that balance performance with expenditure.

    Why it matters · Budget ownership is increasingly part of platform roles; cost-aware infra design is a real differentiator in 2026.

  3. Stage 03

    Stage 3, Containers & CI/CD (Shipping Code)

    Package any application into a container and build an automated pipeline that tests and ships it on every commit. This is the core DevOps loop.

    Docker and containerization (images, layers, Dockerfiles, multi-stage builds, registries, compose; the cgroups/namespaces underneath)Essential

    Docker is a platform for building and running containers, which are lightweight, isolated processes packaged with their dependencies using Linux cgroups and namespaces. A Dockerfile defines the build instructions for an image, with multi-stage builds reducing final image size by separating build and runtime environments. Container registries store and distribute images, while Docker Compose orchestrates multi-container applications on a single host.

    Why it matters · Containers are the universal unit of deployment, and knowing the Linux primitives behind them makes Kubernetes far less magical.

    CI/CD pipelines (GitHub Actions or GitLab CI)Essential

    Continuous integration and continuous delivery (CI/CD) pipelines automate the process of building, testing, and deploying application code on every change. GitHub Actions and GitLab CI define these workflows as YAML files stored alongside source code, triggering jobs on events such as pull requests or tag pushes. A well-designed pipeline provides fast feedback on failures and allows code to move safely from commit to production.

    Why it matters · Automated build/test/deploy on every change is the heart of the role; manual deploys are a red flag.

    Artifact and image registries plus image scanningRecommended

    Artifact registries (such as AWS ECR, GCP Artifact Registry, GitHub Packages, or JFrog Artifactory) store versioned build outputs including container images, packages, and binaries. Image scanning tools analyze container images for known CVEs in OS packages and application dependencies before deployment. Combining a registry with scanning creates a controlled gate that ensures only vetted artifacts reach production environments.

    Why it matters · You need a place to store, version, and vet build outputs before they reach production.

  4. Stage 04

    Stage 4, Orchestration & Infrastructure as Code

    Run containers at scale on Kubernetes and define all infrastructure declaratively in code. This is roughly the line between junior and mid-level.

    Kubernetes (pods, deployments, services, ingress, configmaps/secrets, namespaces, RBAC)Essential

    Kubernetes is an open-source container orchestration platform that automates deployment, scaling, and management of containerized workloads across clusters of nodes. Core API objects include pods (the smallest deployable unit), deployments (declarative replica management), services (stable network endpoints), and ingress controllers (HTTP routing). Namespaces provide logical isolation, while RBAC controls which principals can interact with which resources.

    Why it matters · The de facto orchestration platform; CKA-level fluency appears on most mid and senior postings.

    Helm (packaging and templating Kubernetes manifests)Recommended

    Helm is the package manager for Kubernetes, allowing teams to define, install, and upgrade applications as versioned chart packages. A Helm chart bundles Kubernetes manifests with a templating engine (Go templates) and a values file, enabling a single chart to be deployed with different configurations across environments. Helm also tracks release history, enabling rollbacks to previous chart versions.

    Why it matters · The standard way to package, version, and parameterize Kubernetes deployments.

    Infrastructure as Code: Terraform or OpenTofuEssential

    Terraform is a declarative Infrastructure as Code tool that provisions and manages cloud and on-premises resources through provider plugins by comparing desired state (HCL configuration files) against current state stored in a state file. OpenTofu is the open-source, community-governed fork of Terraform created after HashiCorp relicensed Terraform under the Business Source License in 2023. Both tools support plan-and-apply workflows, enabling reproducible and version-controlled infrastructure changes.

    Why it matters · Declarative, version-controlled, repeatable infrastructure is non-negotiable in 2026 hiring; OpenTofu is the open-source fork after Terraform moved to the BSL.

    Configuration management (Ansible)Recommended

    Ansible is an agentless configuration management and automation tool that uses SSH to apply declarative YAML playbooks to remote hosts. It is commonly used to provision virtual machines, enforce configuration consistency across fleets, and coordinate multi-step deployment procedures. Ansible inventories describe target hosts, roles organize reusable task collections, and modules abstract individual operations such as package installation or file templating.

    Why it matters · Still the go-to for provisioning VMs and taming config drift; complements rather than replaces Terraform.

    Service mesh (Istio or Linkerd)Optional

    A service mesh is an infrastructure layer for managing service-to-service communication in microservice architectures, typically implemented as sidecar proxies (Envoy in Istio, linkerd-proxy in Linkerd) injected alongside each workload. It provides mutual TLS for encryption and authentication between services, traffic management features (retries, timeouts, canary splits), and L7 telemetry without requiring application code changes. Istio offers broad feature coverage, while Linkerd prioritizes operational simplicity and lower resource overhead.

    Why it matters · Common in large microservice estates for mTLS, traffic shaping, and L7 telemetry, but no longer a junior/mid prerequisite; learn it when a real system needs it.

  5. Checkpoint

    Don't wait, start applying

    You don't have to finish the path to begin. Early applications and interviews show you exactly what to learn next.

  6. Stage 05

    Stage 5, Observability & GitOps Delivery

    Make systems answer 'is it healthy?' and 'what broke?', and adopt Git as the single source of truth for deployment (GitOps), the 2026 default delivery model.

    Metrics, logs, and traces with Prometheus, Grafana, and LokiEssential

    Prometheus is an open-source metrics collection and alerting system that scrapes time-series data from instrumented services using a pull model and a query language called PromQL. Grafana is a visualization platform that builds dashboards from Prometheus metrics and other data sources. Loki is Grafana Labs' log aggregation system, designed to store and query logs with minimal indexing by using the same label model as Prometheus.

    Why it matters · You cannot operate or be on-call for what you cannot see; instrumentation and dashboards are core SRE work.

    OpenTelemetry (vendor-neutral instrumentation standard)Essential

    OpenTelemetry (OTel) is a CNCF project that provides a unified set of APIs, SDKs, and a collector for generating, collecting, and exporting telemetry data (traces, metrics, and logs) from applications and infrastructure. It defines vendor-neutral wire protocols (OTLP) and semantic conventions, allowing instrumentation to be written once and exported to any compatible backend such as Grafana, Jaeger, Honeycomb, or Datadog. By 2026, OTel has become the de facto standard for cloud-native observability instrumentation.

    Why it matters · The 2026 convergence point: every major observability vendor supports OTel natively, so instrumenting once avoids lock-in.

    GitOps with Argo CD or Flux (plus progressive delivery / canaries)Essential

    GitOps is an operational model in which a Git repository serves as the single source of truth for the desired state of Kubernetes workloads, with a controller continuously reconciling the cluster to match that state. Argo CD and Flux are the two leading GitOps controllers, each watching repositories and applying changes automatically or on approval. Progressive delivery extends GitOps with traffic-splitting strategies such as canary and blue-green deployments, allowing gradual rollout with automated promotion or rollback based on metrics.

    Why it matters · Now the primary Kubernetes delivery mechanism across much of the industry; Git becomes the deploy source of truth and audit log.

    eBPF-based networking and observability (Cilium)Optional

    eBPF (extended Berkeley Packet Filter) is a Linux kernel technology that allows sandboxed programs to run in kernel space without modifying kernel source code, enabling high-performance networking, security enforcement, and observability with low overhead. Cilium is a CNCF networking and security project that uses eBPF to implement Kubernetes networking (CNI), network policies, service mesh capabilities, and rich L7 telemetry. Cilium's Hubble component provides real-time visibility into network flows without requiring application-level instrumentation.

    Why it matters · A fast-growing 2026 approach to kernel-level networking, security, and telemetry; specialized today but increasingly visible on platform teams.

  7. Stage 06

    Stage 6, Security / DevSecOps (Shift-Left & Supply Chain)

    Bake security into the pipeline and harden the software supply chain, increasingly driven by regulation (SBOMs, signing, policy-as-code), not just good hygiene.

    Pipeline and dependency scanning plus secrets management (Trivy, Vault)Essential

    Trivy is an open-source vulnerability and misconfiguration scanner that checks container images, filesystems, Git repositories, and Kubernetes manifests for known CVEs, exposed secrets, and insecure configurations as part of CI pipelines. HashiCorp Vault is a secrets management platform that centrally stores, leases, and rotates sensitive values (API keys, database credentials, TLS certificates) and provides short-lived dynamic secrets to applications and pipelines. Together, scanning and secrets management prevent vulnerable code and exposed credentials from reaching production.

    Why it matters · Vulnerabilities and leaked secrets must be caught in CI, before they ever reach production.

    Software supply-chain security: SBOMs and artifact signing (Syft, Sigstore/cosign, SLSA)Recommended

    A Software Bill of Materials (SBOM) is a structured inventory of all components, libraries, and dependencies in a software artifact, enabling consumers to identify known vulnerabilities in what they are running. Syft is an open-source SBOM generator that produces CycloneDX or SPDX documents from container images and filesystems. Sigstore (specifically its cosign tool) enables keyless cryptographic signing and verification of container images, and the SLSA (Supply-chain Levels for Software Artifacts) framework defines attestation levels of build provenance to prevent tampering.

    Why it matters · SBOMs and build provenance have moved from optional to regulatory expectation (e.g., US EO 14028, EU CRA) heading into 2026.

    Policy as code (Open Policy Agent / Kyverno)Recommended

    Policy as code is the practice of expressing compliance, security, and operational rules as machine-readable policies stored and versioned alongside infrastructure code. Open Policy Agent (OPA) is a general-purpose policy engine using the Rego language, often integrated as a Kubernetes admission controller or in CI pipelines to evaluate requests against defined rules. Kyverno is a Kubernetes-native policy engine that uses YAML policies to validate, mutate, and generate resources directly within the cluster without requiring a separate policy language.

    Why it matters · Codified, automated guardrails enforce compliance and least-privilege at deploy time instead of relying on review by hand.

  8. Stage 07

    Stage 7, SRE Practice & Platform Engineering (Job-Ready / Senior Track)

    Adopt the operating discipline that defines senior roles: reliability as a measurable engineering target (SLOs/error budgets), real incident response, and building self-service internal platforms so other teams ship safely.

    SRE fundamentals: SLIs, SLOs, error budgets, toil reductionEssential

    Site Reliability Engineering (SRE) is a discipline that applies software engineering practices to operations, developed at Google and now widely adopted. Service Level Indicators (SLIs) are specific metrics measuring service behavior (such as request latency or error rate), Service Level Objectives (SLOs) are the target thresholds for those metrics, and error budgets define the allowable failure margin within a period. Toil reduction is the ongoing effort to identify and automate repetitive, manual operational work that scales with system size.

    Why it matters · Turns 'reliability' into measurable, negotiable engineering, the language of SRE interviews and roadmaps.

    Incident response, on-call, blameless postmortems, alertingEssential

    Incident response is the structured process of detecting, triaging, mitigating, and resolving production outages or degradations, typically coordinated by an on-call engineer following a runbook. Alerting systems (Alertmanager, PagerDuty, or similar) route threshold-based or anomaly-based notifications to the appropriate responders. Blameless postmortems are written analyses produced after an incident that focus on systemic causes and preventive actions rather than individual fault, creating a shared learning artifact.

    Why it matters · Production ownership is the core SRE/DevOps responsibility; runbooks and postmortems are everyday artifacts.

    Platform Engineering and Internal Developer Platforms (Backstage)Recommended

    Platform Engineering is the discipline of designing and maintaining an Internal Developer Platform (IDP), a self-service layer that abstracts infrastructure complexity and provides standardized paths for developers to build, deploy, and operate services. Backstage is an open-source CNCF project originally developed by Spotify that serves as a framework for building IDPs, offering a software catalog, templated scaffolding, plugin-based integrations with CI/CD and cloud providers, and a unified developer portal. IDPs reduce cognitive load for application teams by encoding organizational best practices into reusable golden paths.

    Why it matters · The dominant senior trajectory in 2026 (Gartner projects 80% of large orgs will run platform teams by 2026, up from 45% in 2022); self-service IDPs are becoming the norm.

    Go for cloud-native tooling and operatorsOptional

    Go is a statically typed, compiled programming language designed at Google for systems and server-side software, emphasizing simplicity, fast compilation, and built-in concurrency via goroutines and channels. It is the primary language of the cloud-native ecosystem, including Kubernetes, Docker, Prometheus, Terraform, and the majority of CNCF projects. Writing Kubernetes controllers and operators in Go uses the controller-runtime and client-go libraries to watch cluster resources and reconcile desired state.

    Why it matters · The language Kubernetes and most CNCF tools are written in; needed to write controllers/operators and high-performance internal tooling.

    AI-assisted operations and toil reductionOptional

    AI-assisted operations refers to the integration of large language model tools and AI coding assistants into infrastructure workflows to accelerate repetitive tasks such as drafting Terraform modules, writing runbooks, generating Kubernetes manifests, or summarizing incident timelines. These tools function as accelerators for experienced practitioners rather than autonomous agents, requiring the engineer to review, test, and own all generated output. Effective use involves knowing which tasks are well-suited to AI generation and maintaining critical evaluation of results before applying them to production systems.

    Why it matters · By 2026 teams routinely use AI to draft IaC, summarize incidents, and write runbooks; the skill is using it to reduce toil while still owning what ships, not outsourcing judgment.

    Capstone: build an end-to-end production-style platformRecommended

    A capstone platform project integrates the full DevOps and SRE skill chain into a single cohesive system: infrastructure provisioned with Terraform or OpenTofu, applications containerized with Docker, deployed to Kubernetes via a GitOps controller (Argo CD or Flux), with CI/CD pipelines, centralized observability (Prometheus, Grafana, Loki), and defined SLOs. Building and operating such a project end-to-end demonstrates practical understanding of how each layer interacts and serves as a concrete portfolio artifact showing applied competency across the stack.

    Why it matters · A running portfolio project (IaC to containers to Kubernetes to GitOps to observability to SLOs) demonstrates the whole chain and beats certifications in interviews.

  9. Land the job

    Turn these skills into offers

    ResuMax takes you from skilled to hired: a resume that proves it, applications tailored per role, and interview reps.

Train on this path

Atlas reads your resume, shows what you already have on this path, and coaches the gaps in order.

Map my resume