Roadmap to a job
DevOps / Platform / Site Reliability Engineer
DevOps, Platform Engineering, and SRE are three views of one job: shipping and operating software reliably, safely, and fast
7 stages · 29 skills · 73 free resources
Core stack
Track your progress
0 / 34 done
Stage 01
Stage 1, Systems & Scripting Foundations
Become fluent on a Linux command line, automate routine work with scripts, and be able to model how data moves across a network. Everything above this stage assumes these reflexes.
Linux fundamentals (filesystem, permissions, processes, systemd/journald, package managers)Essential3 links
Linux is the operating system underpinning virtually all servers, containers, and CI infrastructure. Its filesystem hierarchy, permission model, and process management form the foundation of system administration. systemd manages services and boot targets, journald handles structured logging, and package managers such as apt, dnf, and apk install and maintain software.
Why it matters · Every container, server, and CI runner is Linux; the shell is where you will spend your days.
Shell scripting (Bash)Essential3 links
Bash is the default Unix shell and scripting language found on nearly every Linux system. It enables automation of repetitive tasks through scripts that chain commands, handle control flow, and manipulate files and processes. Bash scripts are commonly used for CI steps, server provisioning, log processing, and operational tooling.
Why it matters · The glue for automation, CI steps, and fast ops fixes; the lowest-friction tool in your kit.
Networking (TCP/IP, DNS, HTTP/HTTPS, TLS, load balancing, subnets/CIDR)Essential3 links
TCP/IP is the foundational protocol suite for internet and intranet communication, with DNS translating hostnames to addresses and TLS providing encrypted transport for HTTP. Load balancing distributes traffic across multiple servers to improve availability and throughput. CIDR notation is used to define subnets, controlling how IP address ranges are allocated and routed within networks.
Why it matters · A large share of production incidents are network, DNS, or TLS; you cannot debug what you cannot model.
Python for automationEssential2 links
Python is a general-purpose, interpreted programming language widely used for scripting, automation, and tooling in infrastructure contexts. Its standard library and rich ecosystem (including boto3 for AWS, the google-cloud libraries, and the Kubernetes client) make it the preferred language for writing more complex operational scripts, internal CLIs, and lightweight services beyond what Bash handles cleanly.
Why it matters · The default language for tooling, glue, and small APIs once Bash becomes unwieldy.
Stage 02
Stage 2, Version Control & Cloud Fundamentals
Track everything in Git, collaborate through pull requests, and stand up real infrastructure on one cloud provider you understand deeply.
Git and a forge (GitHub or GitLab): branching, PRs, merge conflicts, tagsEssential3 links
Git is a distributed version control system that tracks changes to source code over time, supporting branching, merging, and tagging. GitHub and GitLab are web-based forges that host Git repositories and add collaboration features such as pull requests, code review, issue tracking, and integrated CI/CD. Together they form the workflow through which code changes are proposed, reviewed, and merged.
Why it matters · Git is the source of truth for application code and, later, for infrastructure itself (GitOps).
One cloud provider deeply (AWS, GCP, or Azure): compute, networking/VPC, IAM, object storage, managed databasesEssential3 links
AWS, GCP, and Azure are the three dominant public cloud platforms, each offering compute (virtual machines and serverless), virtual private networking, identity and access management, object storage, and managed databases as foundational services. Deep knowledge of one provider means understanding how its services interact, how resources are billed, and how to architect resilient, secure workloads within its specific abstractions and console or CLI tooling.
Why it matters · Core concepts transfer across clouds, and employers want demonstrable depth in at least one.
Cloud IAM and the least-privilege modelEssential2 links
Cloud Identity and Access Management (IAM) systems control which principals (users, service accounts, roles) can perform which actions on which resources within a cloud environment. The least-privilege model is a security principle dictating that each identity is granted only the minimum permissions required to perform its function. Properly scoped IAM policies reduce the blast radius of compromised credentials or misconfigured services.
Why it matters · Misconfigured permissions are a leading cause of cloud breaches; security awareness starts at IAM, not at the end of the pipeline.
Cloud cost awareness (FinOps basics)Optional2 links
FinOps (Financial Operations) is a practice that brings financial accountability to cloud spending by connecting engineering, finance, and product teams around shared cost visibility. Core concepts include understanding cloud billing models (on-demand, reserved, spot), reading cost explorer dashboards, tagging resources for allocation, and right-sizing compute. Awareness of cost drivers helps teams make architectural decisions that balance performance with expenditure.
Why it matters · Budget ownership is increasingly part of platform roles; cost-aware infra design is a real differentiator in 2026.
Stage 03
Stage 3, Containers & CI/CD (Shipping Code)
Package any application into a container and build an automated pipeline that tests and ships it on every commit. This is the core DevOps loop.
Docker and containerization (images, layers, Dockerfiles, multi-stage builds, registries, compose; the cgroups/namespaces underneath)Essential3 links
Docker is a platform for building and running containers, which are lightweight, isolated processes packaged with their dependencies using Linux cgroups and namespaces. A Dockerfile defines the build instructions for an image, with multi-stage builds reducing final image size by separating build and runtime environments. Container registries store and distribute images, while Docker Compose orchestrates multi-container applications on a single host.
Why it matters · Containers are the universal unit of deployment, and knowing the Linux primitives behind them makes Kubernetes far less magical.
CI/CD pipelines (GitHub Actions or GitLab CI)Essential3 links
Continuous integration and continuous delivery (CI/CD) pipelines automate the process of building, testing, and deploying application code on every change. GitHub Actions and GitLab CI define these workflows as YAML files stored alongside source code, triggering jobs on events such as pull requests or tag pushes. A well-designed pipeline provides fast feedback on failures and allows code to move safely from commit to production.
Why it matters · Automated build/test/deploy on every change is the heart of the role; manual deploys are a red flag.
Artifact and image registries plus image scanningRecommended2 links
Artifact registries (such as AWS ECR, GCP Artifact Registry, GitHub Packages, or JFrog Artifactory) store versioned build outputs including container images, packages, and binaries. Image scanning tools analyze container images for known CVEs in OS packages and application dependencies before deployment. Combining a registry with scanning creates a controlled gate that ensures only vetted artifacts reach production environments.
Why it matters · You need a place to store, version, and vet build outputs before they reach production.
Stage 04
Stage 4, Orchestration & Infrastructure as Code
Run containers at scale on Kubernetes and define all infrastructure declaratively in code. This is roughly the line between junior and mid-level.
Kubernetes (pods, deployments, services, ingress, configmaps/secrets, namespaces, RBAC)Essential3 links
Kubernetes is an open-source container orchestration platform that automates deployment, scaling, and management of containerized workloads across clusters of nodes. Core API objects include pods (the smallest deployable unit), deployments (declarative replica management), services (stable network endpoints), and ingress controllers (HTTP routing). Namespaces provide logical isolation, while RBAC controls which principals can interact with which resources.
Why it matters · The de facto orchestration platform; CKA-level fluency appears on most mid and senior postings.
Helm (packaging and templating Kubernetes manifests)Recommended2 links
Helm is the package manager for Kubernetes, allowing teams to define, install, and upgrade applications as versioned chart packages. A Helm chart bundles Kubernetes manifests with a templating engine (Go templates) and a values file, enabling a single chart to be deployed with different configurations across environments. Helm also tracks release history, enabling rollbacks to previous chart versions.
Why it matters · The standard way to package, version, and parameterize Kubernetes deployments.
Infrastructure as Code: Terraform or OpenTofuEssential3 links
Terraform is a declarative Infrastructure as Code tool that provisions and manages cloud and on-premises resources through provider plugins by comparing desired state (HCL configuration files) against current state stored in a state file. OpenTofu is the open-source, community-governed fork of Terraform created after HashiCorp relicensed Terraform under the Business Source License in 2023. Both tools support plan-and-apply workflows, enabling reproducible and version-controlled infrastructure changes.
Why it matters · Declarative, version-controlled, repeatable infrastructure is non-negotiable in 2026 hiring; OpenTofu is the open-source fork after Terraform moved to the BSL.
Configuration management (Ansible)Recommended2 links
Ansible is an agentless configuration management and automation tool that uses SSH to apply declarative YAML playbooks to remote hosts. It is commonly used to provision virtual machines, enforce configuration consistency across fleets, and coordinate multi-step deployment procedures. Ansible inventories describe target hosts, roles organize reusable task collections, and modules abstract individual operations such as package installation or file templating.
Why it matters · Still the go-to for provisioning VMs and taming config drift; complements rather than replaces Terraform.
Service mesh (Istio or Linkerd)Optional2 links
A service mesh is an infrastructure layer for managing service-to-service communication in microservice architectures, typically implemented as sidecar proxies (Envoy in Istio, linkerd-proxy in Linkerd) injected alongside each workload. It provides mutual TLS for encryption and authentication between services, traffic management features (retries, timeouts, canary splits), and L7 telemetry without requiring application code changes. Istio offers broad feature coverage, while Linkerd prioritizes operational simplicity and lower resource overhead.
Why it matters · Common in large microservice estates for mTLS, traffic shaping, and L7 telemetry, but no longer a junior/mid prerequisite; learn it when a real system needs it.
Checkpoint
Don't wait, start applying
You don't have to finish the path to begin. Early applications and interviews show you exactly what to learn next.
Stage 05
Stage 5, Observability & GitOps Delivery
Make systems answer 'is it healthy?' and 'what broke?', and adopt Git as the single source of truth for deployment (GitOps), the 2026 default delivery model.
Metrics, logs, and traces with Prometheus, Grafana, and LokiEssential3 links
Prometheus is an open-source metrics collection and alerting system that scrapes time-series data from instrumented services using a pull model and a query language called PromQL. Grafana is a visualization platform that builds dashboards from Prometheus metrics and other data sources. Loki is Grafana Labs' log aggregation system, designed to store and query logs with minimal indexing by using the same label model as Prometheus.
Why it matters · You cannot operate or be on-call for what you cannot see; instrumentation and dashboards are core SRE work.
OpenTelemetry (vendor-neutral instrumentation standard)Essential2 links
OpenTelemetry (OTel) is a CNCF project that provides a unified set of APIs, SDKs, and a collector for generating, collecting, and exporting telemetry data (traces, metrics, and logs) from applications and infrastructure. It defines vendor-neutral wire protocols (OTLP) and semantic conventions, allowing instrumentation to be written once and exported to any compatible backend such as Grafana, Jaeger, Honeycomb, or Datadog. By 2026, OTel has become the de facto standard for cloud-native observability instrumentation.
Why it matters · The 2026 convergence point: every major observability vendor supports OTel natively, so instrumenting once avoids lock-in.
GitOps with Argo CD or Flux (plus progressive delivery / canaries)Essential3 links
GitOps is an operational model in which a Git repository serves as the single source of truth for the desired state of Kubernetes workloads, with a controller continuously reconciling the cluster to match that state. Argo CD and Flux are the two leading GitOps controllers, each watching repositories and applying changes automatically or on approval. Progressive delivery extends GitOps with traffic-splitting strategies such as canary and blue-green deployments, allowing gradual rollout with automated promotion or rollback based on metrics.
Why it matters · Now the primary Kubernetes delivery mechanism across much of the industry; Git becomes the deploy source of truth and audit log.
eBPF-based networking and observability (Cilium)Optional2 links
eBPF (extended Berkeley Packet Filter) is a Linux kernel technology that allows sandboxed programs to run in kernel space without modifying kernel source code, enabling high-performance networking, security enforcement, and observability with low overhead. Cilium is a CNCF networking and security project that uses eBPF to implement Kubernetes networking (CNI), network policies, service mesh capabilities, and rich L7 telemetry. Cilium's Hubble component provides real-time visibility into network flows without requiring application-level instrumentation.
Why it matters · A fast-growing 2026 approach to kernel-level networking, security, and telemetry; specialized today but increasingly visible on platform teams.
Stage 06
Stage 6, Security / DevSecOps (Shift-Left & Supply Chain)
Bake security into the pipeline and harden the software supply chain, increasingly driven by regulation (SBOMs, signing, policy-as-code), not just good hygiene.
Pipeline and dependency scanning plus secrets management (Trivy, Vault)Essential3 links
Trivy is an open-source vulnerability and misconfiguration scanner that checks container images, filesystems, Git repositories, and Kubernetes manifests for known CVEs, exposed secrets, and insecure configurations as part of CI pipelines. HashiCorp Vault is a secrets management platform that centrally stores, leases, and rotates sensitive values (API keys, database credentials, TLS certificates) and provides short-lived dynamic secrets to applications and pipelines. Together, scanning and secrets management prevent vulnerable code and exposed credentials from reaching production.
Why it matters · Vulnerabilities and leaked secrets must be caught in CI, before they ever reach production.
Software supply-chain security: SBOMs and artifact signing (Syft, Sigstore/cosign, SLSA)Recommended3 links
A Software Bill of Materials (SBOM) is a structured inventory of all components, libraries, and dependencies in a software artifact, enabling consumers to identify known vulnerabilities in what they are running. Syft is an open-source SBOM generator that produces CycloneDX or SPDX documents from container images and filesystems. Sigstore (specifically its cosign tool) enables keyless cryptographic signing and verification of container images, and the SLSA (Supply-chain Levels for Software Artifacts) framework defines attestation levels of build provenance to prevent tampering.
Why it matters · SBOMs and build provenance have moved from optional to regulatory expectation (e.g., US EO 14028, EU CRA) heading into 2026.
Policy as code (Open Policy Agent / Kyverno)Recommended2 links
Policy as code is the practice of expressing compliance, security, and operational rules as machine-readable policies stored and versioned alongside infrastructure code. Open Policy Agent (OPA) is a general-purpose policy engine using the Rego language, often integrated as a Kubernetes admission controller or in CI pipelines to evaluate requests against defined rules. Kyverno is a Kubernetes-native policy engine that uses YAML policies to validate, mutate, and generate resources directly within the cluster without requiring a separate policy language.
Why it matters · Codified, automated guardrails enforce compliance and least-privilege at deploy time instead of relying on review by hand.
Stage 07
Stage 7, SRE Practice & Platform Engineering (Job-Ready / Senior Track)
Adopt the operating discipline that defines senior roles: reliability as a measurable engineering target (SLOs/error budgets), real incident response, and building self-service internal platforms so other teams ship safely.
SRE fundamentals: SLIs, SLOs, error budgets, toil reductionEssential3 links
Site Reliability Engineering (SRE) is a discipline that applies software engineering practices to operations, developed at Google and now widely adopted. Service Level Indicators (SLIs) are specific metrics measuring service behavior (such as request latency or error rate), Service Level Objectives (SLOs) are the target thresholds for those metrics, and error budgets define the allowable failure margin within a period. Toil reduction is the ongoing effort to identify and automate repetitive, manual operational work that scales with system size.
Why it matters · Turns 'reliability' into measurable, negotiable engineering, the language of SRE interviews and roadmaps.
Incident response, on-call, blameless postmortems, alertingEssential3 links
Incident response is the structured process of detecting, triaging, mitigating, and resolving production outages or degradations, typically coordinated by an on-call engineer following a runbook. Alerting systems (Alertmanager, PagerDuty, or similar) route threshold-based or anomaly-based notifications to the appropriate responders. Blameless postmortems are written analyses produced after an incident that focus on systemic causes and preventive actions rather than individual fault, creating a shared learning artifact.
Why it matters · Production ownership is the core SRE/DevOps responsibility; runbooks and postmortems are everyday artifacts.
Platform Engineering and Internal Developer Platforms (Backstage)Recommended2 links
Platform Engineering is the discipline of designing and maintaining an Internal Developer Platform (IDP), a self-service layer that abstracts infrastructure complexity and provides standardized paths for developers to build, deploy, and operate services. Backstage is an open-source CNCF project originally developed by Spotify that serves as a framework for building IDPs, offering a software catalog, templated scaffolding, plugin-based integrations with CI/CD and cloud providers, and a unified developer portal. IDPs reduce cognitive load for application teams by encoding organizational best practices into reusable golden paths.
Why it matters · The dominant senior trajectory in 2026 (Gartner projects 80% of large orgs will run platform teams by 2026, up from 45% in 2022); self-service IDPs are becoming the norm.
Go for cloud-native tooling and operatorsOptional2 links
Go is a statically typed, compiled programming language designed at Google for systems and server-side software, emphasizing simplicity, fast compilation, and built-in concurrency via goroutines and channels. It is the primary language of the cloud-native ecosystem, including Kubernetes, Docker, Prometheus, Terraform, and the majority of CNCF projects. Writing Kubernetes controllers and operators in Go uses the controller-runtime and client-go libraries to watch cluster resources and reconcile desired state.
Why it matters · The language Kubernetes and most CNCF tools are written in; needed to write controllers/operators and high-performance internal tooling.
AI-assisted operations and toil reductionOptional2 links
AI-assisted operations refers to the integration of large language model tools and AI coding assistants into infrastructure workflows to accelerate repetitive tasks such as drafting Terraform modules, writing runbooks, generating Kubernetes manifests, or summarizing incident timelines. These tools function as accelerators for experienced practitioners rather than autonomous agents, requiring the engineer to review, test, and own all generated output. Effective use involves knowing which tasks are well-suited to AI generation and maintaining critical evaluation of results before applying them to production systems.
Why it matters · By 2026 teams routinely use AI to draft IaC, summarize incidents, and write runbooks; the skill is using it to reduce toil while still owning what ships, not outsourcing judgment.
Capstone: build an end-to-end production-style platformRecommended2 links
A capstone platform project integrates the full DevOps and SRE skill chain into a single cohesive system: infrastructure provisioned with Terraform or OpenTofu, applications containerized with Docker, deployed to Kubernetes via a GitOps controller (Argo CD or Flux), with CI/CD pipelines, centralized observability (Prometheus, Grafana, Loki), and defined SLOs. Building and operating such a project end-to-end demonstrates practical understanding of how each layer interacts and serves as a concrete portfolio artifact showing applied competency across the stack.
Why it matters · A running portfolio project (IaC to containers to Kubernetes to GitOps to observability to SLOs) demonstrates the whole chain and beats certifications in interviews.
Land the job
Turn these skills into offers
ResuMax takes you from skilled to hired: a resume that proves it, applications tailored per role, and interview reps.
Train on this path
Atlas reads your resume, shows what you already have on this path, and coaches the gaps in order.