Roadmap to a job
Data Engineer
A data engineer builds and operates the pipelines and storage that turn raw, messy data into reliable, query-ready datasets for analysts
6 stages · 21 skills · 45 free resources
Core stack
Track your progress
0 / 26 done
Stage 01
Stage 1, Foundations: SQL, Python, and the command line
Read, write, and reason about SQL and Python fluently, and work confidently in a terminal with Git. This is the floor every data engineer interview tests.
SQL fundamentals to fluencyEssential3 links
SQL (Structured Query Language) is the standard language for interacting with relational databases. It is used to query, insert, update, and delete data, as well as to define schema structures and control access. Nearly all data storage and retrieval workflows in analytics and engineering depend on SQL proficiency.
Why it matters · SQL is requested in nearly every data engineering posting and is used daily to query, transform, and debug data, non-negotiable.
Advanced SQL: window functions, CTEs, NULL handlingEssential3 links
Window functions compute aggregates across a sliding frame of rows without collapsing them into groups, enabling rankings, running totals, and lag/lead comparisons. Common Table Expressions (CTEs) allow complex queries to be broken into named, reusable subqueries for clarity. Correct NULL handling ensures that comparisons, aggregations, and joins behave predictably, since NULL propagates differently from other values.
Why it matters · Window functions, CTEs, and correct NULL handling separate a hire from a no-hire in the SQL screen; you need these without hesitation.
Python for data engineeringEssential2 links
Python is a general-purpose programming language widely adopted in data engineering for writing pipeline scripts, automating file processing, and connecting disparate systems. Its rich ecosystem (pandas, PyArrow, SQLAlchemy, requests) covers data parsing, transformation, and API integration. Data engineering usage emphasizes file I/O, object-oriented design, and robust error handling over algorithmic programming.
Why it matters · Python is the language for pipelines, automation, and glue code, and shows up in the overwhelming majority of postings. Focus on data-eng style (file parsing, OOP basics, error handling), not LeetCode puzzles.
Command line, Bash, and Git/GitHubEssential3 links
The command line and Bash scripting provide direct control over operating system processes, file systems, and scheduled tasks on servers where pipelines run. Git is a distributed version control system used to track code changes, manage branches, resolve merge conflicts, and collaborate via platforms such as GitHub. Both are foundational tools for deploying and maintaining data infrastructure.
Why it matters · Pipelines run on servers and ship through code review; terminal fluency and Git (branches, merges, conflicts) are everyday tools.
Stage 02
Stage 2, Data modeling and database internals
Design correct, query-efficient schemas and explain WHY. The 'design a schema' round decides a large share of loops, and most candidates fail it on reasoning, not SQL.
Dimensional modeling (Kimball): grain, fact & dimension tablesEssential2 links
Dimensional modeling, as formalized by Ralph Kimball, is a schema design technique for analytical databases that organizes data into fact tables (measurable events) and dimension tables (descriptive context). Declaring the grain, the precise level of detail each row represents, is the first step before designing any schema. The resulting star schema optimizes query performance and readability for BI and reporting workloads.
Why it matters · Star schema is the default answer in 2026 interviews; declaring the grain before drawing tables is what passing answers do.
Slowly Changing Dimensions (SCD Types 1/2/3) and normalizationEssential2 links
Slowly Changing Dimensions (SCDs) are patterns for tracking how dimension attributes change over time in a data warehouse. Type 1 overwrites the old value, Type 2 adds a new row with effective dates to preserve history, and Type 3 stores both current and prior values in separate columns. Normalization is the complementary relational-design process of eliminating redundancy by decomposing tables into well-defined forms.
Why it matters · Naming the SCD type (Type 2 is the common one) and defending a star-vs-vault tradeoff is exactly what interviewers probe for.
Relational vs NoSQL, OLTP vs OLAP, columnar storageRecommended2 links
Relational databases enforce a fixed schema with rows and strong consistency, while NoSQL systems (document, key-value, wide-column, graph) trade some consistency for flexibility and scale. OLTP (Online Transaction Processing) systems are optimized for high-throughput row-level reads and writes, whereas OLAP (Online Analytical Processing) systems are optimized for aggregating large volumes of historical data. Columnar storage, used in OLAP warehouses, compresses and reads only the needed columns, dramatically accelerating analytical queries.
Why it matters · Knowing when a workload wants a row store vs a columnar warehouse vs a document/key-value store is core architectural judgment.
Stage 03
Stage 3, Cloud data warehouse + the ELT transformation layer (dbt)
Go deep on ONE cloud warehouse and own the transformation layer with dbt. This is the heart of the modern data stack and where most data-eng work actually happens today.
One cloud data warehouse, deeply (Snowflake / BigQuery / Redshift)Essential2 links
Cloud data warehouses such as Snowflake, Google BigQuery, and Amazon Redshift are fully managed columnar storage and compute platforms designed for large-scale analytical queries. Each separates storage from compute, enabling elastic scaling and pay-per-query pricing. Deep expertise in one platform includes understanding partitioning, clustering, query execution plans, caching, and cost controls.
Why it matters · Cloud warehouses are where analysis-ready data lives; mastering one (partitioning, clustering, query planning) beats skimming three.
dbt (data build tool), transformation as codeEssential2 links
dbt (data build tool) is an open-source framework that allows data teams to write SQL-based transformations as version-controlled code within a data warehouse. It enforces a layered model structure (staging, intermediate, and mart layers), generates data lineage graphs, runs automated tests, and produces documentation from model metadata. dbt operates on the ELT pattern, transforming data after it has been loaded into the warehouse.
Why it matters · dbt is the de-facto ELT transformation standard on 2026 data teams (staging/intermediate/mart layers, tests, docs, lineage) and shows up almost everywhere.
ELT vs ETL patterns and ingestionEssential2 links
ETL (Extract, Transform, Load) transforms data before loading it into the destination, traditionally used when compute was expensive inside warehouses. ELT (Extract, Load, Transform) loads raw data first and transforms it inside the warehouse using its native compute power, which is the dominant pattern in cloud data stacks. Ingestion tools such as Airbyte, Fivetran, and dlt automate the extract-and-load steps from source systems into the warehouse.
Why it matters · Modern stacks load-then-transform (ELT); understanding ingestion (Airbyte/Fivetran/dlt) and when ELT beats ETL is core pipeline design.
Stage 04
Stage 4, Orchestration, scale, and production reliability
Schedule pipelines that survive failure, and process data too big for one machine. This is the jump from 'scripts' to 'production data engineering'.
Apache Airflow, DAGs, scheduling, dependenciesEssential2 links
Apache Airflow is an open-source platform for programmatically authoring, scheduling, and monitoring data workflows. Pipelines are expressed as Directed Acyclic Graphs (DAGs) in Python, where each node is a task and edges define dependencies and execution order. Airflow provides a web UI for tracking runs, configuring retries, and alerting on failures.
Why it matters · Airflow is the dominant orchestrator; designing DAGs with retries, alerting, and dependencies is a near-universal job requirement.
Failure modes: idempotency, backfills, late/duplicate dataEssential2 links
Idempotency in pipeline design means that rerunning a job any number of times produces the same result as running it once, preventing duplicate records or inconsistent state. Backfilling is the process of reprocessing historical data ranges after a logic change or outage. Handling late-arriving and duplicate records requires deduplication logic, CDC (Change Data Capture) awareness, and watermarking strategies to maintain data correctness.
Why it matters · Real pipelines break; handling idempotent reruns, backfills, CDC, and late-arriving data is what makes a pipeline trustworthy in production.
Apache Spark / PySpark, distributed processingEssential2 links
Apache Spark is a distributed data processing engine designed for large-scale batch and streaming workloads across clusters of machines. PySpark is its Python API, exposing DataFrames and the RDD (Resilient Distributed Dataset) abstraction for transformations and actions executed in parallel. Core concepts include lazy evaluation (building an execution plan before running it), partitioning, shuffles, and performance tuning via caching and broadcast joins.
Why it matters · Spark remains the dominant big-data framework and a frequent interview topic; you need DataFrames, lazy evaluation, partitioning, and tuning for data beyond one machine.
Docker + cloud object storage & IAM (S3/GCS)Essential2 links
Docker is a containerization platform that packages an application and its dependencies into a portable image, ensuring consistent behavior across development, CI, and production environments. Cloud object storage services such as Amazon S3 and Google Cloud Storage (GCS) provide scalable, durable storage for raw data files, intermediate outputs, and pipeline artifacts. IAM (Identity and Access Management) controls which identities can read, write, or manage these storage resources and other cloud services.
Why it matters · Pipelines ship as containers and read/write cloud object storage; Docker plus S3/GCS and basic IAM are baseline deployment skills.
Checkpoint
Don't wait, start applying
You don't have to finish the path to begin. Early applications and interviews show you exactly what to learn next.
Stage 05
Stage 5, Differentiators: streaming, AI-data, quality & governance
Layer on the skills that separate a strong candidate from a baseline one. Pick based on target roles, these are high-value but role-dependent, not universal day-one essentials.
Real-time streaming: Kafka + Spark Structured Streaming / FlinkRecommended2 links
Apache Kafka is a distributed event-streaming platform that acts as a high-throughput, durable message bus between producers and consumers. Spark Structured Streaming and Apache Flink are stream-processing engines that consume events from Kafka and apply transformations, aggregations, and joins in near-real-time. Together, these tools power use cases such as fraud detection, real-time dashboards, and live feature computation for machine learning systems.
Why it matters · More analytics is shifting into the streaming layer; Kafka (transport) plus a processor (Flink or Spark Streaming) powers fraud detection, live dashboards, and real-time features.
Data quality, testing & observabilityRecommended2 links
Data quality encompasses validation rules that check schemas, value ranges, uniqueness, and referential integrity at various points in a pipeline. Testing frameworks such as dbt tests and Great Expectations allow these checks to be codified and run automatically on each pipeline execution. Data observability extends quality by continuously monitoring freshness, volume, and distribution metrics to detect anomalies before downstream consumers are affected.
Why it matters · Bad data is worse than no data, validation, tests, and freshness/volume/schema monitoring catch problems before they reach downstream consumers.
AI/ML data plumbing: vector databases, embeddings, RAG ingestionRecommended2 links
Vector databases (such as Pinecone, pgvector, and Qdrant) store and index high-dimensional numerical embeddings, enabling fast approximate nearest-neighbor search used in semantic retrieval and recommendation systems. Embeddings are dense vector representations of text, images, or other data produced by machine learning models and used to capture semantic similarity. Retrieval-Augmented Generation (RAG) ingestion pipelines chunk source documents, compute embeddings, and upsert them into a vector store so that an LLM can retrieve relevant context at inference time.
Why it matters · In 2026, feeding LLM/RAG systems (embeddings, vector stores, feature pipelines) is becoming nearly as fundamental as relational data work, a real hiring edge.
Data governance, security & lineage; Infrastructure as Code (Terraform)Optional2 links
Data governance covers the policies, access controls, and data-masking rules that ensure sensitive data is handled correctly across its lifecycle, including compliance with regulations such as GDPR and HIPAA. Data lineage tracks the origin and transformation path of each dataset, making it possible to trace errors and understand downstream impact of schema changes. Terraform is an Infrastructure as Code (IaC) tool that provisions and manages cloud resources (warehouses, storage buckets, IAM roles) through declarative configuration files checked into version control.
Why it matters · Regulated environments need access control, masking, lineage, and reproducible infra; valuable for mid/senior and platform-leaning roles.
Stage 06
Stage 6, Portfolio, end-to-end projects & interview prep
Prove the whole stack works together and get hired. Employers weight demonstrable end-to-end work and schema reasoning over certificates.
Build 3-5 end-to-end pipeline projects (ingest → warehouse → dbt → orchestrate)Essential2 links
End-to-end pipeline projects demonstrate the full data engineering lifecycle: extracting data from a source, loading it into a cloud warehouse, applying dbt transformations across staging and mart layers, and orchestrating the workflow with a scheduler such as Airflow. Complete projects include architecture diagrams, documented design decisions, and infrastructure provisioned as code. Publishing these projects with source code and write-ups makes the engineering process visible and reviewable.
Why it matters · A portfolio of complete, documented pipelines (with architecture diagrams and tradeoff write-ups) is the strongest hiring signal, more than any course completion.
Interview prep: SQL + data modeling + pipeline system designEssential2 links
Data engineering interviews typically include timed SQL and Python challenges, dimensional modeling exercises (designing a schema from a business scenario), and open-ended system design questions (designing an ingestion pipeline or warehouse architecture). Modeling exercises require declaring a grain, choosing SCD types, and defending star-schema decisions. System design rounds test the ability to reason about scalability, failure handling, latency, and cost tradeoffs out loud.
Why it matters · Loops are decided by timed SQL/Python and the 'design a schema / design this pipeline' rounds; practice articulating tradeoffs out loud.
Cost, performance & query optimization in practiceRecommended2 links
Query optimization in cloud warehouses involves reading execution plans (EXPLAIN output), eliminating full table scans through partitioning and clustering, and reducing data shuffled across the network. Cost controls include limiting bytes processed per query, caching results, choosing appropriate warehouse sizes, and scheduling heavy workloads during off-peak periods. Partition pruning, materialized views, and incremental models in dbt are practical techniques for balancing performance against spend.
Why it matters · Reading execution plans and controlling warehouse spend and partition pruning is what teams care about once you're on the job, and a strong senior-level talking point.
Land the job
Turn these skills into offers
ResuMax takes you from skilled to hired: a resume that proves it, applications tailored per role, and interview reps.
Train on this path
Atlas reads your resume, shows what you already have on this path, and coaches the gaps in order.