Roadmap to a job

Data Scientist

Train on this path

A 2026 data scientist turns messy data into trustworthy decisions and, increasingly, shipped models

6 stages · 24 skills · 49 free resources

Core stack

PythonpandasNumPyJupyterscikit-learn

Track your progress

0 / 29 done

Stage 01
Stage 1, Python & Programming Foundations
Write clean, working Python and get comfortable in the data scientist's daily environment: notebooks, Git, and the terminal.
Python fundamentals (data types, control flow, functions, modules, virtual environments)Essential3 links
Python is a high-level, general-purpose programming language widely used in data science, machine learning, and automation. Its core concepts include built-in data types, conditional and loop constructs, reusable functions, importable modules, and isolated virtual environments for managing dependencies. These fundamentals form the foundation for every data manipulation and modeling task.
Why it matters · The base language you will use for cleaning data, building models, and automating work.
docsThe Official Python Tutorial courseKaggle Learn, Python videofreeCodeCamp, Learn Python for Data Science (full video course)
Git & GitHubEssential2 links
Git is a distributed version control system that tracks changes to source files over time, enabling rollback and parallel development through branches. GitHub is a cloud-hosted platform built on Git that adds collaboration features such as pull requests, code review, and public project hosting. Together they are the standard toolchain for managing and sharing code.
Why it matters · Version control is how you track your work, collaborate, and host the portfolio a recruiter will actually open.
docsPro Git (free official book)docsGitHub, Start Your Journey
Jupyter notebooks & a code editor (VS Code)Essential2 links
Jupyter notebooks are interactive documents that combine runnable code cells, rich text, and visualizations in a single file, making them the standard medium for iterative data exploration and analysis. VS Code is a lightweight but extensible code editor with debugging, linting, and Git integration suited for larger, more structured Python projects. Both tools serve complementary stages of a data science workflow.
Why it matters · Notebooks are the standard medium for exploration; you need an editor once code grows beyond a notebook.
docsJupyter Documentation docsVS Code, Data Science in VS Code
Using an AI coding assistant effectivelyRecommended1 link
AI coding assistants such as Claude and GitHub Copilot are tools that suggest, complete, and explain code inline within an editor or chat interface. They accelerate routine tasks like boilerplate generation, documentation, and debugging by surfacing context-aware suggestions. Effective use requires critically evaluating outputs rather than accepting them unchecked.
Why it matters · Fluency with assistants like Claude or Copilot is now an expected productivity multiplier, but treat their output as a draft to verify, not a final answer.
docsGitHub Copilot, Get Started
Stage 02
Stage 2, SQL & Data Wrangling
Pull, join, filter, and aggregate data from real databases, then clean and reshape it with pandas.
SQL (SELECT, WHERE, JOIN, GROUP BY, aggregates, window functions, CTEs)Essential3 links
SQL is a declarative language for querying and manipulating data stored in relational databases. Core clauses such as SELECT, WHERE, JOIN, and GROUP BY retrieve and filter rows, while window functions compute rolling aggregates and rankings without collapsing result sets. Common Table Expressions (CTEs) allow complex queries to be broken into readable, reusable named blocks.
Why it matters · You cannot analyze data you cannot retrieve; window functions and CTEs are routine in technical interviews.
courseKaggle Learn, Intro to SQL courseKaggle Learn, Advanced SQL (joins, window functions)docsPostgreSQL, The SQL Language Tutorial
pandas (loading, cleaning, missing values, merging, grouping, reshaping)Essential3 links
pandas is a Python library that provides DataFrame and Series objects for tabular data manipulation. It covers reading files and databases, handling missing values, merging datasets, grouping records for aggregation, and reshaping data between wide and long formats. It is the primary tool for preparing structured data before analysis or modeling.
Why it matters · pandas is the workhorse for the tabular manipulation that every downstream model depends on.
courseKaggle Learn, Pandas docspandas, Getting Started courseKaggle Learn, Data Cleaning
NumPy & vectorized operationsEssential1 link
NumPy is a Python library centered on the ndarray, a typed multi-dimensional array stored in contiguous memory. Vectorized operations apply arithmetic and mathematical functions across entire arrays at once using compiled C code, avoiding slow Python loops. pandas DataFrames and most machine learning libraries are built on top of NumPy arrays.
Why it matters · NumPy arrays underpin pandas and scikit-learn, and vectorization is how you keep numerical Python fast.
docsNumPy, The Absolute Beginner's Guide
Build itDataLab: Clean and Chart a Real Datasetbeginner · 6-10 hours
Stage 03
Stage 3, Statistics & Probability
Reason about uncertainty: distributions, sampling, hypothesis testing, correlation versus causation, and the intuition behind regression.
Descriptive & inferential statistics (distributions, sampling, confidence intervals, hypothesis testing, p-values)Essential3 links
Descriptive statistics summarize a dataset through measures such as mean, median, variance, and distribution shape, while inferential statistics use samples to draw conclusions about larger populations. Key concepts include sampling distributions, confidence intervals that bound an estimate, and hypothesis tests with p-values that quantify evidence against a null claim. These methods underpin valid interpretation of any analytical result.
Why it matters · Core to drawing valid conclusions and to not being fooled by noise in the data.
courseKhan Academy, Statistics and Probability articleOpenIntro Statistics (free textbook)courseSeeing Theory, A Visual Intro to Probability & Statistics
A/B testing & experiment designRecommended1 link
A/B testing is a controlled experiment in which two or more variants of a product or feature are exposed to randomly assigned user groups to measure the causal effect of a change. Experiment design covers choosing sample sizes for adequate statistical power, selecting appropriate metrics, and avoiding confounding through randomization. Correct interpretation requires applying inferential statistics to separate real effects from noise.
Why it matters · Product and business data science roles lean heavily on running and correctly interpreting controlled experiments.
courseUdacity, A/B Testing (free course)
Linear algebra & calculus (intuition, not proofs)Recommended2 links
Linear algebra studies vectors, matrices, and the transformations between them, forming the language in which data and model parameters are represented and manipulated. Calculus, particularly differentiation and gradients, describes how a function changes with respect to its inputs, which is how optimization algorithms adjust model weights during training. A conceptual grasp of these ideas explains the behavior of algorithms without requiring formal proofs.
Why it matters · Vectors, matrices, and gradients explain how models actually learn; you need the intuition for ML depth, though application matters more than derivation for a generalist.
video3Blue1Brown, Essence of Linear Algebra (YouTube series)courseKhan Academy, Linear Algebra
Build itCustom Image Classifier, Shippedintermediate · 8-14 hours
Stage 04
Stage 4, Exploratory Data Analysis & Visualization
Interrogate a dataset, surface patterns and data-quality problems, and communicate what you find to non-technical people.
Exploratory Data Analysis (EDA) workflowEssential2 links
Exploratory Data Analysis is a structured process of inspecting a new dataset to understand its distributions, relationships, missing values, and anomalies before building models. The workflow typically combines summary statistics, correlation checks, and visualizations to surface patterns and data quality issues early. Findings from EDA directly inform feature engineering and modeling choices.
Why it matters · EDA is the first thing you do with any real dataset and it shapes every modeling choice that follows.
courseKaggle Learn, Data Visualization coursefreeCodeCamp, Data Analysis with Python (free certification)
Plotting libraries (Matplotlib, Seaborn, Plotly)Essential2 links
Matplotlib is the foundational Python plotting library that renders static charts through a low-level, object-oriented API. Seaborn builds on Matplotlib with a higher-level interface tailored for statistical graphics such as distribution plots, heatmaps, and regression charts. Plotly produces interactive charts that can be embedded in notebooks, web applications, and dashboards.
Why it matters · The standard Python stack for producing charts inside notebooks and reports.
docsMatplotlib, Quick Start Guide docsseaborn, An Introduction
BI dashboards (Tableau or Power BI)Recommended2 links
Tableau and Power BI are business intelligence platforms that let users connect to data sources, build interactive charts and filters, and publish shareable dashboards without writing code. They are the primary tools through which business stakeholders consume ongoing analysis and metrics. Both support direct database connections, scheduled refreshes, and role-based access control.
Why it matters · BI tools are how many businesses consume analysis and they show up regularly in postings; learn one well rather than both.
courseMicrosoft Learn, Get Started with Power BI videoTableau, Free Training Videos
Build itBuild Your Own LLM From Scratchadvanced · 30-50 hours
Checkpoint
Don't wait, start applying
You don't have to finish the path to begin. Early applications and interviews show you exactly what to learn next.
Start applying to Data Science roles nowReal applications and interviews tell you what to learn next. Begin before you finish.Browse jobs
Stage 05
Stage 5, Machine Learning (Classical / Core)
Build, validate, and tune supervised and unsupervised models end-to-end, and be able to explain why a model works or fails.
Supervised learning + the ML workflow (train/test split, cross-validation, metrics: precision/recall/ROC-AUC, RMSE)Essential3 links
Supervised learning trains a model on labeled examples to predict outputs for unseen inputs, covering classification and regression tasks. The standard workflow splits data into training and held-out test sets, uses cross-validation to tune hyperparameters without leaking test information, and measures performance with metrics suited to the task, such as precision, recall, ROC-AUC for classifiers, and RMSE for regressors. This end-to-end process is the core of applied machine learning.
Why it matters · This is the day-to-day of model building and the backbone of nearly every ML interview.
courseGoogle, Machine Learning Crash Course courseKaggle Learn, Intro to Machine Learning docsscikit-learn, Getting Started
Feature engineering & model evaluation/selectionEssential2 links
Feature engineering transforms raw variables into representations that better expose signal to a learning algorithm, including encoding categoricals, creating interaction terms, and scaling numeric inputs. Model evaluation compares candidate models using held-out data and appropriate metrics, while model selection chooses among algorithms and hyperparameter settings based on generalization performance rather than training error. These steps have a larger practical impact on predictive quality than algorithm choice alone.
Why it matters · Better features and honest evaluation move real-world performance far more than swapping one algorithm for another.
courseKaggle Learn, Feature Engineering courseKaggle Learn, Intermediate Machine Learning
Gradient boosting (XGBoost / LightGBM)Recommended2 links
Gradient boosting is an ensemble method that builds a sequence of decision trees, each correcting the residual errors of the previous ones, combined through a gradient descent procedure in function space. XGBoost and LightGBM are optimized implementations that add regularization, efficient histogram-based splitting, and support for missing values. They consistently achieve top performance on structured and tabular datasets.
Why it matters · These dominate structured and tabular problems and are the standard go-to for a strong baseline in industry.
docsXGBoost, Get Started docsLightGBM, Documentation
Unsupervised learning (clustering, PCA)Recommended1 link
Unsupervised learning discovers structure in unlabeled data without a predefined target variable. Clustering algorithms such as k-means and DBSCAN group observations by similarity, while Principal Component Analysis (PCA) reduces the number of features by projecting data onto directions of maximum variance. These techniques are used for customer segmentation, anomaly detection, and preprocessing high-dimensional data.
Why it matters · Segmentation and dimensionality reduction are common in real analyses and round out your ML toolkit.
docsscikit-learn, Clustering User Guide
Deep learning (PyTorch), for CV/NLP-leaning rolesOptional2 links
Deep learning is a class of machine learning that uses multi-layer neural networks to learn hierarchical representations from raw data such as images, text, or audio. PyTorch is an open-source framework from Meta that provides automatic differentiation, GPU acceleration, and a dynamic computation graph for building and training neural networks. It is the dominant research and production framework for computer vision and natural language processing tasks.
Why it matters · Powerful but role-specific; it appears in only a small slice of data-scientist postings, so skip it unless you target computer vision or NLP, since classical ML covers most generalist jobs.
docsPyTorch, Learn the Basics coursefast.ai, Practical Deep Learning for Coders
Build itRAG Over Your Notesadvanced · weekend
Stage 06
Stage 6, Shipping: Deployment, Modern AI & Portfolio
Take a model out of the notebook, add a working layer of GenAI, and prove all of it with two to three real, documented projects.
Portfolio: 2-3 end-to-end projects on GitHub (with READMEs and write-ups)Essential2 links
A data science portfolio on GitHub consists of complete projects that move from raw data through cleaning, exploration, modeling, and interpretation, each documented with a README and narrative write-up. End-to-end projects demonstrate the full analytical workflow rather than isolated code snippets. Clear documentation helps others understand the problem, methodology, and conclusions without running the code.
Why it matters · Tangible, clearly explained projects are the single strongest hiring signal and the way you demonstrate every other skill at once.
projectKaggle, Datasets (project data)projectAwesome Public Datasets (GitHub)
Model deployment basics (Streamlit / FastAPI, pickle/joblib)Recommended2 links
Model deployment packages a trained machine learning model for use outside a notebook by saving it to disk with pickle or joblib and wrapping it in a web interface or API. Streamlit is a Python library for rapidly building interactive data apps, while FastAPI is a high-performance web framework for building REST endpoints. Together these tools bridge the gap between a trained model and a usable application.
Why it matters · Turning a model into a shareable app or API is what makes a project feel real and signals job-readiness.
docsStreamlit, Get Started docsFastAPI, Tutorial
MLOps basics (Docker, experiment tracking with MLflow)Recommended2 links
MLOps practices apply software engineering discipline to machine learning workflows to improve reproducibility and reliability. Docker packages an application and its dependencies into a portable container image that runs consistently across environments. MLflow is an open-source platform for logging parameters, metrics, and artifacts from training runs and comparing experiments in a unified tracking UI.
Why it matters · Reproducibility, containers, and experiment tracking increasingly distinguish hireable candidates and tend to command higher pay.
docsDocker, Get Started docsMLflow, Documentation
GenAI / LLMs / RAG (prompting, embeddings, vector search, basic orchestration)Recommended2 links
Large language models (LLMs) are neural networks trained on text at scale that generate coherent prose, answer questions, and follow instructions. Retrieval-Augmented Generation (RAG) is a pattern that grounds LLM responses in external documents by converting text to dense vector embeddings, storing them in a vector search index, and retrieving relevant chunks at query time. Basic orchestration tools such as LangChain or direct API calls coordinate prompting, retrieval, and response assembly into a pipeline.
Why it matters · Retrieval-augmented generation is the dominant enterprise AI pattern in 2026 and GenAI fluency is shifting from differentiator toward baseline expectation.
courseHugging Face, LLM Course courseMicrosoft, Generative AI for Beginners
One cloud platform (AWS, Azure, or GCP), foundational fluencyOptional2 links
AWS, Azure, and GCP are the three dominant public cloud platforms, each offering compute, storage, managed databases, and machine learning services accessible via APIs and web consoles. Foundational fluency covers navigating the console, running virtual machines or containers, managing storage buckets, and understanding identity and access controls. Most data science infrastructure runs on one of these platforms, so familiarity with core services is a practical necessity.
Why it matters · A cloud platform shows up in many postings, but learn the job-specific one after the core stack rather than trying to cover all three up front.
courseGoogle Skills (Cloud Skills Boost), free training courseMicrosoft Learn, Get Started with AI on Azure
Communication & data storytellingEssential2 links
Data storytelling is the practice of presenting analytical findings through a structured narrative that pairs visualizations with clear explanations tailored to a non-technical audience. It involves choosing the right chart type, reducing visual clutter, leading with the key insight, and connecting results to business decisions. Strong communication ensures that analytical work informs action rather than remaining confined to a notebook.
Why it matters · Explaining results to non-technical stakeholders is what gets analysis adopted, and technical skill alone rarely gets you hired or promoted.
articleStorytelling with Data, Chart Guide articleStorytelling with Data, Blog
Build itFine-Tune and Serve a Domain LLMadvanced · 12-20 hours
Land the job
Turn these skills into offers
ResuMax takes you from skilled to hired: a resume that proves it, applications tailored per role, and interview reps.
Build a resume that proves these skillsIn ResuMaxOpen builder
Tailor it to each Data Science postingIn ResuMaxTailor
Apply to Data Science jobs matched to youIn ResuMaxBrowse jobs
Practice Data Science interviewsIn ResuMaxStart prep

Browse all coding projects

Train on this path

Atlas reads your resume, shows what you already have on this path, and coaches the gaps in order.

Map my resume

Stage 1, Python & Programming Foundations

Stage 2, SQL & Data Wrangling

Stage 3, Statistics & Probability

Stage 4, Exploratory Data Analysis & Visualization

Don't wait, start applying

Stage 5, Machine Learning (Classical / Core)

Stage 6, Shipping: Deployment, Modern AI & Portfolio

Turn these skills into offers