All roadmaps

Roadmap to a job

Data Scientist

A 2026 data scientist turns messy data into trustworthy decisions and, increasingly, shipped models

6 stages · 24 skills · 49 free resources

Core stack

PythonpandasNumPyJupyterscikit-learn

Track your progress

0 / 29 done

  1. Stage 01

    Stage 1, Python & Programming Foundations

    Write clean, working Python and get comfortable in the data scientist's daily environment: notebooks, Git, and the terminal.

    Python fundamentals (data types, control flow, functions, modules, virtual environments)Essential

    Python is a high-level, general-purpose programming language widely used in data science, machine learning, and automation. Its core concepts include built-in data types, conditional and loop constructs, reusable functions, importable modules, and isolated virtual environments for managing dependencies. These fundamentals form the foundation for every data manipulation and modeling task.

    Why it matters · The base language you will use for cleaning data, building models, and automating work.

    Git & GitHubEssential

    Git is a distributed version control system that tracks changes to source files over time, enabling rollback and parallel development through branches. GitHub is a cloud-hosted platform built on Git that adds collaboration features such as pull requests, code review, and public project hosting. Together they are the standard toolchain for managing and sharing code.

    Why it matters · Version control is how you track your work, collaborate, and host the portfolio a recruiter will actually open.

    Jupyter notebooks & a code editor (VS Code)Essential

    Jupyter notebooks are interactive documents that combine runnable code cells, rich text, and visualizations in a single file, making them the standard medium for iterative data exploration and analysis. VS Code is a lightweight but extensible code editor with debugging, linting, and Git integration suited for larger, more structured Python projects. Both tools serve complementary stages of a data science workflow.

    Why it matters · Notebooks are the standard medium for exploration; you need an editor once code grows beyond a notebook.

    Using an AI coding assistant effectivelyRecommended

    AI coding assistants such as Claude and GitHub Copilot are tools that suggest, complete, and explain code inline within an editor or chat interface. They accelerate routine tasks like boilerplate generation, documentation, and debugging by surfacing context-aware suggestions. Effective use requires critically evaluating outputs rather than accepting them unchecked.

    Why it matters · Fluency with assistants like Claude or Copilot is now an expected productivity multiplier, but treat their output as a draft to verify, not a final answer.

  2. Stage 02

    Stage 2, SQL & Data Wrangling

    Pull, join, filter, and aggregate data from real databases, then clean and reshape it with pandas.

    SQL (SELECT, WHERE, JOIN, GROUP BY, aggregates, window functions, CTEs)Essential

    SQL is a declarative language for querying and manipulating data stored in relational databases. Core clauses such as SELECT, WHERE, JOIN, and GROUP BY retrieve and filter rows, while window functions compute rolling aggregates and rankings without collapsing result sets. Common Table Expressions (CTEs) allow complex queries to be broken into readable, reusable named blocks.

    Why it matters · You cannot analyze data you cannot retrieve; window functions and CTEs are routine in technical interviews.

    pandas (loading, cleaning, missing values, merging, grouping, reshaping)Essential

    pandas is a Python library that provides DataFrame and Series objects for tabular data manipulation. It covers reading files and databases, handling missing values, merging datasets, grouping records for aggregation, and reshaping data between wide and long formats. It is the primary tool for preparing structured data before analysis or modeling.

    Why it matters · pandas is the workhorse for the tabular manipulation that every downstream model depends on.

    NumPy & vectorized operationsEssential

    NumPy is a Python library centered on the ndarray, a typed multi-dimensional array stored in contiguous memory. Vectorized operations apply arithmetic and mathematical functions across entire arrays at once using compiled C code, avoiding slow Python loops. pandas DataFrames and most machine learning libraries are built on top of NumPy arrays.

    Why it matters · NumPy arrays underpin pandas and scikit-learn, and vectorization is how you keep numerical Python fast.

  3. Stage 03

    Stage 3, Statistics & Probability

    Reason about uncertainty: distributions, sampling, hypothesis testing, correlation versus causation, and the intuition behind regression.

    Descriptive & inferential statistics (distributions, sampling, confidence intervals, hypothesis testing, p-values)Essential

    Descriptive statistics summarize a dataset through measures such as mean, median, variance, and distribution shape, while inferential statistics use samples to draw conclusions about larger populations. Key concepts include sampling distributions, confidence intervals that bound an estimate, and hypothesis tests with p-values that quantify evidence against a null claim. These methods underpin valid interpretation of any analytical result.

    Why it matters · Core to drawing valid conclusions and to not being fooled by noise in the data.

    A/B testing & experiment designRecommended

    A/B testing is a controlled experiment in which two or more variants of a product or feature are exposed to randomly assigned user groups to measure the causal effect of a change. Experiment design covers choosing sample sizes for adequate statistical power, selecting appropriate metrics, and avoiding confounding through randomization. Correct interpretation requires applying inferential statistics to separate real effects from noise.

    Why it matters · Product and business data science roles lean heavily on running and correctly interpreting controlled experiments.

    Linear algebra & calculus (intuition, not proofs)Recommended

    Linear algebra studies vectors, matrices, and the transformations between them, forming the language in which data and model parameters are represented and manipulated. Calculus, particularly differentiation and gradients, describes how a function changes with respect to its inputs, which is how optimization algorithms adjust model weights during training. A conceptual grasp of these ideas explains the behavior of algorithms without requiring formal proofs.

    Why it matters · Vectors, matrices, and gradients explain how models actually learn; you need the intuition for ML depth, though application matters more than derivation for a generalist.

  4. Stage 04

    Stage 4, Exploratory Data Analysis & Visualization

    Interrogate a dataset, surface patterns and data-quality problems, and communicate what you find to non-technical people.

    Exploratory Data Analysis (EDA) workflowEssential

    Exploratory Data Analysis is a structured process of inspecting a new dataset to understand its distributions, relationships, missing values, and anomalies before building models. The workflow typically combines summary statistics, correlation checks, and visualizations to surface patterns and data quality issues early. Findings from EDA directly inform feature engineering and modeling choices.

    Why it matters · EDA is the first thing you do with any real dataset and it shapes every modeling choice that follows.

    Plotting libraries (Matplotlib, Seaborn, Plotly)Essential

    Matplotlib is the foundational Python plotting library that renders static charts through a low-level, object-oriented API. Seaborn builds on Matplotlib with a higher-level interface tailored for statistical graphics such as distribution plots, heatmaps, and regression charts. Plotly produces interactive charts that can be embedded in notebooks, web applications, and dashboards.

    Why it matters · The standard Python stack for producing charts inside notebooks and reports.

    BI dashboards (Tableau or Power BI)Recommended

    Tableau and Power BI are business intelligence platforms that let users connect to data sources, build interactive charts and filters, and publish shareable dashboards without writing code. They are the primary tools through which business stakeholders consume ongoing analysis and metrics. Both support direct database connections, scheduled refreshes, and role-based access control.

    Why it matters · BI tools are how many businesses consume analysis and they show up regularly in postings; learn one well rather than both.

  5. Checkpoint

    Don't wait, start applying

    You don't have to finish the path to begin. Early applications and interviews show you exactly what to learn next.

  6. Stage 05

    Stage 5, Machine Learning (Classical / Core)

    Build, validate, and tune supervised and unsupervised models end-to-end, and be able to explain why a model works or fails.

    Supervised learning + the ML workflow (train/test split, cross-validation, metrics: precision/recall/ROC-AUC, RMSE)Essential

    Supervised learning trains a model on labeled examples to predict outputs for unseen inputs, covering classification and regression tasks. The standard workflow splits data into training and held-out test sets, uses cross-validation to tune hyperparameters without leaking test information, and measures performance with metrics suited to the task, such as precision, recall, ROC-AUC for classifiers, and RMSE for regressors. This end-to-end process is the core of applied machine learning.

    Why it matters · This is the day-to-day of model building and the backbone of nearly every ML interview.

    Feature engineering & model evaluation/selectionEssential

    Feature engineering transforms raw variables into representations that better expose signal to a learning algorithm, including encoding categoricals, creating interaction terms, and scaling numeric inputs. Model evaluation compares candidate models using held-out data and appropriate metrics, while model selection chooses among algorithms and hyperparameter settings based on generalization performance rather than training error. These steps have a larger practical impact on predictive quality than algorithm choice alone.

    Why it matters · Better features and honest evaluation move real-world performance far more than swapping one algorithm for another.

    Gradient boosting (XGBoost / LightGBM)Recommended

    Gradient boosting is an ensemble method that builds a sequence of decision trees, each correcting the residual errors of the previous ones, combined through a gradient descent procedure in function space. XGBoost and LightGBM are optimized implementations that add regularization, efficient histogram-based splitting, and support for missing values. They consistently achieve top performance on structured and tabular datasets.

    Why it matters · These dominate structured and tabular problems and are the standard go-to for a strong baseline in industry.

    Unsupervised learning (clustering, PCA)Recommended

    Unsupervised learning discovers structure in unlabeled data without a predefined target variable. Clustering algorithms such as k-means and DBSCAN group observations by similarity, while Principal Component Analysis (PCA) reduces the number of features by projecting data onto directions of maximum variance. These techniques are used for customer segmentation, anomaly detection, and preprocessing high-dimensional data.

    Why it matters · Segmentation and dimensionality reduction are common in real analyses and round out your ML toolkit.

    Deep learning (PyTorch), for CV/NLP-leaning rolesOptional

    Deep learning is a class of machine learning that uses multi-layer neural networks to learn hierarchical representations from raw data such as images, text, or audio. PyTorch is an open-source framework from Meta that provides automatic differentiation, GPU acceleration, and a dynamic computation graph for building and training neural networks. It is the dominant research and production framework for computer vision and natural language processing tasks.

    Why it matters · Powerful but role-specific; it appears in only a small slice of data-scientist postings, so skip it unless you target computer vision or NLP, since classical ML covers most generalist jobs.

  7. Stage 06

    Stage 6, Shipping: Deployment, Modern AI & Portfolio

    Take a model out of the notebook, add a working layer of GenAI, and prove all of it with two to three real, documented projects.

    Portfolio: 2-3 end-to-end projects on GitHub (with READMEs and write-ups)Essential

    A data science portfolio on GitHub consists of complete projects that move from raw data through cleaning, exploration, modeling, and interpretation, each documented with a README and narrative write-up. End-to-end projects demonstrate the full analytical workflow rather than isolated code snippets. Clear documentation helps others understand the problem, methodology, and conclusions without running the code.

    Why it matters · Tangible, clearly explained projects are the single strongest hiring signal and the way you demonstrate every other skill at once.

    Model deployment basics (Streamlit / FastAPI, pickle/joblib)Recommended

    Model deployment packages a trained machine learning model for use outside a notebook by saving it to disk with pickle or joblib and wrapping it in a web interface or API. Streamlit is a Python library for rapidly building interactive data apps, while FastAPI is a high-performance web framework for building REST endpoints. Together these tools bridge the gap between a trained model and a usable application.

    Why it matters · Turning a model into a shareable app or API is what makes a project feel real and signals job-readiness.

    MLOps basics (Docker, experiment tracking with MLflow)Recommended

    MLOps practices apply software engineering discipline to machine learning workflows to improve reproducibility and reliability. Docker packages an application and its dependencies into a portable container image that runs consistently across environments. MLflow is an open-source platform for logging parameters, metrics, and artifacts from training runs and comparing experiments in a unified tracking UI.

    Why it matters · Reproducibility, containers, and experiment tracking increasingly distinguish hireable candidates and tend to command higher pay.

    GenAI / LLMs / RAG (prompting, embeddings, vector search, basic orchestration)Recommended

    Large language models (LLMs) are neural networks trained on text at scale that generate coherent prose, answer questions, and follow instructions. Retrieval-Augmented Generation (RAG) is a pattern that grounds LLM responses in external documents by converting text to dense vector embeddings, storing them in a vector search index, and retrieving relevant chunks at query time. Basic orchestration tools such as LangChain or direct API calls coordinate prompting, retrieval, and response assembly into a pipeline.

    Why it matters · Retrieval-augmented generation is the dominant enterprise AI pattern in 2026 and GenAI fluency is shifting from differentiator toward baseline expectation.

    One cloud platform (AWS, Azure, or GCP), foundational fluencyOptional

    AWS, Azure, and GCP are the three dominant public cloud platforms, each offering compute, storage, managed databases, and machine learning services accessible via APIs and web consoles. Foundational fluency covers navigating the console, running virtual machines or containers, managing storage buckets, and understanding identity and access controls. Most data science infrastructure runs on one of these platforms, so familiarity with core services is a practical necessity.

    Why it matters · A cloud platform shows up in many postings, but learn the job-specific one after the core stack rather than trying to cover all three up front.

    Communication & data storytellingEssential

    Data storytelling is the practice of presenting analytical findings through a structured narrative that pairs visualizations with clear explanations tailored to a non-technical audience. It involves choosing the right chart type, reducing visual clutter, leading with the key insight, and connecting results to business decisions. Strong communication ensures that analytical work informs action rather than remaining confined to a notebook.

    Why it matters · Explaining results to non-technical stakeholders is what gets analysis adopted, and technical skill alone rarely gets you hired or promoted.

  8. Land the job

    Turn these skills into offers

    ResuMax takes you from skilled to hired: a resume that proves it, applications tailored per role, and interview reps.

Train on this path

Atlas reads your resume, shows what you already have on this path, and coaches the gaps in order.

Map my resume