90-Day Foundation Models & Gen AI Study Roadmap

I’m working through this comprehensive roadmap to deeply understand foundation models - from attention mechanisms to diffusion models to modern LLMs.

Time Commitment: 3-4 hours/day (weekdays), 5-6 hours (weekends) Total Hours: ~320 hours

Why This Roadmap?

As a data engineer transitioning into AI, I wanted a structured path that goes beyond surface-level tutorials. This roadmap covers the foundational concepts that power today’s AI systems, with a focus on actually implementing things from scratch.

Phase Overview

Phase	Weeks	Focus Area	Hours
Phase 1	1-3	Transformers & Attention Mechanisms	~70 hrs
Phase 2	4-5	Diffusion Models	~50 hrs
Phase 3	6-7	Foundation Model Architectures (Language)	~50 hrs
Phase 4	8-9	Vision Transformers & VL Models	~50 hrs
Phase 5	10-11	Training & Inference Optimization	~50 hrs
Phase 6	12-13	Evaluation, Ethics & Advanced Topics	~50 hrs

Phase 1: Transformers & Attention (Weeks 1-3)

Week 1: Attention Fundamentals

Days 1-2: Attention Origins

Encoder-decoder architecture limitations
Bahdanau & Luong attention mechanisms

Days 3-4: The Transformer Architecture

Multi-head self-attention
Positional encodings
Layer normalization & residual connections

Days 5-7: Hands-on Implementation

Implement scaled dot-product attention from scratch
Build multi-head attention module
Create a minimal transformer encoder

Key Resources:

Attention Is All You Need (Vaswani et al., 2017)
The Illustrated Transformer
Andrej Karpathy: Let’s build GPT

Week 2: Advanced Attention

Self-attention deep dive & interpretability
Cross-attention mechanisms
Efficient attention variants (Longformer, BigBird)

Week 3: Perceiver & Non-Parametric Transformers

Handling arbitrary input modalities
In-context learning as implicit Bayesian inference

Checkpoint: Build a multi-modal classifier using Perceiver-style architecture

Phase 2: Diffusion Models (Weeks 4-5)

Week 4: Diffusion Fundamentals

Forward & reverse diffusion processes
Score matching perspective
DDPM & DDIM sampling

Key Resources:

Week 5: Latent Diffusion

VAE latent space compression
Stable Diffusion architecture
ControlNet & conditional generation

Checkpoint: Fine-tune a latent diffusion model using LoRA

Phase 3: Language Model Architectures (Weeks 6-7)

Week 6: GPT & Decoder-Only Models

GPT-1 through GPT-4 evolution
LLaMA, Mistral, Mixture of Experts

Week 7: BERT, T5 & State Space Models

Masked vs Causal Language Modeling
S4 & Mamba architectures

Checkpoint: Compare BERT-style MLM vs GPT-style CLM

Phase 4: Vision-Language Models (Weeks 8-9)

Week 8: Vision Transformers

ViT patch embeddings
DeiT training recipes
MLP-Mixer & ConvNeXt

Week 9: VL Models

CLIP & contrastive learning
Flamingo, BLIP-2, LLaVA architectures

Checkpoint: Build CLIP-based image-text retrieval system

Phase 5: Training & Inference (Weeks 10-11)

Week 10: Pre-training & Fine-tuning

Pre-training objectives (CLM, MLM, contrastive)
Instruction tuning, RLHF, DPO

Week 11: PEFT & Efficient Inference

LoRA, QLoRA, adapters
Flash Attention, KV-cache, quantization
RAG implementation

Checkpoint: Build RAG system with QLoRA-adapted generator

Phase 6: Evaluation & Ethics (Weeks 12-13)

Week 12: Benchmarks

MMLU, HellaSwag, HumanEval
Hallucination detection

Week 13: Safety & Advanced Topics

Bias & fairness
Privacy & memorization
Prompt engineering patterns

Follow Along

I’ll be documenting my progress through this roadmap on this blog. Sign up below to get weekly updates on what I’m learning, code implementations, and insights from the papers I’m reading.

Key Resources

Courses

Stanford CS224N - NLP with Deep Learning
Stanford CS324 - Large Language Models
Hugging Face NLP Course

Blogs

Jay Alammar - Visual explanations
Lilian Weng - Comprehensive surveys
Transformer Circuits - Mechanistic interpretability

Code

nanoGPT - Minimal GPT
Hugging Face Transformers
vLLM - Fast inference