💡 Post-training alignment in 7 sentences — one page covering the interview essentials (see §2–§9 for derivations). RLHF pipeline (Ouyang 2022 InstructGPT): SFT → RM (Bradley-Terry pairwise) → PPO + ...