Writing

Notes on RL, optimization, numerical methods, and applied techniques.

Posts

Variance reduction and GAE — The same trick shows up in Monte Carlo (control variates), trading (minimum-variance hedge), quant finance (excess return), and policy-gradient RL (the advantage function). A walk through the unified formula and modern policy gradient methods like GAE and GRPO.
Fisher Information again and again — The same matrix appears as a preconditioner in natural gradient, a constraint in TRPO, a diagonal approximation in Adam, an ingredient in NES, and a lower bound in Cramér-Rao.
Goal programming: hyperparameters as economic coefficients — Lagrangian multipliers as shadow prices: prices we measure, dials we tune, constraints we cannot violate, and some questions about RLHF penalty terms.
Personalization and RLHF — Parallels between RankNet/LambdaMART and DPO, both built on Bradley-Terry.
Robustness and clipping — A guardrail against bad inputs can go on the input itself, with robust estimators like winsorization, MAD, or MCD, or on the function output, with gradient clipping, PPO ratio clipping, or per-trade PnL clipping.
Binary search and bisection — A one-stop C++23 template that solves integer search, bisection root-finding, bond yield-to-maturity, and inverse CDF sampling.
Systems under load — Little’s Law, M/M/1 wait times, and Amdahl’s law, applied to software systems and to a trading desk in a sell-off.

Hand-rolled multi-head self-attention — Causal MHA with fused QKV, raw nn.Parameter matrices, and einops + einsum.
Two implementations of RoPE — Even/odd slicing and a block-diagonal rotation matrix.
Reward hacking: optimizer gaming a misspecified objective — Toy REINFORCE with two tasks and an add() tool. A bonus allows the optimizer to game the objective and use the tool unnecessarily.