Probability & Statistics for ML
1. Probability basics
Random variable. A random variable \(X\) is a quantity whose value is uncertain — the outcome of a coin flip, a die roll, tomorrow’s temperature. It takes values in a sample space (the set of possible outcomes). An event is a subset of outcomes (“the die shows an even number”).
Two kinds.
- Discrete \(X\) takes countably many values; described by a probability mass function (PMF) \(p(x) = \Pr(X=x)\), with \(p(x)\ge 0\) and \(\sum_x p(x)=1\).
- Continuous \(X\) takes values on a continuum; described by a probability density function (PDF) \(p(x)\ge 0\) with \(\int p(x)\,dx = 1\). Here \(p(x)\) is a density, not a probability — you read it by area, never by height (the rate intuition is unpacked right after the next figure). (New symbol: the \(\int\) in \(\int p(x)\,dx\) is an integral — read it as “add up the area under the curve \(p(x)\)”, the continuous twin of the sum \(\sum\) for discrete \(x\); the calculus that builds it is the Calculus primer’s job, and all we need here is the area reading.)
PMF vs PDF — the one distinction beginners trip on. Both say “how probability is spread over the values,” but they are read differently. For a PMF you read a height; for a PDF you must measure an area.
| PMF — discrete \(X\) | PDF — continuous \(X\) | |
|---|---|---|
| values | countable: a die, a count, a label | a continuum: height, temperature, wait |
| what \(p(x)\) is | the probability \(\Pr(X{=}x)\) directly | a density — probability per unit width |
| read it as | the bar height is the probability | the area under an interval is the probability |
| one exact value | can be probable: \(\Pr(X{=}3)=\tfrac16\) | has probability \(0\): \(\Pr(X{=}20.000°){=}0\) |
| can \(p(x)>1\)? | never (each bar \(\le 1\)) | yes — only the area need be \(\le 1\) |
| adds to \(1\) via | \(\sum_x p(x)=1\) | \(\int p(x)\,dx=1\) |
Why a density, not a probability — the rate intuition. A continuous \(X\) has infinitely many possible values (between \(170\) and \(171\) cm alone: \(170.1, 170.11, 170.111,\dots\)). They must share a total probability of \(1\), so each exact value gets zero: if every point held even a sliver of probability, infinitely many would sum to \(\infty\), not \(1\). That is why \(\Pr(X{=}170.000\dots)=0\), and why probability for continuous \(X\) lives on intervals, not points.
So \(p(x)\) cannot be a probability — it is a density, a rate, exactly like mass density on an iron rod (thick in the middle, thin at the ends): the mass at a single point is \(0\) (a point has no width), yet a stretch of rod has mass \(=\) density \(\times\) length. Probability behaves identically: \(\Pr(x\le X\le x{+}\Delta x)\approx p(x)\,\Delta x\). So \(p(x)\) carries units of probability per unit \(x\) — multiply by a width to get a probability; sum the slivers (integrate) to get an interval.
This is also why a density can exceed \(1\): a uniform distribution on \([0,0.5]\) must enclose area \(1\) over a width of \(0.5\), so its height is \(1/0.5=\mathbf{2}\) — a perfectly good rate (\(2\times 0.5 = 1\) ✓), even though a probability of \(2\) would be nonsense.
pdf-density-demo.html in this folder.
Function vs distribution — two levels, often blurred. The words “\(p(x)\)” and “distribution” get thrown around interchangeably, but they live on different levels:
- A distribution is the object — the whole pattern of how probability is spread over the values of \(X\). “The Gaussian,” “Bernoulli\((0.3)\),” “this die’s distribution” all name objects.
- A PMF or PDF is a function — a formula \(p(x)\) — that describes that object. It is one representation of the distribution, not the distribution itself.
Two facts pin the difference down:
- One distribution, many functions. The same distribution can be written as its PMF/PDF \(p(x)\) or as its CDF \(F(x)=\Pr(X\le x)\) — the running total of probability up to \(x\). Different functions, identical distribution (figure).
- Not every function is a distribution. A PMF/PDF must be \(\ge0\) and sum/integrate to \(1\); a function that doesn’t is just a function. The likelihood (§4) — the same formula read as a function of the parameter \(\theta\) instead of the data — is a perfectly good function but not a distribution over \(\theta\).
So “the distribution \(p(x)\)” is harmless shorthand (the function does pin the object down), but keep the levels straight: the distribution is the what; the function is how you write it down.
Expectation, mean, variance. The expectation (mean, \(\mu\)) is the probability-weighted average; the variance (\(\sigma^2\)) measures spread:
\[ \mathbb{E}[X] = \sum_x x\,p(x), \qquad \operatorname{Var}[X] = \mathbb{E}\big[(X-\mu)^2\big] = \mathbb{E}[X^2] - \mu^2 . \]
Why the two forms of \(\operatorname{Var}\) are equal (the right-hand one is the shortcut we actually compute with — one pass for \(\mathbb{E}[X^2]\), subtract \(\mu^2\)). Expand the square inside the expectation and use that \(\mathbb{E}\) is linear (the expectation of a sum is the sum of expectations, and constants like \(\mu\) pull out):
\[ \mathbb{E}\big[(X-\mu)^2\big] = \mathbb{E}\big[X^2 - 2\mu X + \mu^2\big] = \mathbb{E}[X^2] - 2\mu\,\underbrace{\mathbb{E}[X]}_{=\,\mu} + \mu^2 = \mathbb{E}[X^2] - 2\mu^2 + \mu^2 = \mathbb{E}[X^2] - \mu^2 . \]
(The middle step uses \(\mathbb{E}[X]=\mu\), so \(-2\mu\,\mathbb{E}[X]=-2\mu^2\); the two \(\mu^2\) terms then collapse to one.)
Worked example — a fair die. \(p(x)=\tfrac16\) for \(x\in\{1,\dots,6\}\). \(\mathbb{E}[X] = \tfrac16(1+2+3+4+5+6) = \tfrac{21}{6} = 3.5\). \(\mathbb{E}[X^2] = \tfrac16(1+4+9+16+25+36) = \tfrac{91}{6} \approx 15.17\), so \(\operatorname{Var}[X] = \tfrac{91}{6} - 3.5^2 = 15.17 - 12.25 \approx 2.92\) (exactly \(35/12\)). Its square root \(\sigma = \sqrt{\operatorname{Var}[X]}\) is the standard deviation — the spread in the same units as \(X\) (for the die, \(\sigma=\sqrt{35/12}\approx1.71\) faces).
Check it in code (the house rule — verify every worked number you write down):
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6]) # die faces
p = np.full(6, 1/6) # fair-die PMF: each face has prob 1/6
assert np.isclose(p.sum(), 1.0) # a PMF must sum to 1
EX = (x * p).sum() # E[X] = 3.5
VarX = (x**2 * p).sum() - EX**2 # Var[X] = 35/12 = 2.9167
print(EX, VarX) # -> 3.5 2.9167
What these two numbers actually mean. Read \(\mathbb{E}[X]=3.5\) as a long-run average: roll the die a million times and average the faces, and you land on \(3.5\) — even though you can never roll a \(3.5\). (That the sample average converges to the expectation as samples pile up is the Law of Large Numbers — the reason an average is trustworthy.) Expectation is the balance point of the bars, not a value you expect to see. Variance is the average squared distance from that mean; we square so that misses above and below don’t cancel to zero, and so a big miss counts extra. Two coffee shops can both average a \(5\)-minute wait, but if shop A is always \(4\)–\(6\) min while shop B swings \(0\)–\(10\) min, B has the far larger variance — it is the unpredictable one.
Conditional probability & independence. The probability of \(A\) given that \(B\) happened:
\[ \Pr(A\mid B) = \frac{\Pr(A\cap B)}{\Pr(B)} . \]
Per component: \(\Pr(A\cap B)\) is the chance \(A\) and \(B\) both happen (\(\cap\) = “intersection”), and dividing by \(\Pr(B)\) restricts the world to the cases where \(B\) already occurred.
\(A\) and \(B\) are independent if knowing one tells you nothing about the other: \(\Pr(A\cap B)=\Pr(A)\Pr(B)\), equivalently \(\Pr(A\mid B)=\Pr(A)\).
Worked example. Two fair dice. \(\Pr(\text{sum}=8 \mid \text{first}=5)\)? Given first \(=5\), the second must be \(3\) (prob \(\tfrac16\)), so the answer is \(\tfrac16\). Compare the unconditional \(\Pr(\text{sum}=8)=\tfrac{5}{36}\approx 0.139\) — conditioning changed it, so “sum\(=8\)” and “first\(=5\)” are not independent.
Conditional probability is the single most important idea in this chapter: likelihood (§4) and Bayes’ rule (§5) are both built directly on it.
Joint, marginal, and the law of total probability. Three more terms close the loop:
- The joint \(p(A,B)=\Pr(A\cap B)\) — the chance of \(A\) and \(B\) together.
- A marginal — what you get by summing the joint over the variable you don’t care about: \(p(A)=\sum_B p(A,B)\), “marginalizing out” \(B\). (The name is literal: write the joint as a grid and the row/column totals in the margins are the marginals.)
- The law of total probability combines them: \(p(A)=\sum_B p(A\mid B)\,p(B)\) — the overall chance of \(A\) is its chance within each case \(B\), weighted by how likely each \(B\) is.
That last formula is the engine of the evidence term in Bayes’ rule (§5) and of the disease-test denominator there — keep it in view.
2. The five distributions ML actually uses
§1 established what a distribution is (the object) and how a PMF/PDF describes it. ML leans on surprisingly few — the five below. Learn each one’s shape (gallery at the end of the section) and a typical example, not its formula: each is a recipe for generating a kind of data, and each becomes a loss when you take \(-\log\) of it (Losses & Regularizers §6). The whole cast at a glance:
| Distribution | Type | Models — a typical example | Parameters | Shape |
|---|---|---|---|---|
| Bernoulli | discrete | one yes/no — will this user click? | \(p\) | two bars |
| Categorical | discrete | one of \(K\) labels — which of 5 genres? | \(p_1,\dots,p_K\) | \(K\) bars |
| Gaussian | continuous | a real value \(+\) noise — a height, a price | \(\mu,\sigma^2\) | bell |
| Laplace | continuous | a real value, robust to outliers | \(\mu,b\) | sharp-peaked bell |
| Beta | continuous | a belief about a probability — a coin’s bias | \(\alpha,\beta\) | blob on \([0,1]\) |
Bernoulli — a single yes/no (a biased coin), after Jacob Bernoulli (Ars Conjectandi, 1713, which founded the study of such yes/no trials). One parameter \(p=\Pr(X=1)\):
\[ p(x) = p^{x}(1-p)^{1-x}, \quad x\in\{0,1\}; \qquad \mathbb{E}[X]=p,\ \operatorname{Var}[X]=p(1-p). \]
That formula looks fancy but is just a switch: plug in \(x=1\) and it returns \(p^{1}(1-p)^{0}=p\); plug in \(x=0\) and it returns \(p^{0}(1-p)^{1}=1-p\). It simply picks the right one of the two probabilities. Use: binary labels (click / no-click). Its \(-\log\) is binary cross-entropy (Losses & Regularizers §2.2).
Categorical — one roll of a \(K\)-sided die (the name is literal — it picks one category / class). Parameters \(p_1,\dots,p_K\) summing to 1:
\[ p(x=c) = p_c , \qquad c\in\{1,\dots,K\},\quad \textstyle\sum_c p_c = 1 . \]
Here \(c\) is the observed class and \(p_c\) its probability. (No mean/variance is listed because the classes are unordered labels — “average class” is meaningless.) Use: multi-class labels (which of \(5\) genres, which of \(10\) digits). Paired with softmax outputs; its \(-\log\) is cross-entropy.
Gaussian (Normal) — the bell curve, parameters mean \(\mu\) and variance \(\sigma^2\):
\[ p(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\Big(\!-\frac{(x-\mu)^2}{2\sigma^2}\Big). \]
Decoded: \((x-\mu)^2\) is “how far \(x\) is from the centre \(\mu\), squared” — so the farther out, the smaller the probability; \(\sigma\) sets the width; the front constant \(\frac{1}{\sqrt{2\pi\sigma^2}}\) only rescales things so the area integrates to \(1\). (Adult women’s heights cluster near \(\mu\approx162\) cm with most within one \(\sigma\approx7\) cm, so \(190\) cm is far less likely than \(162\) cm.) Use: real-valued targets / additive noise. Its \(-\log\) is (up to constants) the squared error \((x-\mu)^2\) — this is why MSE is the default regression loss. Why the name: after Carl Friedrich Gauss (1809), who derived it as the error law that makes the arithmetic mean the most likely estimate; “Normal” does not mean “usual”; its origin is debated — one common account ties it to the normal (perpendicular) equations of least squares — and Karl Pearson (who popularized it) later regretted the term for wrongly implying other distributions are abnormal.
Laplace — a sharper-peaked, heavier-tailed cousin, parameters centre \(\mu\) and scale \(b\):
\[ p(x) = \frac{1}{2b}\exp\!\Big(\!-\frac{|x-\mu|}{b}\Big). \]
Why the name: after Pierre-Simon Laplace, who studied it in 1774 (his “first law of errors”). Reading the shape: the only change from the Gaussian is \(|x-\mu|\) in place of \((x-\mu)^2\) — absolute value instead of square — and that one swap does two visible things:
- A sharp cusp at the centre. \(|x-\mu|\) has a kink at \(\mu\), so the density spikes to a point there: the model bets hard that values land right at \(\mu\). Most errors are tiny.
- Heavier tails. \(e^{-|x|}\) dies far slower than \(e^{-x^2}\), so a large deviation, though rare, is much likelier than under a Gaussian. A few big outliers are expected, not shocking.
Those two together are robustness. Its \(-\log\) is the absolute error \(|x-\mu|\) — i.e. MAE — which grows only linearly, so one wild outlier (a mansion among ordinary house prices) barely tugs a Laplace/MAE fit, while a Gaussian/MSE fit, with its squared error, lurches toward it. Said another way: the value that best fits data is the outlier-chasing mean under a Gaussian, but the outlier-resistant median under a Laplace — the same reason we quote a median house price as the honest “typical” one.
Beta — a distribution over a probability \(\theta\in[0,1]\), parameters \(\alpha,\beta>0\):
\[ p(\theta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}, \qquad \mathbb{E}[\theta]=\frac{\alpha}{\alpha+\beta}. \]
Why the name: its normalizing constant is Euler’s Beta function \(B(\alpha,\beta)\), so the distribution borrows the name. The strange part — “a distribution over a probability” — is simple with our running coin: you don’t know its bias \(\theta\) (its \(\Pr(\text{head})\)), so rather than commit to one number you hold a whole belief curve over every bias in \([0,1]\).
The intuition that makes Beta click is pseudo-counts: \(\text{Beta}(\alpha,\beta)\) behaves as if you had already seen \(\alpha-1\) heads and \(\beta-1\) tails.
- \(\text{Beta}(1,1)\) — seen \(0,0\) — is dead flat: every bias equally plausible (“no idea”).
- \(\text{Beta}(2,2)\) — seen \(1\) head, \(1\) tail — a gentle hill at \(0.5\) (“probably fairish”).
- \(\text{Beta}(5,2)\) — seen \(4\) heads, \(1\) tail — leans toward \(1\) (“probably heads-heavy”).
So Beta has two knobs: the ratio \(\alpha:\beta\) sets where the belief sits (mean \(\alpha/(\alpha+\beta)\)), and the total \(\alpha+\beta\) sets how sharp it is (more pseudo-data → a narrower, more confident curve): concretely the variance \(\operatorname{Var}[\theta]=\dfrac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\) shrinks as the total \(\alpha+\beta\) grows. Use: a prior on the coin’s bias (§6). It is the conjugate prior — after you flip the coin and add the real counts, the posterior is again a Beta, \(\text{Beta}(\alpha+\text{heads},\,\beta+\text{tails})\) (the term is in §11). That “just add your counts” update is exactly what turns §4’s likelihood and §6’s MAP into one-liners.
The table to remember (it is Losses & Regularizers §6).
Distribution Models \(-\log\) of it = Loss name Gaussian real value + noise \((x-\mu)^2\) MSE Laplace real value, robust \(\lvert x-\mu\rvert\) MAE Bernoulli one yes/no \(-[y\log p+(1{-}y)\log(1{-}p)]\) BCE Categorical one of \(K\) \(-\log p_{c^\star}\) cross-entropy “Choosing a loss” and “assuming a noise distribution” are the same act, viewed through a logarithm.
3. Odds, the logit, and the sigmoid
A model outputs an unbounded real score \(z\) (a dot product, a neural net’s last layer). A probability must live in \([0,1]\). The sigmoid is the bridge — and to see why it has its exact shape, start with odds.
Odds. If an event has probability \(p\), its odds are \(\dfrac{p}{1-p}\) (our coin landing heads with \(p=0.8\) has odds \(4\) — 4× as likely to land heads as not). Odds live in \([0,\infty)\): below \(1\) = unlikely, exactly \(1\) = an even bet (\(p=0.5\)), above \(1\) = likely. It is how bookmakers and doctors talk — “odds of \(3\) to \(1\)” means \(p=\tfrac{3}{4}=0.75\).
Logit (log-odds). Take the log of the odds: \(\operatorname{logit}(p)=\ln\dfrac{p}{1-p}\). Piece by piece — the inner \(\tfrac{p}{1-p}\) is the odds, and the outer \(\ln(\cdot)\) stretches them across the whole line: odds \(\to 0\) (\(p\to0\)) give \(\ln 0=-\infty\), odds \(=1\) (\(p=0.5\)) give \(\ln 1=0\), odds \(\to\infty\) (\(p\to1\)) give \(+\infty\). So \(\operatorname{logit}\) maps \(p\in(0,1)\) onto the entire real line \((-\infty,\infty)\) — exactly the range of a model score, so it is natural to let the score be the log-odds. (Why the name: Joseph Berkson coined logit in 1944 as “logistic unit”, echoing the earlier “probit”.)
Sigmoid = the inverse. Solving \(z=\ln\frac{p}{1-p}\) for \(p\) gives the logistic sigmoid (here \(\sigma\) denotes the sigmoid function — a separate use of the letter from the standard deviation \(\sigma = \sqrt{\operatorname{Var}[X]}\) of §1; context disambiguates):
\[ \sigma(z) = \frac{1}{1+e^{-z}} \in (0,1). \]
Component by component: \(e^{-z}\) flips and scales the score (a big positive \(z\Rightarrow e^{-z}\to0\); a big negative \(z\Rightarrow e^{-z}\to\infty\)); the \(1+\) holds the denominator above \(1\), so the reciprocal stays inside \((0,1)\); and at \(z=0\), \(e^{0}=1\) gives exactly \(\tfrac12\). So it squashes any score \(z\) into a probability. (Why the names: sigmoid \(=\) “S-shaped” — the Greek \(\sigma\) (sigma) plus “-oid” (“like”); logistic because the S-curve is Verhulst’s 1838 logistic growth law. Two names, one curve.) Key properties:
- \(\sigma(0)=0.5\); \(\;\sigma(-z)=1-\sigma(z)\) (symmetry); \(\;\sigma(\pm\infty)=\{0,1\}\).
- Derivative \(\sigma'(z)=\sigma(z)\,(1-\sigma(z))\) (derived next) — the reason logistic-regression gradients are so clean.
Why \(\sigma'=\sigma(1-\sigma)\) — one chain-rule step. Write \(\sigma(z)=(1+e^{-z})^{-1}\) and differentiate (the chain rule — Calculus primer): the outer power gives \(-(1+e^{-z})^{-2}\), the inner \(e^{-z}\) contributes \(-e^{-z}\), so \(\sigma'(z)=-(1+e^{-z})^{-2}\cdot(-e^{-z})=\dfrac{e^{-z}}{(1+e^{-z})^{2}}\). Split that product as \(\dfrac{1}{1+e^{-z}}\cdot\dfrac{e^{-z}}{1+e^{-z}}\): the first factor is \(\sigma(z)\), and the second is \(\dfrac{e^{-z}}{1+e^{-z}}=\dfrac{(1+e^{-z})-1}{1+e^{-z}}=1-\sigma(z)\). Hence \(\sigma'(z)=\sigma(z)\,(1-\sigma(z))\) — no new constants, just the sigmoid and its complement.
Worked example. Score \(z=2\). Then \(e^{-2}=0.1353\), so \(\sigma(2)=\dfrac{1}{1.1353}=0.8808\). Check via odds: \(\dfrac{0.8808}{1-0.8808}=\dfrac{0.8808}{0.1192}=7.389=e^{2}\) ✓ — the odds are exactly \(e^{z}\), confirming “score = log-odds.” Slope there: \(\sigma'(2)=0.8808\times0.1192=0.105\).
Why logarithms — and why \(-\log\)? Logs are everywhere in ML for three concrete reasons:
- Products become sums. A likelihood is a product over data points, \(\prod_i p_i\) — the numbers underflow to \(0\) and are awkward to differentiate. \(\log\prod_i p_i=\sum_i\log p_i\) turns it into a stable sum, differentiable term by term. (That is why §4 maximizes the log-likelihood, not the likelihood.)
- \(\log\) is monotonic, so it never moves where a maximum sits: maximizing likelihood and log-likelihood give the same answer — the numerical convenience is free.
- \(-\log p\) is “surprise,” a ready-made cost. A certain event (\(p=1\)) carries \(-\log 1=0\) surprise; a rare one (\(p\to0\)) carries \(-\log p\to\infty\). So \(-\log p\) — taking \(p\) to be the probability the model put on the right answer — is a natural loss: near \(0\) when the model was confident and right, exploding when confidently wrong. Every \(-\log\) loss in the book — BCE, cross-entropy, NLL — is exactly this, and their average is entropy (§7).
Softmax — turning \(K\) scores into a distribution
For multi-class problems the softmax generalizes the sigmoid: given scores \(z_1,\dots,z_K\),
\[ \operatorname{softmax}(z)_c=\frac{e^{z_c}}{\sum_{k}e^{z_k}}, \]
where the numerator \(e^{z_c}\) is class \(c\)’s positive weight and the denominator \(\sum_k e^{z_k}\) is the normalizer (the total) — together, \(K\) positive numbers that sum to \(1\), a probability distribution over the classes. Why exponentiate? \(e^{z}\) is always positive (valid probabilities), monotonic (a bigger score \(\to\) a bigger share), and smooth (differentiable, so gradient descent works); it also makes the largest score dominate — a soft version of taking the max, hence soft-max.
Worked example. Scores \(z=[2,1,0]\): \(e^2,e^1,e^0 = 7.39,\,2.72,\,1.00\), summing to \(11.11\), so \(\operatorname{softmax}=[0.665,\,0.245,\,0.090]\) (they sum to \(1\)). The top score takes the lion’s share, the rest keep nonzero probability.
A temperature \(\tau\) tunes the sharpness — feed \(z/\tau\): small \(\tau\to\) near one-hot (confident), large \(\tau\to\) near-uniform (unsure). (The formula is physics’ Boltzmann distribution, where \(\tau\) literally is temperature — the name’s origin.) With \(K=2\), softmax reduces to a sigmoid of the score difference \(\sigma(z_1-z_0)\) (Losses & Regularizers §2.3), so sigmoid is just the two-class softmax.
When to use which. Sigmoid for a binary yes/no, or several independent labels (multi-label); softmax for one-of-\(K\) mutually exclusive classes. Softmax is the classification head of almost every deep net, the attention weights in transformers (a softmax over similarity scores), and the core of InfoNCE / contrastive learning (SSL & Contrastive Learning). The sigmoid is the scores→probability link in BCE (Losses & Regularizers §2.2) and the squashing function inside BPR (Losses & Regularizers §9).
4. Likelihood, log-likelihood, and Maximum Likelihood (MLE)
Probability vs likelihood — the flip that confuses everyone. They use the same formula \(p(\text{data}\mid\theta)\) but read it in opposite directions:
- Probability: fix the parameters \(\theta\), ask how probable various data are.
- Likelihood: fix the observed data, view \(L(\theta)=p(\text{data}\mid\theta)\) as a function of \(\theta\) — “how well does each \(\theta\) explain what I saw?”
Why a separate word: R.A. Fisher coined likelihood (1921–22) precisely to mark this flip — it is not a probability of \(\theta\) (it need not integrate to \(1\) over \(\theta\); recall §1’s function-vs-distribution), so the everyday twins “probable” and “likely” become two distinct technical terms.
Maximum Likelihood Estimation (MLE). Pick the \(\theta\) that makes the observed data most probable: \(\hat\theta_{\text{MLE}}=\arg\max_\theta L(\theta)\).
Why the log. Data points are usually assumed independent, so \(L\) is a big product \(\prod_n p(x_n\mid\theta)\). Taking \(\log\) turns it into a sum \(\sum_n \log p(x_n\mid\theta)\) — easier to differentiate, numerically stable (no underflow from multiplying thousands of small numbers), and since \(\log\) is increasing, the maximizer is unchanged. Maximizing the log-likelihood = maximizing the likelihood.
Worked example — a coin. Flip 10 times, see 7 heads, 3 tails. Under a Bernoulli with \(\Pr(\text{head})=p\):
\[ L(p)=p^{7}(1-p)^{3}, \qquad \log L(p)=7\ln p + 3\ln(1-p). \]
Read \(L(p)=p^{7}(1-p)^{3}\) by its parts: each factor is one flip’s probability, multiplied because the flips are independent — \(p^{7}\) is the chance of the \(7\) heads (prob \(p\) each), \((1-p)^{3}\) the chance of the \(3\) tails (prob \(1-p\) each); their product is the probability of exactly this outcome under bias \(p\). Read as a function of \(p\) (data fixed), it is the likelihood; the \(\log\) then turns the product into the sum \(7\ln p+3\ln(1-p)\).
Set the derivative to zero:
\[ \frac{d\log L}{dp}=\frac{7}{p}-\frac{3}{1-p}=0 \;\Rightarrow\; 7(1-p)=3p \;\Rightarrow\; 7=10p \;\Rightarrow\; \hat p_{\text{MLE}}=0.7 . \]
So the MLE is just the observed frequency \(7/10\) — reassuringly. (Hand-check: at \(p=0.7\), \(\frac{7}{0.7}=10\) and \(\frac{3}{0.3}=10\); they cancel. ✓)
The bridge to losses. Define the negative log-likelihood (NLL) \(=-\log L(\theta)\). Then minimizing NLL = doing MLE. And Losses & Regularizers §6’s table is exactly this: NLL of a Gaussian = MSE, NLL of a Bernoulli = BCE. For one labelled point \(y\in\{0,1\}\) the Bernoulli NLL is \(-\log\!\big[p^{y}(1-p)^{1-y}\big]=-\big[y\log p+(1-y)\log(1-p)\big]\) — exactly binary cross-entropy. A loss function is a negative log-likelihood.
5. Bayes’ rule — prior, likelihood, evidence, posterior
MLE uses the data only. Bayesian thinking adds a prior belief about \(\theta\) before seeing data, then updates it. The update is Bayes’ rule — after the Rev. Thomas Bayes, whose essay on it was published posthumously in 1763 — which falls straight out of the conditional-probability definition (§1), since \(\Pr(A\cap B)=\Pr(B\mid A)\Pr(A)=\Pr(A\mid B)\Pr(B)\):
\[ \underbrace{p(\theta\mid D)}_{\textbf{posterior}} \;=\; \frac{\overbrace{p(D\mid\theta)}^{\textbf{likelihood}}\;\overbrace{p(\theta)}^{\textbf{prior}}}{\underbrace{p(D)}_{\textbf{evidence}}} . \]
The four names — memorize these, they recur everywhere:
- Prior \(p(\theta)\) — what you believed about \(\theta\) before the data (Latin prior, “earlier”).
- Likelihood \(p(D\mid\theta)\) — how well \(\theta\) explains the data (the §4 object).
- Evidence \(p(D)=\sum_\theta p(D\mid\theta)p(\theta)\) — a normalizing constant from the law of total probability (§1): marginalize out \(\theta\) by averaging the likelihood over every \(\theta\), weighted by the prior. It is the data’s overall probability (a.k.a. the marginal likelihood), independent of \(\theta\) — which is why it drops out in MAP (§6).
- Posterior \(p(\theta\mid D)\) — your updated belief after the data (Latin posterior, “later”).
In words: posterior ∝ likelihood × prior.
Our running coin is the cleanest instance — a Beta prior × the Bernoulli likelihood gives a Beta posterior, worked end-to-end in §6. But first, the example everyone must meet, because it overturns a strong intuition:
Worked example — the famous medical test. Let \(S\) = “has the disease” (we write \(S\) for sick, since \(D\) already means data above). The disease affects 1% of people (\(\Pr(S)=0.01\), the prior). A test has 99% sensitivity (\(\Pr(+\mid S)=0.99\)) and a 5% false-positive rate (\(\Pr(+\mid\neg S)=0.05\)). You test positive. What is the probability you actually have the disease?
\[ \Pr(+) = \Pr(+\mid S)\Pr(S) + \Pr(+\mid\neg S)\Pr(\neg S) = 0.99(0.01) + 0.05(0.99) = 0.0099 + 0.0495 = 0.0594 . \]
\[ \Pr(S\mid +) = \frac{\Pr(+\mid S)\Pr(S)}{\Pr(+)} = \frac{0.0099}{0.0594} \approx \mathbf{0.167} . \]
The same thing by counting (often the clearest way): picture 10,000 people. The 1% prior makes 100 of them sick; the test catches 99 (true positives). Of the 9,900 healthy, 5% — that is 495 — test positive anyway (false positives). So 594 people test positive, yet only 99 are truly sick: \(99/594\approx 17\%\).
Only ~17% — despite a “99% accurate” test! Because the disease is rare (low prior), most positives are false positives from the large healthy population. This base-rate effect is the single most important practical lesson of Bayes’ rule, and exactly why the prior matters.
Frequentist vs Bayesian, in one line. A frequentist treats \(\theta\) as a fixed unknown and asks “what data would this \(\theta\) produce?” (→ MLE). A Bayesian treats \(\theta\) as itself uncertain, carrying a prior, and computes a full posterior. MAP (§6) is the bridge.
6. MLE vs MAP — how a prior becomes a regularizer
Maximum A Posteriori (MAP) estimation — a posteriori is Latin for “from what comes after” (i.e. after the data), the mirror of a priori, “from before” — picks the peak of the posterior, the single most probable \(\theta\) given the data and your prior:
\[ \hat\theta_{\text{MAP}} = \arg\max_\theta\, p(\theta\mid D) = \arg\max_\theta\, p(D\mid\theta)\,p(\theta) \]
(the evidence \(p(D)\) drops out — it does not depend on \(\theta\)). Now take \(-\log\) of the product \(p(D\mid\theta)\,p(\theta)\). Three things happen — two already familiar from §4: the product becomes a sum (\(\log ab=\log a+\log b\)), and the maximizer is unmoved (\(\log\) is increasing). The third is new: the minus sign flips \(\arg\max\) into \(\arg\min\) (negating sends the largest value to the most negative — i.e. the smallest), turning each log-probability into a positive cost:
\[ \hat\theta_{\text{MAP}} = \arg\min_\theta\Big[\underbrace{-\log p(D\mid\theta)}_{\text{NLL} = \textbf{loss}} \;+\; \underbrace{-\log p(\theta)}_{\textbf{regularizer}}\Big]. \]
This is the whole bridge to Losses & Regularizers. MLE = “loss only.” MAP = “loss + a term from the prior.” That second term is the regularizer — so named because it keeps the solution regular (well-behaved), reining in extreme parameter values:
- A Gaussian prior on \(\theta\) (believe weights are small, near 0) → \(-\log p(\theta) \propto \lVert\theta\rVert_2^2\) → L2 / weight decay.
- A Laplace prior → \(\lVert\theta\rVert_1\) → L1 / lasso.
- A flat (uninformative) prior → the regularizer is constant → MAP collapses to MLE.
Worked example — the coin again, now Bayesian. We saw 7 heads in 10 (§4 gave MLE \(=0.7\)). Add a Beta\((2,2)\) prior — a gentle belief that the coin is fairish (prior mean \(0.5\)). Since Beta is conjugate to Bernoulli, the posterior is just “add the counts to the prior exponents”:
\[ p(\theta\mid D) \propto \underbrace{\theta^{7}(1-\theta)^{3}}_{\text{likelihood}}\cdot\underbrace{\theta^{1}(1-\theta)^{1}}_{\text{Beta}(2,2)\ \text{prior}} = \theta^{8}(1-\theta)^{4} \;\propto\; \text{Beta}(9,5). \]
The MAP estimate is the mode of Beta\((9,5)\), namely \(\dfrac{\alpha-1}{\alpha+\beta-2}=\dfrac{9-1}{9+5-2}=\dfrac{8}{12}=\mathbf{0.667}\).
| Estimate | Value | Reading |
|---|---|---|
| MLE | \(0.700\) | trust the data alone |
| Prior mean | \(0.500\) | belief before data |
| MAP | \(\mathbf{0.667}\) | data, pulled toward the prior |
probably fairish'') and observing $7$ heads + $3$ tails, conjugacy gives the posterior Beta$(9,5)$ (solid teal) by simply adding the counts to the exponents: $\alpha'=2+7=9$, $\beta'=2+3=5$. The curve **sharpens** (more pseudo-data $\to$ tighter belief) and **shifts right** (heads-heavy data pulls the centre toward $0.7$), with the MAP at the mode $(9-1)/(9+5-2)=8/12\approx0.667$.Just
add the counts’’ is conjugacy made visible.
The prior shrinks the estimate from \(0.7\) toward \(0.5\) — exactly what L2 does when it shrinks weights toward \(0\). With more data (say 70/100) the likelihood dominates and MAP moves back toward the MLE: a prior is a finite nudge, washed out by enough evidence.
One-line takeaway. MLE maximizes likelihood; MAP maximizes likelihood × prior; under \(-\log\) that is exactly loss + regularizer. Losses & Regularizers §6 is this section, applied to model parameters; Losses & Regularizers §9’s Bayesian Personalized Ranking is this section with a Gaussian prior giving its L2 term.
7. Entropy, cross-entropy, and KL divergence
These three information-theory quantities are the backbone of every classification loss and much of modern ML. All three grow from one idea from §3: surprise \(=-\log p\). A result you were sure of (\(p\approx1\)) carries almost no surprise (\(-\log p\approx0\)); a result you thought nearly impossible (\(p\approx0\)) is a huge shock (\(-\log p\) is huge). Everything below just averages that surprise, one way or another.
Entropy — average surprise
The entropy of a distribution \(p\) is the average surprise of its outcomes:
\[ H(p) = -\sum_x p(x)\,\log p(x) = \mathbb{E}_{x\sim p}\big[-\log p(x)\big]. \]
Component by component: \(-\log p(x)\) is the surprise of outcome \(x\) and \(p(x)\) is how often it occurs, so the sum is a probability-weighted average of surprise — exactly the \(\mathbb{E}_{x\sim p}[-\log p(x)]\) on the right. It measures uncertainty — how unpredictable \(p\) is. The most concrete reading: with \(\log_2\), entropy is the average number of yes/no questions you must ask to pin down the outcome (equivalently, the average bits to store one sample). Fair coin → one question (“heads?”) → \(1\) bit. Four equally likely outcomes → two questions → \(2\) bits. A near-certain coin → usually \(0\) questions → entropy near \(0\). That “20 questions” count is the entropy.
- Fair coin (\(p=0.5\)): \(H=-0.5\log_2 0.5-0.5\log_2 0.5=\mathbf{1}\) bit — maximal uncertainty for two outcomes.
- Biased coin (\(p=0.9\)): \(H=-0.9\log_2 0.9-0.1\log_2 0.1=\mathbf{0.47}\) bits — more predictable, so less surprise. Certain (\(p=1\)): \(H=0\). Uniform over 4: \(\log_2 4=2\) bits.
Why “entropy.” Borrowed from thermodynamics (Clausius, 1865; en- + Greek tropē, “transformation”), where it measures disorder. Claude Shannon (1948) showed the same formula measures the uncertainty of an information source — reportedly naming it “entropy” on von Neumann’s quip that “no one knows what entropy really is, so you’ll win every argument.” It is maximal at the uniform distribution and zero when one outcome is certain.
Units (bits vs nats). The log’s base is just a unit: \(\log_2\) gives bits (the “yes/no questions” reading above), \(\ln\) gives nats; they differ only by the constant \(\ln 2\approx0.693\). Entropy here is in bits; the cross-entropy and KL below switch to nats to match the loss formulas — so the fair coin’s \(1\) bit is the same as \(0.693\) nats.
Cross-entropy — surprise under the wrong distribution
Cross-entropy is the average surprise when reality follows \(p\) but you predict with a different distribution \(q\):
\[ H(p,q) = -\sum_x p(x)\,\log q(x). \]
In classification the truth \(p\) is one-hot (the right class has probability \(1\), the rest \(0\)). Watch the sum collapse: with classes {cat, dog, bird} and true label “cat” so \(p=[1,0,0]\), \[ H(p,q)=-\big(1\cdot\log q_{\text{cat}}+0\cdot\log q_{\text{dog}}+0\cdot\log q_{\text{bird}}\big)=-\log q_{\text{cat}}. \] The two \(0\)s delete their terms, leaving \(H(p,q)=-\log q(\text{true class})\) — the model’s surprise at the correct answer. That is the cross-entropy loss, and (since the one-hot likelihood is just \(q(\text{true class})\)) exactly the NLL of §4.
- Model puts \(q=0.7\) on the true class: loss \(=-\ln 0.7=\mathbf{0.357}\) (low).
- \(q=0.3\): loss \(=-\ln 0.3=\mathbf{1.204}\) (wrong-leaning, high). \(q=0.99\): \(-\ln0.99=\mathbf{0.01}\) (almost free).
Why “cross”-entropy. You cross the true distribution \(p\) with a different model distribution \(q\) — scoring \(p\)’s outcomes by \(q\)’s probabilities. When \(q=p\) it collapses to the ordinary entropy \(H(p)\).
KL divergence — the gap between two distributions
The Kullback–Leibler divergence is the extra surprise from using \(q\) instead of the true \(p\) — cross-entropy minus entropy:
\[ D_{\mathrm{KL}}(p\,\Vert\,q)=\sum_x p(x)\log\frac{p(x)}{q(x)} = \underbrace{H(p,q)}_{\text{cross-entropy}}-\underbrace{H(p)}_{\text{entropy}} \;\ge\; 0. \]
It is \(0\) exactly when \(q=p\) and grows as \(q\) drifts away — a one-directional “how far is \(q\) from the truth.” Worked example: truth \(p=[0.5,0.5]\), model \(q=[0.9,0.1]\): \(D_{\mathrm{KL}}=0.5\ln\tfrac{0.5}{0.9}+0.5\ln\tfrac{0.5}{0.1}=\mathbf{0.511}\) nats, and indeed (all in nats) \(H(p,q)-H(p)=1.204-0.693=0.511\) ✓ — note \(H(p)=0.693\) nats is the fair coin’s \(1\) bit in nats.
Why “divergence” (not “distance”) and “Kullback–Leibler.” After Solomon Kullback and Richard Leibler (1951). It is a divergence, not a distance, because it is asymmetric — \(D_{\mathrm{KL}}(p\Vert q)\neq D_{\mathrm{KL}}(q\Vert p)\) — and fails the triangle inequality. Why direction matters: \(D_{\mathrm{KL}}(p\Vert q)\) weights each surprise by how often the true \(p\) actually produces that outcome, so putting \(q\approx0\) on something \(p\) does often is catastrophic — while the reverse need not be. It still serves as a one-way “how far is \(q\) from \(p\).”
The punchline that unites them. Since \(H(p)\) is fixed by the data, \(\min_q H(p,q)\Leftrightarrow\min_q D_{\mathrm{KL}}(p\Vert q)\): training a classifier by minimizing cross-entropy is literally pulling the model’s \(q\) toward the true \(p\) in KL.
Where they power AI/ML
| Quantity | Used for |
|---|---|
| Cross-entropy | the default classification loss everywhere (softmax + cross-entropy) = NLL (§4) |
| KL divergence | VAEs (the ELBO’s KL term), knowledge distillation, RLHF (a KL penalty keeps the policy near a reference), t-SNE, variational inference |
| Entropy | decision-tree information gain, maximum-entropy models, the entropy bonus that encourages exploration in RL, the softmax temperature/uncertainty (§3) |
| Mutual information | itself a KL (built above); the quantity InfoNCE lower-bounds in contrastive learning (SSL & Contrastive Learning) |
8. Significance testing — is the improvement real? (the paired \(t\)-test)
You train model A and model B; A scores a little higher on the test set. Could that gap be luck? Change the random seed or the test split and the numbers wiggle. A significance test answers: “if the two models were really equally good, how surprising would a gap this big be?” If very surprising, the gap is unlikely to be noise — it is statistically significant.
8.1 The vocabulary (all you need)
- Null hypothesis \(H_0\) — the skeptical default: “there is no real difference; A and B are equally good.” A test tries to disprove it.
- \(p\)-value — the probability of seeing a gap at least this large if \(H_0\) were true. Small \(p\) = the data would be surprising under “no difference” = evidence the difference is real. (\(p\) is not the probability that \(H_0\) is true — a common myth.)
- Significance level \(\alpha\) — the bar you fix in advance (usually \(0.05\)). If \(p<\alpha\) you reject \(H_0\) and call the result significant.
- Type I / Type II error — rejecting a true \(H_0\) (false positive, rate \(\alpha\)) / missing a real effect (false negative).
8.2 Why paired (this is the version you use)
You evaluate A and B on the same things — the same test users, the same folds, or the same random seeds. Exploit that: instead of comparing two noisy averages, look at the per-pair difference
\[ d_i = (\text{A's score})_i - (\text{B's score})_i, \]
which cancels the shared difficulty of each fold/seed (an easy seed lifts both models, so it drops out of \(d_i\)). The question collapses to: is the mean difference \(\bar d\) significantly different from \(0\)? Pairing makes the test far more powerful than comparing the two columns as independent groups.
8.3 The paired \(t\)-test, component by component
\[ t = \frac{\bar d}{\,s_d/\sqrt{n}\,}, \qquad \bar d = \frac1n\sum_{i} d_i, \qquad s_d = \sqrt{\frac{1}{n-1}\sum_i (d_i-\bar d)^2}. \]
Read each piece in plain words:
- \(\bar d\) — the average improvement (the signal). Numerator.
- \(s_d\) — the spread of the differences (the noise: how inconsistent the gain is across seeds).
- \(s_d/\sqrt{n}\) — the standard error: how much \(\bar d\) itself would wobble if you repeated the whole experiment. It shrinks as you add pairs \(n\) because averaging cancels noise — flip a coin \(4\) times and the heads-fraction bounces wildly; flip it \(100\) times and it hugs \(0.5\). More seeds → steadier \(\bar d\) → easier to spot a real gap (this is exactly why more seeds give a test more power). That the average of many noisy pieces clusters ever more tightly, in a bell shape, around the truth is the Central Limit Theorem — it is why the \(t\)-curve is bell-shaped at all. (Formally: for \(n\) independent draws from almost any finite-variance distribution, the sample mean is approximately \(\mathcal N(\mu,\,\sigma^2/n)\), sharpening as \(n\) grows — which is exactly where the \(\sqrt n\) in the standard error comes from.)
- \(t\) — so \(t\) is signal-to-noise: how many standard errors \(\bar d\) sits above \(0\). Large \(|t|\) ⟹ unlikely to be chance.
- Convert \(t\) to a \(p\)-value via the Student-\(t\) distribution with \(n-1\) degrees of freedom — only \(n-1\) of the deviations are free: once you know \(\bar d\) and any \(n-1\) of the \(d_i-\bar d\), the last is forced (they must sum to \(0\)), so one was “spent” estimating \(\bar d\) — or compare against a critical value (for \(n-1=4\) and \(\alpha=0.05\) two-sided, the cutoff is \(2.776\)).
Why “Student”? William Sealy Gosset derived the \(t\)-distribution in 1908 while working at the Guinness brewery (small-sample quality control on barley). Guinness barred staff from publishing, so he wrote under the pen name “Student” — which stuck to both the distribution and the test. The “\(t\)” is just the symbol he used for the statistic.
8.4 Worked example by hand
Five seeds; NDCG@10 in milli-units (\(\times10^{-3}\), so \(409 = 0.409\)). Model A vs B:
| seed | A | B | \(d = A-B\) |
|---|---|---|---|
| 1 | 409 | 402 | 7 |
| 2 | 410 | 401 | 9 |
| 3 | 412 | 405 | 7 |
| 4 | 410 | 399 | 11 |
| 5 | 410 | 404 | 6 |
- \(\bar d = (7+9+7+11+6)/5 = 40/5 = \mathbf{8}\).
- deviations from the mean: \([-1,\,1,\,-1,\,3,\,-2]\); sum of squares \(=1+1+1+9+4 = 16\).
- \(s_d^2 = 16/(5-1) = 4 \Rightarrow s_d = 2\).
- standard error \(= 2/\sqrt{5} \approx 0.894\).
- \(t = 8 / 0.894 \approx \mathbf{8.94}\), with \(df = 4\).
Since \(8.94 \gg 2.776\), we reject \(H_0\): the \(+8\) milli-NDCG (\(+0.008\)) gain is significant, \(p \approx 0.0009\). Every seed has \(d_i>0\) and the gain (\(8\)) dwarfs the wobble (\(s_d=2\)) — A genuinely beats B, not by luck.
8.5 Caveats every paper runs into
- Significance \(\ne\) size. With enough seeds even a trivial gain turns “significant,” so always report the effect size (\(\bar d\)) next to \(p\) — e.g. “+0.008 NDCG@10 (\(p<0.001\)).”
- Report a confidence interval, not just \(p\). The 95% CI \(\bar d \pm t^{*}\,s_d/\sqrt n\) — here \(8\pm2.776(0.894)=[5.5,\,10.5]\) milli-NDCG — packages the effect size with its uncertainty in one object; if it excludes \(0\) the result is significant at that \(\alpha\) (the same verdict as the test, but more informative).
- The assumption. The \(t\)-test assumes the differences are roughly normal. If they are skewed or \(n\) is tiny, use the non-parametric Wilcoxon signed-rank test (Frank Wilcoxon, 1945 — same paired setup, ranks instead of raw values). On the 5 seeds above Wilcoxon gives \(p=0.0625\) — it literally cannot go below \(0.0625\) with \(n=5\), a reminder that tiny \(n\) starves non-parametric tests.
- Multiple comparisons. Testing many model pairs / datasets inflates false positives; correct for it (e.g. Bonferroni (after Carlo Bonferroni): divide \(\alpha\) by the number of tests), or use an omnibus test across datasets (Friedman, then post-hoc) — Demšar (2006).
- Report the protocol. State the number of seeds/folds, the test (paired \(t\) / Wilcoxon), and \(\alpha\); mark wins with e.g. \(^{*}p<0.05\), \(^{**}p<0.01\). This is exactly the experimental rigor top-venue reviewers expect — a few seeds plus a paired test turns “our number is bigger” into “our number is reliably bigger.”
In practice. When comparing two recommender models, always pair runs on the same seeds or folds — unpaired tests waste the pairing structure. Use a paired \(t\)-test when the per-pair differences look roughly normal; switch to the Wilcoxon signed-rank test when they are skewed or \(n\) is small (note: Wilcoxon needs \(n \ge 6\) pairs to even reach \(p < 0.05\)). Default to two-sided unless you pre-registered a directional hypothesis. When you test many variants or datasets at once, correct for multiple comparisons (Bonferroni: divide \(\alpha\) by the number of tests; or Holm for a less conservative bound). Finally, always report an effect size and a 95% confidence interval alongside \(p\) — a tiny gain can be “significant” with enough seeds yet matter nothing in deployment. The Evaluation Metrics chapter’s “evaluate honestly” section builds directly on this protocol.
8.6 The recipe — a runnable significance-test checklist
The prose above scatters the moving parts; here they are as one ordered procedure you can run top to bottom on a real A-vs-B comparison. Each step is a decision; the answer routes the next one.
Step 0 — fix the pairing unit, and pair on it. Decide what each pair is — one seed, one cross-validation fold, or one test user (per-user scores) — and evaluate both models on the same units, so \(d_i=A_i-B_i\) cancels that unit’s shared difficulty (§8.2). Use the same number \(n\) of pairs for both models. (Seeds are the usual unit in RecSys ablations; the worked example below uses \(n=5\) seeds.)
Step 1 — choose sidedness, before looking at the data. Two-sided (“A differs from B”) is the default and the safe reviewer choice. Use one-sided (“A is better than B”) only if you pre-registered that direction — it halves the \(p\)-value (our example: two-sided \(p=0.00086\) vs one-sided \(0.00043\)), so claiming it post hoc, after seeing A win, is a classic way to manufacture significance.
Step 2 — pick the test from the shape of the differences \(d_i\). Look at the \(n\) values \(d_i\) (not the raw scores):
- roughly symmetric / bell-ish (and \(n\) not tiny) -> paired \(t\)-test (§8.3). With only a handful of pairs you cannot really check normality, so lean on the t-test’s robustness for small symmetric \(d_i\) and keep reading.
- visibly skewed, or with a wild outlier -> Wilcoxon signed-rank (it ranks the \(d_i\), so one freak seed can’t dominate). But mind the floor: with all \(d_i\) the same sign the smallest two-sided \(p\) Wilcoxon can return is \(2/2^{n}\), so it needs \(\mathbf{n\ge 6}\) to even reach \(p<0.05\) (\(2/2^{5}=0.0625\) at \(n{=}5\); \(2/2^{6}=0.03125\) at \(n{=}6\) — verified below). Below \(n{=}6\), Wilcoxon cannot declare significance no matter how clean the win, so prefer the \(t\)-test (or collect more pairs).
per-pair differences d_i = A_i - B_i
|
+-- skewed, or a wild outlier? --+
| |
no (symmetric / bell-ish) yes
| |
paired t-test (8.3) n >= 6 ? (Wilcoxon floor 2/2^n)
| | |
| yes no
| | |
| Wilcoxon can't reach p<0.05:
| signed-rank use t-test or get
| more pairs
v v
p-value ----------> Step 3 (multiplicity) -> Step 4 (effect+CI)
Step 3 — correct for multiplicity if you ran more than one test. Comparing A against \(m\) baselines, or across \(m\) datasets, runs \(m\) tests, and each carries its own \(\alpha\) chance of a false positive — so the chance that at least one fires by luck balloons. Divide the bar: Bonferroni uses \(\alpha/m\) (compare each \(p\) to \(0.05/m\)); Holm is a uniformly less conservative stepwise version. A single pre-planned A-vs-B comparison needs no correction.
Step 4 — report the effect size with its CI, never \(p\) alone. Quote \(\bar d\) (the mean gain) and its 95% CI \(\bar d \pm t^{*}s_d/\sqrt n\) beside \(p\). The CI says both how big and how sure; it is significant at \(\alpha\) exactly when it excludes \(0\) — the same verdict as the test, but it also rules out trivially-small effects that “significance” alone hides (§8.5).
The recipe on the worked example, end to end (the same five seeds, \(d=[7,9,7,11,6]\)):
Unit = seed, \(n=5\) (Step 0). Two-sided, pre-chosen (Step 1). The \(d_i\) are tight and symmetric, so paired \(t\)-test (Step 2): \(t=8.94\), \(df=4\), \(p\approx 0.00086<0.05\) -> reject \(H_0\). A single comparison, so no multiplicity correction (Step 3). Report (Step 4): \(\bar d=\mathbf{+8}\) milli-NDCG@10, 95% CI \([5.5,\,10.5]\) (excludes \(0\) ✓), \(p<0.001\). Cross-check: Wilcoxon on these five gives \(p=0.0625\) — it cannot beat its \(n{=}5\) floor, exactly why we’d switch to it only at \(n\ge 6\).
import numpy as np
from scipy import stats
A = np.array([409,410,412,410,410]) # model A, milli-NDCG@10
B = np.array([402,401,405,399,404]) # model B, same 5 seeds
d = A - B # paired differences (Step 0)
n = len(d); dbar = d.mean(); sd = d.std(ddof=1)
t = dbar / (sd/np.sqrt(n)); df = n - 1
p_two = 2*stats.t.sf(abs(t), df) # two-sided p (Step 1)
tstar = stats.t.ppf(0.975, df) # 95% CI multiplier, df=4
ci = (dbar - tstar*sd/np.sqrt(n), dbar + tstar*sd/np.sqrt(n))
print(round(t,2), df, round(p_two,5)) # -> 8.94 4 0.00086 (Step 2)
print(round(dbar,1), [round(c,1) for c in ci]) # -> 8.0 [5.5, 10.5] (Step 4)
print(round(stats.wilcoxon(d).pvalue,4)) # -> 0.0625 (Wilcoxon floor at n=5)
print([2/2**k for k in (5,6)]) # -> [0.0625, 0.03125] (why n>=6)
8.7 Three worked cases that flip the verdict
A significance test earns its place only when it changes a decision. Three short cases — each in the milli-NDCG@10 units of §8.4, where \(d_i\) is B’s per-seed score minus A’s.
Case 1 — a positive headline that isn’t real. Five seeds give \(d=[7,-2,9,-5,1]\). The mean gain is \(\bar d=+2.0\) — positive, just like §8.4’s \(+8\). But the seeds disagree in sign, so the spread is large: \(s_d=5.92\), \(\text{SE}=2.65\), hence \(t=2.0/2.65=0.76\) on \(4\) df, \(p=0.49\), and the \(95\%\) CI is \([-5.4,\,+9.4]\) — it straddles \(0\). Verdict: do not claim B is better. The lesson §8.4’s clean win hides: it is the mean relative to its spread that decides, never the headline mean alone (\(+2\) here vs \(+8\) there).
Case 2 — when the \(t\)-test is fooled, rank instead (Wilcoxon by hand). Six seeds: \(d=[2,3,1,4,5,30]\) — five small consistent wins and one outlier. The \(t\)-test chokes: the \(30\) inflates \(s_d=11.1\), so \(t=7.5/4.54=1.65\), \(p=0.16\) — “not significant,” even though every seed favored B. The Wilcoxon signed-rank test throws away magnitudes and keeps only signs and ranks. Rank the \(|d_i|\): \(\{1,2,3,4,5,30\}\to\) ranks \(1,2,3,4,5,6\). Sum the ranks of the negative differences — there are none — so \(W^{-}=0\) (and \(W^{+}=21\)); the statistic is \(W=\min(W^{+},W^{-})=0\). With all \(n=6\) signs agreeing, the two-sided \(p\) hits its floor \(2/2^{6}=0.031<0.05\) — significant. Same data, opposite verdict: the rank test shrugged off the outlier that wrecked the \(t\)-test (the “skewed / outlier” branch of the §8.6 recipe).
Case 3 — test enough models and one wins by luck (multiplicity). You compare \(m=3\) candidates against the baseline and get \(p=\{0.012,\,0.030,\,0.045\}\). Uncorrected, all three clear \(0.05\) — three “wins.” But three tests give three shots at a false positive. Bonferroni divides the bar: compare each \(p\) to \(\alpha/m=0.05/3=0.0167\). Now only \(0.012\) survives. (Holm — compare the sorted \(p\) against \(0.0167,\,0.025,\,0.05\) — agrees: it stops at the first failure, \(0.030>0.025\).) Two of the three “wins” were the artefact that multiplicity manufactures.
The through-line. In all three cases the test reversed the naive read — a positive mean that was noise, a “non-significant” \(t\) the rank test rescued, three “wins” that were two false alarms. That is why significance testing, not the headline number, is what turns a result into a finding. (The assumption-free bootstrap CI for non-Gaussian per-user scores is worked in Evaluation Metrics §11d.)
9. Exercises
Ten by-hand problems spanning §1–§8. Each is (compute) — a number to crank, (concept) — a why-question to reason out, (extend) — push an idea one step past the text, or (apply) — use it as the book will. Every problem is self-contained and reuses this chapter’s own examples and numbers; nothing needs a calculator beyond arithmetic, a \(\ln\), and an \(e^{x}\). Full worked solutions are in the Solutions appendix at the back of this note — try each first, then check.
E1 (compute) — expectation and variance of a tiny RV. A fair four-sided die shows \(x\in\{1,2,3,4\}\), each with probability \(\tfrac14\). Using \(\mathbb{E}[X]=\sum_x x\,p(x)\) and the shortcut \(\operatorname{Var}[X]=\mathbb{E}[X^2]-\mu^2\) (§1), compute \(\mathbb{E}[X]\) and \(\operatorname{Var}[X]\). (The six-sided die in §1 gave \(3.5\) and \(35/12\); redo the arithmetic for four faces.)
E2 (concept) — name that distribution. For each description, name which of §2’s five distributions (Bernoulli, Categorical, Gaussian, Laplace, Beta) it is, and give its parameter(s): (a) “will this user click — yes or no?”; (b) “which one of \(5\) genres did they pick?”; (c) a belief about a coin’s unknown bias, written as the pseudo-count form “I’ve effectively seen \(4\) heads and \(1\) tail.” For (c), state the \((\alpha,\beta)\) and, using \(\mathbb{E}[\theta]=\alpha/(\alpha+\beta)\), its mean. Finally: for the Bernoulli in (a) with \(p=0.8\) (the §3 coin), give \(\mathbb{E}[X]\) and \(\operatorname{Var}[X]\) from \(\mathbb{E}[X]=p,\ \operatorname{Var}[X]=p(1-p)\).
E3 (compute) — odds → logit → sigmoid. The §3 coin lands heads with \(p=0.8\). (a) Compute its odds \(p/(1-p)\). (b) Compute the logit \(\ln\!\big(p/(1-p)\big)\). (c) Now treat that logit as a model score \(z\) and run it back through the sigmoid \(\sigma(z)=1/(1+e^{-z})\) — you should land back on \(0.8\) (the sigmoid is the logit’s inverse). (d) As a separate check, evaluate \(\sigma(0)\).
E4 (compute) — softmax of three scores. A classifier emits scores \(z=[1,0,0]\) for classes \(A,B,C\). Using \(\operatorname{softmax}(z)_c=e^{z_c}/\sum_k e^{z_k}\) (§3) with \(e^{1}\approx2.718\), compute the three probabilities (round to three decimals) and confirm they sum to \(1\). Which class wins, and do \(B\) and \(C\) come out equal — why?
E5 (compute) — MLE for a coin. You flip a coin 8 times and see 6 heads, 2 tails. Under a Bernoulli\((p)\) the likelihood is \(L(p)=p^{6}(1-p)^{2}\) (§4). Set the derivative of \(\log L(p)=6\ln p+2\ln(1-p)\) to zero and solve for \(\hat p_{\text{MLE}}\). Hand-check by plugging your answer back into \(\tfrac{6}{p}=\tfrac{2}{1-p}\).
E6 (apply) — Bayes’ rule, the medical test. Reuse §5’s exact numbers: a disease has prior \(\Pr(S)=0.01\), the test has sensitivity \(\Pr(+\mid S)=0.99\) and false-positive rate \(\Pr(+\mid\neg S)=0.05\). You test positive. (a) Compute the evidence \(\Pr(+)\) via the law of total probability, then the posterior \(\Pr(S\mid+)\) with Bayes’ rule. (b) Redo it by the counting picture over \(10{,}000\) people (how many true vs false positives?) and confirm you get the same number. Why is it so far below \(99\%\)?
E7 (extend) — MLE vs MAP: the prior as a regularizer. Take the same coin as E5 (6 heads, 2 tails) and add a gentle Beta\((2,2)\) prior (§6). (a) Using Beta–Bernoulli conjugacy (“add the counts to the prior exponents”), write the posterior as a Beta\((\alpha',\beta')\). (b) Its MAP estimate is the mode \(\dfrac{\alpha'-1}{\alpha'+\beta'-2}\) — compute it. (c) Compare the three numbers — MLE (E5), the prior mean \(0.5\), and your MAP — and say in one sentence which way the prior pulled the estimate, and what the analogous L2 effect on a weight would be.
E8 (compute) — entropy, cross-entropy, KL. Truth is a fair coin \(p=[0.5,0.5]\); a model predicts \(q=[0.8,0.2]\) (a new q, different from the §7 text example which used \(q=[0.9,0.1]\)). Compute (a) the entropy \(H(p)\) in bits (\(\log_2\)); (b) the cross-entropy \(H(p,q)=-\sum_x p(x)\ln q(x)\) in nats; (c) the KL divergence \(D_{\mathrm{KL}}(p\Vert q)=\sum_x p(x)\ln\frac{p(x)}{q(x)}\) in nats; and (d) verify the identity \(D_{\mathrm{KL}}(p\Vert q)=H(p,q)-H(p)\) (use \(H(p)=0.693\) nats \(=1\) bit). Useful logs: \(\ln 0.5=-0.693,\ \ln 0.8=-0.223,\ \ln 0.2=-1.609\).
E9 (apply) — a paired \(t\)-test over seeds. Model A and model B are evaluated on the same 5 seeds; the per-seed differences (§8) are \(d=[3,5,4,6,2]\). Compute (a) the mean \(\bar d\); (b) the sample standard deviation \(s_d=\sqrt{\tfrac{1}{n-1}\sum_i(d_i-\bar d)^2}\); (c) the standard error \(s_d/\sqrt{n}\) and the statistic \(t=\bar d/(s_d/\sqrt n)\) with its degrees of freedom. The two-sided critical value at \(\alpha=0.05\), \(df=4\) is \(2.776\) — do you reject \(H_0\)? (d) Give the 95% CI \(\bar d\pm 2.776\,(s_d/\sqrt n)\) and confirm it excludes \(0\).
E10 (concept) — the CLT and the standard error. In E9 the standard error was \(s_d/\sqrt n\). (a) If you re-ran the experiment with \(n=20\) seeds and the spread \(s_d\) stayed \(\approx s_d\) from E9, what would happen to the standard error, and hence to \(t\) and the chance of detecting a real gain (the test’s power)? Give the rough factor. (b) Name the theorem that says the average of many noisy per-seed differences is approximately bell-shaped around the truth — and explain in one sentence why that is exactly what puts the \(\sqrt n\) in the denominator of the standard error.
E11 (compute) — Wilcoxon signed-rank by hand. Six paired seed-differences are \(d=[+3,+1,+5,-2,+4,+6]\) (B minus A). (a) Rank the absolute values \(|d_i|\) smallest to largest. (b) Sum the ranks of the negative differences to get \(W^{-}\). (c) The statistic is \(W=\min(W^{+},W^{-})\); recall (§8.7) that at \(n=6\) the smallest two-sided \(p\) is the floor \(2/2^{6}=0.031\), reached only when all signs agree (\(W=0\)). Is this result significant at \(\alpha=0.05\)?
E12 (apply) — ship, or not? the multiple-comparison trap. You test four new models against a baseline. The best improves NDCG@10 by \(+0.004\) with a \(95\%\) CI of \([-0.001,\,+0.009]\) and a raw \(p=0.04\). (a) From that CI alone, would you conclude the model is better? (b) You ran four tests, not one — apply Bonferroni at \(\alpha=0.05\): does \(p=0.04\) survive? (c) Ship or not, and why? (The §8.7 lesson, as a decision.)
10. Where this fits in the book
Every concept here is a load-bearing prerequisite downstream:
| Foundation (here) | Used in | As… |
|---|---|---|
| Distributions → \(-\log\) (§2) | Losses & Regularizers §2, §6 | MSE / MAE / BCE / cross-entropy |
| Sigmoid, logit, softmax (§3) | Losses & Regularizers §2.2, §9; SSL & Contrastive Learning §3 | scores → probabilities; BPR’s \(\sigma\); InfoNCE |
| Likelihood, NLL, MLE (§4) | Losses & Regularizers §6 | “a loss is a negative log-likelihood” |
| Bayes: prior/likelihood/posterior (§5) | Losses & Regularizers §6, §9 | the MAP derivation; the “Bayesian” in BPR |
| MLE vs MAP (§6) | Losses & Regularizers §4, §6 | loss + regularizer; Gaussian prior → L2 |
| Expectation (§1) | Losses & Regularizers §1 (empirical risk); SSL & Contrastive Learning (alignment/uniformity) | averages of losses |
| Entropy / cross-entropy / KL (§7) | Losses & Regularizers §2 (cross-entropy loss); SSL & Contrastive Learning §3 (InfoNCE/MI) | the classification loss; KL in VAEs / RLHF / distillation |
| Significance test: paired \(t\) (§8) | empirical ML papers; ablation tables | “is the gain real, or noise?” |
THIS NOTE (probability → likelihood → Bayes → MAP)
│
▼
LOSSES & REGULARIZERS loss = −log p(data|θ); regularizer = −log p(θ); MAP = loss + reg
│
▼
FROM GRAPHS TO LIGHTGCN / SSL & CONTRASTIVE LEARNING / THE SPECTRAL VIEW
LightGCN trained with BPR (a Bayesian MAP estimate) + L2,
then SSL / spectral methods built on top
If §6’s “loss + regularizer = −log(likelihood × prior)” feels obvious now, the rest of the series will too.
11. Glossary
| Term | Plain meaning |
|---|---|
| Random variable | A quantity whose value is uncertain (coin, die, temperature). |
| Sample space / event | The set of possible outcomes / a subset of them. |
| PMF / PDF | Probability mass (discrete) / density (continuous) function; sums/integrates to 1. |
| CDF \(F(x)\) | Cumulative distribution function \(F(x)=\Pr(X\le x)\) — the running total of probability; a different function describing the same distribution. |
| Expectation \(\mathbb{E}[X]\) / mean \(\mu\) | Probability-weighted average value. |
| Variance \(\sigma^2\) | Expected squared distance from the mean; spread. |
| Law of Large Numbers | The sample average converges to the expectation as samples accumulate. |
| Conditional probability \(\Pr(A\mid B)\) | Probability of \(A\) given \(B\) occurred \(=\Pr(A\cap B)/\Pr(B)\). |
| Independence | \(\Pr(A\cap B)=\Pr(A)\Pr(B)\); one event tells you nothing about the other. |
| Joint / marginal | Joint \(p(A,B)=\Pr(A\cap B)\); marginal \(p(A)=\sum_B p(A,B)\) (sum the joint over the other variable). |
| Law of total probability | \(p(A)=\sum_B p(A\mid B)p(B)\); the engine of Bayes’ evidence term. |
| Bernoulli | One yes/no with parameter \(p\); \(-\log\) → BCE. |
| Categorical | One of \(K\) classes; \(-\log\) → cross-entropy; paired with softmax. |
| Gaussian (Normal) | Bell curve \((\mu,\sigma^2)\); \(-\log\) → squared error (MSE). |
| Laplace | Peaked/heavy-tailed \((\mu,b)\); \(-\log\) → absolute error (MAE). |
| Beta | Distribution over a probability; conjugate prior for Bernoulli. |
| Odds | \(p/(1-p)\). |
| Logit (log-odds) | \(\ln\!\big(p/(1-p)\big)\); maps \((0,1)\to(-\infty,\infty)\). |
| Sigmoid \(\sigma(z)\) | \(1/(1+e^{-z})\); inverse of the logit; score → probability. |
| Softmax | Multi-class sigmoid; scores → a probability distribution. |
| Likelihood \(L(\theta)\) | \(p(\text{data}\mid\theta)\) read as a function of \(\theta\) (data fixed). |
| Log-likelihood | \(\log L(\theta)\); products become sums. |
| NLL | Negative log-likelihood \(=-\log L\); minimizing it = MLE; this is a loss. |
| Surprise / self-information | \(-\log p(x)\); how surprising an outcome is (\(0\) if certain, \(\to\infty\) if rare). |
| Entropy \(H(p)\) | \(-\sum p\log p\); average surprise / uncertainty / bits to encode (maximal at uniform). |
| Cross-entropy \(H(p,q)\) | \(-\sum p\log q\); surprise of \(p\)’s outcomes scored by model \(q\) — the classification loss. |
| KL divergence \(D_{\mathrm{KL}}(p\Vert q)\) | \(H(p,q)-H(p)\ge0\); extra surprise from using \(q\) not \(p\); asymmetric (a divergence, not a distance). |
| MLE | Maximum Likelihood Estimate; \(\theta\) that best explains the data. |
| Bayes’ rule | posterior \(\propto\) likelihood \(\times\) prior. |
| Prior \(p(\theta)\) | Belief about \(\theta\) before data. |
| Evidence \(p(D)\) | Normalizer; data probability averaged over \(\theta\) (a.k.a. marginal likelihood). |
| Posterior \(p(\theta\mid D)\) | Updated belief after data. |
| MAP | Maximum A Posteriori; peak of the posterior \(=\) MLE \(+\) prior \(=\) loss \(+\) regularizer. |
| Conjugate prior | A prior whose posterior is the same family (Beta↔︎Bernoulli). |
| Base-rate effect | A rare condition (low prior) makes even an accurate test’s positives mostly false. |
| Null hypothesis \(H_0\) | The skeptical default “no real difference”; a test tries to disprove it. |
| \(p\)-value | Probability of a gap this large if \(H_0\) were true; small = significant. NOT \(\Pr(H_0\text{ true})\). |
| Significance level \(\alpha\) | The pre-set bar (usually \(0.05\)); reject \(H_0\) if \(p<\alpha\). |
| Type I / II error | False positive (reject true \(H_0\)) / false negative (miss a real effect). |
| Paired \(t\)-test | Tests whether the mean of per-pair differences \(\bar d\) differs from 0: \(t=\bar d/(s_d/\sqrt n)\). |
| Standard error | \(s_d/\sqrt n\) — how much an estimate wobbles across repeats; shrinks with more data. |
| Central Limit Theorem | The average of many noisy pieces is bell-shaped around the truth; basis of the \(t\)-curve. |
| Confidence interval | \(\bar d \pm t^{*}s_d/\sqrt n\); the effect size with its uncertainty — significant when it excludes \(0\). |
| Degrees of freedom | \(n-1\) for the paired test; sets which Student-\(t\) curve gives the \(p\)-value. |
| Wilcoxon signed-rank | Non-parametric paired test (ranks, not values); use when differences aren’t normal. |
| Effect size | The magnitude of the difference (\(\bar d\)); report it alongside \(p\) (significance \(\ne\) size). |
| One- vs two-sided test | Two-sided asks “A \(\ne\) B” (the default); one-sided asks “A \(>\) B” and halves \(p\) — valid only if the direction was pre-registered. |
| Bonferroni / Holm | Multiple-comparison fixes for \(m\) tests: Bonferroni compares each \(p\) to \(\alpha/m\); Holm is a uniformly less conservative stepwise version. |
12. References
- Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
- Brown University (2017). Seeing Theory: A visual introduction to probability and statistics. https://seeing-theory.brown.edu/
- Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30. https://jmlr.org/papers/v7/demsar06a.html
- Jaynes, E. T. (2003). Probability theory: The logic of science. Cambridge University Press.
- MacKay, D. J. C. (2003). Information theory, inference, and learning algorithms. Cambridge University Press.
- Murphy, K. P. (2022). Probabilistic machine learning: An introduction. MIT Press.
- Ng, A. Y., & Jordan, M. I. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. In Proceedings of Advances in Neural Information Processing Systems 14 (NIPS 2001), pp. 841–848.
- Olah, C. (2015). Visual information theory. Blog post. https://colah.github.io/posts/2015-09-Visual-Information/
- Starmer, J. (StatQuest). Statistics fundamentals. Video series. https://www.youtube.com/c/joshstarmer
- Student [W. S. Gosset] (1908). The probable error of a mean. Biometrika, 6(1), 1–25. https://doi.org/10.2307/2331554
- 3Blue1Brown (2019). Bayes theorem, the geometry of changing beliefs. Video. https://www.youtube.com/watch?v=HZGCoVF3YvM
- Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83. https://doi.org/10.2307/3001968
Online sources verified June 2026.