Glossary
A single alphabetical list of every term defined across the book. Each term is defined in full in its home chapter; this page gathers them for quick reference and search.
A
| Term | Meaning |
|---|---|
| A/B test (online eval) | Serve two models to live user halves and compare a business metric — the real verdict. |
| Activation \(\phi\) | Nonlinear function applied to \(z\) (sigmoid, tanh, ReLU). |
| Adjacency matrix \(A\) | The whole graph’s connections as a square grid of 0/1. |
| Agentic recommender | An LLM that plans, remembers, and calls tools over multi-turn conversation (§2.4). |
| Alignment | Positive pairs end up close in space. |
| Alignment & uniformity | Pull positives together / spread embeddings on the sphere (DirectAU); fights popularity bias. |
| Alignment loss (RLMRec) | Ties LLM semantic vectors to CF embeddings (contrastive -Con or generative -Gen). |
| ANN | Approximate nearest-neighbour search — fast, slightly-inexact vector retrieval; the candidate-generation engine. |
| ANN (approximate nearest neighbour) | Fast similarity search that makes dot-product retrieval scale to millions of items. |
| ANN (approximate nearest-neighbour) | Fast vector search that returns almost the closest vectors to a query, trading a little accuracy for large speed-ups — how embedding retrieval is served (Implementation Choices). |
| AP / MAP | Average Precision (precision averaged over hit positions) / its mean over users. |
| Arm | One choosable option (an item); “pulling” it = showing it. From the slot-machine metaphor. |
| Attention | \(\mathrm{softmax}(QK^\top/\sqrt{d_k})V\): each position takes a similarity-weighted blend of all positions’ value vectors. |
| Attributed / featureless | Whether nodes carry input feature vectors; recsys nodes are featureless (IDs only). |
| AUC | Probability a random click outscores a random non-click; grades ordering, not calibration. |
| AUC / ROC | Area Under the ROC Curve = P(score of a random relevant > random irrelevant). |
| Autoregressive decoding | Generate one token, append it, run the model again, repeat — the GPT/SASRec generation loop. |
| Auxiliary task | The extra, label-free objective added alongside the main (BPR) loss. |
B
| Term | Meaning |
|---|---|
| Back-propagation | Computing all weight-gradients by the chain rule, backward through the net. |
| Bandit feedback | You observe the reward only for the action you took, never the alternatives. |
| Base-rate effect | A rare condition (low prior) makes even an accurate test’s positives mostly false. |
| Batch / Layer Norm | Re-center/re-scale activations for training stability (LayerNorm powers the Transformer). |
| Bayes’ rule | posterior \(\propto\) likelihood \(\times\) prior. |
| BCE loss | Pointwise binary cross-entropy on positive vs. sampled-negative pairs. |
| Beam search (decoding) | Keep the \(B\) best partial ID prefixes per step; the completed tuples, ranked, are the top-\(K\) recommendations (§2.2). |
| Benchmark framework | RecBole / Cornac / Elliot / Microsoft Recommenders / LensKit — unified datasets + models + protocol. |
| Benchmaxxing | Training on (or near) benchmark data so a high score reflects recall, not ability — why one leaderboard number is untrustworthy. |
| Bernoulli | One yes/no with parameter \(p\); \(-\log\) → BCE. |
| BERT | Bidirectional Encoder (masked-LM); an encoder for understanding text → embeddings (Devlin 2019). |
| BERT4Rec | A bidirectional Transformer trained with the cloze (masked-item) objective. |
| Beta | Distribution over a probability; conjugate prior for Bernoulli. |
| Beta posterior | \(\text{Beta}(1+s,1+f)\) for \(s\) clicks, \(f\) non-clicks; conjugate to the Bernoulli click. |
| Bipartite graph | Nodes in two groups; edges only cross between groups (users ↔︎ movies). |
| Bonferroni / Holm | Multiple-comparison fixes for \(m\) tests: Bonferroni compares each \(p\) to \(\alpha/m\); Holm is a uniformly less conservative stepwise version. |
| BPR | Bayesian Personalized Ranking — pairwise ranking loss; “Bayesian” = its MAP derivation; default for LightGCN. |
| BPR loss | Pairwise ranking loss: push observed pairs above sampled negatives. |
C
| Term | Meaning |
|---|---|
| Candidate generation → ranking | The production funnel: a cheap recall-stage (MF/\(k\)NN) then a precise rank-stage. |
| Categorical | One of \(K\) classes; \(-\log\) → cross-entropy; paired with softmax. |
| Causal mask | Lower-triangular mask: position \(t\) may attend only to positions \(\le t\) (no peeking at the future). |
| CDF \(F(x)\) | Cumulative distribution function \(F(x)=\Pr(X\le x)\) — the running total of probability; a different function describing the same distribution. |
| Central Limit Theorem | The average of many noisy pieces is bell-shaped around the truth; basis of the \(t\)-curve. |
| Chain rule | Derivative of \(f(g(x))\) = \(f'(g)\cdot g'\); the basis of back-propagation. |
| Characteristic equation | \(\det(A-\lambda I)=0\); its roots are the eigenvalues. |
| Chebyshev filter | A numerically stable polynomial filter family (ChebyCF). |
| Client / server | Devices holding private data / the coordinator that aggregates their updates. |
| Closed-form / training-free CF | Recommenders that apply a fixed filter with no gradient training (GF-CF, PSGE). |
| Cloze objective | Mask random items and predict each from both-side context (fill-in-the-blank). |
| Cold start | No history yet for a new user/item; CF cannot act. |
| Collaborative filtering (CF) | Recommend using interaction patterns across users (no content). |
| Collaborative signal | What can be learned from who interacted with what (From Graphs to LightGCN, SSL & Contrastive Learning, The Spectral / Graph-Filter View), independent of item content. |
| Component / entry | One number inside a vector. |
| Conditional probability \(\Pr(A\mid B)\) | Probability of \(A\) given \(B\) occurred \(=\Pr(A\cap B)/\Pr(B)\). |
| Confidence interval | \(\bar d \pm t^{*}s_d/\sqrt n\); the effect size with its uncertainty — significant when it excludes \(0\). |
| Conjugate prior | A prior whose posterior is the same family (Beta↔︎Bernoulli). |
| Content-based filtering | Recommend items similar (in features) to what the user liked. |
| Contextual bandit | The reward depends on a context (user/item features); LinUCB models it linearly. |
| Contrastive learning | SSL pretext: pull a node’s two views together, push other nodes apart; needs negatives; discriminative mechanism. |
| Convex | Curves up everywhere; one variable \(f''\ge0\), many variables Hessian PSD; bowl-shaped; one global minimum. |
| Cosine similarity | \(\dfrac{\mathbf{u}\cdot\mathbf{v}}{\lVert\mathbf{u}\rVert\lVert\mathbf{v}\rVert}\in[-1,1]\); direction agreement, ignoring length. |
| Cost / objective / criterion | Near-synonyms for loss (cost = total; objective = neutral; criterion = the code object). |
| Coverage | Fraction of the catalog ever recommended. |
| Critical / stationary point | Where \(f'(x)=0\): peak, valley, or plateau. |
| Croissant | MLCommons machine-readable dataset-metadata standard (2024); required for NeurIPS dataset tracks. |
| Cross network (DCN) | Stacked layers that raise interaction degree by one each, explicitly and with linear cost. |
| Cross-entropy / BCE / log loss | Classification loss; bits to encode truth \(p\) with code for prediction \(q\); = −log-likelihood of Bernoulli. |
| Cross-entropy \(H(p,q)\) | \(-\sum p\log q\); surprise of \(p\)’s outcomes scored by model \(q\) — the classification loss. |
| Cross-view alignment | Pulling a node’s two embeddings (collaborative \(e_i\) and semantic \(s_i\)) together with InfoNCE. |
| CTR (click-through rate) | \(P(\text{click}\mid \text{user, item, context})\); the score the ranking stage predicts. |
D
| Term | Meaning |
|---|---|
| Data leakage | Test-period information seeping into training; silently inflates every metric. |
| DCG / IDCG / NDCG | Discounted Cumulative Gain / its ideal / the normalized ratio in \([0,1]\). |
| DeepFM | Wide & Deep with the wide arm replaced by an FM, sharing one embedding layer; no hand-crossing. |
| Degree | How many neighbors a node has. |
| Degree matrix \(D\) | Diagonal matrix of degrees; used for normalization. |
| Degrees of freedom | \(n-1\) for the paired test; sets which Student-\(t\) curve gives the \(p\)-value. |
| Demographic parity | Equal selection / exposure rate across groups. |
| Derivative \(f'(x)\) | Instantaneous rate of change = slope of the curve at a point. |
| Determinant \(\det A\) | Area/volume-scaling factor (\(ad-bc\) for \(2\times2\)); \(\det=0\) ⇒ collapses a dimension / non-invertible. |
| Difference quotient | \(\frac{f(x+h)-f(x)}{h}\) — rise over run before taking \(h\to0\). |
| Differential privacy | Add calibrated noise to updates so no single user’s data can be inferred. |
| Differentiate | Compute a derivative (limit of differences). |
| Dimension | How many components a vector has (\(\mathbb{R}^d\) = all \(d\)-vectors). |
| Directed / undirected | Whether an edge has a direction (A→B \(\ne\) B→A) or not. |
| Directional derivative | Rate of change of \(f\) along a unit direction \(\hat{\mathbf d}\): \(\nabla f\cdot\hat{\mathbf d}\). |
| Discount \(\gamma\) | How much a later reward is worth vs. now; \(\gamma\to0\) is myopic (a bandit), \(\gamma\to1\) far-sighted. |
| Discriminative vs. generative | Axis 1 — whether a model learns \(p(y\mid x)\) (a boundary; can’t generate) or \(p(x)\)/\(p(x,y)\) (the data; can sample). BPR/LightGCN is discriminative; VAEs/LLMs are generative. |
| Distributional hypothesis | “A word is known by the company it keeps” — similar contexts ⟹ similar vectors (Firth 1957). |
| Diversity | Average dissimilarity within a recommended list. |
| Dot / inner / scalar product \(\mathbf{u}\cdot\mathbf{v}\) | \(\sum_i u_iv_i\); one number measuring agreement. |
| Dot product | Multiply-and-sum two embeddings → a similarity / ranking score. |
| Dropout | Randomly zero units during training; an ensemble-like regularizer; “edge dropout” on graphs. |
E
| Term | Meaning |
|---|---|
| Early stopping | Stop when validation stops improving; implicit regularizer. |
| Edge | A connection (user watched movie). |
| Effect size | The magnitude of the difference (\(\bar d\)); report it alongside \(p\) (significance \(\ne\) size). |
| Eigenbasis | A symmetric matrix’s full set of orthogonal eigenvectors. |
| Eigenvalue \(\lambda\) | The stretch factor of an eigenvector. |
| Eigenvector | A direction a matrix only stretches, not rotates: \(A\mathbf{v}=\lambda\mathbf{v}\). |
| Eigenvector \(v\) | A special direction a matrix only stretches (not rotates): \(\tilde{A}v=\lambda v\). The graph’s eigenvectors are its “frequency patterns.” |
| Elastic net | L1 + L2 combined; “stretchable net” keeping correlated features. |
| ELBO | Evidence Lower BOund; the VAE (variational autoencoder) objective = reconstruction loss + β·KL. |
| Embedding | A learned vector representing a user/item as a point in space. |
| Embedding layer | A learnable lookup table; id → its row (= one-hot × matrix); the input side for discrete data. |
| Enhancer / reranker | Recsys LLM roles: generate item/user text features / re-order a candidate list — not chat. |
| Enhancer pipeline | Using an LLM offline to manufacture features for a fast collaborative backbone that serves. |
| Entropy \(H(p)\) | \(-\sum p\log p\); average surprise / uncertainty / bits to encode (maximal at uniform). |
| Epoch | One full pass over all observed positives during training. |
| Epoch / batch | One full pass over the data / the examples used per update. |
| Equal opportunity | Equal true-positive rate across groups (the truly-relevant are surfaced equally). |
| ERM | Empirical Risk Minimization — “fit the training data” (usually + a regularizer). |
| Euclidean distance \(\lVert\mathbf{u}-\mathbf{v}\rVert\) | Straight-line gap between two arrow-tips; feels magnitude (unlike cosine). |
| Evidence \(p(D)\) | Normalizer; data probability averaged over \(\theta\) (a.k.a. marginal likelihood). |
| Expectation \(\mathbb{E}[X]\) / mean \(\mu\) | Probability-weighted average value. |
| Explicit / implicit feedback | Stated ratings (likes and dislikes) vs. behavioural \(0/1\) (positives only — no true negatives). |
| Explicit vs. implicit feedback | Star ratings vs. clicks/plays (positive-only, unlabeled negatives). |
| Explore vs. exploit | Try an uncertain option to learn (explore) vs. take the best-known one now (exploit). |
F
| Term | Meaning |
|---|---|
| F1@K | Harmonic mean of precision and recall. |
| Factor vector | A feature’s short learned vector \(\mathbf v_i\); the dot product of two is their interaction strength. |
| Factorization machine (FM) | MF generalized to any features: each feature gets a vector, each feature pair interacts via a dot product (Rendle, 2010). MF is the ids-only special case. |
| FCF | Federated collaborative filtering: item factors are global; each user’s factor stays on-device. |
| Feature interaction | A pair of features whose combination matters (user \(\times\) genre); FM scores it as \(\langle\mathbf v_i,\mathbf v_j\rangle\). |
| Feature interaction / cross | The conjunction of two categories (sci-fi ∧ mobile); where the predictive signal lives. |
| Feature store | The table of generated id → (profile, vector) the serving path reads; the LLM is never on the serving path. |
| FedAvg | Aggregate by a data-size-weighted average of clients’ local models (McMahan et al., 2017). |
| Federated learning | Train a shared model across devices that keep their raw data local; only updates are shared. |
| Feed-forward (FFN) | Per-token MLP \(\max(0,xW_1{+}b_1)W_2{+}b_2\) inside each block; transforms each token (most of the parameters). |
| Feedback loop | Recommend → click → log → retrain on skewed data → more skew; the “rich get richer” spiral that compounds bias. |
| Filter response \(h(\lambda)\) | The function deciding how much each frequency is kept; defines the filter. |
| Focal loss | Cross-entropy times \((1-\hat p)^\gamma\) to down-weight easy examples; for class imbalance. |
| Forward pass | Running an input through the net to a prediction + loss. |
| FPMC | Factorizing Personalized Markov Chains: MF term \(+\) a learned (factorized) transition term. |
| Frequency (on a graph) | An eigenvector pattern; low = smooth over neighbors, high = jagged. |
| Frozen text encoder | Run text once through a fixed pretrained model and keep the output vector — the semantic \(s_i\) RecSys aligns/consumes. |
| Full vs. sampled ranking | Rank the true item against the whole catalog vs. a few sampled negatives (the latter is inconsistent). |
G
| Term | Meaning |
|---|---|
| Gaussian (Normal) | Bell curve \((\mu,\sigma^2)\); \(-\log\) → squared error (MSE). |
| GCL | Graph Contrastive Learning — contrastive SSL on a graph recommender. |
| GCN | The model built by stacking graph-convolution layers (+ loss, prediction head, and — in vanilla GCN — \(W\) and \(\sigma\)). Operation vs. architecture. |
| Generative learning | SSL pretext: reconstruct the masked/corrupted input; no negatives; generative mechanism. |
| Generative recommendation | Recommend by generating an item’s Semantic ID token-by-token (§2.2). |
| Gini coefficient | Concentration of recommendations on few items (\(0\) even … \(1\) concentrated). |
| GPT | Generative Pre-trained Transformer (next-token, causal); a decoder for generating text (Radford 2018). |
| Gradient \(\nabla f\) | Vector of all partials; points in the direction of steepest increase. |
| Gradient descent | \(\theta\leftarrow\theta-\eta\nabla f\): step against the gradient to minimize. |
| Graph | Nodes + edges. |
| Graph convolution | The operation: replace each node’s embedding with the normalized average of its neighbors’ (one matrix multiply by the normalized adjacency). |
| Graph Fourier Transform (GFT) | Projecting a graph signal onto the eigenvectors of the (normalized) adjacency/Laplacian. |
| Graph signal | One value per node (e.g. a user’s interaction vector over items). |
| Graph Signal Processing (GSP) | Treating one-number-per-node data as a “signal” and processing it with graph frequencies. |
| Graph-CF trio | Gowalla / Yelp2018 / Amazon-Book — the frozen-split benchmark of graph-CF papers. |
| GraphMAE | A generative (masked-autoencoder) graph SSL method. |
| ε-greedy | Exploit the best estimate with prob \(1-\varepsilon\); pull a uniform-random arm with prob \(\varepsilon\). |
| GRU | Gated Recurrent Unit: a lighter 2-gate LSTM (Cho 2014); GRU4Rec is its session-based recommender. |
| GRU4Rec | A gated RNN for session-based recommendation; session-parallel batches + a ranking loss. |
H
| Term | Meaning |
|---|---|
| Harness / orchestration | The engineering around the LLM call — templates, validation, batching, retries, caching, cost control. |
| Hessian \(H\) | Matrix of second partial derivatives (multivariable curvature). |
| Hidden layer | A layer that is neither input nor output. |
| Hinge loss | SVM loss; flat once correct-with-margin, then a linear ramp — shaped like a hinge. |
| Hit Rate (HR@K) | Fraction of users with \(\ge 1\) relevant item in the top-\(K\). |
| Hit-Rate@K / NDCG@K | Was the held-out next item in the top \(K\)? / how high was it? (Evaluation Metrics). |
| HNSW | A multi-layer navigable-graph ANN index with \({\sim}\log M\) query cost. |
| HNSW / IVF-PQ / DiskANN / CAGRA | ANN index types: graph (in-RAM default) / inverted-list + compression (memory-thrifty) / on-disk (billion-scale) / GPU graph (batched throughput). |
| Homogeneous / heterogeneous | One node type, vs. several (users and items). |
| Hop | One step along an edge (distance in the graph); fixed by the graph, not chosen. \(K\) layers ⟹ reach \(K\) hops. |
| Huber loss | Quadratic near 0, linear far out; robust compromise (after P. Huber). |
| Hybrid | Combine content-based + collaborative (e.g. LLM features + a CF backbone). |
I
| Term | Meaning |
|---|---|
| Idempotency | Re-running the pipeline on unchanged input does no new work (hash → cache hit), so you generate each profile once. |
| Identity \(I\) | The “do-nothing” matrix; \(I\mathbf{x}=\mathbf{x}\). |
| Impression | One shown item; the unit of a CTR row, labelled \(1\) (clicked) or \(0\) (not). |
| Indefinite | A symmetric matrix with both positive and negative eigenvalues (curves up some ways, down others) — the Hessian at a saddle. |
| Independence | \(\Pr(A\cap B)=\Pr(A)\Pr(B)\); one event tells you nothing about the other. |
| Inference / serving | Using the trained embeddings to return a user’s top-\(K\) list. |
| Inflection point | Where \(f''\) changes sign (the curve switches between bending up and down). |
| InfoNCE | The standard contrastive loss (softmax over similarities, temperature \(\tau\)). |
| InfoNCE / NT-Xent | Contrastive loss; Info = mutual-info bound, NCE = Noise-Contrastive Estimation; temperature \(\tau\). |
| Integral \(\int\) | Area under a curve; the reverse of differentiation; gives probabilities/expectations. |
| Interaction matrix \(R\) | Users-×-movies table of 0/1 (who watched what). |
| Invertible / singular | Invertible = has an inverse \(A^{-1}\) (\(\det\neq0\), full rank); singular = no inverse (\(\det=0\)). |
| item2vec | word2vec applied to user interaction sequences (“item = word, history = sentence”); a collaborative item embedding. |
J
| Term | Meaning |
|---|---|
| Jaccard similarity | \(\lvert A\cap B\rvert/\lvert A\cup B\rvert\) on liked-sets; ignores rating values. |
| Jacobian \(J\) | Matrix of partials of a vector-valued function (rows=outputs, cols=inputs). |
| Joint / marginal | Joint \(p(A,B)=\Pr(A\cap B)\); marginal \(p(A)=\sum_B p(A,B)\) (sum the joint over the other variable). |
K
| Term | Meaning |
|---|---|
| @K | Evaluated over the top \(K\) recommended positions only. |
| k-core filter | Iteratively drop users/items with \(<k\) interactions until none remain; stabilizes the data. |
| \(k\)-NN / neighbourhood | The \(k\) most-similar users/items used for a prediction. |
| KL divergence | Asymmetric “distance” between distributions (Kullback–Leibler); a divergence, not a metric. |
| KL divergence \(D_{\mathrm{KL}}(p\Vert q)\) | \(H(p,q)-H(p)\ge0\); extra surprise from using \(q\) not \(p\); asymmetric (a divergence, not a distance). |
| KV-cache | Store past tokens’ keys/values so each decode step processes only the new token: \(O(L)\) instead of \(O(L^2)\). |
L
| Term | Meaning |
|---|---|
| L1 / lasso | $ |
| L2 / ridge / Tikhonov / weight decay | \(\sum\theta_k^2\) penalty; shrinks weights smoothly toward 0; = Gaussian prior. |
| L2 / weight decay | Regularizer \(\lambda\lVert E^{(0)}\rVert^2\); LightGCN regularizes only the base embeddings. |
| \(\lambda\) (reg. strength) | How much we weight the regularizer vs. the loss. |
| LambdaRank / LambdaMART | Listwise loss whose gradient is weighted by each pair’s effect on NDCG. |
| Laplace | Peaked/heavy-tailed \((\mu,b)\); \(-\log\) → absolute error (MAE). |
| Latent factor | A hidden, learned dimension of taste/content. |
| Law of Large Numbers | The sample average converges to the expectation as samples accumulate. |
| Law of total probability | \(p(A)=\sum_B p(A\mid B)p(B)\); the engine of Bayes’ evidence term. |
| Layer | One application of the graph-convolution operation (a computation step in the model); a hyperparameter \(K\) you choose. |
| Layer / width / depth | A bank of neurons / neurons-per-layer / number of layers. |
| Layer combination | Averaging embeddings from all layers; LightGCN’s fix for over-smoothing. |
| Learning rate \(\eta\) | Step size in gradient descent. |
| Leave-one-last split | Hold out each user’s chronologically last interaction for test (temporal, leak-free). |
| Likelihood / prior / posterior | \(p(\text{data}\mid\theta)\) / \(p(\theta)\) / \(p(\theta\mid\text{data})\). |
| Likelihood \(L(\theta)\) | \(p(\text{data}\mid\theta)\) read as a function of \(\theta\) (data fixed). |
| Limit | The value a quantity approaches (here as the run \(h\to0\)). |
| LinUCB | Ridge-regression reward estimate \(\boldsymbol\theta^{\top}\mathbf x\) plus a feature-space confidence bonus. |
| LLM | A large pretrained Transformer language model; here, demystified as this note’s lineage at scale. |
| LLM-as-enhancer | The LLM produces semantic features/profiles offline that feed/align with a classical model (§2.3). |
| LLM-as-recommender | The LLM directly ranks/selects items from a prompt of the user’s history (§2.1). |
| LLM-as-reranker | The most-deployed sub-pattern: the LLM (often a cross-encoder) re-scores a short top-\(K\) shortlist a cheap retriever already fetched, rather than ranking the whole catalogue (§2.1). |
| Local / global minimum | Local = lowest within a neighbourhood; global = lowest anywhere. |
| Log-likelihood | \(\log L(\theta)\); products become sums. |
| Logistic regression | Linear score \(\to\) sigmoid \(\to\) probability; the CTR baseline and every model’s output head. |
| Logit (log-odds) | \(\ln\!\big(p/(1-p)\big)\); maps \((0,1)\to(-\infty,\infty)\). |
| LogLoss | Binary cross-entropy; grades how calibrated the predicted probabilities are (lower better). |
| Long tail | The many low-degree (niche) items; under-trained and under-served. |
| Long tail / popularity bias | A few blockbusters get most interactions; metrics can be gamed by pushing them. |
| Loss function | A single number measuring how wrong the model is; training minimizes it. |
| Low-pass filter | Keep low frequencies (smooth signal), suppress high (noise). |
| LSTM | Long Short-Term Memory: an RNN with a cell state + forget/input/output gates; the additive cell line stops gradients vanishing. |
M
| Term | Meaning |
|---|---|
| MAE / absolute loss | Mean of absolute errors; from Laplace noise; outlier-robust; targets the median. |
| MAE / MSE / RMSE | Mean Absolute / Mean Squared / Root-Mean-Squared rating error. |
| MAP | Maximum A Posteriori; peak of the posterior \(=\) MLE \(+\) prior \(=\) loss \(+\) regularizer. |
| MAP estimate | Maximize posterior ∝ likelihood × prior; ⟹ minimize (−log-lik) + (−log-prior) = loss + regularizer. |
| Markov chain (first-order) | Next item depends only on the last one; transitions estimated by counting. |
| Matrix | A rectangular grid of numbers; also a linear transformation of vectors. |
| Matrix Factorization (MF) | \(\hat r_{ui}=p_u^\top q_i\); trained with regularized MSE. |
| Matrix–vector product \(A\mathbf{x}\) | Dot each row of \(A\) with \(\mathbf{x}\); “\(n\) in, \(m\) out.” |
| Matryoshka (MRL) | Truncatable embeddings — a shorter prefix of the vector still works, so you index at a smaller dim to save memory/latency. |
| MDP | Markov Decision Process: state, action, reward, transition, policy — the formalism of RL. |
| Memory-based CF | Predict from similar rows/columns of \(R\) at query time (\(k\)-NN). |
| Mini-batch | A small group of triples updated together (averaged gradient); also the source of in-batch negatives. |
| MIPS (Maximum Inner-Product Search) | Finding the item vectors with the largest dot product against a query — what top-\(K\) scoring is. |
| MLE | Maximum Likelihood Estimate; \(\theta\) that best explains the data. |
| MLP | Multi-layer perceptron — a feedforward stack of fully-connected layers. |
| MNAR data | Missing-Not-At-Random: whether an interaction is observed depends on what the system chose to show — recommender logs are the textbook case. |
| Model-based CF | Learn a compact model first (e.g. matrix factorization). |
| MoE (mixture-of-experts) | A model that holds many parameters but activates only a few per token — inference cheaper than total size. |
| Momentum / Adam | SGD upgrades: a velocity term / per-parameter adaptive step (Adam = the default optimizer). |
| MRR | Mean Reciprocal Rank — \(1/\)rank of the first relevant item, averaged. |
| MSE / squared loss | Mean of squared errors; from Gaussian noise; outlier-sensitive. |
| MTEB / ann-benchmarks / LMArena / CoNLL | The standard leaderboards for embeddings / vector indexes / chat-LLMs / NER. Priors, not verdicts. |
| Multi-head | Several attentions in parallel with different learned projections, then concatenated. |
| Mutual-information maximization | The framing RLMRec uses for aligning the collaborative and semantic views. |
N
| Term | Meaning |
|---|---|
| Nabla / del (\(\nabla\)) | The symbol for the gradient operator. |
| Negative sampling | Drawing un-interacted items to act as negatives. |
| Neuron / unit | A weighted sum of inputs + bias, passed through an activation. |
| Next-item prediction | The core task: given \((i_1,\dots,i_{t-1})\), score every item for being \(i_t\). |
| NLL | Negative log-likelihood \(=-\log L\); minimizing it = MLE; this is a loss. |
| Node | A thing (a user, a movie). |
| Non-IID data | Each client’s data is small and unrepresentative of the whole — the core difficulty of federation. |
| Norm / length \(\lVert\mathbf{u}\rVert\) | The arrow’s length; default L2 \(\sqrt{\sum_i u_i^2}\). L1 \(=\sum_i\lvert u_i\rvert\) (used in lasso). |
| Novelty | Non-obviousness, often \(-\log_2(\text{popularity})\). |
| Null hypothesis \(H_0\) | The skeptical default “no real difference”; a test tries to disprove it. |
O
| Term | Meaning |
|---|---|
| Odds | \(p/(1-p)\). |
| Off-policy | Learning from data collected by a different (older) policy than the one being trained. |
| Offline replay | Unbiased evaluation of a bandit policy on a log of randomly-served actions (Li et al. 2011). |
| One- vs two-sided test | Two-sided asks “A \(\ne\) B” (the default); one-sided asks “A \(>\) B” and halves \(p\) — valid only if the direction was pre-registered. |
| One-hot vector | A length-\(V\) vector, \(1\) in one slot, \(0\) elsewhere; encodes identity but no similarity (all pairs orthogonal). |
| Open weights / open source / open data | Released weights only / weights + permissive code license / + training data too. They are different and often confused. |
| Orthogonal | Perpendicular; dot product \(0\). |
| Orthonormal basis | A full set of axes that are mutually orthogonal and each unit length; a coordinate is then just a dot product (\(U,V\) in §6). |
| Outer product \(\mathbf{u}\mathbf{v}^{\top}\) | A column times a row = a whole rank-1 matrix. |
| Over-smoothing | Too many layers make all embeddings collapse to one blurry point. |
| Overfitting | Memorizing training noise; great on train, poor on new data. |
P
| Term | Meaning |
|---|---|
| \(p\)-value | Probability of a gap this large if \(H_0\) were true; small = significant. NOT \(\Pr(H_0\text{ true})\). |
| Paired \(t\)-test | Tests whether the mean of per-pair differences \(\bar d\) differs from 0: \(t=\bar d/(s_d/\sqrt n)\). |
| Partial derivative \(\partial f/\partial x\) | Derivative wrt one variable, others held fixed. |
| Pearson correlation | A similarity measure (centred cosine) for ratings — removes each user’s rating bias. |
| PMF / PDF | Probability mass (discrete) / density (continuous) function; sums/integrates to 1. |
| Pointwise / pairwise / listwise | Score each pair / a pair’s order / the whole list. |
| Policy \(\pi(a\mid s)\) | The learned rule mapping a state to an action; what RL optimizes. |
| Polynomial filter | A filter that is a polynomial in \(\tilde{A}\) (what LightGCN layers compute). |
| Popularity bias | Tendency to over-recommend high-degree (blockbuster) items (§13). |
| Popularity bias (in eval) | Popular items dominate the test set, so pushing them inflates accuracy metrics. |
| Position bias | Higher-ranked slots draw more clicks regardless of true relevance. |
| Positional embedding | A learned per-position vector added to item embeddings so attention can see order. |
| Positional encoding | A per-position signal added to embeddings so order-blind attention can see order (sinusoidal or learned). |
| Positive / negative pair | Positive = two views of the same node; negative = views of different nodes. |
| Positive semidefinite (PSD) | A symmetric matrix with \(\mathbf x^{\!\top}\!H\mathbf x\ge0\) for all \(\mathbf x\) (all eigenvalues \(\ge0\)); the multivariable “\(\ge0\)” that makes a Hessian convex. |
| Positive-semidefinite (PSD) | A symmetric matrix with all eigenvalues \(\ge0\) (equivalently \(\mathbf{x}^{\top}A\mathbf{x}\ge0\) always); every \(M^{\top}M\) is PSD since \(\mathbf{x}^{\top}M^{\top}M\mathbf{x}=\lVert M\mathbf{x}\rVert^{2}\). |
| Posterior \(p(\theta\mid D)\) | Updated belief after data. |
| Pre-activation \(z\) (logit) | The raw score \(\mathbf{w}\cdot\mathbf{x}+b\) before the activation. |
| Precision@K | Of the \(K\) shown, the fraction relevant. |
| Predictive learning | SSL pretext: predict a property derived from the data (e.g. a masked item/attribute); one correct answer, no negatives; discriminative mechanism. |
| Pretraining | Train a big Transformer on a huge corpus with a self-supervised objective, then reuse it. |
| Prior \(p(\theta)\) | Belief about \(\theta\) before data. |
| Profile | A short LLM-written description of a user’s taste or an item’s character, used as input to an embedder. |
| Projection \(\mathrm{proj}_{\mathbf{v}}\mathbf{u}\) | The shadow of \(\mathbf{u}\) on \(\mathbf{v}\)’s line, \(\frac{\mathbf{u}\cdot\mathbf{v}}{\lVert\mathbf{v}\rVert^{2}}\mathbf{v}\); in an orthonormal basis a coordinate is just a dot product. |
| Propensity / IPS | Propensity = probability an item was shown; IPS re-weights each interaction by \(1/\)propensity to undo exposure bias. |
| Provider vs. consumer fairness | Comparable exposure across item creators (provider) vs. comparable quality across user groups (consumer). |
Q
| Term | Meaning |
|---|---|
| Quantization | Storing weights/vectors in fewer bits (4/8-bit, int8/binary) to shrink memory and speed inference, at a small quality cost. |
| Query / Key / Value | Soft database lookup: a query is matched (dot product) against keys to retrieve a blend of values. |
R
| Term | Meaning |
|---|---|
| Random split | Hold out a random fraction of interactions — leaks the future for time-ordered data. |
| Random variable | A quantity whose value is uncertain (coin, die, temperature). |
| Rank | The matrix’s number of independent directions: independent rows/columns (§2) = nonzero singular values (§6). |
| Rank / latent dimension \(d\) | Width of the factor vectors; the dial between under- and over-fitting. |
| Ranking | The expensive, high-precision second stage that reorders the shortlist with a feature-rich model. |
| Recall vs QPS | The ANN trade-off: fraction of true neighbours found vs queries-per-second. Always compare at a fixed recall. |
| Recall@K | Of all relevant items, the fraction in the top-\(K\). |
| Regret | \(\sum_t(\mu^{*}-\mu_{a_t})\): reward lost versus always pulling the best arm. |
| Regularization | Making an ill-posed solution “regular” (well-behaved); from Hadamard/Tikhonov. |
| Regularizer | A term/procedure that discourages complex solutions to improve generalization. |
| Relevant item | A ground-truth item the user actually likes. |
| ReLU | \(\max(0,z)\) — Rectified Linear Unit; the default activation. |
| Representation learning | The network learning useful features by itself. |
| Residual connection | A sub-layer computes \(x+f(x)\); the \(+1\) in its derivative is a gradient highway for deep stacks. |
| Retrieval / candidate generation | The cheap, high-recall first stage that cuts millions of items to a few hundred. |
| Return \(G\) | Cumulative discounted reward \(\sum_t \gamma^t r_t\) — the long-term objective. |
| Risk / empirical risk | Expected loss over the true distribution / its average on the training sample. |
| RLMRec | An enhancer that aligns an LLM-written semantic embedding with a CF embedding via contrastive/generative alignment (§3). |
| RLMRec-Con / -Gen | RLMRec’s contrastive / generative variants for aligning LLM semantics with GNN embeddings. |
| RMSE / Recall@K / NDCG@K | Rating-error / top-\(K\) ranking-quality metrics. |
| RNN | Recurrent Neural Network: one shared cell that updates a hidden-state “memory” \(h_t=\tanh(Wh_{t-1}+Ux_t+b)\) along a sequence. |
| RNN / hidden state \(\mathbf h_t\) | A net that folds each item into a running fixed-length summary of the session. |
| RQ-VAE | The residual-quantized variational autoencoder that learns the Semantic-ID codebooks by reconstructing the content embedding (§2.2). |
S
| Term | Meaning |
|---|---|
| Saddle point | Critical point that is a min in some directions, a max in others (indefinite Hessian). |
| Sample space / event | The set of possible outcomes / a subset of them. |
| Sampled softmax (SSM) | Listwise cross-entropy over a positive + many sampled negatives; structurally = InfoNCE. |
| SASRec | Self-Attentive Sequential Recommendation: a causal Transformer over the item sequence. |
| SASRec / BERT4Rec | Sequential recommenders: a causal (left-to-right) / bidirectional (masked) Transformer over the item history. |
| Second derivative \(f''\) | Rate of change of the slope = curvature. |
| Secure aggregation | The server sees only the sum of client updates, never any individual one. |
| Selection / exposure bias | Only items the system showed can get feedback; un-shown items look “disliked” by silence. |
| Self-attention | Attention of a sequence to itself (Q, K, V all from the same tokens). |
| Self-loop | An artificial edge from a node to itself; present in GCN, absent in LightGCN. |
| Semantic embedding | A fixed-length vector encoding the meaning of the profile text (from a frozen sentence encoder), distinct from the CF latent space. |
| Semantic ID | A short sequence of content-derived codewords identifying an item (quantized text embedding). |
| Sequential recommendation | Predict the next item from a user’s ordered history. |
| Serendipity | Relevant and surprising. |
| Session-based recommendation | Same, but from a short, often anonymous current session (no user id). |
| SGD | Stochastic Gradient Descent — gradient steps on random mini-batches. |
| SGL / SimGCL / XSimGCL / LightGCL | GCL methods differing in how the two views are built (edge-drop / noise / cross-layer noise / SVD). |
| Shape \(m\times n\) | \(m\) rows by \(n\) columns. |
| \(\sigma\) (nonlinearity) | Activation (e.g. ReLU) in GCN/NGCF; removed in LightGCN. |
| Sigmoid \(\sigma(z)\) | \(1/(1+e^{-z})\); inverse of the logit; score → probability. |
| Significance (paired \(t\)-test) | Is a gain bigger than run-to-run seed noise? Report mean \(\pm\) std and a \(p\)-value. |
| Significance level \(\alpha\) | The pre-set bar (usually \(0.05\)); reject \(H_0\) if \(p<\alpha\). |
| Singular value \(\sigma\) | The strength of an SVD pattern; \(\sigma\ge0\), biggest first. |
| Slate | A whole page/list of recommended items — the (combinatorial) action in slate RL. |
| SlateQ | Decomposes a slate’s value into per-item \(Q\)-values, making value-based slate RL tractable. |
| SLM | Small language model (\(\approx 0.5\)–9B) — cheap, high-throughput; the recsys feature-generation workhorse. |
| Smooth | Has a derivative everywhere — no kinks or jumps; looks straight up close. |
| Softmax | Multi-class sigmoid; scores → a probability distribution. |
| Softmax output | Multi-class output head: \(K\) scores → a probability distribution, paired with cross-entropy. |
| Sparse / one-hot feature | A category encoded as all-zeros with a single \(1\); CTR rows stack millions of these. |
| Sparsity | Most of \(R\) is unknown (real matrices \({\sim}99.5\%\) empty); the core difficulty. |
| SSL (self-supervised learning) | Training on an auxiliary task that needs no labels; supervision is manufactured from the data’s own structure. |
| Standard error | \(s_d/\sqrt n\) — how much an estimate wobbles across repeats; shrinks with more data. |
| State \(s\) | A summary of the current situation; the ingredient bandits lack (actions change it). |
| Static vs contextual | word2vec gives one fixed vector per word; a Transformer gives a different vector per occurrence (context-dependent). |
| Stochastic gradient descent (SGD) | Gradient descent on a noisy gradient estimated from a random minibatch, not the full dataset. |
| Structured output | Forcing an LLM to emit schema-valid JSON via constrained decoding — valid shape, not guaranteed-correct values. |
| Structured output / JSON mode | Forcing the LLM to emit schema-valid JSON, so the result is reliably parseable. |
| Subgradient | A stand-in slope at a kink where the true derivative is undefined (e.g. any value in \([0,1]\) for ReLU at \(0\)). |
| Subword tokenization (BPE) | Split text into reusable sub-word pieces by greedily merging the most frequent adjacent pair; fixed vocabulary, no <UNK> (WordPiece/SentencePiece are cousins). |
| Surprise / self-information | \(-\log p(x)\); how surprising an outcome is (\(0\) if certain, \(\to\infty\) if rare). |
| SVD (\(M=U\Sigma V^\top\)) | Factorizes a (rectangular) matrix into patterns \(U,V\) and strengths \(\Sigma\); keeping the top-\(q\) = an ideal low-pass (PSGE, LightGCL). |
| SVD \(M=U\Sigma V^{\top}\) | Factor any matrix into orthonormal patterns (\(U,V\)) scaled by singular values (\(\Sigma\)). |
| Symmetric | \(A=A^{\top}\); clean orthogonal eigenvectors. |
T
| Term | Meaning |
|---|---|
| Taylor / linear approximation | \(f(x+\delta)\approx f(x)+f'(x)\delta\); a curve looks straight up close. |
| Temperature \(\tau\) | Scales the contrast sharpness in InfoNCE; a sensitive hyperparameter. |
| Temporal / leave-one-last split | Hold out each user’s chronologically last interaction; the honest, leak-free default. |
| TF-IDF | A way to turn text into a feature vector (term frequency × inverse document frequency). |
| The recsys leak | Sending an update for item \(i\) reveals the user touched \(i\); FedRec masks this with decoy items. |
| Thompson sampling | Keep a posterior per arm; sample one value from each and pull the largest sample. |
| Top-\(K\) | The \(K\) highest-scoring unseen items returned to the user. |
| Trace | Sum of a square matrix’s diagonal entries; equals the sum of its eigenvalues. |
| Train/test split | How held-out test data is chosen: random, leave-one-out (LOO), or temporal (by timestamp). |
| Transformer | A stack of (self-attention + add&norm + feed-forward + add&norm) blocks; no recurrence, fully parallel (Vaswani 2017). |
| Transition probability | \(P(\text{next}=j\mid\text{last}=i)\) — count of \(i\!\to\!j\) over count of \(i\)-as-from. |
| Transpose \(A^{\top}\) | The matrix mirrored across its diagonal (rows ↔︎ columns). |
| Triple \((u,i^+,i^-)\) | A training example for pairwise ranking: user \(u\) prefers positive \(i^+\) over sampled negative \(i^-\). |
| Truncated SVD | Keep the top-\(q\) singular components = best low-rank approximation (least-squares / Frobenius sense). |
| Two-stage funnel | Retrieval → ranking: fast-and-forgiving then slow-and-sharp, so a rich model can run on only a few hundred candidates. |
| Two-tower LLM-embedding retrieval | A hybrid: embed users and items with an LLM-grade encoder offline, then retrieve by ANN over those vectors at serve time — enhancer-class cost, LLM semantics on the retrieval path (§4). |
| Type I / II error | False positive (reject true \(H_0\)) / false negative (miss a real effect). |
U
| Term | Meaning |
|---|---|
| UCB | Pull \(\arg\max_i\,[\hat\mu_i + \sqrt{2\ln t / n_i}\,]\): highest upper bound — optimism under uncertainty. |
| Uniform / popularity-weighted / in-batch / hard negatives | Four sampling distributions for \(i^-\): equal-probability; \(\propto\text{pop}^{0.75}\); reuse the batch’s other positives; pick high-scoring (confusing) items. |
| Uniformity | Embeddings spread evenly over the sphere; fights popularity bias / collapse. |
| Unit vector | A vector of length \(1\); “normalizing” = dividing by your own length. |
| Universal approximation | A wide enough net can approximate any continuous function. |
| User–item matrix \(R\) | Rows = users, cols = items, entries = ratings/clicks (mostly unknown). |
V
| Term | Meaning |
|---|---|
| Value / \(Q\)-value | Expected return from a state (\(V\)) or a state–action pair (\(Q\)). |
| Vanishing gradient | Back-prop through many steps multiplies many small factors → gradient decays → long-range dependencies unlearnable. |
| Variance \(\sigma^2\) | Expected squared distance from the mean; spread. |
| Vector | An ordered list of numbers; equivalently, an arrow from the origin to a point. |
| View | One augmented version of a node’s representation; contrastive learning needs two per node. |
W
| Term | Meaning |
|---|---|
| \(W\) (weight matrix) | Learnable feature transform in GCN/NGCF; removed in LightGCN. |
| Weight / bias | Learned multiplier per input / learned constant offset. |
| Weighted / unweighted | Whether edges carry a number (rating/count) or are just 0/1. |
| Wide & Deep | Joint linear-with-hand-crosses (memorize) \(+\) MLP-over-embeddings (generalize) model. |
| Wilcoxon signed-rank | Non-parametric paired test (ranks, not values); use when differences aren’t normal. |
| word2vec | Predict-based word embeddings trained so a word predicts its context (skip-gram / CBOW), Mikolov 2013. |
| WRMF / ALS | Weighted Regularized MF / Alternating Least Squares (implicit-feedback, squared error). |
X
| Term | Meaning |
|---|---|
| xDeepFM / AutoInt / DIN | Vector-wise crosses (CIN) / attention-learned interactions / per-candidate behavior attention. |
Z
| Term | Meaning |
|---|---|
| Zero-shot NER | Extracting arbitrary entity types with no task-specific training (GLiNER, NuNER). |