Representation Learning & the Transformer

1. From symbols to vectors: the representation problem

A neural network (previous note) consumes a vector of numbers. But the inputs we care about in recommendation are discrete symbols: the movie Titanic, the word “great”, the user #4072. There is no number you can hand a network for “Titanic”. So the first question of this whole note is:

How do you turn a discrete symbol into a vector — one whose geometry carries useful meaning?

1.1 The naive answer: one-hot, and why it fails

Number the catalog \(1,\dots,V\) (\(V\) = vocabulary size). Represent symbol \(k\) by the one-hot vector \(\mathbf{e}_k\in\mathbb{R}^V\): a \(1\) in slot \(k\), \(0\) everywhere else.

Decode it: the vector is a switch board with exactly one switch on. Slot \(k\) means “is this symbol \(k\)? yes/no”. No slot means anything else.

This is technically a vector, but it has two fatal flaws:

Dimensionality. With a million movies, every item is a million-long vector that is all zeros but one. Wasteful, and the layer that consumes it needs a million weights per neuron.
No similarity — the killer. Any two distinct one-hots are orthogonal, so their dot product is \(0\) and their cosine similarity (Linear Algebra note) is

\[ \cos(\mathbf{e}_i,\mathbf{e}_j)=\frac{\mathbf{e}_i\cdot\mathbf{e}_j}{\lVert\mathbf{e}_i\rVert\,\lVert\mathbf{e}_j\rVert}=\frac{0}{1\cdot 1}=0 \quad (i\neq j). \]

Titanic and The Notebook (both romances) are exactly as far apart as Titanic and a chainsaw documentary. One-hot encodes identity but zero relationship — and relationship is the entire game in recommendation.

1.2 The fix: a dense, low-dimensional embedding

Replace the \(V\)-long switchboard with a short, dense vector of real numbers learned from data:

\[ \text{symbol } k \;\longmapsto\; \mathbf{v}_k \in \mathbb{R}^{d}, \qquad d \ll V \;\;(\text{e.g. } d=64,\ 300). \]

This \(\mathbf{v}_k\) is an embedding, and the table of all of them (a \(V\times d\) matrix) is an embedding matrix — a lookup table you train like any other weights.

Why the name “embedding.” You embed the discrete set of \(V\) symbols into a continuous \(d\)-dimensional space, the way a curve is embedded in a plane — each symbol becomes a point, and now the distances and directions between points carry meaning. The guiding slogan of this whole note:

meaning = location in a vector space.

Two symbols that behave alike end up at nearby points (high cosine), so the network can generalize: learn something about Titanic and it transfers to its neighbor The Notebook. One-hot could never do that.

You have already met embeddings — three times. This is not a new object; it is the one object this whole curriculum keeps reusing:

The same **embedding** object appears across the whole book — what differs is only *how* it is learned: a dot-product fit (MF), graph propagation (LightGCN), or context-prediction (word2vec/Transformer).
Where	The “embedding” is	Note
Matrix factorization	the latent vectors \(\mathbf{p}_u,\mathbf{q}_i\) with \(r_{ui}\approx\mathbf{p}_u\cdot\mathbf{q}_i\)	Traditional RecSys
LightGCN	the featureless ID embedding of each user/item, refined by propagation	From Graphs to LightGCN
Word/text models	the word vector \(\mathbf{v}_k\) (this chapter)	here

The only question that differs across them is how the embedding is learned. The rest of this chapter is a tour of answers, each more powerful than the last.

1.3 Two ways to learn an embedding

Count-based. Build a big table of co-occurrence counts (how often words appear together; how often items are co-watched), then compress it. TF-IDF (Traditional RecSys note) and classical word-count matrices are this. Simple, but the vectors are still huge and sparse before compression.
Predict-based. Train low-dimensional vectors so that they are good at predicting context. This is word2vec (§2) — and it is the idea that scales all the way to LLMs.

The rest of the note follows the predict-based road, because it is the one the modern field is built on.

2. Distributed representations: word2vec (and item2vec)

2.1 The distributional hypothesis

How can a vector learn meaning with no dictionary? The trick is an old linguistic observation:

“You shall know a word by the company it keeps.” — J. R. Firth, 1957 (and Z. Harris, 1954). Words that appear in similar contexts tend to have similar meanings. Coffee and tea show up around drink, cup, morning; so if we force their vectors to predict the same neighbours, the vectors come out close.

This is the distributional hypothesis, and it has a direct recommendation analogue: items consumed in similar contexts (by similar users, in similar sessions) are similar items. Same idea, swap “word in a sentence” for “item in a user’s history”.

Why “distributed” representation. The meaning of a symbol is not stored in one slot (as in one-hot, a localist code) but spread across all \(d\) coordinates — each coordinate is a soft, shared feature (“is-royal-ish”, “is-romance-ish”). Meaning is distributed over the dimensions; that sharing is exactly what lets vectors generalize.

2.2 word2vec: train vectors to predict their neighbours

word2vec (Mikolov et al., 2013) makes the hypothesis into an objective. Its skip-gram variant: slide a window over text; for each center word \(w\), try to predict each context word \(c\) in the window. Every word carries two vectors — a center vector \(\mathbf{v}_w\) and a context vector \(\mathbf{u}_c\) — and the model scores a (center, context) pair by their dot product, turned into a probability by softmax over the whole vocabulary:

\[ P(c \mid w) \;=\; \frac{\exp\!\big(\mathbf{u}_c\cdot\mathbf{v}_w\big)}{\displaystyle\sum_{c'=1}^{V}\exp\!\big(\mathbf{u}_{c'}\cdot\mathbf{v}_w\big)} . \]

Decoded, component by component:

\(\mathbf{v}_w\) — the center word’s vector (“what I am”).
\(\mathbf{u}_c\) — a context word’s vector (“what tends to sit near me”).
\(\mathbf{u}_c\cdot\mathbf{v}_w\) — the dot product = compatibility score (Linear Algebra note): big when the two vectors point the same way.
\(\exp(\cdot)\) — make it positive.
the denominator \(\sum_{c'}\) — sum over every word in the vocabulary, so the result is a probability that sums to \(1\). This is exactly the softmax of the Probability primer (§3), and training minimizes its cross-entropy (§7; Losses & Regularizers). Push up the score of words that do appear in the window; push down the rest.

Why “skip-gram” and “CBOW.” Skip-gram skips across the window predicting context from the center; its mirror image, CBOW (Continuous Bag Of Words), predicts the center from the averaged (“bag of”) context. Same vectors, opposite direction of prediction.

Negative sampling — the practical trick. That denominator sums over all \(V\) words — far too expensive for a real vocabulary. word2vec replaces it with negative sampling: for each true (center, context) pair, draw a handful of random “negative” words and only train the model to say yes to the true pair and no to the few negatives. This is the same move you saw in Losses & Regularizers (sampled softmax / BPR) and that powers InfoNCE (SSL & Contrastive Learning): don’t normalize over everything; contrast the positive against a few negatives.

Worked — what “predict your neighbours” computes. Take a romance-leaning center word with vector \(\mathbf v_w=[1,1]\) and score three candidate context words by the dot product \(\mathbf u_c\cdot\mathbf v_w\): Notebook \(\mathbf u=[1,1]\) (score \(2\)), La La Land \(\mathbf u=[0,1]\) (score \(1\)), Die Hard \(\mathbf u=[1,-1]\) (score \(0\)). Softmax those scores — \(e^2,e^1,e^0 = 7.39,\,2.72,\,1.00\), summing to \(11.11\) — gives \[ P(\text{Notebook}\mid w)=\tfrac{7.39}{11.11}=0.665,\qquad P(\text{LaLaLand}\mid w)=0.245,\qquad P(\text{DieHard}\mid w)=0.090 . \] The model places two-thirds of its probability on the word whose vector aligns with the center and almost none on the opposite one; training then nudges the vectors so the actually observed neighbours score higher still. That one pressure, repeated over a corpus, is what carves “meaning = location” (§2.3).

2.3 “Meaning = location”: the famous parallelogram

The headline result that made word2vec famous: directions in the learned space line up with semantic relationships, so meaning can be done with arithmetic:

\[ \mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}} \;\approx\; \mathbf{v}_{\text{queen}} . \]

The picture below makes it exact with a 2-axis cartoon of the (really a few-hundred-dimensional) space — one axis stands for gender, one for royalty:

Figure 6.1: The ``royalty’’ relationship is a single direction (the two red arrows are parallel and equal). So king \(-\) man \(+\) woman lands exactly on queen.

Check it by hand:

\[ \underbrace{(3,3)}_{\text{king}}-\underbrace{(3,1)}_{\text{man}}+\underbrace{(1,1)}_{\text{woman}}=(1,3)=\underbrace{\mathbf{v}_{\text{queen}}}_{(1,3)}, \qquad \cos\big(\text{result},\,\text{queen}\big)=1.0 . \]

The relationship “\(\text{man}\!\to\!\text{king}\)” is a vector (here \((0,2)\), “add royalty”); add it to woman and you arrive at queen. Meaning became geometry. (For contrast, \(\cos(\text{king},\text{man})=0.894\) — same gender, different royalty, so close but not identical.)

2.4 item2vec: word2vec is already a recommender

Here is the payoff. Take a user’s watch history in order and call it a “sentence”; call each item a “word”. Run skip-gram on millions of these “sentences” and you learn an item embedding in which co-watched items sit close together — purely from the distributional hypothesis applied to behaviour. This is item2vec (Barkan & Koenigstein, 2016), and its e-commerce cousin prod2vec.

No content, no ratings — just “items that keep each other company are similar”, learned by prediction.
The result is a content-free collaborative embedding, the same kind of object matrix factorization and LightGCN learn — reached by the language-model route.

The bridge, stated once. Every “turn a thing into a vector” method in modern RecSys is word2vec’s idea wearing different clothes. Keep the slogan: meaning = location, learned by predicting context.

3. Sequences: RNNs and LSTMs

Embeddings turn one symbol into one vector. But a review is a sequence of words, and a watch history is a sequence of items — and order matters: “not good” \(\neq\) “good not”; Empire Strikes Back before Return of the Jedi tells a story that the reverse does not. Averaging the embeddings (a “bag of vectors”) throws order away. We need a model that reads left to right and remembers.

3.1 The Recurrent Neural Network (RNN)

An RNN keeps a running summary — a hidden state \(h_t\) (“memory”) — and updates it one step at a time:

\[ h_t \;=\; \tanh\!\big(\,W\,h_{t-1} \;+\; U\,x_t \;+\; b\,\big). \]

Decoded, component by component:

\(x_t\) — the input vector at step \(t\) (the \(t\)-th word/item embedding).
\(h_{t-1}\) — the memory carried in from the previous step (the past, compressed).
\(U\,x_t\) — how the new input enters the memory (\(U\) = input-to-hidden weights).
\(W\,h_{t-1}\) — how the old memory carries forward (\(W\) = hidden-to-hidden weights — the recurrent weights).
\(b\) — a bias (Neural Networks note).
\(\tanh(\cdot)\) — squash the sum into \((-1,1)\) so the memory can’t blow up (Neural Networks note).
\(h_t\) — the updated memory, used both to make a prediction now and as the input to step \(t{+}1\).

Why “recurrent.” The same weights \(W,U,b\) are applied again and again at every step, and the output feeds back into the next step — the computation recurs. One small cell, unrolled across the sequence (Figure 2).

Worked by hand (scalars, so you can check every digit). Let \(W=0.5,\ U=1,\ b=0\), start \(h_0=0\), and feed the tiny sequence \(x=(+1,-1)\):

\[ \begin{aligned} h_1 &= \tanh(0.5\cdot 0 + 1\cdot(+1)) = \tanh(1.0) = \mathbf{0.762},\\ h_2 &= \tanh(0.5\cdot 0.762 + 1\cdot(-1)) = \tanh(-0.619) = \mathbf{-0.551}. \end{aligned} \]

The memory rose to \(+0.762\) on the positive input, then the negative input pulled it down to \(-0.551\) — but not all the way, because \(h_1\)’s value carried forward through \(W h_1\). That carry-over is the memory.

Figure 6.2: The RNN cell unrolled: one shared cell (\(W,U,b\)) applied at each step; memory \(h_t\) flows left to right while inputs \(x_t\) enter from below.

3.2 The vanishing-gradient problem

Train an RNN by back-propagation (Neural Networks note) and you back-propagate through time — the chain rule runs across every step. The gradient that reaches step \(1\) from a loss at step \(T\) is a product of \(T\) factors (one per step). If those factors are mostly \(<1\), the product shrinks toward zero (\(0.5^{20}\approx 10^{-6}\)); if mostly \(>1\), it explodes.

Vanishing gradients → the network cannot learn long-range dependencies: by the time the signal from step 50 reaches step 1, it has decayed to nothing. The RNN effectively forgets the distant past.
This is the same chain-rule product from the Calculus and Neural Networks notes — just stretched over time, where the depth equals the sequence length.

3.3 The LSTM: a protected memory with gates

The LSTM (Long Short-Term Memory; Hochreiter & Schmidhuber, 1997) fixes vanishing gradients by adding a cell state \(c_t\) — a memory “conveyor belt” that runs straight through with only gentle, additive edits — controlled by three gates. A gate is a vector of valves in \((0,1)\), produced by a sigmoid \(\sigma\) (Probability §3): \(0\) = “block”, \(1\) = “let through”, and \(\odot\) is element-wise multiply.

\[ \begin{aligned} f_t &= \sigma(W_f[h_{t-1},x_t]+b_f) &&\textbf{forget gate: what to erase from memory}\\ i_t &= \sigma(W_i[h_{t-1},x_t]+b_i) &&\textbf{input gate: how much new info to write}\\ \tilde c_t &= \tanh(W_c[h_{t-1},x_t]+b_c) &&\textbf{candidate: the new info itself}\\ c_t &= f_t\odot c_{t-1} + i_t\odot \tilde c_t &&\textbf{update the cell: keep + write}\\ o_t &= \sigma(W_o[h_{t-1},x_t]+b_o) &&\textbf{output gate: what to read out}\\ h_t &= o_t\odot \tanh(c_t) &&\textbf{the readable hidden state} \end{aligned} \]

Decoded: the cell line \(c_t = f_t\odot c_{t-1} + i_t\odot\tilde c_t\) is the whole point. The forget gate \(f_t\) decides how much of the old memory to keep; the input gate \(i_t\) decides how much of the new candidate \(\tilde c_t\) to add. Because the old memory passes through by a near-\(1\) multiply and an addition (not a long chain of matrix multiplies), the gradient can flow back many steps without vanishing — the cell state is a gradient highway.

Worked by hand — one step, and why the memory survives. Suppose the cell already carries a fact from far back, \(c_{t-1}=0.80\) (say “the subject is singular”), and this step’s input is irrelevant to it. A trained LSTM then sets gates (each a \(\sigma(\cdot)\in(0,1)\) of \([h_{t-1},x_t]\); values shown) that protect the fact — \(f_t=0.95\) (keep almost all), \(i_t=0.10\) (write almost nothing), candidate \(\tilde c_t=0.40\), output \(o_t=0.60\) — so the cell update and readout, scalar and by hand, are:

\[ \begin{aligned} c_t &= f_t\,c_{t-1}+i_t\,\tilde c_t = 0.95(0.80)+0.10(0.40)=0.760+0.040=\mathbf{0.800},\\ h_t &= o_t\,\tanh(c_t)=0.60\,\tanh(0.800)=0.60(0.664)=\mathbf{0.398}. \end{aligned} \]

The memory came in at \(0.80\) and leaves at \(0.80\) — the new input barely moved it. Contrast the RNN trace in §3.1, where every step forced the whole state through a fresh \(\tanh\) and the old value decayed. And the “gradient highway” is now a number: the cell’s sensitivity to its own past is \(\partial c_t/\partial c_{t-1}=f_t=0.95\), so a signal from \(20\) steps later arrives scaled by \(0.95^{20}\approx0.36\) — versus the plain RNN’s \(0.5^{20}\approx10^{-6}\) (§3.2). A near-\(1\) forget gate, multiplied (not matrix-chained), is exactly what keeps the gradient alive.

Figure 6.3: **The LSTM cell.** The **cell state** \(c\) runs straight across the top, edited only by a **forget** multiply (\(\times f_t\)) and a **write** add (\(+ i_t\odot\tilde c_t\)) — gentle, additive edits, so gradients flow back un-decayed (the *gradient highway*). The **read** gate \(o_t\) taps it for the output \(h_t\). Three sigmoid valves, one protected memory.

Why “Long Short-Term Memory.” An ordinary RNN has only a short-term memory (it fades within a few steps). The LSTM keeps that short-term memory but makes it last a long time — a long short-term memory. The GRU (Gated Recurrent Unit; Cho et al., 2014) is a lighter cousin that merges the gates into two; often as good, cheaper to run.

3.4 RecSys bridge: GRU4Rec

The GRU, written out. The recommender below runs on a GRU, so here are its two gates (vs the LSTM’s three) — a reset gate \(r_t\) and an update gate \(z_t\), each a sigmoid of \([h_{t-1},x_t]\):

\[ r_t=\sigma(W_r[h_{t-1},x_t]),\quad z_t=\sigma(W_z[h_{t-1},x_t]),\quad h_t=(1-z_t)\odot h_{t-1}+z_t\odot\tilde h_t, \]

with candidate \(\tilde h_t=\tanh\!\big(W[\,r_t\odot h_{t-1},\,x_t\,]\big)\). The update gate \(z_t\) interpolates between keeping the old state and writing the new — one gate doing the LSTM’s forget+input job; the reset gate decides how much past to use when forming the candidate. Fewer gates, no separate cell state, often as good — and cheaper.

Swap “word” for “item” and an RNN becomes a session-based recommender: feed the items a user has clicked in order, let the hidden state summarize the session, and predict the next item. GRU4Rec (Hidasi et al., 2016) did exactly this with a GRU and was the first deep sequential recommender — the direct ancestor of the attention-based ones in §4. The hidden state \(h_t\) is the user’s “current intent” vector; score candidates by dot product with it, exactly as everywhere else in the book.

4. Attention and the Transformer

RNNs read sequences, but they have two weaknesses: they are inherently sequential (step \(t\) must wait for step \(t{-}1\), so they are slow to train), and even an LSTM strains over very long ranges. The fix that took over the field is attention.

4.1 The idea: let every position look at every other

Instead of forcing all the past through one hidden state, attention lets each position look directly at every other position and pull out exactly what it needs. The central operation is self-attention:

\[ \mathrm{Attention}(Q,K,V) \;=\; \mathrm{softmax}\!\left(\frac{Q\,K^\top}{\sqrt{d_k}}\right) V . \]

Each input token is turned (by three learned weight matrices) into three vectors:

\(Q\) — queries: “what am I looking for?”
\(K\) — keys: “what do I offer / advertise?”
\(V\) — values: “what do I actually pass on if chosen?”

Why query / key / value. It is a soft database lookup. In a database you match a query against stored keys and return the matching values. Self-attention does this softly: the query matches all keys to a degree, and you get a weighted blend of all values. The names are literal.

Why three different matrices? The worked example below sets \(Q=K=V=X\) to isolate the mechanism, but in practice each is a learned projection \(W_Q,W_K,W_V\) of the token. That lets a token advertise one thing (its key), search for another (its query), and pass on a third (its value) — e.g. a verb can look for its subject (query \(\ne\) key) yet contribute its tense (value). Identical \(Q,K,V\) could only ever match on raw similarity; the three separate projections are what make attention learnable.

Decoded, component by component:

\(Q\,K^\top\) — every query dotted with every key: an “\(n\times n\)” table of similarities (dot product = alignment, Linear Algebra note), where \(n\) is the sequence length (number of tokens). The matrices have shapes \(Q, K \in \mathbb{R}^{n \times d_k}\) and \(V \in \mathbb{R}^{n \times d_v}\), so \(QK^\top\) is indeed \(n\times n\). Entry \((i,j)\) = how much token \(i\) should attend to token \(j\).
\(\div\sqrt{d_k}\) — scale down by the square root of the key dimension. Dot products grow with dimension; without this, large values push softmax into a near-one-hot spike with tiny gradients. (“Scaled dot-product attention.”)
\(\mathrm{softmax}(\cdot)\) — turn each row of similarities into weights that sum to 1 (Probability §7).
\(\cdots\,V\) — use those weights to take a weighted average of the value vectors. Each token’s output is a blend of everything it chose to attend to.

The cost. Because every token attends to every other, self-attention builds an \(n\times n\) score table — it is \(O(n^2)\) in the sequence length \(n\). Feel it: a \(10\)-token sentence costs \(10^2=100\) scores; a \(1{,}000\)-token document costs \(1{,}000^2=\mathbf{10^6}\) — and doubling the length to \(2{,}000\) quadruples it to \(4\times10^6\). That quadratic cost is the price for the parallelism and the direct long-range links, and it is exactly what the post-2017 “efficient-Transformer” and long-context research works to reduce.

4.2 Self-attention, worked by hand

Take the romance triple from LLM × RecSys — a user watched \(t_1=\) Titanic, \(t_2=\) The Notebook, \(t_3=\) La La Land — with tiny 2-D taste embeddings (axis 1 = “romance”, axis 2 = “musical/drama”):

\[ X=\begin{bmatrix} 1&0\\ 1&1\\ 0&1\end{bmatrix} \quad(\text{Titanic}=[1,0],\ \text{Notebook}=[1,1],\ \text{LaLaLand}=[0,1]). \]

To isolate the mechanism, use identity projections, so \(Q=K=V=X\) (in a real model these are learned). With \(d_k=2\), \(\sqrt{d_k}=1.414\). The scaled scores \(S=XX^\top/\sqrt2\) and the softmax weights \(A\) (each row sums to 1) come out to:

attends to →	\(t_1\)	\(t_2\)	\(t_3\)		weight on \(t_1\)	\(t_2\)	\(t_3\)
\(t_1\) query	0.707	0.707	0.000	→	0.401	0.401	0.198
\(t_2\) query	0.707	1.414	0.707	→	0.248	0.503	0.248
\(t_3\) query	0.000	0.707	0.707	→	0.198	0.401	0.401

Then each output row is its weights \(\times\) the value vectors \(V=X\):

\[ \text{out} = A\,X =\begin{bmatrix} 0.802 & 0.599\\[1pt] \mathbf{0.752} & \mathbf{0.752}\\[1pt] 0.599 & 0.802\end{bmatrix}. \]

Read the middle row (The Notebook, \(t_2\)): its scaled self-score is highest (\(1.414\), because \([1,1]\cdot[1,1]=2\)), so after softmax it attends mostly to itself (\(0.503\)) and equally to its two neighbours (\(0.248\) each). Its output \([0.752,0.752]\) is therefore a blend of all three — fitting, since The Notebook sits “between” the pure-romance Titanic and the pure-musical La La Land. Each token’s new vector is a context-aware mix of the whole sequence, computed in one parallel shot — no stepping.

Figure 6.4: **The attention-weight matrix \(A\)** (each *row* sums to 1; darker = more weight). Every token attends mostly to **itself** (the diagonal) — the middle token \(t_2\) most of all (\(0.50\)) — with the rest spread over neighbours. Reading a row tells you where that token *looks*.

Figure 6.5: Self-attention for the middle token: it pulls a weighted blend of all three value vectors (weights \(0.248,0.503,0.248\)). Attention = ``look at everything, keep what matches.’’

What the learned projections add (why \(Q,K,V\) are separate). We set \(Q=K=V=X\) to expose the mechanism; a trained Transformer instead learns three different linear views — \(Q=XW_Q\), \(K=XW_K\), \(V=XW_V\) — so it learns what to match on (via \(W_Q,W_K\)) separately from what to pass along (via \(W_V\)). One token makes it concrete: take \(t=[1,2]\) with \[ W_Q=\begin{bmatrix}1&0\\1&0\end{bmatrix},\quad W_K=\begin{bmatrix}0&1\\0&1\end{bmatrix},\quad W_V=I \;\Rightarrow\; q=tW_Q=[3,0],\;\; k=tW_K=[0,3],\;\; v=tW_V=[1,2]. \] Three different vectors from one token — and now \(q\cdot k=0\): under these projections the token’s own query and key are orthogonal, so it would attend away from itself, the opposite of the identity case. That is the freedom training buys — attention on a learned relationship, not just raw similarity. The §4.2 walk-through is the special case \(W_Q=W_K=W_V=I\); everything else is the same softmax-blend, on projected vectors.

The jump that matters: static → contextual embeddings. word2vec (§2) gives each word one fixed vector, whatever the sentence. Self-attention gives each token a different vector per occurrence — a blend of its neighbours — so “bank” by a river and “bank” with money end up with different representations. This contextual embedding (vs word2vec’s static one) is the leap that carried NLP from word2vec to BERT/GPT, and it is the real reason the Transformer took over.

Figure 6.6: **Static vs. contextual embeddings.** **Left (static, word2vec):** `bank'' is *one* fixed point in the space — the same vector whether the sentence is about a riverbank or a financial bank, so the two senses are conflated at a single location. **Right (contextual, Transformer):** the *same* token`bank’’ becomes *two different vectors* depending on context — `bank by a river'' is pulled toward the geography cluster (green), while`bank with money’’ lands near the finance cluster (orange). The Transformer’s self-attention does this: each token’s output is a context-aware blend of its neighbours, so meaning shifts with the sentence.

4.3 Multi-head attention, positional encoding, the block

Multi-head attention. One attention is one “point of view”. Run \(h\) of them in parallel with different learned \(Q/K/V\) matrices, then concatenate. Why: one head can track romance, another recency, another genre — different relations at once. (“Head” = one parallel attention.)

Figure 6.7: **Multi-head attention.** The same input feeds \(h\) attentions in parallel, each with its *own* learned \(Q/K/V\) — one head can follow *romance*, another *recency*, another *genre*. Their outputs are concatenated and linearly mixed by \(W_O\). 4.2 worked **one** head by hand; multi-head just runs several at once and joins them.

Positional encoding. Look again at \(\mathrm{softmax}(QK^\top/\sqrt d)V\): shuffle the tokens and every output is just shuffled too — attention is a set operation, blind to order. But order matters (§3). So we add a position signal to each token’s embedding. The original Transformer uses fixed sinusoids:

\[ PE(\text{pos},2i)=\sin\!\Big(\tfrac{\text{pos}}{10000^{2i/d}}\Big),\qquad PE(\text{pos},2i+1)=\cos\!\Big(\tfrac{\text{pos}}{10000^{2i/d}}\Big). \]

Each dimension is a sine/cosine wave of a different wavelength, so each position \(\text{pos}\) gets a unique, smooth fingerprint the model can read. (For \(\text{pos}=1,\ d=4\): \(PE=[\,0.841,\ 0.540,\ 0.010,\ 1.000\,]\).) Modern models often use learned positional embeddings instead — same purpose.

The Transformer block. One layer stacks self-attention → add & norm → feed-forward → add & norm, repeated \(N\) times to form the Transformer encoder. (That order — normalize after each sub-layer — is the original Post-LN design; modern large models almost always use Pre-LN, normalizing before each sub-layer as \(x+f(\mathrm{LN}(x))\), which trains more stably at depth.) Two pieces beyond attention deserve a decode:

The feed-forward network (FFN). A small per-token MLP — \(\mathrm{FFN}(x)=\max(0,\,xW_1+b_1)\,W_2+b_2\) (a ReLU layer up to a wider dimension, then back down) — applied to each position independently. Attention mixes tokens; the FFN transforms each one nonlinearly. It holds the majority of a Transformer’s parameters and is where much of the model’s stored knowledge lives.
Residual connection (“add the input back”). Each sub-layer computes \(x+f(x)\), not \(f(x)\). Why it matters is the LSTM’s trick again (§3.3): the derivative of \(x+f(x)\) carries a \(+1\), so a gradient always has a clear, undiminished path straight back — a gradient highway that lets very deep stacks train. LayerNorm then rescales each token’s vector to keep training stable (the normalization of the Neural-Networks note).

Why “Transformer.” Vaswani et al. titled the 2017 paper “Attention Is All You Need” — they removed recurrence and convolution (the sliding filters used in vision models) entirely; the name is usually read as attention transforming each token’s representation using the others (the paper itself never explains the choice). No recurrence means the whole sequence is processed in parallel, which is what let these models scale to the sizes we now call LLMs.

Figure 6.8: **One Transformer block** (stacked \(N\) times). Self-attention *mixes* tokens; the **FFN** *transforms* each one; each sub-layer is wrapped in a **residual** ``add the input back’’ (red) and a **LayerNorm**. The residual is the gradient highway that lets deep stacks train. **BERT** stacks these with a full mask, **GPT** with a causal mask (5).

4.4 RecSys bridge: SASRec and BERT4Rec

Replace “word” with “item” once more and the Transformer becomes the state-of-the-art sequential recommender:

SASRec (Self-Attentive Sequential Recommendation; Kang & McAuley, 2018) runs a causal Transformer over the watch history (each position may attend only to earlier items) and predicts the next item — a self-attention upgrade of GRU4Rec.
BERT4Rec (Sun et al., 2019) uses a bidirectional (masked) Transformer: it masks items in the history and predicts them from both sides, often beating the left-to-right SASRec.

How a next-item prediction is actually made. The Transformer turns the history into a context vector \(h\); the model then scores every candidate item by \(h\cdot e_j\) (a dot product with the item’s embedding) and applies a softmax over the catalog to get next-item probabilities — trained by cross-entropy. That is exactly §2’s skip-gram head, now at the top of a sequence model, and at scale it uses the same negative sampling (§2.2), since the catalog is too large to normalize over fully. The output side is word2vec’s machinery again.

The “causal vs bidirectional” choice between them is exactly the GPT-vs-BERT choice we turn to next.

Serving aside — L2-normalizing before scoring. The dot product \(h \cdot e_j\) that scores candidates is sensitive to vector norm: an item with a large \(\|e_j\|\) (typically a popular item whose embedding has been updated many more times) can accumulate a high score purely from its length, not from genuine compatibility. The standard fix is to L2-normalize both the intent vector \(h\) and every item embedding \(e_j\) before computing the dot product; the normalized dot product equals the cosine similarity (Linear Algebra note), which compares direction only and removes the length bias. This is the “length bias” Exercise 10 asks you to observe: the item whose raw dot-product score is inflated by norm is exactly the one that drops in the cosine re-ranking. Normalization at serving time is cheap — scale the catalog embedding table once offline — and it is the default in production ANN (approximate nearest-neighbour) retrieval.

5. Pretraining and LLMs: BERT, GPT, and the frozen encoder

A Transformer is just an architecture. What makes a Large Language Model is pretraining at scale: train a big Transformer on a huge text corpus with a self-supervised objective (labels come free from the text itself — the same self-supervision idea as SSL & Contrastive Learning), then reuse the result. (Real models first split text into subword tokens — BPE / WordPiece — so a “token” is not quite a “word”: a rare word becomes a few sub-pieces. We say “word” for simplicity until §5.3 builds a tokenizer by hand.) Two families differ in the objective and the attention pattern.

5.1 BERT (encoder) vs GPT (decoder)

BERT — Bidirectional Encoder Representations from Transformers (Devlin et al., 2019). Objective: masked language modeling — hide about 15% of the tokens and predict them using context from both directions. Because every token sees the whole sentence, BERT is an encoder: great at understanding text and turning it into rich vectors.
GPT — Generative Pre-trained Transformer (Radford et al., 2018). Objective: next-token prediction — predict the next word from the left context only (a causal mask). Because it only looks left, GPT can generate text one token at a time. This is the family the “LLM” usually refers to.

The single difference that splits them is the attention mask — which tokens each token may look at:

BERT (bidirectional) — every token sees every token:

query key	\(t_1\)	\(t_2\)	\(t_3\)	\(t_4\)
\(t_1\)	✓	✓	✓	✓
\(t_2\)	✓	✓	✓	✓
\(t_3\)	✓	✓	✓	✓
\(t_4\)	✓	✓	✓	✓

GPT (causal) — a token sees only itself and earlier ones:

query key	\(t_1\)	\(t_2\)	\(t_3\)	\(t_4\)
\(t_1\)	✓
\(t_2\)	✓	✓
\(t_3\)	✓	✓	✓
\(t_4\)	✓	✓	✓	✓

Figure 6.9: **Bidirectional (BERT) vs causal (GPT) attention mask** for a 4-token sequence. **Left:** BERT uses a *full* \(n\times n\) grid — every token may attend to every other, so the model reads context from both directions (useful for understanding/embedding). **Right:** GPT uses only the *lower triangle* — token \(t_k\) may attend to \(t_1, \dots, t_k\) but not to any future token \(t \to t_k\), enforcing the left-to-right causal order needed for generation. SASRec uses the causal mask; BERT4Rec uses the full mask.

The lower-triangular GPT mask is what makes generation possible: when predicting token \(t\), the model must not peek at the future. The same attention masks* reappear in RecSys as BERT4Rec (bidirectional) vs SASRec (causal) — though their training objectives differ (see §4.4).*

5.2 From pretraining to use — and the one bridge RecSys needs

A pretrained LLM is used in three ways, in rising order of cheapness:

Fine-tuning — keep training it on your labelled task (e.g. TALLRec, LLM × RecSys §2.1).
In-context / few-shot prompting — freeze it; just show examples in the prompt (an emergent ability of scale).
As a frozen encoder — run text through it once and keep the output vector.

That third use is the load-bearing bridge for recommendation:

The frozen text encoder = the semantic vector RecSys consumes. Take a movie’s reviews, or an LLM-written profile of a user, push it through a frozen encoder, and out comes one fixed semantic embedding of that text — meaning = location, one last time. This vector is precisely the \(s_i\) that RLMRec (LLM × RecSys §3) aligns with the collaborative embedding \(e_i\), and the signal KAR feeds a backbone. The expensive LLM runs offline; the recommender keeps serving a cheap dot product.

So the “LLM” in the LLM-as-enhancer line (LLM × RecSys §2.3) is, mechanically: §2’s meaning = location, built by §4’s Transformer, pretrained per §5, read out as a frozen vector. No magic — just this chapter’s lineage, scaled up.

→ Which encoder, in practice? “Push it through a frozen encoder” raises the practitioner’s real next question — which one? For the current (mid-2026) open-source choices — embedding models (e5, BGE-M3, Qwen3-Embedding), how to read the MTEB leaderboard, and the Matryoshka/quantization levers — see the Implementation Choices appendix §2.

In practice. For turning items or metadata into embeddings, reach for a current open text encoder (the Implementation Choices appendix lists mid-2026 options) rather than training one from scratch — a pretrained encoder brings hundreds of billions of tokens of world knowledge; fine-tuning a task-specific head on top costs a fraction of training from zero. And always use the model’s own tokenizer (next): switching tokenizers silently corrupts the input.

5.3 Subword tokenization: the practitioner’s literal first step

Before any of this runs, raw text must become a sequence of token IDs — and a real model never tokenizes by whole words. A word-level vocabulary cannot cope with the long tail (every typo, name, or new word becomes an out-of-vocabulary <UNK> with no vector), while a character-level one makes sequences punishingly long. The modern compromise is subword tokenization: keep frequent words whole, but break rare ones into reusable pieces.

The workhorse is byte-pair encoding (BPE). Why the name: it began as a 1994 data-compression trick that repeatedly replaces the most frequent adjacent pair of bytes with a new symbol; the tokenizer reuses the exact same idea on text. The recipe: start from individual characters, then greedily merge the most frequent adjacent pair into a new token, over and over, until you hit a target vocabulary size.

Why a “subword” beats a whole word. Whole-word vocabularies drown in the long tail; characters make sequences too long. Subwords sit in between — a fixed-size vocabulary that can still spell any word out of known pieces, so there is no <UNK>: an unseen word just costs a few more tokens.

Worked by hand (countable). Take a toy corpus of two word-types — low seen 5 times and lowest seen 2 times — and write each as a character sequence with an end-of-word marker _ (so the tokenizer can tell low the word from low inside lowest). The starting vocabulary is the 7 distinct characters l o w e s t _:

 l o w _              (x5)
 l o w e s t _        (x2)

Now merge the most frequent adjacent pair, twice:

step	most-frequent adjacent pair	count	corpus after the merge
1	`o`,`w` \(\to\) `ow`	7	`l ow _` (×5), `l ow e s t _` (×2)
2	`l`,`ow` \(\to\) `low`	7	`low _` (×5), `low e s t _` (×2)

(Count merge 1 by hand: o w appears once in low_ \(\times5\) and once in lowest_ \(\times2\), so \(5+2=7\).) Two merge rules were learned, in order: o+w, then l+ow. The vocabulary grew from 7 characters to 9 tokens (real BPE stops at roughly 30k–50k).

The payoff — an unseen word. A word never in the training corpus, say lower, is tokenized by replaying the learned rules in order on l o w e r _:

 start:        l o w e r _
 apply o+w:    l ow e r _
 apply l+ow:   low e r _      ->  tokens [ low , e , r , _ ]

lower becomes 4 tokens, and the stem low is the same piece it shares with low and lowest — so the model reuses everything it already learned about that stem. No <UNK>, ever. This is why “Titanic” might be one token but an unknown surname ten: common strings survive as whole pieces; rare ones fall back to smaller, shared ones. (WordPiece, used by BERT, and SentencePiece/Unigram, used by many multilingual models, differ only in which pair/piece to merge or keep — the same divide-into-reusable-subwords idea.)

5.4 How a decoder actually generates text

A GPT-style decoder does not emit a sentence in one shot. Generation is autoregressive — “regressing on itself”: predict one token, append it to the input, and run the model again on the now-longer sequence, repeating until a stop token:

  prompt: "The movie was"
  step 1: run model -> next token "great"   -> "The movie was great"
  step 2: run model -> next token "and"     -> "The movie was great and"
  step 3: run model -> next token "..."      ->  ... (until <end>)

Figure 6.10: **The autoregressive decode loop.** The model emits *one* token, that token is **appended** to the input (red), and the whole thing is fed back in for the next token — repeat until a stop token. SASRec’s next-item prediction (4.4) is the recommender form of exactly this loop.

The KV-cache — why this is not absurdly slow. Naively, every step re-runs the model on the whole prefix, so to produce \(L\) tokens (we write \(L\) for the generated length here, to distinguish it from the input sequence length \(n\) of §4) you re-encode \(1+2+\cdots+L=\tfrac{L(L+1)}{2}=O(L^2)\) token-passes. But a token’s key and value vectors (§4.1) never change once computed, so implementations store them — the KV-cache — and at each step push only the one new token through. Generating \(L\) tokens then costs about \(L\) new-token passes, not \(L^2\):

Caching each token’s **key** and **value** vectors reduces autoregressive generation from \(O(L^2)\) to \(O(L)\) token-passes — a \(\sim500\times\) saving at \(L=1{,}000\).
generate \(L\) tokens	token-passes through the stack	order
without KV-cache (re-encode the prefix)	\(1+2+3+4=10\) (for \(L{=}4\))	\(O(L^2)\)
with KV-cache (one new token each step)	\(1+1+1+1=4\) (for \(L{=}4\))	\(O(L)\)

At \(L=4\) the cache already saves \(2.5\times\); at \(L=1{,}000\) it saves about \(500\times\). The cache is the difference between a chatbot that answers in a second and one that crawls.

Aside — cross-attention (encoder→decoder). The original 2017 Transformer was an encoder–decoder for translation: the decoder ran a second attention whose queries come from the decoder but whose keys/values come from the encoder’s output — cross-attention, the decoder “reading” the source sentence. The decoder-only LLMs and the encoder-only/causal recommenders this chapter targets (BERT4Rec, SASRec) drop that block, so we only name it; the self-attention machinery is identical, just with \(Q\) from one sequence and \(K,V\) from another.

6. Exercises

Work these by hand — the numbers are kept tiny on purpose, and each reuses an example or number already worked above. Full worked solutions are in the Solutions appendix at the back of the book.

(compute) Number a tiny catalog \(1,\dots,5\) and take the two one-hot vectors \(\mathbf{e}_2\) (a \(1\) in slot \(2\), else \(0\)) and \(\mathbf{e}_4\) (§1.1). Compute their dot product and their cosine similarity. Why does every pair of distinct one-hots come out this way, and what does that say about Titanic vs The Notebook under one-hot encoding?
(compute) Give two words the cartoon 2-D vectors \(\mathbf{v}_{\text{coffee}}=[4,3]\) and \(\mathbf{v}_{\text{tea}}=[3,4]\) (axes = “morning-drink-ish”, “hot-beverage-ish”). Compute the cosine similarity \(\cos(\mathbf{v}_{\text{coffee}},\mathbf{v}_{\text{tea}})\) (Linear Algebra note / §1.1). Is the value near \(+1\), and what would the distributional hypothesis (§2.1) predict about two words this close?
(concept) In one or two sentences, state the skip-gram objective of word2vec (§2.2): what does it slide over the text, what does each center word try to predict, and how does the dot product \(\mathbf{u}_c\cdot\mathbf{v}_w\) enter? Then explain what negative sampling changes and why it is needed — what expensive thing in \(P(c\mid w)\) does it avoid?
(compute) Reuse §4.2’s three tokens \(X=\left[\begin{smallmatrix}1&0\\1&1\\0&1\end{smallmatrix}\right]\) with identity projections (\(Q=K=V=X\)) and \(d_k=2\). The chapter worked the middle row (\(t_2\)); now do the first row (\(t_1=\) Titanic\(=[1,0]\)): (a) form its three scaled scores \(t_1\!\cdot\!t_j/\sqrt{2}\); (b) softmax them into attention weights; (c) take the weighted blend of the value rows to get \(t_1\)’s output vector. Confirm you reproduce the chapter’s first output row \([0.802,\,0.599]\).
(compute) Self-attention divides the scores by \(\sqrt{d_k}\) (§4.1). For the middle token of §4.2, the raw self-score is \([1,1]\cdot[1,1]=2\). (a) What is the scaled self-score \(2/\sqrt{2}\)? (b) In one sentence, say what would go wrong with the softmax in §4.2 if we used the raw scores at a large \(d_k\) instead — i.e. why scaled dot-product attention.
(concept) §4.3 runs multi-head attention: several attentions in parallel, each with its own learned \(Q/K/V\) projections, concatenated at the end. The single head in §4.2 used identity projections. In two or three sentences, explain what a single head can and cannot capture, and what running several heads with different projections buys you (use the romance / recency / genre framing of §4.3).
(extend) Apply §5.3’s two learned BPE merge rules — o+w, then l+ow, in that order — to the new unseen word slow, written s l o w _. (a) Replay the rules step by step and list the final tokens. (b) Separately, in the original two-word corpus (low ×5, lowest ×2), what is the count of the adjacent pair e,s, and why is it not the pair BPE merges first?
(extend) Extend §5.4’s KV-cache table to a sequence of length \(L=5\). (a) Without the cache, each step re-encodes the whole prefix, so the total is \(1+2+3+4+5\) token-passes — how many is that? (b) With the cache, each step pushes only the one new token — how many passes total? (c) Confirm the ratio matches the chapter’s claim that the saving grows like \(L\) (the chapter quotes \(\approx 500\times\) at \(L=1{,}000\)).
(extend) Using the sinusoidal formula of §4.3 with \(d=4\), the chapter evaluates the positional encoding at \(\text{pos}=1\) as \(PE=[0.841,\,0.540,\,0.010,\,1.000]\). Evaluate it at \(\text{pos}=2\): compute \(PE(2,0)=\sin(2/10000^{0})\), \(PE(2,1)=\cos(2/10000^{0})\), \(PE(2,2)=\sin(2/10000^{1/2})\), \(PE(2,3)=\cos(2/10000^{1/2})\). (Note \(10000^{0}=1\) and \(10000^{1/2}=100\).) Why does giving each position its own vector matter to an order-blind attention layer?
(apply) A sequential recommender (SASRec, §4.4) turns a watch history into one intent vector \(h=[1,2]\) on §4.2’s two taste axes (here leaning musical), and scores each candidate item by the dot product \(h\cdot e_j\) — exactly §2.4’s item2vec / §4.4’s next-item head. Score the chapter’s three item vectors \(e=[1,0]\) (Titanic), \([1,1]\) (Notebook), \([0,1]\) (La La Land). (a) Rank them by the raw dot product. (b) Re-rank by cosine similarity. Do the rankings agree, and which item does the raw dot product flatter partly for its larger norm — the length bias §4’s serving discussion (and the Linear Algebra note) warns about?
(compute) LSTM cell, one step (§3.3). §3.3 traced a step that kept a memory; now trace one that overwrites it. The cell holds \(c_{t-1}=0.50\), and this step’s gates are \(f_t=0.40\) (mostly forget), \(i_t=0.90\) (mostly write), candidate \(\tilde c_t=0.80\), and output gate \(o_t=0.50\). (a) Compute the new cell state \(c_t=f_t\,c_{t-1}+i_t\,\tilde c_t\). (b) Compute the readout \(h_t=o_t\tanh(c_t)\) (use \(\tanh(0.92)\approx0.726\)). (c) In one sentence, contrast this step with §3.3’s worked step: which is the LSTM remembering, and which is it overwriting?
(compute) — skip-gram probabilities. A center word has vector \(\mathbf v_w=[1,1]\) (§2.2); three context candidates have \(\mathbf u_A=[2,1]\), \(\mathbf u_B=[0,1]\), \(\mathbf u_C=[-1,1]\). (a) Score each by \(\mathbf u_c\cdot\mathbf v_w\). (b) Turn the three scores into \(P(c\mid w)\) with the softmax. (c) Which context word is the model most sure of, and does it match which vector best aligns with \(\mathbf v_w\)?
(compute) — \(Q,K,V\) are learned views. A token is \(t=[2,1]\), with projections \(W_Q=\begin{bmatrix}2&0\\0&0\end{bmatrix}\), \(W_K=\begin{bmatrix}0&0\\0&2\end{bmatrix}\), \(W_V=I\) (§4.2). (a) Compute \(q=tW_Q\), \(k=tW_K\), \(v=tW_V\). (b) Are they three different vectors? (c) Compute \(q\cdot k\) — and say in one line what attention now matches on that raw similarity \(t\cdot t\) would not.
(compute) — the KV-cache payoff. Generating \(L\) tokens autoregressively (§5.4). (a) Without a KV-cache you re-encode the whole prefix each step; how many token-passes for \(L=5\) (use \(1+2+\dots+L\))? (b) With the cache, one new token per step — how many passes? (c) What exactly is stored, and roughly what speedup does it give at \(L=1000\)?

7. The bridge to RecSys: one table, then the hand-off

Every technique in this chapter has a direct recommender incarnation. This is the map:

Every NLP representation technique in this chapter has a direct **recommender incarnation** — the progression from sparse vectors to pretrained LLMs maps cleanly onto the *LLM × RecSys* taxonomy.
Representation idea (this chapter)	RecSys incarnation	Where it lives
One-hot / BoW / TF-IDF (§1)	content-based filtering	Traditional RecSys
word2vec predict-based embedding (§2)	item2vec, prod2vec	§2.4
RNN / LSTM / GRU (§3)	GRU4Rec (session-based)	§3.4
Self-attention / Transformer (§4)	SASRec (causal), BERT4Rec (bidirectional)	§4.4
Pretrained encoder → text embedding (§5)	RLMRec semantic view, KAR features	LLM × RecSys §2.3, §3
Quantized embedding → discrete codes (§5/§2)	TIGER Semantic IDs (generative rec) → LLM × RecSys §2.2 (not built here)	LLM × RecSys §2.2
Generative LLM (GPT-style, §5)	LLM-as-recommender (TALLRec, P5)	LLM × RecSys §2.1

The hand-off to LLM × RecSys. LLM × RecSys opens by assuming only that an LLM “maps text to a vector and can generate text”. You now know what is inside that box: a discrete symbol becomes a vector because we embed it (§1) and train it to predict context (§2); a sequence is processed by attention (§4); the model is a pretrained Transformer (§5); and “maps text to a vector” is a frozen encoder read-out (§5.2). The four roles of LLM × RecSys are four ways to wire that machinery into a recommender — and the enhancer/graph line (RLMRec-style) is the cheapest: LLM offline for meaning, graph backbone online for the collaborative signal.

That is the whole arc of the book in one sentence: turn things into vectors whose geometry carries meaning, learn those vectors ever more powerfully (counts → word2vec → RNN → Transformer → LLM), and recommend by comparing them.

8. Glossary

Term	Plain meaning
One-hot vector	A length-\(V\) vector, \(1\) in one slot, \(0\) elsewhere; encodes identity but no similarity (all pairs orthogonal).
Embedding	A learned dense low-dim vector for a discrete symbol; meaning = location in the space.
Static vs contextual	word2vec gives one fixed vector per word; a Transformer gives a different vector per occurrence (context-dependent).
Distributional hypothesis	“A word is known by the company it keeps” — similar contexts ⟹ similar vectors (Firth 1957).
word2vec	Predict-based word embeddings trained so a word predicts its context (skip-gram / CBOW), Mikolov 2013.
Negative sampling	Train against a few random negatives instead of normalizing over the whole vocabulary (cf. BPR, InfoNCE).
item2vec	word2vec applied to user interaction sequences (“item = word, history = sentence”); a collaborative item embedding.
RNN	Recurrent Neural Network: one shared cell that updates a hidden-state “memory” \(h_t=\tanh(Wh_{t-1}+Ux_t+b)\) along a sequence.
Vanishing gradient	Back-prop through many steps multiplies many small factors → gradient decays → long-range dependencies unlearnable.
LSTM	Long Short-Term Memory: an RNN with a cell state + forget/input/output gates; the additive cell line stops gradients vanishing.
GRU	Gated Recurrent Unit: a lighter 2-gate LSTM (Cho 2014); GRU4Rec is its session-based recommender.
Attention	\(\mathrm{softmax}(QK^\top/\sqrt{d_k})V\): each position takes a similarity-weighted blend of all positions’ value vectors.
Query / Key / Value	Soft database lookup: a query is matched (dot product) against keys to retrieve a blend of values.
Self-attention	Attention of a sequence to itself (Q, K, V all from the same tokens).
Multi-head	Several attentions in parallel with different learned projections, then concatenated.
Feed-forward (FFN)	Per-token MLP \(\max(0,xW_1{+}b_1)W_2{+}b_2\) inside each block; transforms each token (most of the parameters).
Positional encoding	A per-position signal added to embeddings so order-blind attention can see order (sinusoidal or learned).
Transformer	A stack of (self-attention + add&norm + feed-forward + add&norm) blocks; no recurrence, fully parallel (Vaswani 2017).
Residual connection	A sub-layer computes \(x+f(x)\); the \(+1\) in its derivative is a gradient highway for deep stacks.
SASRec / BERT4Rec	Sequential recommenders: a causal (left-to-right) / bidirectional (masked) Transformer over the item history.
Subword tokenization (BPE)	Split text into reusable sub-word pieces by greedily merging the most frequent adjacent pair; fixed vocabulary, no `<UNK>` (WordPiece/SentencePiece are cousins).
Pretraining	Train a big Transformer on a huge corpus with a self-supervised objective, then reuse it.
BERT	Bidirectional Encoder (masked-LM); an encoder for understanding text → embeddings (Devlin 2019).
GPT	Generative Pre-trained Transformer (next-token, causal); a decoder for generating text (Radford 2018).
Autoregressive decoding	Generate one token, append it, run the model again, repeat — the GPT/SASRec generation loop.
KV-cache	Store past tokens’ keys/values so each decode step processes only the new token: \(O(L)\) instead of \(O(L^2)\).
Frozen text encoder	Run text once through a fixed pretrained model and keep the output vector — the semantic \(s_i\) RecSys aligns/consumes.
LLM	A large pretrained Transformer language model; here, demystified as this chapter’s lineage at scale.

9. References

Barkan, O., & Koenigstein, N. (2016). Item2Vec: Neural item embedding for collaborative filtering. In Proceedings of the 2016 IEEE International Workshop on Machine Learning for Signal Processing (MLSP). arXiv:1603.04259
Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. arXiv:1406.1078
Deisenroth, M. P., Faisal, A. A., & Ong, C. S. (2020). Mathematics for machine learning. Cambridge University Press. https://mml-book.github.io/
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT 2019. arXiv:1810.04805
Firth, J. R. (1957). A synopsis of linguistic theory, 1930–1955. In Studies in linguistic analysis (pp. 1–32). Blackwell.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. https://www.deeplearningbook.org/
Hidasi, B., Karatzoglou, A., Baltrunas, L., & Tikk, D. (2016). Session-based recommendations with recurrent neural networks (GRU4Rec). In Proceedings of the 4th International Conference on Learning Representations (ICLR). arXiv:1511.06939
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Kang, W.-C., & McAuley, J. (2018). Self-attentive sequential recommendation (SASRec). In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM). arXiv:1808.09781
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (NeurIPS), pp. 3111–3119. arXiv:1310.4546
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training (GPT). Technical report, OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Sun, F., Liu, J., Wu, J., Pei, C., Lin, X., Ou, W., & Jiang, P. (2019). BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). arXiv:1904.06690
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS). arXiv:1706.03762

Online sources verified June 2026.