Appendix D — Reinforcement Learning for Recommendation

1. From bandits to RL: the MDP

RL formalizes sequential decision-making as a Markov Decision Process (MDP) — five pieces:

a state \(s\) — a summary of the situation now (the user’s recent history, time, context);
an action \(a\) — what the agent does (which item, or slate of items, to show);
a reward \(r\) — the immediate payoff (a click, a watch-minute, a return visit);
a transition — taking action \(a\) in state \(s\) moves you to a new state \(s'\) (this is the ingredient bandits lack: actions change the next state);
a policy \(\pi(a\mid s)\) — the rule mapping states to actions, which is what we learn.

The goal is not the next reward but the expected long-term (cumulative, discounted) reward, the return:

\[ G \;=\; r_0 + \gamma\,r_1 + \gamma^2\,r_2 + \cdots \;=\; \sum_{t\ge 0}\gamma^{t}\,r_t, \]

where \(\gamma\in[0,1)\) is the discount factor — how much a reward one step later is worth than one now (it both encodes “sooner is better” and keeps an infinite sum finite). The value \(V(s)\) is the expected return from state \(s\) under the policy, and the action-value \(Q(s,a)\) the expected return from taking \(a\) in \(s\) then following the policy; RL methods learn one of these and act greedily (or explore, exactly as a bandit does) with respect to it.

Worked example — a discounted return. Suppose over three steps the agent collects rewards \(r = (1,\,0,\,1)\) with discount \(\gamma = 0.9\). Then \(G = 1 + 0.9(0) + 0.9^2(1) = 1 + 0 + 0.81 = \mathbf{1.81}.\) The click two steps out is worth only \(\gamma^2 = 0.81\) of an immediate one — the same click, discounted for being later. Push \(\gamma\to 0\) and RL collapses back to the bandit’s myopic “maximize the immediate reward”; push \(\gamma\to 1\) and the far future matters almost as much as now.

Figure D.1: **The discounted return \(G=\sum_t\gamma^t r_t\) on the §1 worked example** (\(r=(1,0,1)\), \(\gamma=0.9\)). Each bar is one step’s *discounted* reward \(\gamma^t r_t\): the immediate \(r_0\) counts in full, but the identical reward at \(t{=}2\) is shrunk to \(\gamma^2=0.81\). The bars sum to \(G=\mathbf{1.81}\). Send \(\gamma\to0\) and only the first bar survives (the myopic bandit); send \(\gamma\to1\) and every bar counts equally.

A bandit is exactly an MDP with a single state (no transitions): that is why the explore / exploit machinery of Bandits & Online Recommendation reappears here, now wrapped around a notion of long-term value rather than immediate reward.

Figure D.2: **The reinforcement-learning loop.** The agent picks an action from its policy; the environment (the user) returns a *reward* *and* a new *state*. The new state is what bandits lack — because the action changes it, the agent must optimize the *discounted return* \(G=\sum_t\gamma^t r_t\), not just the next reward. A bandit is this loop with a single, unchanging state.

2. Why RL is genuinely hard in recommendation

RL is alluring for recommendation — “optimize long-term engagement, not next-click clickbait” — but four obstacles make it hard in practice, and a from-zero reader should know them before reaching for it:

A combinatorial action space. The action is often a whole slate (a page of items), so the number of actions is astronomically large — ordinary Q-learning, which enumerates actions, does not apply directly. SlateQ (Ie et al., 2019) is the standard fix: it decomposes a slate’s value into per-item \(Q\)-values under a click model, making the problem tractable.
It is off-policy by necessity. You cannot let a half-trained agent experiment on real users for months, so you must learn from logged data collected by a different (old) policy — the hard, bias-prone off-policy setting (the same logged-data problem behind the bandit offline replay of Bandits & Online Recommendation).
Delayed, sparse reward + the simulator gap. “Long-term satisfaction” arrives much later than the action that caused it (credit assignment is hard), and training RL needs millions of interactions — so teams build user simulators, which are themselves recommendation models and can bake in their own errors.
Safety and non-stationarity. An exploring policy on a live platform can hurt real users and revenue, and user behaviour drifts under the policy itself.

3. The main approaches, in one line each

Value-based (Q-learning). Learn \(Q(s,a)\) and act greedily; for slates, SlateQ makes it tractable by decomposition.
Policy-gradient. Directly optimize the policy \(\pi_\theta\) by gradient ascent on expected return — common at web scale (e.g. off-policy-corrected REINFORCE for very large action sets).
Deep RL for news (DRN). Zheng et al. (2018) use deep Q-learning for online news, explicitly modelling future reward and a user-return signal beyond the immediate click — a concrete case of optimizing for retention rather than the next tap.

4. Where it pays off — and an honest caveat

RL earns its complexity when the objective is genuinely long-horizon: total session length, multi-step journeys (a course syllabus, a show’s episode order), or retention (will the user come back next week?) — outcomes a one-step click model or bandit cannot express. The honest caveat, in the spirit of the reproducibility caution of Datasets & Benchmarks §5: RL in recommendation is operationally heavy and often hard to reproduce, and a well-tuned bandit or supervised ranker is frequently competitive on the metrics that are actually measured. Reach for RL when the objective itself is long-term and you can pay the engineering cost — not by default.

This is why RL lives here as a pointer appendix rather than a core chapter: the book’s through-line is collaborative filtering from classical to graph- and LLM-based models, and RL is an adjacent paradigm the through-line touches (via bandits) but does not need. For the one-step case, return to Bandits & Online Recommendation; for the full theory, the references below.

5. Glossary

Term	Plain meaning
MDP	Markov Decision Process: state, action, reward, transition, policy — the formalism of RL.
State \(s\)	A summary of the current situation; the ingredient bandits lack (actions change it).
Policy \(\pi(a\mid s)\)	The learned rule mapping a state to an action; what RL optimizes.
Return \(G\)	Cumulative discounted reward \(\sum_t \gamma^t r_t\) — the long-term objective.
Discount \(\gamma\)	How much a later reward is worth vs. now; \(\gamma\to0\) is myopic (a bandit), \(\gamma\to1\) far-sighted.
Value / \(Q\)-value	Expected return from a state (\(V\)) or a state–action pair (\(Q\)).
Slate	A whole page/list of recommended items — the (combinatorial) action in slate RL.
Off-policy	Learning from data collected by a different (older) policy than the one being trained.
SlateQ	Decomposes a slate’s value into per-item \(Q\)-values, making value-based slate RL tractable.

6. References

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press. (the standard text: MDPs, TD, Q-learning, policy gradients).
Ie, E., et al. (2019). SlateQ: A tractable decomposition for reinforcement learning with recommendation sets. In Proceedings of IJCAI. https://doi.org/10.24963/ijcai.2019/360
Zheng, G., et al. (2018). DRN: A deep reinforcement learning framework for news recommendation. In Proceedings of WWW. https://doi.org/10.1145/3178876.3185994
Afsar, M. M., Crump, T., & Far, B. (2022). Reinforcement learning based recommender systems: A survey. ACM Computing Surveys, 55(7), Article 145. https://doi.org/10.1145/3543846

Online sources verified June 2026.