RL introduces bias into the distribution of internet text, and when a language model is trained with cross
entropy loss, it minimizes the distributional divergence between its learned distribution over text and the training data distribution over text.