Reinforcement Learning with LLMs

After a base large language model is pretrained, it goes through multiple rounds of post-training to shape and align how it responds. This post-training tends to focus on refining traits such as what it knows, how it reasons, and the style it uses. To do this, most modern systems rely on three main levers:

Supervised Fine-Tuning (SFT) to teach formats and tasks by imitation,
Preference Training to prefer better answers over worse ones, and
Reinforcement Learning (RL) to optimize for sequence-level goals and generalize procedures.

These techniques are used in different ways at different times in the model's training lifecycle depending on the alignment goal, often in an order that follows:

Pre-Training a language model using next-token prediction at scale.
SFT for base instruction-following + tools + formatting.
Preference Training to push general answer quality and safety (RLHF, DPO/KTO or reward-model + small PPO).
Targeted RL where you have strong rewards (coding, math, tool-using agents, browsing).
Continual Refresh with new data, procedures, and preferences as they become available.

Training language models to follow instructions with human feedback

Beyond pretraining and SFT, reinforcement learning is central to both preference training and (obviously) targeted RL above. For more on how RL drives alignment and preference optimization, check out my other resource on LLM alignment in post-training here!

A Survey of Reinforcement Learning for Large Reasoning Models

However, with a general saturation of quality pre-training data and diminishing returns for scaling supervised fine-tuning techniques, we've seen many labs innovate more heavily with reinforcement learning techniques.

In this notebook we'll cover:

A light refresher of reinforcement learning in the context of large language models
When and why RL is used with LLMs
The adoption of reinforcement learning with verifiable rewards
How to create your own RL environment and train an LLM within it!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
media		media
LLM_RL.ipynb		LLM_RL.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reinforcement Learning with LLMs

About

Uh oh!

Languages

ALucek/rl-for-llms

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning with LLMs

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages