blogposts

Blog Posts

RLHF without RL - Direct Preference Optimization

We discuss the RL part of RLHF and its recent displacement by direct preference optimization (DPO). With DPO, a language model can be aligned with human preferences without sampling from an LM, thereby significantly simplifying the training process. By now, DPO has been implemented in many projects and seems to be here to stay.

14 min read · May 7, 2024

2024
Sample Blog Post

Your blog post's abstract. Please add your abstract or summary here and not in the main body of your text. Do not include math/latex or hyperlinks.

15 min read · May 7, 2024

2024
Sample Blog Post (HTML version)

Your blog post's abstract. Please add your abstract or summary here and not in the main body of your text. Do not include math/latex or hyperlinks.

12 min read · May 7, 2024

2024