A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More

时过境迁,当时怎么看的已经忘记了。现在我的问题变成了:什么是llm的对齐?vlm也有对齐吗?怎样实现对齐的?从这篇文章出发的话,我应该怎么做 重看论文(1)

  • 论文标题:A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More
  • code:
  • 打标签:对齐
  • 时间:2024年7月23日(latest)

内容

本文将分成4个主题

  1. 奖励模型 Reward Model
    1. explicit RM vs. implicit RM
    2. pointwie RM vs. preference model
    3. Response-level reward vs. token-level reward
    4. negative preference optimization
  2. 反馈
    1. Preference Feedback vs. Binary Feedback
    2. Pairwise Feedback vs. Listwise Feedback
    3. Human Feedback vs. AI Feedback
  3. 强化学习 Reinforcement Learning
    1. Reference-Based RL vs. Reference-Free RL
    2. Length-Control RL
    3. Different Divergences in RL
    4. On Policy RL vs. Off-Policy RL
  4. 优化
    1. Online/Iterative Preference Optimization vs. Offline/Non-iterative Preference Optimization
    2. Separating SFT and Alignment vs. Merging SFT and Alignment

想法

  1. 什么是对齐