LLM RL Optimization with PRM
Read deepseek-R1, although PRM can facilitates the optimization of LLMs in RL progress, they face: Obtaining the training data for PRM is time- and money-consuming. It is hard for people to accurately annotate the process reward. Accurate ORM is more useful that rough PRM. 📍 How to accurately annotate the PRM without human intensive? This problem is more important for Multi-hop QA / Web Search. Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement Utilizing MCTS to build the PRM and combine it with the ORM to conduct RL optimization....