Siri.Deng’s Log

๐Ÿ‘‹ Hi, this is Siri.Deng, aka Zhirui Deng!

LLM RL Optimization with PRM

Read deepseek-R1, although PRM can facilitates the optimization of LLMs in RL progress, they face: Obtaining the training data for PRM is time- and money-consuming. It is hard for people to accurately annotate the process reward. Accurate ORM is more useful that rough PRM. ๐Ÿ“ How to accurately annotate the PRM without human intensive? This problem is more important for Multi-hop QA / Web Search. Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement Utilizing MCTS to build the PRM and combine it with the ORM to conduct RL optimization....

2 min ยท Theme PaperMod