Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more
deepseek aiwhich has gained recognition of powerful open source language models such as Deepseek-R1, has made great strides in reward modeling for large-scale language models (LLMS).
Their new technique, Self-impressive Criticism Tuning (SPCT), aims to create generalists and scalable reward models (RMS). This could lead to more capable AI applications for open-ended tasks and domains where current models cannot capture environment and user nuances and complexity.
The key role and current limitations of reward models
Reinforcement learning (RL) is the cornerstone for developing cutting-edge LLM. In RL, the model is fine-tuned based on a feedback signal indicating the quality of the response.
Reward models are the key components that provide these signals. Essentially, RM acts as a judge, assigning a score or “reward” that teaches LLM to evaluate the LLM output, guide the RL process, and generates more useful responses.
However, current RMS often faces limitations. They usually excel in narrow domains with clear rules or easily verifiable answers. For example, current state-of-the-art inference models such as Deepseek-R1 have undergone the RL phase. There, they were trained on mathematics and coding problems where ground truth is clearly defined.
However, creating reward models for complex, open-ended or subjective queries in a typical domain remains a major hurdle. in paper Deepseek AI researchers explained their new approach, saying, “Generalist RM should generate high-quality rewards across specific domains where reward criteria are more diverse and complex and often lack explicit references or ground truths.”
They highlight four key challenges in creating a generalist RMS that can handle a wider range of tasks.
- Input Flexibility: The RM must be able to process various input types and evaluate one or more responses simultaneously.
- Accuracy: The criteria are complex and accurate reward signals must be generated across diverse domains where ground truth is often not available.
- Inference Time Scalability: RM should generate a higher quality reward if more computational resources are allocated during inference.
- Learning scalable behavior: For RMS to scale effectively to inference times, one needs to learn the behavior that will ensure performance improves as more calculations are used.
Reward models can be broadly categorized by “reward generation paradigms” (e.g., scalar RMS generating a single score output, a generated RMS generating text critique) and “scoring patterns” (point-wise scoring can be broadly categorized by assigning individual scores to each response; pairwise selects the best of two responses). These design choices affect the generalist task, particularly the suitability of their models. Input Flexibility Possibility of Inference Time Scaling.
For example, simple scalar RMS fights inference time scaling because it repeatedly generates the same score, while pairwise RMS cannot easily evaluate a single response.
Researchers propose that “point-wise generation reward modeling” (GRM), in which models generate text critiques and derive scores from them, can provide the flexibility and scalability required for generalist requirements.
The DeepSeek team conducted preliminary experiments on models such as the GPT-4O and Gemma-2-27B, and found that “specific principles can guide reward generation within the appropriate standards of GRMS and improve reward quality.
Training RMS to generate your own principles
Based on these findings, researchers developed self-specified critical tuning (SPCT). This trains GRMs to dynamically generate principles and critiques based on queries and responses.
Researchers suggest that principles should be “part of reward generation instead of preprosing steps.” In this way, GRMS can generate principles on the fly based on the tasks they are evaluating, and critiques based on the principles.
“This shift will be enabled [the] Principles generated based on input queries and responses, adaptively adjust [the] “The reward generation process, as well as the quality and granularity of the principles and corresponding critiques, can be further improved after training in GRM,” the researchers wrote.

SPCT includes two major phases:
- Removal fine adjustment: In this phase, you train the GRM to generate principles and critiques for different input types using the correct format. This model generates the specified query/response principles, critiques, and rewards. A trajectory (an attempt to generate) is only accepted if the predicted reward matches ground truth (e.g. correctly identifying a better response) and otherwise rejected. This process is repeated, and the model is fine-tuned with filtered examples to improve its principle/criticism generation capabilities.
- Rule-based RL: During this phase, the model is further tweaked through results-based reinforcement learning. GRM generates the principles and critiques for each query, and the reward signal is calculated based on simple precision rules (e.g. did you choose the best known response?). The model will then be updated. This encourages GRM to learn how to generate effective principles and accurate critiques dynamically and in a scalable way.
“By leveraging rule-based online RL, SPCT can learn that GRMS adaptively assumes principles and critiques based on input queries and responses, leading to improved rewards for outcomes in general domains,” the researchers write.
To tackle the inference time scaling challenge (get better results with more calculations), researchers run GRM multiple times on the same input, generating different principles and critiques. The final reward is determined by voting (aggregating sample scores). This allows the model to consider a broader perspective and provides more resources, leading to potentially more accurate and subtle final decisions.
However, some of the generated principles/criticisms may have poor quality or bias due to model limitations or randomness. To address this, researchers introduced “meta” RM” – Separate lightweight scalar RMs trained specifically to predict whether the principle/criticism generated by the primary GRM could lead to the correct final reward.
During inference, metaRM evaluates the generated samples and filters low-quality judgments before the final vote, further enhancing scaling performance.
Practice SPCT with deepseek-grm
The researchers applied SPCT to Google’s openweight model Gemma-2-27B to create the DeepSeek-GRM-27B. We evaluated against several strong baseline RMSs (including LLM-AA-Judge, Scalar RMS, and Semi-Scalar RMS, Semi-Scalar RMS) and public models (e.g. GPT-4-340B-REWARD).
They found that DeepSeek-GRM-27B outweighed baseline methods trained with the same data. SPCT has significantly improved quality and, crucially, significantly improved inference time scalability compared to standard fine-tuning.

Generating more samples and scaling to inference time significantly improves the performance of the DeepSeek-GRM-27B, outperforming much larger models such as the Nemotron-4-340B-Reward and GPT-4O. MetaRM achieved the best results by further improving scaling and filtering judgments.
“With large-scale sampling, DeepSeek-GRM allows for more accurate judgment of the rewards of outputs with more diversity principles and finer granularity,” the researchers write.
Interestingly, SPCT showed less bias in different domains compared to scalar RMS. This often works well for verifiable tasks, but elsewhere it is not enough.
Impact on companies
Developing more generalist and scalable reward models is promising for enterprise AI applications. Potential areas that can benefit from Generalist RMS include creative tasks and applications that require models to adapt to dynamic environments such as customer preferences as they evolve.
Despite strong results, DeepSeek-GRM lags behind a scalar RMS specialised in purely verifiable tasks where explicit inference generation may be less efficient than direct scoring. Efficiency remains a challenge compared to non-generated RMS.
The DeepSeek team suggests that future work will focus on improving efficiency and deeper integration. As they conclude, “future directions include integrating GRMS into an online RL pipeline into a multi-purpose interface of reward systems and investigating the co-scaling of inference times with policy models, or as a robust offline evaluator of the underlying model.”