Participate in the new and weekly newsletter for the latest updates and exclusive content on the leading AI coverage in the industry. learn more
There are problems with inference models such as Openai O1 and Deepseek-R1. Ask a simple question such as “What is 1+1?” And they think for a few seconds before answering.
Ideally, like humans, AI models should be able to tell the time of direct answer and when to spend extra time and resources to infer the response. a New technique Announced by researchers Meta AI and Illinois University Chicago Train a model to assign a budget budget based on the difficulty of queries. This improves the response, reducing costs, and allocating computing resources.
Expensive inference
Large -scale language models (LLMS) generate long inference chains, which are often called “ideas” (cot), can improve the performance of inference issues. With the success of COT, the problem has led to the full range of inference time scaling methods that have “thought” for a long time, created multiple answers, and encourage the optimal answer to select the optimal answer.
One of the main methods used in the inference model is to generate multiple answers and select the most frequently repeated answers, also known as “many votes” (MV). The problem of this approach is that the model adopts a uniform operation, treats all prompts as a difficult reasoning problem, and spends multiple answers by spending unnecessary resources.
Smart inference
The new paper proposes a series of training techniques that make response models more efficient. The first step is “sequential voting” (SV). Here, the model stops the reasoning process as soon as the answer is displayed in a certain number. For example, a model will generate up to eight answers and select at least three answers. If the model is given the simple query above, the first three answers will probably be similar.
Their experiments show that when SV generates the same number of answers, the problem of mathematics competition exceeds the classic MV. However, SV requires additional instructions and tokens, and is equivalent to MV from the token vs. ackali ratio.
The second method, “Adaptive Sequental Voting” (ASV), promotes the model to examine the problem, and improves SV by generating multiple answers only when the problem is difficult. In the case of simple problems (such as 1+1 prompt), the model only generates a single answer without executing the voting process. This makes the model more efficient by processing both simple and complex issues.
Enhanced learning
Both SV and ASV improve the efficiency of the model, but requires a lot of treble data. In order to reduce this problem, researchers have a reinforced learning algorithm that teaches models to adjust the length of inference traces based on the difficulty of queries (IBPO). I will propose.
IBPO is designed to optimize the response while LLMS is within the range of inference budget restrictions. With the RL algorithm, the model is constantly generated an ASV trace, evaluates the response, and selects the correct answer and the results of providing the optimal inference budget, so that the profits obtained through manually labeled data training. It goes up.
Their experiments indicate that IBPO improves the front of the parate. This means that the IBPO trained model is better than other bass lines in the fixed inference budget.
The survey results are contrary to the background of researchers warning that the current AI model is hitting the wall. Companies are struggling to find high -quality training data and are looking for an alternative to improve the model.
One of the promising solutions is a reinforcing learning that gives the model a purpose and can find a unique solution in contrast to the monitored fine -tuning (SFT). The model is trained in a manually labeled example.
Surprisingly, this model often finds solutions that humans do not consider. This is a style that seems to be working well for Deepseek-R1 and is challenging the rule of the AI lab based in the United States.
Researchers point out that “prompt -based and SFT -based methods are suffering from both absolute improvement and efficiency, and SFT alone supports the speculation that self -revision function is not enabled. This observation result. Is also partially supported by simultaneous work, suggesting that such self -correction behavior is automatically appeared in the RL, rather than a prompt or SFT. Masu.”