A new paradigm for AI: How ‘thinking as optimization’ leads to better general-purpose models

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now

Researchers at the University of Illinois Urbana-Champaign and the University of Virginia have developed a new model architecture that could lead to more robust AI systems with more powerful reasoning capabilities.

Called an energy-based transformer (EBT), the architecture shows a natural ability to use inference-time scaling to solve complex problems. For the enterprise, this could translate into cost-effective AI applications that can generalize to novel situations without the need for specialized fine-tuned models.

The challenge of System 2 thinking

In psychology, human thought is often divided into two modes: System 1, which is fast and intuitive, and System 2, which is slow, deliberate and analytical. Current large language models (LLMs) excel at System 1-style tasks, but the AI industry is increasingly focused on enabling System 2 thinking to tackle more complex reasoning challenges.

Reasoning models use various inference-time scaling techniques to improve their performance on difficult problems. One popular method is reinforcement learning (RL), used in models like DeepSeek-R1 and OpenAI’s “o-series” models, where the AI is rewarded for producing reasoning tokens until it reaches the correct answer. Another approach, often called best-of-n, involves generating multiple potential answers and using a verification mechanism to select the best one.

However, these methods have significant drawbacks. They are often limited to a narrow range of easily verifiable problems, like math and coding, and can degrade performance on other tasks such as creative writing. Furthermore, recent evidence suggests that RL-based approaches might not be teaching models new reasoning skills, instead just making them more likely to use successful reasoning patterns they already know. This limits their ability to solve problems that require true exploration and are beyond their training regime.

Energy-based models (EBM)

The architecture proposes a different approach based on a class of models known as energy-based models (EBMs). The core idea is simple: Instead of directly generating an answer, the model learns an “energy function” that acts as a verifier. This function takes an input (like a prompt) and a candidate prediction and assigns a value, or “energy,” to it. A low energy score indicates high compatibility, meaning the prediction is a good fit for the input, while a high energy score signifies a poor match.

Applying this to AI reasoning, the researchers propose in a paper that devs should view “thinking as an optimization procedure with respect to a learned verifier, which evaluates the compatibility (unnormalized probability) between an input and candidate prediction.” The process begins with a random prediction, which is then progressively refined by minimizing its energy score and exploring the space of possible solutions until it converges on a highly compatible answer. This approach is built on the principle that verifying a solution is often much easier than generating one from scratch.

This “verifier-centric” design addresses three key challenges in AI reasoning. First, it allows for dynamic compute allocation, meaning models can “think” for longer on harder problems and shorter on easy problems. Second, EBMs can naturally handle the uncertainty of real-world problems where there isn’t one clear answer. Third, they act as their own verifiers, eliminating the need for external models.

Unlike other systems that use separate generators and verifiers, EBMs combine both into a single, unified model. A key advantage of this arrangement is better generalization. Because verifying a solution on new, out-of-distribution (OOD) data is often easier than generating a correct answer, EBMs can better handle unfamiliar scenarios.

Despite their promise, EBMs have historically struggled with scalability. To solve this, the researchers introduce EBTs, which are specialized transformer models designed for this paradigm. EBTs are trained to first verify the compatibility between a context and a prediction, then refine predictions until they find the lowest-energy (most compatible) output. This process effectively simulates a thinking process for every prediction. The researchers developed two EBT variants: A decoder-only model inspired by the GPT architecture, and a bidirectional model similar to BERT.

*Energy-based transformer (source: GitHub)*

The architecture of EBTs make them flexible and compatible with various inference-time scaling techniques. “EBTs can generate longer CoTs, self-verify, do best-of-N [or] you can sample from many EBTs,” Alexi Gladstone, a PhD student in computer science at the University of Illinois Urbana-Champaign and lead author of the paper, told VentureBeat. “The best part is, all of these capabilities are learned during pretraining.”

EBTs in action

The researchers compared EBTs against established architectures: the popular transformer++ recipe for text generation (discrete modalities) and the diffusion transformer (DiT) for tasks like video prediction and image denoising (continuous modalities). They evaluated the models on two main criteria: “Learning scalability,” or how efficiently they train, and “thinking scalability,” which measures how performance improves with more computation at inference time.

During pretraining, EBTs demonstrated superior efficiency, achieving an up to 35% higher scaling rate than Transformer++ across data, batch size, parameters and compute. This means EBTs can be trained faster and more cheaply.

At inference, EBTs also outperformed existing models on reasoning tasks. By “thinking longer” (using more optimization steps) and performing “self-verification” (generating multiple candidates and choosing the one with the lowest energy), EBTs improved language modeling performance by 29% more than Transformer++. “This aligns with our claims that because traditional feed-forward transformers cannot dynamically allocate additional computation for each prediction being made, they are unable to improve performance for each token by thinking for longer,” the researchers write.

For image denoising, EBTs achieved better results than DiTs while using 99% fewer forward passes.

Crucially, the study found that EBTs generalize better than the other architectures. Even with the same or worse pretraining performance, EBTs outperformed existing models on downstream tasks. The performance gains from System 2 thinking were most substantial on data that was further out-of-distribution (different from the training data), suggesting that EBTs are particularly robust when faced with novel and challenging tasks.

The researchers suggest that “the benefits of EBTs’ thinking are not uniform across all data but scale positively with the magnitude of distributional shifts, highlighting thinking as a critical mechanism for robust generalization beyond training distributions.”

The benefits of EBTs are important for two reasons. First, they suggest that at the massive scale of today’s foundation models, EBTs could significantly outperform the classic transformer architecture used in LLMs. The authors note that “at the scale of modern foundation models trained on 1,000X more data with models 1,000X larger, we expect the pretraining performance of EBTs to be significantly better than that of the Transformer++ recipe.”

Second, EBTs show much better data efficiency. This is a critical advantage in an era where high-quality training data is becoming a major bottleneck for scaling AI. “As data has become one of the major limiting factors in further scaling, this makes EBTs especially appealing,” the paper concludes.

Despite its different inference mechanism, the EBT architecture is highly compatible with the transformer, making it possible to use them as a drop-in replacement for current LLMs.

“EBTs are very compatible with current hardware/inference frameworks,” Gladstone said, including speculative decoding using feed-forward models on both GPUs or TPUs. He said he is also confident they can run on specialized accelerators such as LPUs and optimization algorithms such as FlashAttention-3, or can be deployed through common inference frameworks like vLLM.

For developers and enterprises, the strong reasoning and generalization capabilities of EBTs could make them a powerful and reliable foundation for building the next generation of AI applications. “Thinking longer can broadly help on almost all enterprise applications, but I think the most exciting will be those requiring more important decisions, safety or applications with limited data,” Gladstone said.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Source link