Breaking the Algorithmic Ceiling in Pre-Training with Inductive Moment Matching

AuthorsJiaming SongLinqi Zhou

March 11, 2025

Generated samples on ImageNet-256×256 using 8 steps.

Fully unlocking the potential of rich multi-modal data

There is a growing sentiment in the AI community that generative pre-training is reaching a limit. However, we argue that these limits are not due to a lack of data itself, but rather a stagnation in algorithmic innovation. The field remains dominated by just two paradigms since around mid-2020: autoregressive models for discrete signals and diffusion models for continuous signals. This stagnation has created a bottleneck that prevents us from fully unlocking the potential of rich multi-modal data, which in turn limits the progress on multimodal intelligence.

At Luma, we aim to overcome this algorithmic ceiling through the lens of efficient inference-time compute scaling. Today we are introducing a new method, Inductive Moment Matching (IMM), a pre-training technique that not only delivers superior sample quality compared to diffusion models but also offers over a tenfold increase in sampling efficiency. In contrast to consistency models (CMs), which are unstable as a pre-training technique and require special hyperparameter designs, IMM employs a single objective with enhanced stability across diverse settings.

Sample quality during training for IMM vs. CM. CM is highly unstable and collapses easily.

To facilitate future research, we release the code and checkpoints (https://github.com/lumalabs/imm), a technical paper over the details of the IMM (https://arxiv.org/abs/2503.07565), and a position paper on how to advance generative pre-training algorithms from an efficient inference-time scaling perspective (https://arxiv.org/abs/2503.07154).

How Inductive Moment Matching Works

Inference can generally be scaled along two dimensions: extending sequence length (in autoregressive models), and augmenting the number of refinement steps (in diffusion models). While adding more refinement steps significantly boosts diffusion models, simply increasing the model capacity does not yield proportional improvements. This is because diffusion models inherently require more granular steps to converge to an optimal solution, regardless of the networks’ representational power. This shows that, from an inference-time perspective, diffusion models are not optimal in utilizing the networks’ capacity.

Diffusion models' performance grows slowly with number of steps, regardless of its model size. In contrast, IMM scales much more efficiently.

We illustrate these limitations from an inference perspective by examining the DDIM sampler for diffusion models. In each DDIM iteration, the network first generates a prediction using the current input and timestep, then linearly interpolates this prediction toward that of the next timestep. This constrains the expressive capacity of each iteration as it is linear with respect to the next timestep, ultimately capping performance regardless of the training method employed (see figure below).

The original DDIM sampler using the diffusion model network is not flexible over the target timestep during inference.
Adding the target timestep allows the sampler to be flexible enough for few step generation.

We design our new pre-training algorithm by first aiming to mitigate this inference limitation. Our new method, Inductive Moment Matching (IMM), introduces a subtle yet powerful modification: alongside the current timestep, the network also processes the target timestep to jump towards. This change enhances the flexibility of each inference iteration, paving the way for state-of-the-art performance and efficiency. We realize this improvement by incorporating maximum mean discrepancy — a robust moment matching technique that was developed more than 15 years ago.

Scaling and Stability

We test IMM on various hyperparameters and model architectures. On ImageNet 256x256, IMM achieves 1.99 Frechet Inception Distance (FID) and surpasses diffusion models (2.27 FID) and Flow Matching (2.15 FID) with 30x fewer sampling steps. It similarly achieves state-of-the-art 2-step FID of 1.98 on the standard CIFAR-10 dataset for a model trained from scratch.

Sample quality versus sampling compute. IMM dominates the optimal frontier for methods with inference-time scaling.
Sample quality versus sampling compute. IMM dominates the optimal frontier for methods with inference-time scaling.

IMM scales with training and inference compute as well as model size. We show in the figure below FID vs. training and inference compute, and we find strong correlation between compute used and performance.

Scaling compute improves generative quality.
Scaling compute improves generative quality.
Scaling compute improves generative quality.
Scaling compute improves generative quality.
Scaling compute improves generative quality.
Scaling compute improves generative quality.

Unlike consistency models, which have been shown to have unstable training dynamics, IMM is stable to train across various hyperparameters and architectures.

IMM trains more stably than consistency models.
IMM trains more stably than consistency models.

What’s next

Notably, IMM does not rely on denoising score matching or the score-based stochastic differential equations on which the foundations of diffusion models are built. The key driver of our performance gains is not only moment matching itself, but also our shift towards an inference-first perspective. This not only reveals the inherent limitations in current pre-training paradigms but also empowers us to develop innovative algorithms designed to break through the current limits of pre-training.

We believe that this is just the beginning of a paradigm shift towards multi-modal foundation models that transcends current boundaries and fully unlock creative intelligence.

If you are interested in the mission, join us.

References

Linqi Zhou, Stefano Ermon, Jiaming Song. “Inductive Moment Matching”.
Jiaming Song, Linqi Zhou. “Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms”.
Song et al. “Denoising Diffusion Implicit Models.” ICLR 2021.
Song et al. “Consistency Models.” ICML 2023.
Lipman et al. “Flow matching for generative modeling.” ICLR 2023.
Gretton et al. “A Kernel Method for the Two-Sample Problem.” NeurIPS 2006.
Song et al. “Score-Based Generative Modeling through Stochastic Differential Equations.” ICLR 2021.
Kim et al. “Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion.” ICLR 2024.
Vincent. “A Connection Between Score Matching and Denoising Autoencoders.” Neural Computation (Vol. 23).
Geng et al. “Consistency Models Made Easy”. ICLR 2025.

Luma AI