UNI-1
In the past 3 years, our journey at Luma has taken us from scene reconstruction, to 3D generation, to scaling video diffusion across the world’s visual content. Each step was a climb — more data, more compute, and higher ambitions. But scaling visual media alone has a fundamental ceiling: generation without understanding can only go so far.
Uni-1, Luma’s first unified understanding and generation model, is our next step on the path towards multimodal general intelligence.

Motivation
General intelligence requires the ability to reason and to imagine, to manipulate symbols and to simulate worlds. In humans, these capabilities are often described as left- and right-brain functions, spanning from language and logic to spatial awareness and creativity. Today's AI systems, from LLMs to image generators to world simulators, have mastered some of these capabilities in isolation.
But the brain does not operate as isolated halves. Language, perception, and imagination are deeply intertwined, connected by dense neural pathways that allow thought and imagery to co-evolve.
With the Uni-fied family of models, we take an inspired approach. We grow a mind's eye from a logical brain: a system that reasons, imagines, plans, iterates, and executes across both digital and physical domains. Our models jointly model time, space, and logic in a single architecture, enabling forms of problem-solving that fractured pipelines cannot achieve. With Uni-1, we're just getting started.
We believe unified intelligence is the path to general intelligence.
Architecture
At its core, our model is a decoder-only autoregressive transformer. Text and images are represented in a single interleaved sequence, acting both as input and as output.
To Reason: Thinking Improves Visual Generation
Uni-1 can perform structured internal reasoning before and during image synthesis. It decomposes instructions, resolves constraints, and plans composition, then renders accordingly.

We evaluate Uni-1 on RISEBench [1]RISEBench: Evaluating Reasoning-Informed Visual Editing (arXiv:2504.02826), a benchmark specifically designed for Reasoning-Informed Visual Editing (RISE). RISEBench assesses four core reasoning capabilities, namely Temporal, Causal, Spatial, and Logical. Uni-1 achieves state-of-the-art results on this benchmark, demonstrating the model's capacity to decompose complex editing instructions, maintain scene coherence, and generate visually plausible outputs grounded in real-world reasoning. The Capability Showcase presented below illustrates these strengths directly.
To Imagine: Visual Generation Improves Understanding
Uni-1 shows that learning to generate images materially improves fine-grained visual understanding performance, reasoning over regions, objects, and layouts.
This leads to strong grounding and dense visual capabilities while preserving full generative flexibility within a single unified model.

We evaluate on ODinW-13 following consistent protocols from prior work [2]Qwen3-VL Technical Report (arXiv:2511.21631)[3]Qwen 3.5: Towards Native Multimodal Agents (qwen.ai).
ODinW (Open Detection in the Wild) measures open vocabulary dense detection, testing fine-grained visual reasoning. We use this benchmark to show how generation improves understanding in our unified model, and how it compares against prior state of the art understanding-focused models.
Intelligent
Common-sense scene completion, spatial reasoning, and plausibility-driven transformation.
Maintains consistency across time while evolving scenes through coherent motion and event progression.
Examples
User
Keyframe 1 — Child
Directable
Reference-guided generation with source-grounded controls.
Uses one or more references to preserve identity, composition, and key visual constraints in the output.

Studio Group Identity Swap

Medieval Feast

Autumn Bridge Couple

Posing as Alien

Cartoon to Realistic

Mona Liu

Uni-1 Technical Details

Cat and Dog Scientists

A Joyful Scene
Cultured
Culture-aware visual generation across aesthetics, memes, and manga.
Adopts distinct artistic languages while preserving subject identity and composition across style variants.
To Infinity and Beyond
This unified design naturally extends beyond static images to video, voice agents, and fully interactive world simulators.
With Uni-1, we are laying the foundation for a system that can see, speak, reason, and imagine in one continuous stream.
If you want to help build that future, join us.