UNI-1

In the past 3 years, our journey at Luma has taken us from scene reconstruction, to 3D generation, to scaling video diffusion across the world’s visual content. Each step was a climb — more data, more compute, and higher ambitions. But scaling visual media alone has a fundamental ceiling: generation without understanding can only go so far.

Uni-1, Luma’s first unified understanding and generation model, is our next step on the path towards multimodal general intelligence.

UNI-1 hero visual

Motivation

General intelligence requires the ability to reason and to imagine, to manipulate symbols and to simulate worlds. In humans, these capabilities are often described as left- and right-brain functions, spanning from language and logic to spatial awareness and creativity. Today's AI systems, from LLMs to image generators to world simulators, have mastered some of these capabilities in isolation.

But the brain does not operate as isolated halves. Language, perception, and imagination are deeply intertwined, connected by dense neural pathways that allow thought and imagery to co-evolve.

With the Uni-fied family of models, we take an inspired approach. We grow a mind's eye from a logical brain: a system that reasons, imagines, plans, iterates, and executes across both digital and physical domains. Our models jointly model time, space, and logic in a single architecture, enabling forms of problem-solving that fractured pipelines cannot achieve. With Uni-1, we're just getting started.

We believe unified intelligence is the path to general intelligence.

Architecture

At its core, our model is a decoder-only autoregressive transformer. Text and images are represented in a single interleaved sequence, acting both as input and as output.

To Reason: Thinking Improves Visual Generation

Uni-1 can perform structured internal reasoning before and during image synthesis. It decomposes instructions, resolves constraints, and plans composition, then renders accordingly.

RISE benchmark chart

We evaluate Uni-1 on RISEBench [1]RISEBench: Evaluating Reasoning-Informed Visual Editing (arXiv:2504.02826), a benchmark specifically designed for Reasoning-Informed Visual Editing (RISE). RISEBench assesses four core reasoning capabilities, namely Temporal, Causal, Spatial, and Logical. Uni-1 achieves state-of-the-art results on this benchmark, demonstrating the model's capacity to decompose complex editing instructions, maintain scene coherence, and generate visually plausible outputs grounded in real-world reasoning. The Capability Showcase presented below illustrates these strengths directly.

To Imagine: Visual Generation Improves Understanding

Uni-1 shows that learning to generate images materially improves fine-grained visual understanding performance, reasoning over regions, objects, and layouts.

This leads to strong grounding and dense visual capabilities while preserving full generative flexibility within a single unified model.

ODINW chart

We evaluate on ODinW-13 following consistent protocols from prior work [2]Qwen3-VL Technical Report (arXiv:2511.21631)[3]Qwen 3.5: Towards Native Multimodal Agents (qwen.ai).

ODinW (Open Detection in the Wild) measures open vocabulary dense detection, testing fine-grained visual reasoning. We use this benchmark to show how generation improves understanding in our unified model, and how it compares against prior state of the art understanding-focused models.

Capability Showcase

Intelligent

Common-sense scene completion, spatial reasoning, and plausibility-driven transformation.

Maintains consistency across time while evolving scenes through coherent motion and event progression.

Examples

User

Generate a sequence of images that are each from a different shot, based on the following storyboard description: A fixed, unchanging camera frames the same upright piano in a quiet room as a young boy begins learning to play, his movements tentative; without any shift in angle, time flows forward as he grows into a confident teenager, then a passionate young man, then a gentle parent playing for a child at his side, then a reflective middle-aged figure pausing between phrases, and finally an elderly man whose slow, deliberate notes carry the weight of a lifetime, the worn piano and aging room silently marking the passage of years. The camera never changes angle throughout each frame - only the person's physical appearance, background, and characters around the person Generate the first frame of the sequence. A young boy sits at playing on a piano, carefully pressing each key as sunlight spills across the room. His mother is next to him, watching him play. The camera is facing the boy, as if on the piano.

Keyframe 1Child

Here's the opening frame. A young boy with short, dark hair and a light blue collared shirt is seated at a white upright piano in a dimly lit room. He is focused on playing the piano, his hands positioned over the keys. To his right, a woman with long dark hair, wearing a white blouse and a patterned skirt, sits beside him, watching him play with a gentle expression. The piano has sheet music on its stand. The background features a textured, off-white wall with visible stains or peeling paint, suggesting an aged or rustic interior. A small, dark-framed picture hangs on the wall behind them. The lighting is soft and natural, coming from the left side of the frame, casting subtle shadows. The camera angle is static, positioned slightly below eye level, looking up at the subjects, giving a sense of intimacy and focus on the interaction between the boy and the woman.

Directable

Reference-guided generation with source-grounded controls.

Uses one or more references to preserve identity, composition, and key visual constraints in the output.

Cultured

Culture-aware visual generation across aesthetics, memes, and manga.

Adopts distinct artistic languages while preserving subject identity and composition across style variants.

Loading styles…

To Infinity and Beyond

This unified design naturally extends beyond static images to video, voice agents, and fully interactive world simulators.

With Uni-1, we are laying the foundation for a system that can see, speak, reason, and imagine in one continuous stream.

If you want to help build that future, join us.