Media Industry & The Rise of World Models

Dr. Pirita Pyykkönen-Klauck
CEO & Co-Founder
Woman looking out window at mountains - Created with AI

As I write this in March 2026, the term "World Model" has definitively established itself on social media. Anyone who has been on LinkedIn recently has surely noticed the hype surrounding World Models. Or perhaps it's more than just hype. Perhaps it's one of the most important developments we can expect in the near future.

The recent news of AMI (Advanced Machine Intelligence), a Paris-based startup led by Yann LeCun, which raised €890 million in funding, caused a major stir in the European AI community. The company stated that it focuses on developing new World Models based on the physical world and our experiences within it.

This approach is very close to my heart as a previous academic researcher. It reminds me of my own time in university labs, where I run experiments to understand how humans navigate a multimodal world. We don't just hear and see; we infer meaning. By building mental models, we can manage contradictions and ambiguities, establish causal relationships (regardless of whether they are explicitly signaled or not), and predict future states of the world without having to predict the exact next word (as language models do) or a pixel combination (as current image processing models do). World Models are not a new concept in science, but the AI industry hopes to make them available for professional applications.

I welcome the fact that the AI industry is pivoting to this grounded approach to build new models. On the one hand, they will facilitate the development of high-quality and controlled AI applications; on the other hand, they will offer new exciting opportunities for creatives in the media industry. But how exactly? Let us first take a brief look at the challenges of LLMs and vision models, and then outline what a "good" World Model for the media industry could look like.

Why LLMs are Grounded in Thin Air

In recent years, many industries have relied heavily on large language models (LLMs). They are undoubtedly impressive when it comes to co-writing screenplays or summarizing plots, but let's be honest: language is just one sensor. Many creatives agree that language is sometimes a very poor proxy for describing the visual world in which their screenplays are to be embedded.

The field of situated language research tells us that language is not the primary factor in how people interpret the world. Humans are visually oriented beings. You probably won't remember the exact wordings of your today's conversations tomorrow, but you'll easily recall the layout of the room where those conversations took place.

The core problem of LLMs is their statistical decoupling from reality:

  • LLMs work with representations of linguistic expressions, not with reality.
  • They predict the next most likely word (/token) based on a massive corpus of text they have been trained on.
  • They don't know that a glass falls due to gravity; they only know that in billions of sentences, the word glass was followed by the words "shattered on the floor."

This is why LLMs exhibit these frustrating hallucinations that we encounter in our applications and that we need to control and eliminate with various technical workarounds. Without a physical anchor, they lack an understanding of what is achievable. As one LLM put it: "LLMs are basically geniuses who live in a vacuum, dream in sentences, but are unable to catch a ball" (a phrase one Gemini model told me in a conversation with it).

Vision Models: Pretty Pixels without Understanding

While current image and video generators produce stunning visuals, they essentially only master the next pixel. They analyze patterns in millions of individual images with which they have been trained to predict the next optimal pixel combination.

In a high-stakes media production environment, this pixel-guessing approach quickly reaches its limits:

  • Logical inconsistencies: You might witness a breathtaking sunset, but the sun sets in north. To avoid such inconsistencies, storytellers must explain many logical aspects of the world to current models so that they behave as we expect.
  • Physics error: A character could walk right through a solid table because the model didn't know the table was a solid 3D object. To the model, it was just a collection of brown pixels; it couldn't tell which pixels a character could pass through. Try explaining that to the model; not so easy!
  • Causality gaps: For example, a model might indicate that a window breaks before a ball hits it because it lacks the causal understanding that action A leads to outcome B. If the training datasets contained many scenes in which a ball hit a window that subsequently broke, the model might predict correctly. However, if the ball is replaced with another object, it might fail.

What is a good World Model for the media industry?

So what distinguishes a World Model from the models mentioned above, and how should it be designed to be a good one? Instead of predicting the next token or pixel, it predicts the next world state. It's a simulation engine that understands the "why" behind the "what".

To be truly useful for the media industry, a World Model must master its fundamental pillars:

  • Fundamental physics: A good World Model must understand gravity, momentum, and velocity. If a director wants an action sequence, the AI shouldn't simply guess how a car will roll over based on similar scenes in the training datasets. Instead, the model should calculate the approximate trajectory in a simulated physical environment, taking into account the physical properties of the objects in that environment.
  • Object persistence: Something we learn as a child: knowing that an object still exists even when it is behind a curtain. A good World Model captures the 3D existence of objects in new situations even when they are outside the camera's field of view.
  • Spatial geometry: A good World Model takes into account depth, lighting, and the 3D structure of a scene. It priorizise on the essential elements of a scene rather than individual decorative details, thus enabling the creation of realistic events.

Beyond Physics: Pragmatics and the Hierarchy of Truths

But that's not all. Things only get interesting when a world model is introduced that takes into account pragmatics and the hierarchy of truths, including potential discrepancies between what is said and what is meant in the current situation. (This all ties in with my own academic research in the field of experimental psychology and psycholinguistics, so you can understand my enthusiasm.)

In the simplest case, if a film character says, "it's hot in here" current models likely only consider the temperature. A good world model understands the context: the character is probably asking someone to open a window or turn on the air conditioning. It uses latent variables to represent uncertainty, allowing the AI to consider multiple possible interpretations of a scene instead of just issuing a single (and often incorrect) prediction.

Furthermore, these models must consider the hierarchy of truth. Research shows that adults, in certain situations, trust visual information more than verbal information when the two sensory pieces of information contradict each other or compete with each other. In a film, this would also mean that if the script states "he flew through space" but the model's internal physics engine says, "This is a realistic drama, not a superhero movie," the AI must be able to recognize this contradiction and interpret the information as metaphorical language rather than factual truth. A good World Model would also understand that, if it were a superhero movie, actual physical flying was meant.

Consistency as a Byproduct

The biggest challenge with creative AI right now is maintaining temporal and character consistency. We've all seen videos where a character's hair color changes or their shirt looks different between takes. A good World Model solves this problem by reversing the approach.

While current AI models try to keep pixels consistent across all frames (which is very difficult), a good World Model first creates the 3D environment and character model. It then "films" the scene within this stable internal world. By anchoring the production in a simulated reality, consistency becomes a byproduct of the system, rather than a struggle for creatives looking for optimal control or for engineers finding technical solutions to maintain consistency across scenes.

World Models as Grounded Collaborators for Creators

The practical challenge that science has grappled with for decades remains: We live in an ambiguous, multimodal world where signals often contradict each other. Visual realism alone is just as unreliable as a substitute for reality as an LLM. For the media industry, the ideal approach lies in a model that understands the laws of physics and the causality of different creative worlds so well that it can deliberately bend them for artistic effect.

With the new World Models, we are moving towards an era in which AI is not just a generator, but a reliable partner that understands the various world representations in which creators want to set their stories. This world doesn't have to be realistic; a good world model should also be able to simulate alternative worlds. In practice, these models will not automate storytelling in any of these worlds, but they will allow human creatives even more creative freedom to create new innovative worlds for their stories.

It's difficult to predict how quickly these new World Models can be realized for industrial and creative applications. However, I'm pleased that investors are investing in them. Personally, I want these models to be available to us as a development company as soon as possible, as they will allow us to develop practical and easier to control applications more quickly. It's also crucial that creative professionals get to use these models themselves, without the significant technical hurdles that still exist in today's creative work with AI.

"Note: Some of the visuals in this blog post were created using AI technology."

AI with Purpose. Innovation with Integrity.
ZDF Sparks GmbH
Büro: Hausvogteiplatz 3-4, 10117 Berlin
Kontaktiere Uns: