Seminar

World Models for Vision and Artificial Intelligence: Bayes or Bust

June 2, 2026 · Alan Yuille · Johns Hopkins University

Alan Yuille, Bloomberg Distinguished Professor at Johns Hopkins University, delivers his lecture on “World Models for Vision and Artificial Intelligence: Bayes or Bust” at the Seoul AI Hub. [Source=NAIRL]

As large language models and vision-language models (VLMs) push the boundaries of semantic understanding, Alan Yuille, Bloomberg Distinguished Professor at Johns Hopkins University, offered a pointed counterargument: it is perception, not reasoning, that is currently holding back progress in artificial intelligence. Drawing on decades of work spanning computer science and cognitive science, he argued that today’s models still lack the 3D and 4D world knowledge that humans naturally acquire by interacting with the physical world from infancy.

On May 29, the National AI Research Lab (NAIRL) and the KAIST Kim Jaechul Graduate School of AI co-hosted a Distinguished Scholar Seminar featuring Professor Yuille at the Seoul AI Hub. Held under the title “World Models for Vision and Artificial Intelligence: Bayes or Bust,” the seminar drew around 50 participants, including researchers, graduate students, and industry professionals who gathered to hear his perspective on where vision research must head next.

To frame the problem, Professor Yuille revisited the Bayesian framework developed in the 1980s and 90s, which combines analysis, synthesis, and prior world knowledge. He argued that this perspective offers a unified way to integrate perception, reasoning, and action, allowing a system to first estimate the 3D structure of the world and then reason about it, rather than collapsing the two into a single opaque step.

Participants listen attentively to Professor Yuile’s presentation during the seminar held at Seoul AI Hub. [Source=NAIRL]

Building on this view, Professor Yuille presented three lines of work supporting his argument. First, he showed that state-of-the-art VLMs perform poorly on visual question-answering tasks requiring genuine 3D understanding, such as judging the relative height, distance, or orientation of objects, reaching only 54 AP against near-perfect human performance. He demonstrated that fine-tuning VLMs with 3D-aware data and explicitly estimating 3D structure before reasoning substantially closes this gap, approaching human-level accuracy on synthetic benchmarks. Second, he introduced his group’s work on medical image analysis, where tumor detection systems were trained directly from paired CT scans and radiology reports, treating the task as a missing-data problem rather than relying on costly expert annotations. Recognized as a best paper at MICCAI, this approach achieved detection performance surpassing radiologists on pancreatic tumors.

Third, Professor Yuille described a new benchmark for evaluating generative world models in embodied agent tasks, including active recognition, image-goal navigation, embodied question answering, and robotic manipulation. The benchmark is designed to test whether world models can serve as mental models that allow an agent to imagine the outcomes of its actions before taking them, a capability he positioned as central to bridging perception and action.

Concluding his lecture, Professor Yuille stressed that vision researchers must produce models capable of estimating the 3D and 4D properties of the world, noting that such structured approaches also yield more interpretable AI systems that are less prone to shortcut learning. NAIRL plans to continue fostering an ecosystem where domestic researchers can drive both scientific discovery and industrial innovation based on cutting-edge AI technologies, through ongoing exchanges with leading AI researchers from around the world.