Generative AI in focus – deep dive summaries and presentations as part of an nxtAIM project event on fundamental and innovative topics
nxtAIM at the Forschungscenter Jülich
Fourteen months after the launch of project nxtAIM – Generative Methods for Perception, Prediction and Planning – 95 researchers gathered for the first Winter School at the Research Center in Jülich. The goal was to foster knowledge exchange and collaboration while enabling the participating partners to efficiently utilize Jülich’s computing resources.
On March 13 and 14, 2025, academic and industrial project partners delivered and engaged in lectures, workshops, and deep dives covering fundamental, advanced, and practical topics in generative AI.
The program was complemented by sessions on supercomputing and the necessary data-loading processes. Guided tours of the Jülich supercomputer provided participants with an impression of the scale of the available computing hardware.
At the end of the two-day event, all participants gave positive feedback: well-organized, insightful technical discussions, an excellent tour of the computing center, and engaging deep dives were among the comments heard at the farewell.
Building on the success of this first Winter School, the project organizers plan to hold a second Winter School in 2026.
Deep Dive Summaries
In these compact Deep Dive Summaries, nxtAIM researchers provide insights into current approaches, challenges, and solutions — ranging from precise image processing to safe AI for autonomous driving.
As this was an internal Winter School session, you will find here an overview in the form of brief summaries of the contributions.
If you’d like to dive deeper into the topics, be sure to save the date: In March 2026, the nxtAIM Open Project Day in Freiburg, Germany will offer the opportunity to explore the topics in detail and engage directly with the researchers.
Current discriminative depth estimation methods often produce blurry artifacts, while generative approaches suffer from slow sampling due to curvatures in the noise-to-depth transport. Our method addresses these challenges by framing depth estimation as a direct transport between image and depth distributions. We are the first to explore flow matching in this field, and we demonstrate that its interpolation trajectories enhance both training and sampling efficiency while preserving high performance. While generative models typically require extensive training data, we mitigate this dependency by integrating external knowledge from a pre-trained image diffusion model, enabling effective transfer even across differing objectives. To further boost our model performance, we employ synthetic data and utilize image-depth pairs generated by a discriminative model on an in-the-wild image dataset. As a generative model, our model can reliably estimate depth confidence, which provides an additional advantage. Our approach achieves competitive zero-shot performance on standard benchmarks of complex natural scenes while improving sampling efficiency and only requiring minimal synthetic data for training.
Paper: Ming Gui, Johannes Schusterbauer, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan Andreas Baumann, Vincent Tao Hu, Björn Ommer: DepthFM: Fast Generative Monocular Depth Estimation with Flow Matching
Text-to-image generative models have become a prominent and powerful tool that excels at generating high-resolution realistic images. However, guiding the generative process of these models to consider detailed forms of conditioning reflecting style and/or structure information remains an open problem. We present LoRAdapter, an approach that unifies both style and structure conditioning under the same formulation using a novel conditional LoRA block that enables zero-shot control. LoRAdapter is an efficient, powerful, and architecture-agnostic approach to condition text-to-image diffusion models, which enables fine-grained control conditioning during generation and outperforms recent state-of-the-art approaches.
Paper: Nick Stracke, Stefan Andreas Baumann, Josh Susskind, Miguel Angel Bautista, Björn Ommer: CTRLorALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models
Generative world modeling is an emerging and rapidly evolving research area with significant potential to enhance our understanding of complex dynamic environments, such as driving scenarios. Unlike traditional synthetic video generation, generative world modeling enables nuanced control over both real and synthetic environments using diverse conditioning inputs, including textual instructions, odometry data, maps, and historical observations. By leveraging driving world models, we can achieve precise control and superior planning capabilities informed by real-world perception. In this project, we introduce a masked generative model capable of generating multiple plausible future scenarios based on a limited sequence of past observations. Our model effectively generates temporally coherent predictions spanning up to 25 seconds into the future, accurately modeling the dynamics of both the ego-vehicle and other dynamic objects in the scene. Notably, our approach requires significantly less data and computational resources compared to existing driving world models. The current implementation employs a model with 400 million parameters, trained on just 130 hours of driving data.
Agents in real-world scenarios like automated driving deal with uncertainty in their environment, in particular due to perceptual uncertainty. Although, reinforcement learning is dedicated to autonomous decision-making under uncertainty these algorithms are typically not informed about the uncertainty currently contained in their environment. On the other hand, uncertainty estimation for perception itself is typically directly evaluated in the perception domain, e.g., in terms of false positive detection rates or calibration errors based on camera images. Its use for deciding on goal-oriented actions remains largely unstudied. In this paper, we investigate how an agent's behavior is influenced by an uncertain perception and how this behavior changes if information about this uncertainty is available. Therefore, we consider a proxy task, where the agent is rewarded for driving a route as fast as possible without colliding with other road users. For controlled experiments, we introduce uncertainty in the observation space by perturbing the perception of the given agent while informing the latter. Our experiments show that an unreliable observation space modeled by a perturbed perception leads to a defensive driving behavior of the agent. Furthermore, when adding the information about the current uncertainty directly to the observation space, the agent adapts to the specific situation and in general accomplishes its task faster while, at the same time, accounting for risks.
Automated driving validation is a comprehensive process to prove that the system is free from unreasonable risk within the defined operational design domain. It requires identification of sufficient scenario space, being compliant to regulatory requirements and traffic rules, creation of representative real-world scenarios and testing the complete system. Therefore, in this work Gen-AI is employed to improve this process as a complete framework starting from regulations through the testing of scenarios in a realistic way. Gen-AI approaches help execute this comprehensive process in a more efficient way and manage a high number of scenarios, requirements and other artifacts. Framework starts with the regulation analysis from any country with LLMs to derive compliance requirements from ADS perspective. These requirements are handled in the same LLM prompt in order to be converted into abstract scenario definitions. This establishes comprehensive coverage of traffic rules and regulations to ensure that every required scenario is generated. Then, abstract scenario definitions are taken in LLM and OpenScenario file generation is established. These scenario files are augmented with realistic trajectories which are generated from a Conditional Variational Auto-Encoder model which was trained on realistic trajectory data that represents human driving and pedestrian behaviour. Finally, generated OpenScenario files are automatically inserted into an appropriate scenario location related to the prompted countries and tested in the simulation environments. This approach in general enables end-to-end framework for generating and testing realistic automated driving scenarios from given legislative sources.
Transformer-based models generate hidden states that are difficult to interpret. In this work, we aim to interpret these hidden states and control them at inference, with a focus on motion forecasting. We use linear probes to measure neural collapse towards interpretable motion features in hidden states. High probing accuracy implies meaningful directions and distances between hidden states of opposing features, which we use to fit interpretable control vectors for activation steering at inference. To optimize our control vectors, we use sparse autoencoders with fully-connected, convolutional, MLPMixer layers and various activation functions. Notably, we show that enforcing sparsity in hidden states leads to a more linear relationship between control vector temperatures and forecasts. Our approach enables mechanistic interpretability and zero-shot generalization to unseen dataset characteristics with negligible computational overhead.
In this lecture, I would like to emphasize the needs for model optimization, stress the benefits of compression and give an overview of the existing optimization techniques. The main topic of this session is low rank compression methods. A brief introduction to low rank compression followed by the techniques of implementation of such compression methods on neural networks will be discussed. Three main low rank methods, namely Canonical-Parafac method, Tucker decomposition and Tensor Train methods will be focused in this presentation. Finally, I would like to dive deep into the existing low rank methods and present some initial results of our work.