Introduction
The most recent NeurIPS conference took place in Vancouver, BC and I had the pleasure of attending with a large cohort from Isomorphic Labs (although I was a few minutes too late for the all-important team photo!).
One of the major themes of the conference was the proliferation of diffusion which, assuming things continue on their current trajectory, will soon account for over 100% of NeurIPS papers. With diffusion being one of the key architectural changes introduced in AlphaFold 3 [1], I was keen to take the opportunity to meet many others working in the area and review some of the emerging trends. This post summarises what I found but only offers a glimpse of the diverse works presented at the conference (and is rather biassed by what I attended). Apologies if I didn’t include your work and please do let me know what I’ve missed!
This blog assumes background knowledge about diffusion models, but if you would benefit from an introduction I recommend the sacred texts, Lilian Weng’s blog post [2] and Calvin Luo’s tutorial [3].
Personal Highlights
Among the works I found particularly interesting was the introduction of autoguidance [4] by the same authors who brought you EDM [5]. In addition to some very snazzy visualisations, they offer a unique insight into why classifier-free guidance improves image quality: not just by boosting prompt alignment but also by guiding sampling towards the core of the data manifold. Moreover, they introduce a technique to achieve this effect in the unconditional setting by guiding sampling against the score induced by a ‘bad’ version of a strong model.
Also noteworthy was TFG [6] which introduces a framework to unify Training-free Guidance approaches, such as Universal Guidance [7]. Such approaches become instances of TFG in a hyperparameter subspace. These methods allow you to guide a pretrained diffusion model using a predictor, but unlike classifier guidance, you need only define the predictor in the clean data space. The authors introduce a mechanism to search the hyperparameter space for different tasks and demonstrate improved performance across 16 task types compared to previous training-free guidance approaches.
Finally, while it’s adamantly not a diffusion paper, this year’s best paper (setting aside the soap opera) introduces Visual AutoRegressive modeling [8]. This is a novel image generation approach which offers comparable generation quality to diffusion but with significantly reduced inference times. The technique hinges on applying autoregressive decoding, not pixel by pixel, but from low resolution to high resolution. A neat aspect of this work is the demonstration of power-law scaling laws for VAR, following LLMs, which haven’t been shown for diffusion. Could be one to watch in case we all need to pivot.
Data-to-data Translation
Traditional diffusion sampling commences from Gaussian noise samples. Instead, data-to-data translation aims to map from and to arbitrary data distributions. While not papers exactly, there were a couple of excellent talks in this direction worth mentioning.
Flow Matching has become the go-to technique for this problem in contrast to diffusion (although the two are more closely related than you might hear [9]). Yaron Lipman and his team from FAIR presented a comprehensive introduction to Flow Matching, along with a superb accompanying guide complete with code examples [10].
Returning to diffusion, Arnaud Doucet presented an overview of Schrödinger Bridges (so-called because Erwin Schrödinger developed a rudimentary form of iterative proportional fitting in 1931), which bring unpaired data-to-data translation to diffusion. Schrödinger Bridges are stochastic paths between data distributions which minimise kinetic energy, approximating optimal transport. Such bridges can be learnt using an algorithm called Iterative Markovian Fitting (NeurIPS 2023) [11]. Unfortunately this is prohibitively expensive as each iteration requires retraining a diffusion model and resampling data from it. Conveniently, Doucet has a solution in his NeurIPS 2024 paper which introduces α-Iterative Markovian Fitting [12] to allow you to fine-tune existing Brownian Bridge models to create Schrödinger Bridges.
Discrete Generation
Despite having emerged as a powerful generative model framework for continuous data such as images, the application of diffusion models to discrete data, such as language, remains limited and rather complex. Autoregressive (AR) models are still dominant despite being limited by sequential sampling and necessitating the imposition of order which is unnatural for some datatypes.
MD4 [13] from Google DeepMind addresses this by introducing a simplified and general framework for masked diffusion models (or ‘absorbing diffusion’). In the forward process, such models gradually introduce noise by replacing ground truth tokens with a mask according to a schedule and the model learns to denoise. The authors demonstrate that the reverse generation process follows a simple logic and that the loss can be simplified to an integral over weighted cross entropy losses. Remarkably, this makes training similar to BERT [14] and more numerically stable. The resultant models outperform previous discrete diffusion approaches and, on some tasks, surpass the best AR models for a given model size.
On the Flow Matching side (which will be entirely ignored for the rest of this blog post, my apologies), Discrete Flow Matching [15] was introduced from Lipman’s group. This framework, you guessed it, extends the Flow Matching framework to model discrete data. I’m much less well read on the Flow Matching literature but it appears to be a natural extension from the continuous setup and notably, the generating probability velocity for Discrete Flow Matching has an identical form to its continuous counterpart.
Unconditional Generation
In addition to the aforementioned autoguidance, The Return of Unconditional Generation [16] introduces Representation-Conditioned Generation (RCG). In order to do unconditional generation, this framework applies conditional generation to produce the final image. The trick is that they sample the conditional information itself in an unconditional way using a representation generator which maps noise to a distribution generated by a pretrained encoder. For the conditional image generator they use a variety of diffusion models but also show the benefit of this technique extends to MAGE [17], a transformer-based generative model using iterative decoding. This approach significantly improves unconditional generation quality and achieves a new state-of-the-art FID on ImageNet 256×256.
Image Editing
There were a number of fun papers looking at image editing from various angles.
TextCtrl [18] focuses on the task of controllably modifying the text in a visual scene while preserving the original style. They achieve this by constructing a text style encoder disentangled into various features of the text and a text glyph structure encoder. These encoders are used to guide generation. This work also introduces the ScenePair dataset for benchmarking this task.
LOCO Edit [19] analyses the semantic subspaces of pretrained models and identifies interesting editing directions. They apply these directions with a mask to do training-free local image editing based on DDIM Inversion1.
Schedule Your Edit [20] takes a closer look at error accumulation in DDIM Inversion which harms content preservation and edit fidelity. They identify a singularity problem at the beginning of the inversion process in traditional noise schedules and attempt to ameliorate this by proposing a ‘logistic’ noise schedule.
Training-free Methods
There were a great many works which used training-free methods, which has become something of a catch-all buzzword, with the promise of saving you compute by exploiting pretrained models without extra fine-tuning.
In addition to TFG which I highlighted above, Ctrl-X [21] is one such method. This work introduces a mechanism for controllable generation using text-to-image diffusion models without retraining (expensive and data hungry) or guidance (sensitive to sampling hyperparameters and high inference cost). Two input images, one for structure conditioning and one for appearance conditioning are partially noised and input into the pretrained denoiser. They use clever copying techniques based on previous findings in the literature (e.g., diffusion features [22]) to inject information from each into a copy of the denoiser used to produce the final output.
Trust Sampling [23] extends previous works which introduce constraints by guiding sampling with a loss term. At each diffusion step, they add support for multi-step optimisation of the constraint loss and early termination when the sample starts to wander away from the state manifold (as measured by the denoiser magnitude).
FIFO-Diffusion [24] brings infinite length video support to existing video diffusion models in a training-free manner. The trick is to move from chunked autoregressive generation, which is how the model was trained, to a first-in-first-out system. In the training setup, one contiguous chunk of frames is denoised to completion before moving on to the next chunk. With FIFO-Diffusion, each frame in the batch is at a different noise level and consecutive denoiser invocations introduce a new frame in a sliding window pattern. As a result, later frames are conditioned on better-resolved earlier frames. This causes a train-inference distribution shift which the authors address by introducing ‘latent partitioning’.
Controlling Input Noise
Numerous works this year looked at controlling the noise input which is used as the starting point to the reverse diffusion process.
ReNO (Reward-based Noise Optimization) [25] is one such work which optimises the initial noise in text-to-image models using human preference reward models, such as ImageReward. To sidestep exploding/vanishing gradients caused by backpropagating through the denoising process, the authors focus on distilled, one-step models [26]. With only a few gradient optimisation iterations, they show this approach can outperform multi-step models like SDXL.
Blind Image Restoration via Fast Diffusion Inversion [27] presents a method for image restoration based on noise optimisation. Starting from noise, they generate a clean image, compare it to the provided degraded image (using some predicted degradation parameters) and invert the DDIM process. This provides gradients to tweak the original noise to produce a better match at t = 0. This must be repeated several times, making the process expensive, a problem they address by using large sampling step sizes. By only modifying the noise and not intermediate latents, they ensure restored images lie on the data manifold.
Immiscible Diffusion [28] seeks to accelerate diffusion model training by controlling the input noise such that different clean data modes correspond to different regions of the input noise distribution. At training time, they sample a batch of Gaussian noise and a batch of target images and form image-noise pairs so as to minimise pairwise distance. Evaluating this technique on a few small datasets, they observe a 1.2x to 3x speed-up.
Image Diffusion to Video Diffusion
A final theme I picked up on at NeurIPS was the application of image diffusion models for video generation. This has become important in part because state-of-the-art video diffusion models are not generally publicly available and in part due to the prohibitive cost of training and running such models.
For instance, Warped Diffusion [29] from NVIDIA, combines the idea of controlling input noise with the direction of extending image diffusion models into video diffusion models. To introduce video support, they warp the noise between consecutive frames using transformations from an optical flow. To make this work, they fine-tune models (also using SDXL) inputting noise from Gaussian processes and introduce an equivariance self-guidance to ensure that the generated frames are consistent under the transformation.
MotionCraft [30] was another interesting work in this vein which introduces motion to Stable Diffusion by exploiting the fact that the latent space is image-like. They apply a masked velocity field from a physics simulation to warp this space directly and use the model to denoise which results in fewer artefacts than applying the simulation directly to a clean image.
COrrespondence-guided Video Editing (COVE) [31] introduces a technique to use text-to-image diffusion models for video editing. Following image editing, the authors apply DDIM Inversion and denoise with a text prompt. To extend to videos, the authors extract diffusion features [22] from noisy latents representing video frames and apply them to measure the semantic correspondence of tokens in adjacent frames. At each timestep of inversion and denoising, the tokens in the noisy latent are sampled using the computed correspondence and merged [32], before self-attention is conducted on the merged tokens.
Wrap Up
And that’s everything I can remember that happened at NeurIPS. Thanks to all the kind people who explained their work to me and it was great to meet so many of you at the diffusion meetup. And lastly, thanks to Sebastian Ruder, whose idea of writing conference summary blog posts I shamelessly ripped off, and Sander Dieleman whose excellent blog focussed on diffusion motivated me to start this one.
What was your favorite work? What did I miss? Let me know on Twitter (DMs open if you’re shy).
Citation
Cited as:
Bambrick, Joshua. (Jan 2025). NeurIPS 2024: Diffusion Themes and Memes, Joshua Bambrick’s Blog. https://joshbambrick.com/blog/posts/neurips-2024/.
Or
@article{bambrick2025neurips2024,
title = "NeurIPS 2024: Diffusion Themes and Memes",
author = "Bambrick, Joshua",
journal = "Joshua Bambrick's Blog",
year = "2025",
month = "Jan",
url = "https://joshbambrick.com/blog/posts/neurips-2025/"
}
References
Footnotes
DDIM Inversion is going to come up a lot. The idea is to edit an image by starting from a clean image, repeatedly calling the denoiser and stepping in the opposite direction to obtain a partially noised latent. Running DDIM, which applies deterministic backwards sampling (ODE), starting from this latent should approximately regenerate the original sample. We can modify this sampling with our editing information to obtain a similar but modified image.↩︎