NeurIPS 2024: Diffusion Themes and Memes

diffusion
NeurIPS
Author

Joshua Bambrick

Published

January 18, 2025

Introduction

The most recent NeurIPS conference took place in Vancouver, BC and I had the pleasure of attending with a large cohort from Isomorphic Labs (although I was a few minutes too late for the all-important team photo!).

NeurIPS 2024, Vancouver, BC

One of the major themes of the conference was the proliferation of diffusion which, assuming things continue on their current trajectory, will soon account for over 100% of NeurIPS papers. With diffusion being one of the key architectural changes introduced in AlphaFold 3 [1], I was keen to take the opportunity to meet many others working in the area and review some of the emerging trends. This post summarises what I found but only offers a glimpse of the diverse works presented at the conference (and is rather biassed by what I attended). Apologies if I didn’t include your work and please do let me know what I’ve missed!

This blog assumes background knowledge about diffusion models, but if you would benefit from an introduction I recommend the sacred texts, Lilian Weng’s blog post [2] and Calvin Luo’s tutorial [3].

Personal Highlights

Among the works I found particularly interesting was the introduction of autoguidance [4] by the same authors who brought you EDM [5]. In addition to some very snazzy visualisations, they offer a unique insight into why classifier-free guidance improves image quality: not just by boosting prompt alignment but also by guiding sampling towards the core of the data manifold. Moreover, they introduce a technique to achieve this effect in the unconditional setting by guiding sampling against the score induced by a ‘bad’ version of a strong model.

Also noteworthy was TFG [6] which introduces a framework to unify Training-free Guidance approaches, such as Universal Guidance [7]. Such approaches become instances of TFG in a hyperparameter subspace. These methods allow you to guide a pretrained diffusion model using a predictor, but unlike classifier guidance, you need only define the predictor in the clean data space. The authors introduce a mechanism to search the hyperparameter space for different tasks and demonstrate improved performance across 16 task types compared to previous training-free guidance approaches.

Finally, while it’s adamantly not a diffusion paper, this year’s best paper (setting aside the soap opera) introduces Visual AutoRegressive modeling [8]. This is a novel image generation approach which offers comparable generation quality to diffusion but with significantly reduced inference times. The technique hinges on applying autoregressive decoding, not pixel by pixel, but from low resolution to high resolution. A neat aspect of this work is the demonstration of power-law scaling laws for VAR, following LLMs, which haven’t been shown for diffusion. Could be one to watch in case we all need to pivot.

Classifier-free guidance and autoguidance guide sampling towards the core of the data manifold. (Image source: Karras et al. 2024 [4])

Data-to-data Translation

Traditional diffusion sampling commences from Gaussian noise samples. Instead, data-to-data translation aims to map from and to arbitrary data distributions. While not papers exactly, there were a couple of excellent talks in this direction worth mentioning.

Flow Matching has become the go-to technique for this problem in contrast to diffusion (although the two are more closely related than you might hear [9]). Yaron Lipman and his team from FAIR presented a comprehensive introduction to Flow Matching, along with a superb accompanying guide complete with code examples [10].

Returning to diffusion, Arnaud Doucet presented an overview of Schrödinger Bridges (so-called because Erwin Schrödinger developed a rudimentary form of iterative proportional fitting in 1931), which bring unpaired data-to-data translation to diffusion. Schrödinger Bridges are stochastic paths between data distributions which minimise kinetic energy, approximating optimal transport. Such bridges can be learnt using an algorithm called Iterative Markovian Fitting (NeurIPS 2023) [11]. Unfortunately this is prohibitively expensive as each iteration requires retraining a diffusion model and resampling data from it. Conveniently, Doucet has a solution in his NeurIPS 2024 paper which introduces α-Iterative Markovian Fitting [12] to allow you to fine-tune existing Brownian Bridge models to create Schrödinger Bridges.

Bridge Matching generalises Flow Matching with a stochastic process (Brownian Bridge). Schrödinger Bridges are Brownian Bridges which come closer to achieving optimal transport. (Image source: Doucet 2024)

Discrete Generation

Despite having emerged as a powerful generative model framework for continuous data such as images, the application of diffusion models to discrete data, such as language, remains limited and rather complex. Autoregressive (AR) models are still dominant despite being limited by sequential sampling and necessitating the imposition of order which is unnatural for some datatypes.

MD4 [13] from Google DeepMind addresses this by introducing a simplified and general framework for masked diffusion models (or ‘absorbing diffusion’). In the forward process, such models gradually introduce noise by replacing ground truth tokens with a mask according to a schedule and the model learns to denoise. The authors demonstrate that the reverse generation process follows a simple logic and that the loss can be simplified to an integral over weighted cross entropy losses. Remarkably, this makes training similar to BERT [14] and more numerically stable. The resultant models outperform previous discrete diffusion approaches and, on some tasks, surpass the best AR models for a given model size.

Masked diffusion models gradually replace ground truth tokens with a mask and the model learns to denoise. (Image source: Shi 2024)

On the Flow Matching side (which will be entirely ignored for the rest of this blog post, my apologies), Discrete Flow Matching [15] was introduced from Lipman’s group. This framework, you guessed it, extends the Flow Matching framework to model discrete data. I’m much less well read on the Flow Matching literature but it appears to be a natural extension from the continuous setup and notably, the generating probability velocity for Discrete Flow Matching has an identical form to its continuous counterpart.

Unconditional Generation

In addition to the aforementioned autoguidance, The Return of Unconditional Generation [16] introduces Representation-Conditioned Generation (RCG). In order to do unconditional generation, this framework applies conditional generation to produce the final image. The trick is that they sample the conditional information itself in an unconditional way using a representation generator which maps noise to a distribution generated by a pretrained encoder. For the conditional image generator they use a variety of diffusion models but also show the benefit of this technique extends to MAGE [17], a transformer-based generative model using iterative decoding. This approach significantly improves unconditional generation quality and achieves a new state-of-the-art FID on ImageNet 256×256.

Representation-Conditioned Generation (RCG) unconditionally generates representations which are input to a conditional diffusion model. (Image source: Li et al. 2024)

Image Editing

There were a number of fun papers looking at image editing from various angles.

TextCtrl [18] focuses on the task of controllably modifying the text in a visual scene while preserving the original style. They achieve this by constructing a text style encoder disentangled into various features of the text and a text glyph structure encoder. These encoders are used to guide generation. This work also introduces the ScenePair dataset for benchmarking this task.

LOCO Edit [19] analyses the semantic subspaces of pretrained models and identifies interesting editing directions. They apply these directions with a mask to do training-free local image editing based on DDIM Inversion1.

Schedule Your Edit [20] takes a closer look at error accumulation in DDIM Inversion which harms content preservation and edit fidelity. They identify a singularity problem at the beginning of the inversion process in traditional noise schedules and attempt to ameliorate this by proposing a ‘logistic’ noise schedule.

The singularity problem with traditional noise schedules. (Image source: Lin et al. 2024 [20])

The logistic schedule avoids this problem. (Image source: Lin et al. 2024 [20])

Training-free Methods

There were a great many works which used training-free methods, which has become something of a catch-all buzzword, with the promise of saving you compute by exploiting pretrained models without extra fine-tuning.

In addition to TFG which I highlighted above, Ctrl-X [21] is one such method. This work introduces a mechanism for controllable generation using text-to-image diffusion models without retraining (expensive and data hungry) or guidance (sensitive to sampling hyperparameters and high inference cost). Two input images, one for structure conditioning and one for appearance conditioning are partially noised and input into the pretrained denoiser. They use clever copying techniques based on previous findings in the literature (e.g., diffusion features [22]) to inject information from each into a copy of the denoiser used to produce the final output.

Trust Sampling [23] extends previous works which introduce constraints by guiding sampling with a loss term. At each diffusion step, they add support for multi-step optimisation of the constraint loss and early termination when the sample starts to wander away from the state manifold (as measured by the denoiser magnitude).

FIFO-Diffusion [24] brings infinite length video support to existing video diffusion models in a training-free manner. The trick is to move from chunked autoregressive generation, which is how the model was trained, to a first-in-first-out system. In the training setup, one contiguous chunk of frames is denoised to completion before moving on to the next chunk. With FIFO-Diffusion, each frame in the batch is at a different noise level and consecutive denoiser invocations introduce a new frame in a sliding window pattern. As a result, later frames are conditioned on better-resolved earlier frames. This causes a train-inference distribution shift which the authors address by introducing ‘latent partitioning’.

Comparison between chunked autoregressive generation and FIFO-Diffusion. Black represents random noise, white represents clean generated images, and grey are intermediate latents. (Image source: Kim et al. 2024 [24])

Controlling Input Noise

Numerous works this year looked at controlling the noise input which is used as the starting point to the reverse diffusion process.

ReNO (Reward-based Noise Optimization) [25] is one such work which optimises the initial noise in text-to-image models using human preference reward models, such as ImageReward. To sidestep exploding/vanishing gradients caused by backpropagating through the denoising process, the authors focus on distilled, one-step models [26]. With only a few gradient optimisation iterations, they show this approach can outperform multi-step models like SDXL.

Blind Image Restoration via Fast Diffusion Inversion [27] presents a method for image restoration based on noise optimisation. Starting from noise, they generate a clean image, compare it to the provided degraded image (using some predicted degradation parameters) and invert the DDIM process. This provides gradients to tweak the original noise to produce a better match at t = 0. This must be repeated several times, making the process expensive, a problem they address by using large sampling step sizes. By only modifying the noise and not intermediate latents, they ensure restored images lie on the data manifold.

Immiscible Diffusion [28] seeks to accelerate diffusion model training by controlling the input noise such that different clean data modes correspond to different regions of the input noise distribution. At training time, they sample a batch of Gaussian noise and a batch of target images and form image-noise pairs so as to minimise pairwise distance. Evaluating this technique on a few small datasets, they observe a 1.2x to 3x speed-up.

Immiscible forward diffusion as analogous to immiscible diffusion of fluids in Physics. (Image source: Li et al. 2024 [28])

Image Diffusion to Video Diffusion

A final theme I picked up on at NeurIPS was the application of image diffusion models for video generation. This has become important in part because state-of-the-art video diffusion models are not generally publicly available and in part due to the prohibitive cost of training and running such models.

For instance, Warped Diffusion [29] from NVIDIA, combines the idea of controlling input noise with the direction of extending image diffusion models into video diffusion models. To introduce video support, they warp the noise between consecutive frames using transformations from an optical flow. To make this work, they fine-tune models (also using SDXL) inputting noise from Gaussian processes and introduce an equivariance self-guidance to ensure that the generated frames are consistent under the transformation.

MotionCraft [30] was another interesting work in this vein which introduces motion to Stable Diffusion by exploiting the fact that the latent space is image-like. They apply a masked velocity field from a physics simulation to warp this space directly and use the model to denoise which results in fewer artefacts than applying the simulation directly to a clean image.

COrrespondence-guided Video Editing (COVE) [31] introduces a technique to use text-to-image diffusion models for video editing. Following image editing, the authors apply DDIM Inversion and denoise with a text prompt. To extend to videos, the authors extract diffusion features [22] from noisy latents representing video frames and apply them to measure the semantic correspondence of tokens in adjacent frames. At each timestep of inversion and denoising, the tokens in the noisy latent are sampled using the computed correspondence and merged [32], before self-attention is conducted on the merged tokens.

Video editing using a text-to-image diffusion model with COVE. (Image source: Wang et al. 2024 [31])

Wrap Up

And that’s everything I can remember that happened at NeurIPS. Thanks to all the kind people who explained their work to me and it was great to meet so many of you at the diffusion meetup. And lastly, thanks to Sebastian Ruder, whose idea of writing conference summary blog posts I shamelessly ripped off, and Sander Dieleman whose excellent blog focussed on diffusion motivated me to start this one.

What was your favorite work? What did I miss? Let me know on Twitter (DMs open if you’re shy).

Let me know what I missed! See, I told you there would be memes. (Image source: Hamilton Wiki)

Citation

Cited as:

Bambrick, Joshua. (Jan 2025). NeurIPS 2024: Diffusion Themes and Memes, Joshua Bambrick’s Blog. https://joshbambrick.com/blog/posts/neurips-2024/.

Or

@article{bambrick2025neurips2024,
  title   = "NeurIPS 2024: Diffusion Themes and Memes",
  author  = "Bambrick, Joshua",
  journal = "Joshua Bambrick's Blog",
  year    = "2025",
  month   = "Jan",
  url     = "https://joshbambrick.com/blog/posts/neurips-2025/"
}

References

1.
Abramson J, Adler J, Dunger J, et al (2024) Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature
2.
Weng L (2021) What are diffusion models? Lil’Log
3.
Luo C (2022) Understanding diffusion models: A unified perspective. arXiv preprint arXiv:220811970
4.
Karras T, Aittala M, Kynkäänniemi T, Lehtinen J, Aila T, Laine S (2024) Guiding a diffusion model with a bad version of itself. Neural Information Processing Systems (NeurIPS)
5.
Karras T, Aittala M, Aila T, Laine S (2022) Elucidating the design space of diffusion-based generative models. Neural Information Processing Systems (NeurIPS)
6.
Ye H, Lin H, Han J, et al (2024) Tfg: Unified training-free guidance for diffusion models. Neural Information Processing Systems (NeurIPS)
7.
Bansal A, Chu H-M, Schwarzschild A, et al (2023) Universal guidance for diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
8.
Tian K, Jiang Y, Yuan Z, Peng B, Wang L (2024) Visual autoregressive modeling: Scalable image generation via next-scale prediction. Neural Information Processing Systems (NeurIPS)
9.
Gao R, Hoogeboom E, Heek J, Bortoli VD, Murphy KP, Salimans T (2024) Diffusion meets flow matching: Two sides of the same coin
10.
Lipman Y, Havasi M, Holderrieth P, et al (2024) Flow matching guide and code. arXiv preprint arXiv:241206264
11.
Shi Y, De Bortoli V, Campbell A, Doucet A (2023) Diffusion schrödinger bridge matching. Neural Information Processing Systems (NeurIPS)
12.
De Bortoli V, Korshunova I, Mnih A, Doucet A (2024) Schrödinger bridge flow for unpaired data translation. Neural Information Processing Systems (NeurIPS)
13.
Shi J, Han K, Wang Z, Doucet A, Titsias MK (2024) Simplified and generalized masked diffusion for discrete data. Neural Information Processing Systems (NeurIPS)
14.
Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. Association for Computational Linguistics (ACL)
15.
Gat I, Remez T, Shaul N, et al (2024) Discrete flow matching. Neural Information Processing Systems (NeurIPS)
16.
Li T, Katabi D, He K (2024) Return of unconditional generation: A self-supervised representation generation method. In: Neural Information Processing Systems (NeurIPS)
17.
Li T, Chang H, Mishra S, Zhang H, Katabi D, Krishnan D (2023) Mage: Masked generative encoder to unify representation learning and image synthesis. In: CVPR
18.
Zeng W, Shu Y, Li Z, Yang D, Zhou Y (2024) TextCtrl: Diffusion-based scene text editing with prior guidance control. Neural Information Processing Systems (NeurIPS)
19.
Chen S, Zhang H, Guo M, Lu Y, Wang P, Qu Q (2024) Exploring low-dimensional subspaces in diffusion models for controllable image editing. Neural Information Processing Systems (NeurIPS)
20.
Lin H, Wang M, Wang J, et al (2024) Schedule your edit: A simple yet effective diffusion noise schedule for image editing. Neural Information Processing Systems (NeurIPS)
21.
Lin KH, Mo S, Klingher B, Mu F, Zhou B (2024) Ctrl-x: Controlling structure and appearance for text-to-image generation without guidance. Neural Information Processing Systems (NeurIPS)
22.
Tang L, Jia M, Wang Q, Phoo CP, Hariharan B (2023) Emergent correspondence from image diffusion. Neural Information Processing Systems (NeurIPS)
23.
Huang W, Jiang Y, Van Wouwe T, Liu CK (2024) Constrained diffusion with trust sampling. Neural Information Processing Systems (NeurIPS)
24.
Kim J, Kang J, Choi J, Han B (2024) FIFO-diffusion: Generating infinite videos from text without training. Neural Information Processing Systems (NeurIPS)
25.
Eyring L, Karthik S, Roth K, Dosovitskiy A, Akata Z (2024) ReNO: Enhancing one-step text-to-image models through reward-based noise optimization. Neural Information Processing Systems (NeurIPS)
26.
Luhman E, Luhman T (2021) Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:210102388
27.
Chihaoui H, Lemkhenter A, Favaro P (2024) Blind image restoration via fast diffusion inversion. Neural Information Processing Systems (NeurIPS)
28.
Li Y, Jiang H, Kodaira A, Tomizuka M, Keutzer K, Xu C (2024) Immiscible diffusion: Accelerating diffusion training with noise assignment. Neural Information Processing Systems (NeurIPS)
29.
Daras G, Nie W, Kreis K, et al (2024) Warped diffusion: Solving video inverse problems with image diffusion models. Neural Information Processing Systems (NeurIPS)
30.
Aira LS, Montanaro A, Aiello E, Valsesia D, Magli E (2024) MotionCraft: Physics-based zero-shot video generation. Neural Information Processing Systems (NeurIPS)
31.
Wang J, Ma Y, Guo J, Xiao Y, Huang G, Li X (2024) COVE: Unleashing the diffusion feature correspondence for consistent video editing. Neural Information Processing Systems (NeurIPS)
32.
Bolya D, Fu C-Y, Dai X, Zhang P, Feichtenhofer C, Hoffman J (2022) Token merging: Your vit but faster. arXiv preprint arXiv:221009461

Footnotes

  1. DDIM Inversion is going to come up a lot. The idea is to edit an image by starting from a clean image, repeatedly calling the denoiser and stepping in the opposite direction to obtain a partially noised latent. Running DDIM, which applies deterministic backwards sampling (ODE), starting from this latent should approximately regenerate the original sample. We can modify this sampling with our editing information to obtain a similar but modified image.↩︎