Flow matching has emerged as a powerful generative framework, with recent few-step methods achieving remarkable inference acceleration. However, we identify a critical yet overlooked limitation: these models suffer from severe diversity degradation, concentrating samples on dominant modes while neglecting rare but valid variations of the target distribution. We trace this degradation to averaging distortion: when trained with MSE objectives, class-conditional flows learn a frequency-weighted mean over intra-class sub-modes, causing the model to over-represent high-density modes while systematically neglecting low-density ones.
To address this, we propose SubFlow (Sub-mode Conditioned Flow Matching), which eliminates averaging distortion by decomposing each class into fine-grained sub-modes via semantic clustering and conditioning the flow on sub-mode indices. Each conditioned sub-distribution is approximately unimodal, so the learned flow accurately targets individual modes with no averaging distortion, restoring full mode coverage in a single inference step. Crucially, SubFlow is entirely plug-and-play: it integrates seamlessly into existing one-step models such as MeanFlow and Shortcut Models without any architectural modifications. Extensive experiments on ImageNet-256 demonstrate that SubFlow yields substantial gains in generation diversity (Recall) while maintaining competitive image quality (FID), confirming its broad applicability across different one-step generation frameworks.
We provide an interactive Colab notebook to walk you through the entire SubFlow pipeline on a 2D toy example. The tutorial covers standard flow matching, MeanFlow, and SubFlow — you can train the models and visualize mode collapse vs. diversity recovery in real time.
The core insight of SubFlow is that dominant-mode bias arises because the conditional mean velocity \(\mathbb{E}[x_1 - x_0 \mid x_t, t, c]\) must average over all sub-modes within class \(c\). By further conditioning on a sub-mode index \(k\), each sub-distribution \((c, k)\) becomes approximately unimodal, and the conditional mean accurately points to a specific mode with no averaging distortion:
\[ v_\theta^*(x_t, t, c, k) = \mathbb{E}[x_1 - x_0 \mid x_t, t, c, k] \]
SubFlow consists of three simple steps: (a) Offline pre-processing: extract semantic features (e.g., DINOv3) and cluster within each class to obtain sub-mode assignments; (b) Training: optimize the vector field \(v_\theta(x_t, t, c, k)\) conditioned on both class and sub-mode; (c) Inference: sample \(k \sim p(k \mid c)\) and generate with the sub-mode-conditioned flow.
Overview of SubFlow. (a) Offline pre-processing: semantic features are extracted and clustered within each class. (b) Training: the vector field is optimized with class and sub-mode conditioning. For CFG, only the class label is dropped while sub-mode index is always retained. (c) Inference: a sub-mode index is sampled from the empirical prior, and the conditioned vector field generates a sample.
Qualitative comparison between MeanFlow (left) and MeanFlow + SubFlow (right) on ImageNet-256. Green boxes highlight samples where SubFlow produces visibly higher image quality with sharper details and fewer artifacts.
First column: MeanFlow output from a fixed noise \(x_0\). Columns 2–6: MeanFlow + SubFlow with different sub-mode indices \(k\). Despite sharing the same noise, the generated images exhibit clearly distinct visual styles.
@article{lin2026subflow,
title={SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation},
author={Yexiong Lin and Jia Shi and Shanshan Ye and Wanyu Wang and Yu Yao and Tongliang Liu},
year={2026},
eprint={2604.12273},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2604.12273}
}