Flow matching trains a velocity field $v_\theta(x, t)$ to transport a simple base distribution to data along a path. The trouble, in its naive form, is that the paths can be curved — and curved paths are expensive to integrate and high-variance to learn from. Optimal transport fixes this beautifully.
The setup
Pick a base sample $x_0 \sim \mathcal{N}(0, I)$ and a data sample $x_1$, and interpolate linearly:
The velocity along this straight interpolant is constant, $x_1 - x_0$, so the flow-matching objective is just a regression:
Where the coupling matters
Here's the subtlety: how do we pair $x_0$ with $x_1$? If we pair them independently, different $(x_0, x_1)$ interpolants cross, and the marginal velocity the network must learn becomes curved and high-variance.
Instead, draw the pair from an optimal-transport coupling $\pi_{\text{OT}}$. Within a minibatch this is cheap — and in one dimension it's exact: just sort and rank-pair. OT coupling minimizes the expected squared displacement, so on average the straight interpolants don't cross:
The payoff is concrete. Straighter paths mean:
- fewer integration steps at inference — four Euler steps can be enough;
- lower-variance targets during training;
- and a cleaner link to the theory of Wasserstein geodesics.
Inference under the right loss
One last trick I like. If your metric is mean absolute error, the optimal point estimate is the conditional median, not the mean. So at inference I sample $K$ noise seeds, integrate the ODE $\dot{x} = v_\theta(x, t)$ for each, and take the per-coordinate median:
where $\Phi_\theta$ is the learned flow map. It's a small thing, but it's the statistically correct way to turn a generative model into a point predictor — and it's exactly what made the difference in my Padova traffic project.
That's the whole idea: let optimal transport choose the geometry, and let the loss choose the estimator.