Diffusion models have recently become a practical mechanism for sampling long-horizon action sequences and trajectories, and they have been adopted as planning primitives in robotics and control. The typical deployment pattern is model-predictive control (MPC): given a current context (e.g., the observed start state and a goal), we sample a batch of candidate trajectories from a learned conditional generator and execute the first action of the best-scoring sample under a known task reward. This approach is attractive in offline settings because it avoids explicit system identification and can exploit the expressivity of generative models to represent multimodal long-horizon behavior.
However, the offline regime imposes a structural limitation that is particularly acute for diffusion-based planners. The dataset induces an empirical context distribution μ𝒟, and at test time we are often required to act under contexts c ∼ μtest that differ systematically from those in supp(μ𝒟). In robotics this mismatch is not merely covariate shift in the ambient state space; it arises from structured missingness: there may be entire ``holes’’ of start–goal pairs, or entire regions of visited states, absent from the data. When diffusion planners are trained only by conditional score matching, they can succeed spectacularly on covered contexts and yet fail sharply on such missing regions ℛ, despite these regions being geometrically close to the dataset. Empirically, the failure is consistent with a representational pathology: the shared representation used by the conditional denoiser can discard conditioning information when it is not needed to reduce training loss on μ𝒟, and the learned reverse process then becomes weakly conditional or effectively unconditional precisely where conditioning is required for generalization.
We study this phenomenon in the setting of goal-conditioned offline planning. We treat the environment only through an offline dataset of labeled trajectories and assume no further interaction during learning. Our focus is on planners that operate by sampling a length-H trajectory τ (or its vectorization x = vec(τ)) from a learned conditional law p̂(⋅ ∣ c) and selecting among N i.i.d. samples by maximizing a known return functional R(τ, c). Our guiding question is the following: under what conditions can a diffusion-based conditional trajectory model generalize across structured dataset holes, and how does conditional generation error translate into planning suboptimality for MPC?
Our main algorithmic contribution is : a conditional trajectory
diffusion model augmented with an auxiliary context reconstruction
objective applied at every diffusion time step. Concretely, we maintain
a shared trunk fϕ that consumes
(xt, c, t)
and produces a representation z, and we attach two heads: a
diffusion head fθ predicting
the forward noise (equivalently the conditional score), and a
reconstruction head fψ predicting
the context c (or its
components such as s0 and g). We train on the combined
objective
∥ε − ε̂∥22 + λ ∥c − ĉ∥22,
with (t, ε) sampled
as in standard denoising diffusion training. The reconstruction term
does not modify the inference-time sampling recursion; rather, it
constrains the representation to retain sufficient information about
c across noise scales. In this
sense RAP is not a new planner at inference time but a modification of
the learned conditional generator that targets generalization under
missing contexts.
Our theoretical contributions formalize the above intuition through a chain of implications from geometric regularity in context space to conditional generation accuracy and then to planning performance. First, we assume that contexts lie on or near a k-dimensional manifold ℳc and that the true conditional trajectory law p*(⋅ ∣ c) varies smoothly along this manifold in Wasserstein-2 distance, in the sense of an α-H"older condition. Under these hypotheses, we prove an upper bound on the conditional generation error of RAP that decomposes into three terms: (i) a diffusion estimation term corresponding to score-matching error, (ii) a representation/conditioning term controlled by the reconstruction loss, and (iii) an extrapolation term scaling with the distance δ(c) from a test context to the dataset support. This bound makes explicit that smoothness alone is insufficient if the learned representation discards the conditioning signal, and it identifies reconstruction at all diffusion times as a sufficient mechanism (at the level of population error) to prevent this failure mode.
Second, we provide a minimax lower bound showing that, even under the same manifold-smoothness assumptions, no offline method can surpass the intrinsic-dimension rate Ω(n−α/(2α + k)) for conditional generation in W2, up to constants and secondary terms. This establishes that the favorable scaling in k predicted by our upper bound is not an artifact of the analysis but is information-theoretically unavoidable. Moreover, when test contexts include points separated from supp(μ𝒟) by a fixed distance, an additional term proportional to Lδα is unavoidable, matching the structure of the extrapolation term in the upper bound.
Third, we connect conditional generation error to MPC planning performance. Assuming the reward is Lipschitz as a functional of the trajectory (in the vectorized Euclidean metric), we show that the expected gap between the best trajectory one could draw from the true law p*(⋅ ∣ c) and the trajectory chosen by MPC when sampling from p̂(⋅ ∣ c) is bounded by a term proportional to W2(p̂(⋅ ∣ c), p*(⋅ ∣ c)), plus a finite-sampling term ΔN capturing the order-statistics effect of selecting the best among N samples. This theorem makes precise how improvements in conditional generation translate into downstream planning gains, while also highlighting an orthogonal limitation: even a perfect generator may require large N when high-reward trajectories occupy very small probability mass.
Finally, we complement the positive results with an impossibility statement showing that without coverage and smoothness assumptions, purely offline planning over missing regions is fundamentally ill-posed. We construct pairs of conditional trajectory laws that agree on supp(μ𝒟) yet induce incompatible optimal actions on some missing context c ∈ ℛ. Consequently, any offline algorithm is forced to fail on at least one of the two hypotheses, yielding a worst-case success probability bounded away from 1 on ℛ. This result clarifies the role of our manifold proximity assumption: it is not merely a convenient regularity condition but a necessary structural restriction for nontrivial guarantees.
Empirically, we validate these claims on goal-conditioned robotics-style benchmarks with explicitly constructed missing-context and missing-visited-state regimes. Across tasks, RAP improves success and normalized return in the missing regions while matching or improving performance on covered contexts. Ablations varying λ, the diffusion horizon, and the number of denoising steps support the central claim that reconstruction regularizes the conditional representation and thereby reduces failure under structured dataset holes, consistent with the decomposition in our bounds.
We consider goal-conditioned offline planning in a Markov decision
process (MDP) ℳ = (𝒮, 𝒜, P, r, ρ0, γ)
augmented with a goal variable g ∈ 𝒢. A is c = (s0, g) ∈ 𝒞 ≔ 𝒮 × 𝒢,
and the task reward is a known function r : 𝒮 × 𝒜 × 𝒢 → ℝ. We focus on
finite-horizon planning over length-H trajectories τ = (s0, a0, s1, a1, …, sH)
with (possibly goal-conditioned) trajectory return
including the common sparse-goal specialization $r(s_t,a_t,g)=\mathbbm{1}\{d(s_{t+1},g)\le
\varepsilon\}$ or terminal-only variants. For notational
convenience we also employ a vectorized representation x = vec(τ) ∈ ℝdτ,
which may concatenate states and actions.
We are given an offline dataset
where each τi is a
length-H trajectory segment.
When raw logs contain variable-length episodes, we interpret 𝒟 as the collection of sliding windows of
length H extracted from those
episodes (with the appropriate s0 and goal label). Let
μ𝒟 denote the
empirical distribution over contexts induced by {ci}i = 1n,
and let μtest
denote the (unknown) test-time distribution over contexts. Our goal is
to construct a goal-conditioned planning policy π(a ∣ s, g)
that maximizes expected discounted test return. Operationally, we
implement π via MPC over a
learned conditional trajectory generator: at each environment step we
observe the current context c = (s, g), sample
N candidate trajectories from
a learned conditional law p̂(⋅ ∣ c), score them using
the known R(⋅, c),
and execute the first action of the best-scoring sample.
A defining difficulty in offline planning is that μ𝒟 and μtest may differ in a way that cannot be reduced to small i.i.d. covariate shift. We isolate two forms of structured missingness that are convenient to analyze and to instantiate experimentally.
There exists a region ℛ ⊂ 𝒞 such
that
i.e., certain start–goal pairs never appear in the offline data yet
arise at test time. In goal-conditioned robotics benchmarks this
corresponds, for example, to holding out an entire range of goal
locations, or holding out combinations of start poses and goals. For
analysis it is useful to quantify how far a test context is from covered
contexts; we write
with respect to a fixed norm on 𝒞
(typically Euclidean after appropriate normalization of state and goal
coordinates).
Even when μ𝒟
provides coverage of start–goal pairs, the dataset may systematically
avoid portions of the state space that are nevertheless required for
successful test-time execution. To formalize this, let ν𝒟 be the empirical state
visitation distribution induced by dataset trajectories (e.g., the
uniform mixture over time indices and trajectories). A is a measurable
subset 𝒮miss ⊂ 𝒮 such that
ν𝒟(𝒮miss) = 0
while the optimal (or any high-return) test-time behavior under some
contexts must pass through 𝒮miss with positive probability.
This setting is practically relevant when demonstrations or prior
policies avoid narrow passages, obstacle-adjacent regions, or rare
contact configurations; at test time, however, the goal or perturbations
force traversal of precisely these regions.
While missing visited states is not, on its face, missing contexts, it can often be reduced to it by enriching the context variable to include additional conditioning information (e.g., the current state at replanning time) and observing that MPC repeatedly queries the generator at contexts (st, g) for states st that may lie outside the dataset visitation support. Accordingly, both forms of missingness exert pressure on the conditional generator in essentially the same way: it must produce high-quality trajectories for condition values that were never (or rarely) seen during training.
We separate evaluation into (i) conditional generative modeling fidelity and (ii) downstream planning performance.
For a fixed context c, let
p*(⋅ ∣ c)
denote the (unknown) true conditional trajectory law induced by the
environment and whatever data-collection process generated 𝒟, and let p̂(⋅ ∣ c) denote the learned
conditional generator. We use the Wasserstein-2 metric
computed over the space of vectorized trajectories x = vec(τ) with the
Euclidean cost. This metric is well-aligned with planning under
Lipschitz trajectory rewards, since it controls discrepancies in
low-order moments and admits coupling-based bounds that translate
distribution shift into reward shift. In empirical studies where p*(⋅ ∣ c) is
inaccessible, we approximate ε(c) by two-sample
estimators (e.g., Sinkhorn-regularized W2) using generated
samples and reference trajectories available for the same context (from
held-out rollouts, a simulator, or an auxiliary data split), reporting
averages over c ∼ μtest and
also stratifying by δ(c).
In sparse-goal tasks, a primary metric is the probability of reaching
the goal within tolerance at (or before) horizon H. For a rollout {st}t = 0H
under the executed MPC policy, define
and report 𝔼c ∼ μtest[Succ(c)]
(often separately on covered contexts and on missing regions).
For shaped rewards, or when comparing across tasks with different reward
scales, we additionally report a normalized return. Given per-episode
return $G(c)\coloneqq \sum_{t=0}^{H-1}
r(s_t,a_t,g)$ under the evaluated planner, we normalize via
where Gmax(c) and
Gmin(c)
are task-dependent reference returns (e.g., from an expert/oracle and a
weak baseline, respectively) and η > 0 prevents division by zero.
We then report 𝔼c ∼ μtest[NR(c)]
and its decomposition over covered versus missing regimes. In purely
sparse-goal settings, NR typically
coincides (up to affine transformation) with success, but it remains
useful when intermediate shaping terms or energy penalties are
present.
These metrics together disentangle (a) whether the conditional generator produces trajectories close to the true conditional law in a distributional sense, and (b) whether the induced MPC policy achieves the intended control objective under structured missingness.
Diffusion generative models provide a convenient interface between offline data and test-time planning because they learn a conditional distribution over entire trajectory segments and defer objective-specific selection to inference. In contrast to value-based offline RL methods, which require extrapolating counterfactual action values for out-of-distribution state–action pairs, a trajectory generator can be used as a proposal mechanism: it produces candidate rollouts that are subsequently evaluated by the known reward functional. This separation is particularly natural in goal-conditioned settings, where the conditional trajectory law can be highly multimodal (multiple homotopy classes of paths to the same goal), and where the offline data may cover only a subset of contexts.
Let x0 = vec(τ) ∈ ℝdτ
denote a vectorized trajectory. A (conditional) diffusion model
specifies a forward noising process
for t ∈ {1, …, T},
and learns a parameterized reverse-time denoiser that maps (xt, c, t)
to an estimate of the injected noise. Concretely, one trains a network
ε̂θ(xt, c, t)
so that, when embedded into the standard DDPM reverse update, repeated
denoising steps transform an initial xT ∼ 𝒩(0, I)
into a sample x̂0
whose induced trajectory τ̂
resembles the data distribution conditioned on c. Conditioning is typically
implemented by concatenating c
to per-time-step embeddings, by FiLM-style modulation, or by
cross-attention from trajectory tokens to context tokens; for the
present discussion we treat these as interchangeable mechanisms that aim
to approximate a conditional score model.
The canonical objective samples (c, x0) from the
offline dataset, then samples t (often uniformly) and ε ∼ 𝒩(0, I), forms xt, and
minimizes the squared error
Under standard assumptions, this training procedure is equivalent to
score matching for the conditional distributions {p(xt ∣ c)}t = 1T
and, in the infinite-data/infinite-capacity limit, yields a reverse
process whose samples match the target conditional law p*(x0 ∣ c).
At test time, given a context c = (s, g), we generate N candidate trajectories by sampling xT(j) ∼ 𝒩(0, I) and iterating the learned reverse update for t = T, T − 1, …, 1 to obtain x̂0(j) and hence τ̂(j). We then evaluate each candidate under the known trajectory return R(τ̂(j), c) and select j⋆ ∈ arg maxjR(τ̂(j), c). The MPC policy executes the first action of τ̂(j⋆), observes the next state, updates the context to the new (s, g), and replans. This ``generate–and–select’’ mechanism has two salient properties: it naturally exploits multimodality through sampling, and it can incorporate reward shaping or constraints at inference without retraining the generator, provided the reward is computable from the trajectory.
The preceding picture implicitly assumes that the learned denoiser meaningfully uses the context. However, under structured missingness, conditional diffusion models exhibit a failure mode we refer to as : the learned reverse process becomes only weakly dependent on c, so that p̂(⋅ ∣ c) degenerates toward an approximately unconditional model (or, more subtly, becomes correct only on supp(μ𝒟) and behaves unpredictably elsewhere). We emphasize that this is not a purely architectural pathology; it is compatible with the training objective.
One way to see the issue is to observe that for many noise schedules, a substantial fraction of training steps use t for which ᾱt is close to 1. In that regime xt already contains nearly all information about x0, and the Bayes-optimal predictor of ε can be well-approximated as a function of xt alone. Consequently, minimizing ℒdiff does not, by itself, force the network to retain or exploit c, especially when c is statistically redundant given xt on the dataset distribution. In finite-capacity networks trained by SGD, this pressure manifests as a representation that prioritizes features predictive of ε averaged over μ𝒟, potentially discarding low-signal conditioning information. The problem is exacerbated precisely in the regimes we care about: if μ𝒟(c) = 0 for c ∈ ℛ, then the loss provides no direct gradient signal enforcing correct dependence on c in ℛ, and extrapolation is driven only by inductive bias.
A complementary perspective is information-theoretic. Let z denote an intermediate representation used to predict ε. If the trunk can map (xt, c, t) to z in such a way that I(z; c) is small while still allowing accurate prediction of ε on training data, then conditioning information is effectively absent from the computation that determines the reverse dynamics. In that case, at test time the sampling distribution varies weakly with c, and MPC reduces to selecting among trajectories that are typical under the dataset marginal rather than appropriate for the queried goal. Under missing contexts or missing visited states (which induce out-of-support replanning contexts), this collapse translates directly into poor success: the high-reward set may have negligible probability mass under the near-unconditional generator, making best-of-N sampling ineffective even if N is large.
A natural remedy is to explicitly enforce that the internal
representation used for denoising retains the conditioning information.
The SRDP idea is to augment the diffusion network with an auxiliary
reconstruction task: from the same representation z used to predict ε, we also predict ĉ and penalize reconstruction error
at diffusion step. Abstractly, if ĉψ(z)
is a reconstruction head, we add
Because the reconstruction must succeed for all t, the representation cannot ignore
c even at low-noise steps
where xt
nearly determines x0. In effect, we
constrain the learned features to preserve conditioning information in a
form that remains accessible throughout denoising, which empirically
mitigates collapse and, in analysis, supports stability arguments
linking small reconstruction error to small conditional mismatch.
These considerations motivate our Reconstruct-and-Plan (RAP) architecture, which we present next as a trajectory diffusion planner with a shared trunk and dual heads for denoising and context reconstruction.
We now specify the Reconstruct-and-Plan (RAP) model class and its use as a goal-conditioned planner. RAP instantiates a conditional trajectory diffusion model together with an auxiliary reconstruction task designed to prevent the learned reverse process from becoming weakly dependent on the context under structured missingness. The method is deliberately modular: the diffusion component is trained to represent a conditional trajectory law, while planning is delegated to a sampling-based MPC loop that evaluates candidates under a known return functional.
We fix a planning horizon H
and represent a length-H
trajectory segment by a vector
x ≡ x0 ≔ vec(τ) ∈ ℝdτ, τ = (s0, a0, …, sH),
where vec(⋅) denotes a fixed
concatenation of states and actions (or, in action-only variants,
actions plus a compact state encoding). Each training example consists
of a context–trajectory pair (c, x0) with
context c = (s0, g),
where g ∈ 𝒢 denotes a goal
descriptor. When demonstrations have variable length, we reduce them to
a collection of length-H
windows, treating each window start state as s0 and inheriting the
goal g from the parent
episode. This reduction yields an empirical context distribution μ𝒟 that may have holes
relative to μtest,
which is precisely the regime in which we seek stable conditioning.
RAP parameterizes all diffusion steps with a shared feature extractor
(the )
z = fϕ(xt, c, t),
followed by two task-specific heads. The predicts the noise injected in
the forward process,
ε̂ = fθ(z),
and the predicts the context,
ĉ = fψ(z).
In goal-conditioned tasks it is often convenient to decompose ĉ into (ŝ0, ĝ) and to
measure error componentwise; we keep the notation ∥c − ĉ∥22
for brevity. Architecturally, fϕ may be
implemented by a transformer over time-indexed tokens with time
embedding of t, with c injected by concatenation,
cross-attention, or FiLM modulation; our analysis treats these
mechanisms as different realizations of a Lipschitz map (xt, c, t) ↦ z.
The salient design choice is not the particular conditioning mechanism,
but the requirement that representation z simultaneously supports denoising
and context reconstruction at every diffusion step.
We adopt the standard forward noising process
$$
x_t \;=\; \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\,\varepsilon,
\qquad \varepsilon\sim\mathcal{N}(0,I),
\qquad t\in\{1,\dots,T\},
$$
with schedule {βt}t = 1T,
αt = 1 − βt,
and $\bar\alpha_t=\prod_{i=1}^t\alpha_i$.
Training proceeds by sampling (c, x0) from
𝒟, sampling t (typically uniformly on {1, …, T}) and ε, forming xt, and
minimizing the composite loss
where λ > 0 controls the
strength of the anti-collapse regularization. The reconstruction term is
enforced , so that the trunk cannot satisfy denoising by relying on
xt alone
at low-noise steps while ignoring c. In particular, because the
reconstruction head receives only z (and not xt directly),
small reconstruction error forces the trunk to encode context
information into z in a manner
that remains accessible uniformly across noise levels. We emphasize that
the reconstruction target is the conditioning variable c supplied at training time; the
loss in therefore does not require additional labels beyond those
already present in goal-conditioned planning datasets.
At inference, given a queried context c = (s, g), we
sample candidate trajectories by running the learned reverse diffusion
starting from Gaussian noise. Concretely, for each sample index j ∈ {1, …, N} we draw xT(j) ∼ 𝒩(0, I)
and iterate for t = T, T − 1, …, 1:
zt(j) = fϕ(xt(j), c, t), ε̂t(j) = fθ(zt(j)), xt − 1(j) = denoise(xt(j), ε̂t(j), t),
where denoise(⋅) denotes a standard
DDPM update (or an accelerated deterministic/stochastic variant such as
DDIM or a solver-based step). After T steps we decode x̂0(j) ≡ x0(j)
into a trajectory τ̂(j) = unvec(x̂0(j)).
The reconstruction head fψ is used to
modify this sampling rule at test time; its role is to shape the learned
representation during training so that the conditional reverse dynamics
induced by fθ ∘ fϕ
meaningfully depends on c.
RAP uses the learned conditional generator as a proposal distribution
inside a receding-horizon controller. Given current state s and fixed goal g, we form the context c = (s, g),
generate N candidate
trajectories {τ̂(j)}j = 1N
as above, and score each candidate under a known return functional
$$
J^{(j)} \;\coloneqq\; R(\hat\tau^{(j)},c) \;=\; \sum_{t=0}^{H-1} r(\hat
s_t^{(j)},\hat a_t^{(j)},g),
$$
or, in sparse-goal settings, a terminal indicator or shaped variant
computable from the generated rollout. We select an index j⋆ ∈ arg maxjJ(j)
and execute the first action a = â0(j⋆)
in the environment. After observing the next state, we update the
context to (s′, g) and
repeat. This MPC loop separates from : the generator need only place
non-negligible probability mass on plausible goal-directed trajectories,
while the known reward selects among sampled modes. The dependence of
performance on both conditional generation error and the finite sampling
budget N is quantified in our
subsequent theory.
Training adds only a second head to a standard conditional diffusion model; the trunk computation is shared, so the incremental cost is a single small forward/backward through fψ. Inference costs O(NT) network evaluations per replanning step, with the usual tradeoffs between T and sample quality; accelerated samplers reduce the effective number of denoising steps without changing the method. In all cases, RAP is defined by the single principle that conditioning information is explicitly reconstructible from the denoising representation across diffusion time, which is the property we will exploit to obtain conditional generation guarantees under manifold-smoothness assumptions.
We formalize the conditional generation problem induced by offline
goal-conditioned planning with holes in the empirical context
distribution. Let p*(⋅ ∣ c) denote
the (unknown) conditional law of vectorized trajectories x0 = vec(τ)
given context c = (s0, g),
and let p̂(⋅ ∣ c) be
the conditional law implicitly defined by running the learned reverse
diffusion dynamics using the denoising network fθ ∘ fϕ
(with c held fixed). Our goal
in this section is to bound
W2 (p̂(⋅ ∣ c), p*(⋅ ∣ c))
for test contexts c ∼ μtest that
may lie outside supp(μ𝒟).
The principal structural assumption is that conditional behavior
varies smoothly along a low-dimensional set of contexts, rather than in
the ambient space. Concretely, we assume there exists a k-dimensional (possibly curved) set
ℳc ⊂ 𝒞 such that
both the dataset and test-time contexts concentrate near ℳc, and that the
conditional law is H"older along ℳc in Wasserstein
distance: there exist α ∈ (0, 1] and L < ∞ such that for all c, c′ ∈ ℳc,
To quantify structured missingness, we measure the test context’s
distance to covered contexts by
δ(c) ≔ dist(c, supp(μ𝒟)).
When c ∈ ℛ (a missing-context
region), we have δ(c) > 0 by
definition.
Our analysis treats the learned denoiser and reconstruction maps as
Lipschitz function classes. Specifically, we assume that for each
diffusion step t the mappings
(xt, c) ↦ fϕ(xt, c, t)
and z ↦ fθ(z)
are L-Lipschitz in their
continuous arguments, and that the reverse diffusion update denoise(⋅) is stable to perturbations of the
noise prediction in a manner controlled by the schedule {βt}t = 1T
(a standard property in diffusion perturbation analyses). We summarize
statistical/approximation errors at the population level by two
quantities:
(i) a diffusion estimation error εD, representing
the gap between the learned noise predictor and the true conditional
score/noise (aggregated over t); and (ii) a reconstruction error
εR defined
by
where the expectation is over (c, x0) ∼ 𝒟,
t, and the forward-process
noise. The purpose of is not to use ĉ at test time, but to certify that
the shared representation retains conditioning information across
diffusion time, thereby preventing the learned reverse dynamics from
becoming effectively unconditional.
We sketch the logical structure underlying . First, we compare the ideal reverse diffusion that uses the true conditional score/noise to the learned reverse diffusion that uses fθ ∘ fϕ; standard diffusion stability yields a term proportional to εD. Second, we isolate the effect of : even if the denoiser is accurate on average, the representation z = fϕ(xt, c, t) could discard c and thereby induce a reverse process that is weakly dependent on context. The reconstruction constraint rules out this collapse by forcing I(z; c) to be large in an operational sense; quantitatively, it yields the term C2εR1/2 by translating mean-squared reconstruction error into a bound on the discrepancy between the learned conditional and a context-agnostic surrogate. Third, when c lies outside the dataset support, we cannot expect zero error even with perfect estimation; we therefore relate p*(⋅ ∣ c) to p*(⋅ ∣ c′) at a nearest covered context c′ and invoke , producing the extrapolation penalty C3Lδ(c)α.
The upper bound is meaningful only if its dependence on sample size is unimprovable (up to constants) under the stated structural assumptions. We therefore consider a minimax class 𝒫k, α of conditional trajectory laws whose context dependence is α-H"older on a k-dimensional manifold, and ask for the best possible worst-case W2 estimation rate from n offline samples. The following statement is standard in nonparametric estimation on manifolds, and applies to conditional diffusion modeling when we view the model as estimating a smooth family of distributions.
Theorems~– have three consequences. (i) under manifold proximity, the best achievable rate depends on k rather than the ambient dimension of (s0, g). (ii) even with infinite data near supp(μ𝒟), extrapolation into ℛ costs at least order δ(c)α without further structure. (iii) the εR1/2 term quantifies the benefit of enforcing context-retention across diffusion time; it cannot remove the δ(c)α term, but it prevents an additional failure mode in which p̂(⋅ ∣ c) becomes nearly independent of c even on covered regions. These generation guarantees will be converted into planning guarantees once we account for MPC selection from finitely many sampled trajectories.
We now convert a conditional generation guarantee for p̂(⋅ ∣ c) into a bound on
the planning loss incurred by MPC that selects the best trajectory among
N samples. Throughout, we
write x = vec(τ) and
define the conditional generative error
ε(c) ≔ W2 (p̂(⋅ ∣ c), p*(⋅ ∣ c)).
We treat R(τ, c) as a known
functional used only for scoring candidate trajectories at
inference.
We assume that for each fixed context c, the trajectory reward is LR-Lipschitz in
the vectorized trajectory:
This assumption is natural when R is a sum of per-step costs that
are themselves Lipschitz in (st, at)
and the vectorization concatenates the corresponding components. For
instance, if
$$
R(\tau,c)=\sum_{t=0}^{H-1} r(s_t,a_t,g),\qquad |r(u,g)-r(u',g)|\le
L_r\|u-u'\|_2,
$$
and vec(τ) concatenates ut = (st, at),
then holds with $L_R\le L_r\sqrt{H}$
(up to fixed constants determined by the chosen vectorization norm).
Thus the bound below will typically degrade at most polynomially with
horizon through LR.
Let (X, X̂) be any
coupling of p*(⋅ ∣ c) and
p̂(⋅ ∣ c). By and
Cauchy–Schwarz,
Taking the infimum over couplings yields
This step is the sole place where we pass from distributional error to
planning error.
Fix a context c and
consider a single MPC planning call. Let τ⋆(c) ∈ arg maxτR(τ, c)
denote an (idealized) optimal horizon-H trajectory for context c.
Let $\tau^{(1)},\dots,\tau^{(N)}\stackrel{\text{i.i.d.}}{\sim}p^*(\cdot\mid
c)$ and define the best reward achievable by N samples,
τbest*(c) ∈ arg max1 ≤ j ≤ NR(τ(j), c).
Similarly, let $\hat\tau^{(1)},\dots,\hat\tau^{(N)}\stackrel{\text{i.i.d.}}{\sim}\hat
p(\cdot\mid c)$ and define the MPC-selected trajectory
τ̂MPC(c) ∈ arg max1 ≤ j ≤ NR(τ̂(j), c).
We quantify the intrinsic limitation of best-of-N selection under the true
conditional law by
Equivalently, if J = R(τ, c)
for τ ∼ p*(⋅ ∣ c)
and J(N) = maxj ≤ NJj,
then ΔN(c) = R(τ⋆(c), c) − 𝔼[J(N)].
In particular, if the reward distribution has a regular upper tail,
ΔN(c)
decays with N (e.g., as a
quantile gap); conversely, for sparse rewards ΔN(c)
can remain large unless N is
sufficiently large.
We decompose
$$
R(\tau^\star,c)-R(\hat\tau_{\mathrm{MPC}},c)
=\underbrace{R(\tau^\star,c)-R(\tau^*_{\mathrm{best}},c)}_{\text{finite-sampling
under }p^*}
+\underbrace{R(\tau^*_{\mathrm{best}},c)-R(\hat\tau_{\mathrm{MPC}},c)}_{\text{model
mismatch}}.
$$
Taking expectations, the first term is exactly ΔN(c)
by definition . For the second term, we take an optimal W2 coupling π of p*(⋅ ∣ c) and
p̂(⋅ ∣ c) and draw
N i.i.d. coupled pairs (τ(j), τ̂(j)) ∼ π.
By maximality of τbest* and
τ̂MPC and the
elementary inequality |maxjuj − maxjvj| ≤ maxj|uj − vj|,
we obtain
R(τbest*, c) − R(τ̂MPC, c) ≤ max1 ≤ j ≤ N|R(τ(j), c) − R(τ̂(j), c)|.
Applying and bounding the maximum by a sum yields
$$
\mathbb{E}\bigl[R(\tau^*_{\mathrm{best}},c)-R(\hat\tau_{\mathrm{MPC}},c)\bigr]
\;\le\;L_R\sum_{j=1}^N
\mathbb{E}\|\mathrm{vec}(\tau^{(j)})-\mathrm{vec}(\hat\tau^{(j)})\|_2,
$$
and by exchangeability and we conclude that this term is at most CLRε(c)
for a constant C (absorbing
the crude max-to-sum step into constants, or tightening it under
additional tail control on the coupling error). Substituting into the
decomposition gives .
Theorem~ is stated for a single planning call at a fixed context. In receding-horizon execution, we replan at each time step with a new context ct = (st, g). Under standard stability assumptions on the environment and the replanning loop, we may apply conditionally at each time and sum (with discount) to obtain a cumulative return degradation controlled by ∑t ≥ 0γt(LRε(ct) + ΔN(ct)). Consequently, once we insert a bound on ε(c) from conditional generation analysis, planning performance degrades gracefully with both structured missingness (through ε(c)) and finite sampling (through ΔN(c)).
The preceding bounds are necessarily conditional on structure: when μ𝒟 assigns zero mass to a region of contexts ℛ that is encountered at test time, no purely offline procedure can in general determine what should be done on ℛ. We formalize this via a standard indistinguishability argument (Le Cam two-point method) specialized to goal-conditioned planning: we construct two conditional trajectory laws that coincide on supp(μ𝒟), yet induce incompatible optimal actions for some c ∈ ℛ. Consequently, any offline learner—whether model-based, behavior-cloning, diffusion-based, or otherwise—must fail on at least one of the two worlds.
We outline one concrete instantiation. Consider a deterministic system with a ``junction’’ state sJ that is reached only from contexts in ℛ (equivalently, the dataset never contains trajectories whose initial condition is such that the induced rollouts enter the junction). From sJ there are two actions, aL and aR, leading to two disjoint corridors of length H′ terminating in terminal states sgoal and strap. Define a sparse reward that gives 1 if and only if sgoal is reached by time H, and 0 otherwise. Now define two environments (or conditional trajectory laws) that agree everywhere on supp(μ𝒟), but differ at the junction: in world~1, aL leads to sgoal and aR leads to strap; in world~2 the mapping is swapped. Because μ𝒟(ℛ) = 0, the offline data never contains transitions from sJ (nor any other distinguishing evidence), so the induced joint distribution over 𝒟 is identical under world~1 and world~2. However, optimal behavior at c† (a context that reaches sJ) differs: one must choose aL in world~1 and aR in world~2.
Let ℙ1 and ℙ2 denote the distributions over
datasets 𝒟 induced by p1* and p2* together
with μ𝒟. By
construction, ℙ1 = ℙ2. Fix any
algorithm Alg producing π = Alg(𝒟). Since 𝒟 has the same law under both hypotheses, the
random output π has the same
law under ℙ1 and ℙ2. Yet the optimal success events
at c† are disjoint:
if π selects (or induces)
action aL at the
junction, it succeeds in world~1 and fails in world~2; if it selects
aR, it succeeds in
world~2 and fails in world~1. Thus,
𝔼𝒟 ∼ ℙ1[Succ(π; c†)] + 𝔼𝒟 ∼ ℙ2[Succ(π; c†)] ≤ 1,
which implies the stated $\le
\tfrac{1}{2}$ bound for at least one of the two worlds. This is
the simplest form of an indistinguishability lower bound: any guarantee
that would exceed 1/2 uniformly over
both hypotheses would contradict ℙ1 = ℙ2.
Theorem~ isolates what smoothness/manifold assumptions buy us: they rule out such arbitrary discontinuities on ℛ by requiring that p*(⋅ ∣ c) vary regularly with c, enabling extrapolation from nearby covered contexts. In particular, the additional term proportional to δ(c)α in Theorem~1 (and the matching necessity in Theorem~2) should be viewed as the minimal price of leaving the support of μ𝒟 even when the problem is otherwise well behaved. Conversely, without any continuity control, no method—including those that reconstruct c at each diffusion step—can infer the correct behavior at c†, because the training signal contains no information about the relevant transitions.
A practical corollary is that success in missing regions demands either (i) additional structure that links ℛ to observed contexts (smoothness, latent manifold proximity, symmetries, known dynamics), or (ii) an explicit mechanism to detect when the current context is out-of-support and to trigger a safe fallback. In our setting, one natural diagnostic is a distance-to-data estimate for c (or for the trunk representation), which can be used to abstain, increase conservatism, enlarge uncertainty bounds, or switch to a reactive controller. Such mechanisms do not contradict the impossibility result; rather, they acknowledge it by declining to commit to an unidentifiable choice when μ𝒟(c) is effectively zero.
Even if p*(⋅ ∣ c) were known, best-of-N MPC can fail in sparse-reward problems when successful trajectories have tiny probability under the generator; in that case N must scale as Ω(1/p) to observe a success with constant probability. This phenomenon is orthogonal to offline identifiability: it persists under perfect coverage and perfect modeling, and motivates guidance, shaping, or hierarchical decompositions when H is large.
We record the basic resource requirements of RAP and note several implementation choices that materially affect the cost–quality tradeoff in offline planning.
Each stochastic gradient step draws a minibatch of size B, samples t ∼ Unif({1, …, T}) and
ε ∼ 𝒩(0, I), forms
$x_t=\sqrt{\bar\alpha_t}x_0+\sqrt{1-\bar\alpha_t}\varepsilon$,
and evaluates the trunk and two heads. Denoting by Ctrunk the cost of a
forward/backward pass through fϕ(⋅) and by
Chead the
corresponding cost for one head, the per-step cost is
O (B ⋅ Ctrunk + B ⋅ (Cθ + Cψ)),
and is thus a constant-factor overhead relative to a baseline
conditional diffusion model (which omits fψ and the
reconstruction term). In practice Cψ is small (a
shallow MLP) and the dominant cost is the trunk. Since we sample only
one diffusion time index t per
datapoint per iteration (as in standard score matching), training cost
is independent of T up to a
negligible constant for schedule lookups; T matters primarily at
inference.
Training memory consists of model parameters plus activation storage for backpropagation. The batch storage for trajectories is O(B ⋅ dτ), and the additional head increases activation memory by O(B ⋅ dz) where dz is the trunk feature dimension. Because reconstruction is computed from the same z used by the diffusion head, we do not require any additional trunk activations beyond those already needed for score matching. When trajectories are stored as length-H windows extracted from longer episodes, it is convenient to precompute the vectorization map x0 = vec(τ) and to cache normalization statistics (per-dimension mean/variance) so that the diffusion noise scale is commensurate across state and action channels.
At test time, the per-action computational cost is dominated by
denoising N candidate
trajectories for T reverse
steps. Writing Cnet
for the cost of one network evaluation (trunk plus diffusion head), the
sampling cost is
O (N ⋅ T ⋅ Cnet).
The scoring cost depends on the reward evaluation. When the reward is a
known functional $R(\tau,c)=\sum_{t=0}^{H-1}
r(s_t,a_t,g)$, a direct evaluation is O(N ⋅ H) up to the
cost of computing r(⋅), which
is typically negligible compared to Cnet for image-free state
spaces. When reward evaluation requires additional checks
(e.g. collision or constraint penalties computed from st), the scoring
term may be comparable to denoising; in that case one should fuse reward
computation with unvectorization to avoid repeated passes over
memory.
A practical point is that MPC is typically receding-horizon: only the first action of the maximizing sample is executed, after which the context is updated and the optimization is rerun. Consequently, even moderate T and N can suffice in environments where frequent replanning corrects trajectory imperfections; conversely, long-horizon sparse-reward tasks may require either larger N (to hit rare successes) or guided sampling, irrespective of modeling error.
Naively storing all N denoised trajectories requires O(N ⋅ dτ) memory, which can be large when H is large. This can be reduced by streaming: we maintain the current xt(j) for all j during denoising (still O(N ⋅ dτ)), but after producing x0(j) we can immediately compute J(j) = R(τ(j), c) and retain only the current best candidate (or top-K) rather than all trajectories. This reduces post-sampling storage to O(dτ) while leaving sampling memory unchanged; further reductions are possible by sampling candidates sequentially (one j at a time), at the expense of losing vectorization/batching benefits on modern accelerators.
The dependence on T motivates using accelerated sampling schemes. Deterministic or semi-deterministic samplers (e.g. DDIM-style updates) can reduce variance and permit using smaller T without catastrophic degradation in sample quality. Higher-order ODE/SDE solvers (e.g. DPM-Solver-type methods) often allow T in the single digits while maintaining perceptual fidelity in standard diffusion settings; here the relevant metric is not visual quality but the induced distribution over feasible, high-reward trajectories, so the appropriate choice of solver is empirical. A further reduction is possible through distillation of the reverse process into a small number of steps (even T ≈ 1), which directly lowers MPC cost from O(NT) to O(N). Since our theory treats T as an algorithmic parameter and does not rely on asymptotics in T, such accelerations are fully compatible with the framework, though they alter constants through solver bias.
The reconstruction head fψ may be tasked with predicting c = (s0, g) directly, or with predicting its components separately (e.g. ŝ0 and ĝ) to allow different scalings and losses. We generally normalize s0 and g to comparable magnitudes and set λ so that ∥ε − ε̂∥22 and ∥c − ĉ∥22 are of the same order early in training; otherwise the trunk may ignore one objective. Conditioning c and time t into the trunk can be realized by concatenation followed by an MLP, FiLM-style modulation, or cross-attention; the reconstruction term is intended to prevent the trunk from becoming condition-agnostic under any of these mechanisms.
To isolate the effect of reconstruction, we recommend the following
ablation axes.
(i) set λ = 0 and remove fψ, yielding a
standard conditional diffusion planner trained purely by score
matching.
(ii) reconstruct c from z = fϕ(xt, c, t)
at every t (the default),
versus reconstruct only at small noise levels (e.g. t ≤ t0) or only
at a single t sampled from a
restricted set; this tests whether information retention is needed
across the entire noise schedule or only near x0.
(iii) reconstruct the full context (s0, g) versus
reconstruct only g (or only
s0). In
goal-conditioned settings, reconstructing only g probes whether the improvement
arises from preserving goal information specifically or from a broader
regularization effect.
(iv) replace ∥c − ĉ∥2 with a
loss that predicts selected trajectory states (e.g. endpoints (s0, sH)
or a subset {stℓ})
from z. This tests whether the
benefit comes from enforcing retention of conditioning variables or from
encouraging the representation to encode semantically meaningful
trajectory features. We emphasize that such variants change the
inductive bias (and may implicitly regularize dynamics consistency), but
preserve the central principle: auxiliary prediction prevents the trunk
from discarding information that the planner later needs.
We recommend evaluating RAP in goal-conditioned continuous-control benchmarks where structured missingness can be induced in a controlled manner and where success is sensitive to the ability to condition on (s0, g) outside the empirical context support. Our primary emphasis is on Maze2D-style navigation and AntMaze-style long-horizon planning, since both admit natural ``holes’’ (unvisited corridors, removed goal sets, or disconnected start–goal pairs) while retaining a well-defined test-time reward.
Let the underlying environment be a planar maze with state s = (x, y, ẋ, ẏ) (or position-only) and goal g = (xg, yg), with sparse terminal reward $\mathbbm{1}\{\|s_H-g\|\le \epsilon\}$ or shaped reward −∥st − g∥. From a standard offline dataset, we construct missingness variants by editing and/or :For each protocol, we recommend reporting performance as a function
of the distance-to-coverage proxy δ(c), approximated by kNN distance of c = (s0, g)
to the set of dataset contexts, together with separate analyses for
in-support'' andout-of-support’’ test contexts.
AntMaze is a high-dimensional setting where diffusion planners are
often competitive, but where missingness can be made concrete by
removing narrow passages or goals in a specific region. We recommend two
controlled edits:
(i) delete all trajectories that enter a chosen corridor or doorway;
test on tasks that require that corridor to reach the goal.
(ii) delete all trajectories whose goals lie in a designated region of
the maze; test on goals in that region while keeping start states
broad.
To reduce confounds, we suggest keeping the behavior policy and
environment identical and only modifying dataset filtering; in
particular, the dynamics remain fixed, so any performance gap is
attributable to offline generalization rather than domain shift.
A minimal baseline set should include (a) a standard conditional trajectory diffusion model trained with score matching only (RAP with λ = 0), (b) an implicit behavior-cloning policy with goal conditioning (e.g. goal-conditioned BC), and (c) an offline RL method appropriate for the benchmark if available. For RAP-specific ablations, we recommend varying: (i) reconstruction on/off; (ii) reconstruction enforced for all t versus only small-noise t; (iii) reconstruct g only versus reconstruct (s0, g); and (iv) replace context reconstruction with predicting a subset of trajectory states (endpoint or waypoints). To isolate representation effects, all variants should share the same trunk architecture, diffusion schedule, and inference sampler.
Sparse-reward success is necessary but insufficient to diagnose why reconstruction helps. We recommend the following measurements, computed on held-out contexts and stratified by δ(c):For each benchmark and missingness protocol, we recommend reporting: (i) success/return under μtest and separately on ℛ, (ii) diagnostics above, and (iii) compute budgets (N, T, H) and wall-clock inference per action. Since RAP targets missing-context generalization, we regard stratified evaluation by δ(c) as essential: improvements confined to near-support contexts are qualitatively different from improvements that persist deeper into ℛ.
The operative effect of the reconstruction head is not to modify the reverse-time sampling rule at inference, but to constrain the learned representation z = fϕ(xt, c, t) so that c remains recoverable even when xt is heavily corrupted. In conditional diffusion, a frequent pathology in the offline setting is that the trunk learns to rely predominantly on correlations between xt and a subset of context coordinates that are well-covered by μ𝒟, while discarding those components of c that are predictive only in rare or missing regions. Imposing ∥c − fψ(z)∥2 for all t forces a form of conditional identifiability: among hypotheses that fit the score-matching objective comparably well on supp(μ𝒟), we bias toward those that retain conditioning information globally. Under the manifold-smoothness assumption, this bias is aligned with the desired extrapolation behavior, since p*(⋅ ∣ c) varies continuously along ℳc and thus the ``correct’’ conditioning signal is stable under small context perturbations.
A second regime where reconstruction helps is high-noise conditioning. For large t, the signal-to-noise ratio of xt is small, and an unconstrained conditional model may effectively revert to an unconditional generator (or a generator conditioned only on easy-to-learn context statistics), thereby producing samples that are not discriminative with respect to g or s0. Since MPC performance depends on drawing at least one high-reward candidate among τ̂(1 : N), this collapse can be catastrophic in sparse-reward tasks: the mass of goal-reaching trajectories becomes too small for any feasible N. Reconstruction loss applied uniformly in t is a direct way to prevent this failure mode by ensuring that z remains informative about c throughout the diffusion chain.
Reconstruction is not free regularization; it competes with the score-matching objective for representational capacity. If λ is too large relative to the diffusion loss scale, the trunk may allocate excessive capacity to encoding c at the expense of encoding the fine details of xt needed to predict ε. In the extreme, this yields a representation that is nearly a function of c alone, which can reduce sample fidelity and increase W2(p̂(⋅ ∣ c), p*(⋅ ∣ c)) even on in-support contexts. A more subtle harm is induced mode narrowing: if the reconstruction head is easiest to satisfy by learning a near-deterministic encoding of c into z, gradients may push fϕ toward features that correlate strongly with a dominant mode of p*(⋅ ∣ c), reducing conditional diversity. This can worsen best-of-N planning when multiple qualitatively different homotopy classes of trajectories exist (e.g. distinct corridors in a maze) and the optimal route depends on downstream stochasticity or constraints not well-modeled by the offline data.
Reconstruction can also be mis-specified. If c = (s0, g) contains nuisance coordinates irrelevant to planning (or coordinates that are noisy labels), forcing accurate reconstruction can encourage the trunk to represent precisely what we would prefer to ignore, thereby decreasing robustness. Conversely, if the true conditional law depends on additional latent context not included in c (e.g. unobserved environment parameters or partial observability), then strong reconstruction of the observed c does not resolve ambiguity and may give a misleading sense of ``conditioning correctness.’’ In such settings the correct remedy is to expand the context, infer a latent variable, or adopt a stochastic conditional objective, rather than to increase λ.
The reconstruction term can be viewed as a tractable surrogate for maximizing mutual information between c and z under the joint distribution induced by the forward diffusion. While we do not require an explicit information-theoretic formalism for the stated bounds, the proof intuition is consistent: small reconstruction error implies that distinct contexts map to distinguishable representations, and hence the conditional score predictor fθ(z) can, in principle, implement a family of reverse dynamics that varies with c. In the absence of such a constraint, the function class can contain degenerate solutions that fit the marginal of xt well but implement only weak conditional dependence, a phenomenon exacerbated by structured missingness.
When c is high-dimensional (e.g. images, language, or long goal descriptors), reconstructing c directly can be unnecessarily stringent. A natural variant is to learn an encoder eη(c) and reconstruct only a low-dimensional latent u = eη(c), replacing ∥c − ĉ∥2 by ∥u − û∥2 (or a cosine/contrastive loss). This retains the main mechanism—preventing the trunk from discarding conditioning—while allowing invariances and reducing the risk of overfitting to irrelevant context details. Another variant is to reconstruct only the components of c that matter for planning (e.g. g alone), which is compatible with our analysis provided the smoothness condition is stated in the corresponding reduced context metric.
The reconstruction head yields an auxiliary signal that can be repurposed at test time: the error ∥c − fψ(fϕ(xt, c, t))∥ (or its expectation across t) can serve as a proxy for being far from supp(μ𝒟). While not a calibrated distance estimator in general, it can be used for conservative planning heuristics: increase N or T when the proxy is large, fall back to a safer baseline policy, or reject samples whose internal representations appear inconsistent with c. Such gating is consistent with the bound structure in Thm.~1, in the sense that deterioration deep in ℛ is expected and should be detected rather than ignored.
RAP can be instantiated with pixel observations by letting s be an image (or a learned visual state) and letting c = (s0, g) be either two images (start and goal) or an image plus a goal embedding. The trunk then combines (i) a temporal model over xt (trajectory tokens) and (ii) a visual encoder for c. Reconstruction may be applied in the visual embedding space rather than pixel space, to avoid penalizing irrelevant appearance changes. The principal technical caution is that the effective dimension k in the manifold assumption should be interpreted as the intrinsic dimension of the visual factors, not the raw pixel dimension; otherwise the minimax rate becomes vacuous.
Our guarantees rely on smoothness/manifold proximity and thus do not apply to adversarial missingness or to environments where p*(⋅ ∣ c) changes discontinuously with c (e.g. bifurcations caused by hard contact dynamics or discrete task modes). Moreover, state-coverage holes that remove essential intermediate states can violate the premise that a nearby context suffices to infer feasible trajectories; in such cases, extrapolation from c alone is not information-theoretically justified, aligning with Thm.~4. Finally, RAP does not address the computational hardness of long-horizon sparse-reward planning: even with an accurate generator, the mass of successful trajectories may be too small for moderate N. These limitations motivate combining reconstruction with additional mechanisms such as hierarchical decomposition, reward guidance, or explicit feasibility models, while keeping the central requirement intact: the representation used by the diffusion model must remain genuinely conditional on (s0, g) throughout the denoising process.
We studied goal-conditioned planning from offline data in the presence of structured missingness, formalized by a region ℛ ⊂ 𝒞 of contexts that is absent under the empirical distribution μ𝒟 but has positive mass under μtest. In this regime, a conditional trajectory model is required not only to interpolate within the support of μ𝒟, but also to extrapolate in a controlled manner to contexts at nonzero distance from that support. Our central claim was that such extrapolation is possible only under explicit structure on the conditional law p*(⋅ ∣ c), and that the representation used by a conditional diffusion model must be constrained to actually retain the conditioning information in the high-noise part of the diffusion chain.
To this end we proposed Reconstruct-and-Plan (RAP), which augments a standard conditional diffusion trajectory model with an auxiliary head fψ that reconstructs c = (s0, g) (or selected components thereof) from the shared trunk representation z = fϕ(xt, c, t) at every diffusion step. The algorithm does not alter the reverse-time sampling rule at test time; instead, it biases learning toward conditional hypotheses in which the trunk does not collapse to an effectively unconditional representation when xt is heavily corrupted. This modification is compatible with conventional score-matching training and preserves the typical computational footprint of diffusion planners up to constant factors. At inference, RAP is deployed as a planner via MPC: we sample N candidate trajectories τ̂(1 : N) ∼ p̂(⋅ ∣ c) and execute the first action from the highest-reward sample under a known reward functional.
Our theoretical results connect this architectural constraint to distributional and planning guarantees. Under a manifold proximity hypothesis—contexts lying on or near a k-dimensional manifold ℳc on which p*(⋅ ∣ c) is α-H"older in W2—we showed in Thm.~1 that the conditional generation error at a test context c decomposes into three interpretable terms: (i) a diffusion estimation term εD reflecting score-matching error, (ii) a representation/conditioning term controlled by the reconstruction error εR, and (iii) an unavoidable extrapolation term scaling as L δ(c)α, where δ(c) is the distance from c to supp(μ𝒟). This bound explicitly separates what can be improved by model capacity and optimization from what is dictated by dataset coverage. Thm.~2 complements this by giving a minimax lower bound over 𝒫k, α, showing that dependence on intrinsic dimension k and smoothness α is information-theoretically sharp up to constants, and that a term of order Lδα cannot be eliminated when test contexts are separated from the dataset support.
We then translated conditional generation error into planning performance. Thm.~3 provides an additive suboptimality bound for best-of-N MPC selection: when the reward is Lipschitz in the vectorized trajectory, the expected gap between the optimal achievable return (under the true conditional distribution) and the return attained by MPC using samples from p̂(⋅ ∣ c) is controlled by O(LR ε(c)) plus a finite-sampling term ΔN. The qualitative implication is that improving the conditional generator in W2 directly improves planning, and the degradation in missing regions is precisely mediated by δ(c). This formalizes a phenomenon observed in offline diffusion planning: failures in out-of-support contexts are not merely ``distribution shift,’’ but can be quantified as a combination of estimation error and an irreducible extrapolation penalty dictated by the smoothness class.
Conversely, we made explicit the limits of offline planning in the absence of structural assumptions. Thm.~4 provides an indistinguishability-based impossibility: if p*(⋅ ∣ c) can differ arbitrarily on ℛ while agreeing on supp(μ𝒟), then no offline method can guarantee nontrivial success uniformly on ℛ. In other words, reconstruction (or any other regularizer) is not a substitute for assumptions; it is a mechanism for aligning the learned representation with the assumptions under which generalization is possible. This separation is useful in practice: when empirical performance deteriorates in missing regions, the correct diagnosis may be that the smoothness/manifold hypothesis is violated, rather than that the learning algorithm is inadequately optimized.
Several practical lessons follow from the analysis. First, tuning λ is not merely a regularization heuristic but a means of trading diffusion estimation against conditional identifiability: overly small λ permits conditional collapse, while overly large λ may consume capacity needed for accurate score prediction. Second, the reconstruction head provides an auxiliary signal that can be repurposed for conservative deployment, e.g. as a proxy for being far from supp(μ𝒟), suggesting adaptive inference budgets in (N, T) or fallback strategies. Third, since MPC success in sparse reward can be limited by the mass of successful trajectories even under an accurate model, one should view RAP as complementary to guidance, hierarchical decomposition, or alternative search procedures that mitigate needle-in-haystack effects without sacrificing conditional fidelity.
We conclude with open problems that are suggested by, but not resolved within, our framework. A first direction is to formalize reconstruction as an information constraint (or a conditional sufficiency constraint) and to relate εR to mutual information between c and z under the forward diffusion, potentially yielding sharper or more interpretable constants. A second direction is to extend the smoothness assumption to settings with discrete task modes or contact-induced discontinuities, where the appropriate notion of ``nearby context’’ may be non-Euclidean or require latent mode variables. A third direction is computational: bridging the gap between the statistical rates captured by W2 bounds and the algorithmic efficiency of sampling-based MPC in long horizons. Establishing guarantees for guided or amortized samplers, or for planners that combine diffusion proposals with explicit feasibility checks, would sharpen the connection between conditional generation fidelity and end-to-end control performance in the most challenging offline regimes.