Reconstruct-and-Plan: Trajectory Diffusion with Context Reconstruction for Offline Planning Under Missing Data

Metadata

Total Words: 9,996
Export Date: 2026-01-20 00:54:08
Description: Trajectory diffusion (Diffusers-style) offers a modern approach to offline planning by generating long-horizon state–action sequences and selecting the best via goal-conditioned scoring. However, in offline datasets with structured gaps—e.g., missing sub-trajectories or removed regions near goals—trajectory diffusion can collapse to dataset-marginal modes, neglect goals, or generate infeasible plans. Inspired by SRDP (state reconstruction for action diffusion policies), we introduce Reconstruct-and-Plan (RAP): a goal-conditioned trajectory diffusion model with a shared representation and an auxiliary reconstruction head trained to reconstruct the planning context (start state and goal) at every diffusion step. We formalize offline planning under missing context/state coverage and provide (i) an upper bound showing that reconstruction-guided trajectory diffusion achieves conditional generation error rates governed by the intrinsic dimension of the context manifold and distance-to-coverage, and (ii) a planning suboptimality bound for MPC that converts generative error into return loss. We complement this with an impossibility result demonstrating that without coverage/smoothness assumptions, no offline method can guarantee success in missing regions. Empirically (to strengthen the contribution), we propose evaluations on maze2d-missing-data variants and AntMaze with removed corridors/goal neighborhoods, diagnosing failure modes such as goal neglect and trajectory collapse.

1. Introduction: offline planning with diffusion in 2026 robotics; failure under missing regions; thesis and contributions (algorithm + bounds + impossibility + experiments).
2. Problem Setup: goal-conditioned offline planning; missing-data models (missing contexts vs missing visited states); metrics (Wasserstein conditional error, success rate, normalized return).
3. Background: trajectory diffusion for planning (Diffusers-style MPC); SRDP idea (reconstruction across diffusion steps); why conditional collapse occurs under missingness.
4. Method (RAP): trajectory diffusion architecture; shared trunk + diffusion head + context-reconstruction head; training objective; inference/planning procedure (sample N trajectories, score, execute first action).
5. Theory I — Conditional Generation Under Missingness: formal assumptions (Hölder on manifold, Lipschitz networks); theorem giving upper bound on W₂(p̂(⋅ ∣ c), p^*(⋅ ∣ c)); matching minimax lower bound; interpretation.
6. Theory II — Planning Suboptimality from Generative Error: convert conditional generation error + finite-sample trajectory selection to return bound for MPC; dependence on N, horizon H, Lipschitz reward.
7. Impossibility / Lower Bounds Without Coverage: construct indistinguishable offline datasets with different optimal actions/paths in missing regions; show no offline planner can guarantee success; implications for required assumptions and for OOD detection/fallback.
8. Complexity and Implementation Notes: training/inference time, memory; accelerated samplers; ablation axes (with/without reconstruction; reconstruct context vs reconstruct intermediate states).
9. Experimental Plan (Recommended): maze2d-missing-data; AntMaze with corridor/goal removal; diagnostics (goal conditioning accuracy, feasibility, trajectory diversity, mode collapse metrics).
10. Discussion: when reconstruction helps/hurts; extensions (latent reconstruction, uncertainty gating, vision); limitations.
11. Conclusion.

Content

1. Introduction: offline planning with diffusion in 2026 robotics; failure under missing regions; thesis and contributions (algorithm + bounds + impossibility + experiments).

Diffusion models have recently become a practical mechanism for sampling long-horizon action sequences and trajectories, and they have been adopted as planning primitives in robotics and control. The typical deployment pattern is model-predictive control (MPC): given a current context (e.g., the observed start state and a goal), we sample a batch of candidate trajectories from a learned conditional generator and execute the first action of the best-scoring sample under a known task reward. This approach is attractive in offline settings because it avoids explicit system identification and can exploit the expressivity of generative models to represent multimodal long-horizon behavior.

However, the offline regime imposes a structural limitation that is particularly acute for diffusion-based planners. The dataset induces an empirical context distribution μ_𝒟, and at test time we are often required to act under contexts c ∼ μ_test that differ systematically from those in supp(μ_𝒟). In robotics this mismatch is not merely covariate shift in the ambient state space; it arises from structured missingness: there may be entire ``holes’’ of start–goal pairs, or entire regions of visited states, absent from the data. When diffusion planners are trained only by conditional score matching, they can succeed spectacularly on covered contexts and yet fail sharply on such missing regions ℛ, despite these regions being geometrically close to the dataset. Empirically, the failure is consistent with a representational pathology: the shared representation used by the conditional denoiser can discard conditioning information when it is not needed to reduce training loss on μ_𝒟, and the learned reverse process then becomes weakly conditional or effectively unconditional precisely where conditioning is required for generalization.

We study this phenomenon in the setting of goal-conditioned offline planning. We treat the environment only through an offline dataset of labeled trajectories and assume no further interaction during learning. Our focus is on planners that operate by sampling a length-H trajectory τ (or its vectorization x = vec(τ)) from a learned conditional law p̂(⋅ ∣ c) and selecting among N i.i.d. samples by maximizing a known return functional R(τ, c). Our guiding question is the following: under what conditions can a diffusion-based conditional trajectory model generalize across structured dataset holes, and how does conditional generation error translate into planning suboptimality for MPC?

Our main algorithmic contribution is : a conditional trajectory diffusion model augmented with an auxiliary context reconstruction objective applied at every diffusion time step. Concretely, we maintain a shared trunk f_ϕ that consumes (x_t, c, t) and produces a representation z, and we attach two heads: a diffusion head f_θ predicting the forward noise (equivalently the conditional score), and a reconstruction head f_ψ predicting the context c (or its components such as s₀ and g). We train on the combined objective
∥ε − ε̂∥₂² + λ ∥c − ĉ∥₂²,
with (t, ε) sampled as in standard denoising diffusion training. The reconstruction term does not modify the inference-time sampling recursion; rather, it constrains the representation to retain sufficient information about c across noise scales. In this sense RAP is not a new planner at inference time but a modification of the learned conditional generator that targets generalization under missing contexts.

Our theoretical contributions formalize the above intuition through a chain of implications from geometric regularity in context space to conditional generation accuracy and then to planning performance. First, we assume that contexts lie on or near a k-dimensional manifold ℳ_c and that the true conditional trajectory law p^*(⋅ ∣ c) varies smoothly along this manifold in Wasserstein-2 distance, in the sense of an α-H"older condition. Under these hypotheses, we prove an upper bound on the conditional generation error of RAP that decomposes into three terms: (i) a diffusion estimation term corresponding to score-matching error, (ii) a representation/conditioning term controlled by the reconstruction loss, and (iii) an extrapolation term scaling with the distance δ(c) from a test context to the dataset support. This bound makes explicit that smoothness alone is insufficient if the learned representation discards the conditioning signal, and it identifies reconstruction at all diffusion times as a sufficient mechanism (at the level of population error) to prevent this failure mode.

Second, we provide a minimax lower bound showing that, even under the same manifold-smoothness assumptions, no offline method can surpass the intrinsic-dimension rate Ω(n^{−α/(2α + k)}) for conditional generation in W₂, up to constants and secondary terms. This establishes that the favorable scaling in k predicted by our upper bound is not an artifact of the analysis but is information-theoretically unavoidable. Moreover, when test contexts include points separated from supp(μ_𝒟) by a fixed distance, an additional term proportional to Lδ^α is unavoidable, matching the structure of the extrapolation term in the upper bound.

Third, we connect conditional generation error to MPC planning performance. Assuming the reward is Lipschitz as a functional of the trajectory (in the vectorized Euclidean metric), we show that the expected gap between the best trajectory one could draw from the true law p^*(⋅ ∣ c) and the trajectory chosen by MPC when sampling from p̂(⋅ ∣ c) is bounded by a term proportional to W₂(p̂(⋅ ∣ c), p^*(⋅ ∣ c)), plus a finite-sampling term Δ_N capturing the order-statistics effect of selecting the best among N samples. This theorem makes precise how improvements in conditional generation translate into downstream planning gains, while also highlighting an orthogonal limitation: even a perfect generator may require large N when high-reward trajectories occupy very small probability mass.

Finally, we complement the positive results with an impossibility statement showing that without coverage and smoothness assumptions, purely offline planning over missing regions is fundamentally ill-posed. We construct pairs of conditional trajectory laws that agree on supp(μ_𝒟) yet induce incompatible optimal actions on some missing context c ∈ ℛ. Consequently, any offline algorithm is forced to fail on at least one of the two hypotheses, yielding a worst-case success probability bounded away from 1 on ℛ. This result clarifies the role of our manifold proximity assumption: it is not merely a convenient regularity condition but a necessary structural restriction for nontrivial guarantees.

Empirically, we validate these claims on goal-conditioned robotics-style benchmarks with explicitly constructed missing-context and missing-visited-state regimes. Across tasks, RAP improves success and normalized return in the missing regions while matching or improving performance on covered contexts. Ablations varying λ, the diffusion horizon, and the number of denoising steps support the central claim that reconstruction regularizes the conditional representation and thereby reduces failure under structured dataset holes, consistent with the decomposition in our bounds.

2. Problem Setup: goal-conditioned offline planning; missing-data models (missing contexts vs missing visited states); metrics (Wasserstein conditional error, success rate, normalized return).

We consider goal-conditioned offline planning in a Markov decision process (MDP) ℳ = (𝒮, 𝒜, P, r, ρ₀, γ) augmented with a goal variable g ∈ 𝒢. A is c = (s₀, g) ∈ 𝒞 ≔ 𝒮 × 𝒢, and the task reward is a known function r : 𝒮 × 𝒜 × 𝒢 → ℝ. We focus on finite-horizon planning over length-H trajectories τ = (s₀, a₀, s₁, a₁, …, s_H) with (possibly goal-conditioned) trajectory return

including the common sparse-goal specialization $r(s_t,a_t,g)=\mathbbm{1}\{d(s_{t+1},g)\le \varepsilon\}$ or terminal-only variants. For notational convenience we also employ a vectorized representation x = vec(τ) ∈ ℝ^d_τ, which may concatenate states and actions.

We are given an offline dataset

where each τ_i is a length-H trajectory segment. When raw logs contain variable-length episodes, we interpret 𝒟 as the collection of sliding windows of length H extracted from those episodes (with the appropriate s₀ and goal label). Let μ_𝒟 denote the empirical distribution over contexts induced by {c_i}_i = 1ⁿ, and let μ_test denote the (unknown) test-time distribution over contexts. Our goal is to construct a goal-conditioned planning policy π(a ∣ s, g) that maximizes expected discounted test return. Operationally, we implement π via MPC over a learned conditional trajectory generator: at each environment step we observe the current context c = (s, g), sample N candidate trajectories from a learned conditional law p̂(⋅ ∣ c), score them using the known R(⋅, c), and execute the first action of the best-scoring sample.

A defining difficulty in offline planning is that μ_𝒟 and μ_test may differ in a way that cannot be reduced to small i.i.d. covariate shift. We isolate two forms of structured missingness that are convenient to analyze and to instantiate experimentally.

There exists a region ℛ ⊂ 𝒞 such that

i.e., certain start–goal pairs never appear in the offline data yet arise at test time. In goal-conditioned robotics benchmarks this corresponds, for example, to holding out an entire range of goal locations, or holding out combinations of start poses and goals. For analysis it is useful to quantify how far a test context is from covered contexts; we write

with respect to a fixed norm on 𝒞 (typically Euclidean after appropriate normalization of state and goal coordinates).

Even when μ_𝒟 provides coverage of start–goal pairs, the dataset may systematically avoid portions of the state space that are nevertheless required for successful test-time execution. To formalize this, let ν_𝒟 be the empirical state visitation distribution induced by dataset trajectories (e.g., the uniform mixture over time indices and trajectories). A is a measurable subset 𝒮_miss ⊂ 𝒮 such that ν_𝒟(𝒮_miss) = 0 while the optimal (or any high-return) test-time behavior under some contexts must pass through 𝒮_miss with positive probability. This setting is practically relevant when demonstrations or prior policies avoid narrow passages, obstacle-adjacent regions, or rare contact configurations; at test time, however, the goal or perturbations force traversal of precisely these regions.

While missing visited states is not, on its face, missing contexts, it can often be reduced to it by enriching the context variable to include additional conditioning information (e.g., the current state at replanning time) and observing that MPC repeatedly queries the generator at contexts (s_t, g) for states s_t that may lie outside the dataset visitation support. Accordingly, both forms of missingness exert pressure on the conditional generator in essentially the same way: it must produce high-quality trajectories for condition values that were never (or rarely) seen during training.

We separate evaluation into (i) conditional generative modeling fidelity and (ii) downstream planning performance.

For a fixed context c, let p^*(⋅ ∣ c) denote the (unknown) true conditional trajectory law induced by the environment and whatever data-collection process generated 𝒟, and let p̂(⋅ ∣ c) denote the learned conditional generator. We use the Wasserstein-2 metric

computed over the space of vectorized trajectories x = vec(τ) with the Euclidean cost. This metric is well-aligned with planning under Lipschitz trajectory rewards, since it controls discrepancies in low-order moments and admits coupling-based bounds that translate distribution shift into reward shift. In empirical studies where p^*(⋅ ∣ c) is inaccessible, we approximate ε(c) by two-sample estimators (e.g., Sinkhorn-regularized W₂) using generated samples and reference trajectories available for the same context (from held-out rollouts, a simulator, or an auxiliary data split), reporting averages over c ∼ μ_test and also stratifying by δ(c).

In sparse-goal tasks, a primary metric is the probability of reaching the goal within tolerance at (or before) horizon H. For a rollout {s_t}_t = 0^H under the executed MPC policy, define

and report 𝔼_{c ∼ μ_test}[Succ(c)] (often separately on covered contexts and on missing regions).

For shaped rewards, or when comparing across tasks with different reward scales, we additionally report a normalized return. Given per-episode return $G(c)\coloneqq \sum_{t=0}^{H-1} r(s_t,a_t,g)$ under the evaluated planner, we normalize via

where G_max(c) and G_min(c) are task-dependent reference returns (e.g., from an expert/oracle and a weak baseline, respectively) and η > 0 prevents division by zero. We then report 𝔼_{c ∼ μ_test}[NR(c)] and its decomposition over covered versus missing regimes. In purely sparse-goal settings, NR typically coincides (up to affine transformation) with success, but it remains useful when intermediate shaping terms or energy penalties are present.

These metrics together disentangle (a) whether the conditional generator produces trajectories close to the true conditional law in a distributional sense, and (b) whether the induced MPC policy achieves the intended control objective under structured missingness.

3. Background: trajectory diffusion for planning (Diffusers-style MPC); SRDP idea (reconstruction across diffusion steps); why conditional collapse occurs under missingness.

Diffusion generative models provide a convenient interface between offline data and test-time planning because they learn a conditional distribution over entire trajectory segments and defer objective-specific selection to inference. In contrast to value-based offline RL methods, which require extrapolating counterfactual action values for out-of-distribution state–action pairs, a trajectory generator can be used as a proposal mechanism: it produces candidate rollouts that are subsequently evaluated by the known reward functional. This separation is particularly natural in goal-conditioned settings, where the conditional trajectory law can be highly multimodal (multiple homotopy classes of paths to the same goal), and where the offline data may cover only a subset of contexts.

Let x₀ = vec(τ) ∈ ℝ^d_τ denote a vectorized trajectory. A (conditional) diffusion model specifies a forward noising process

for t ∈ {1, …, T}, and learns a parameterized reverse-time denoiser that maps (x_t, c, t) to an estimate of the injected noise. Concretely, one trains a network ε̂_θ(x_t, c, t) so that, when embedded into the standard DDPM reverse update, repeated denoising steps transform an initial x_T ∼ 𝒩(0, I) into a sample x̂₀ whose induced trajectory τ̂ resembles the data distribution conditioned on c. Conditioning is typically implemented by concatenating c to per-time-step embeddings, by FiLM-style modulation, or by cross-attention from trajectory tokens to context tokens; for the present discussion we treat these as interchangeable mechanisms that aim to approximate a conditional score model.

The canonical objective samples (c, x₀) from the offline dataset, then samples t (often uniformly) and ε ∼ 𝒩(0, I), forms x_t, and minimizes the squared error

Under standard assumptions, this training procedure is equivalent to score matching for the conditional distributions {p(x_t ∣ c)}_t = 1^T and, in the infinite-data/infinite-capacity limit, yields a reverse process whose samples match the target conditional law p^*(x₀ ∣ c).

At test time, given a context c = (s, g), we generate N candidate trajectories by sampling x_T^(j) ∼ 𝒩(0, I) and iterating the learned reverse update for t = T, T − 1, …, 1 to obtain x̂₀^(j) and hence τ̂^(j). We then evaluate each candidate under the known trajectory return R(τ̂^(j), c) and select j^⋆ ∈ arg max_jR(τ̂^(j), c). The MPC policy executes the first action of τ̂^{(j^⋆)}, observes the next state, updates the context to the new (s, g), and replans. This ``generate–and–select’’ mechanism has two salient properties: it naturally exploits multimodality through sampling, and it can incorporate reward shaping or constraints at inference without retraining the generator, provided the reward is computable from the trajectory.

The preceding picture implicitly assumes that the learned denoiser meaningfully uses the context. However, under structured missingness, conditional diffusion models exhibit a failure mode we refer to as : the learned reverse process becomes only weakly dependent on c, so that p̂(⋅ ∣ c) degenerates toward an approximately unconditional model (or, more subtly, becomes correct only on supp(μ_𝒟) and behaves unpredictably elsewhere). We emphasize that this is not a purely architectural pathology; it is compatible with the training objective.

One way to see the issue is to observe that for many noise schedules, a substantial fraction of training steps use t for which ᾱ_t is close to 1. In that regime x_t already contains nearly all information about x₀, and the Bayes-optimal predictor of ε can be well-approximated as a function of x_t alone. Consequently, minimizing ℒ_diff does not, by itself, force the network to retain or exploit c, especially when c is statistically redundant given x_t on the dataset distribution. In finite-capacity networks trained by SGD, this pressure manifests as a representation that prioritizes features predictive of ε averaged over μ_𝒟, potentially discarding low-signal conditioning information. The problem is exacerbated precisely in the regimes we care about: if μ_𝒟(c) = 0 for c ∈ ℛ, then the loss provides no direct gradient signal enforcing correct dependence on c in ℛ, and extrapolation is driven only by inductive bias.

A complementary perspective is information-theoretic. Let z denote an intermediate representation used to predict ε. If the trunk can map (x_t, c, t) to z in such a way that I(z; c) is small while still allowing accurate prediction of ε on training data, then conditioning information is effectively absent from the computation that determines the reverse dynamics. In that case, at test time the sampling distribution varies weakly with c, and MPC reduces to selecting among trajectories that are typical under the dataset marginal rather than appropriate for the queried goal. Under missing contexts or missing visited states (which induce out-of-support replanning contexts), this collapse translates directly into poor success: the high-reward set may have negligible probability mass under the near-unconditional generator, making best-of-N sampling ineffective even if N is large.

A natural remedy is to explicitly enforce that the internal representation used for denoising retains the conditioning information. The SRDP idea is to augment the diffusion network with an auxiliary reconstruction task: from the same representation z used to predict ε, we also predict ĉ and penalize reconstruction error at diffusion step. Abstractly, if ĉ_ψ(z) is a reconstruction head, we add

Because the reconstruction must succeed for all t, the representation cannot ignore c even at low-noise steps where x_t nearly determines x₀. In effect, we constrain the learned features to preserve conditioning information in a form that remains accessible throughout denoising, which empirically mitigates collapse and, in analysis, supports stability arguments linking small reconstruction error to small conditional mismatch.

These considerations motivate our Reconstruct-and-Plan (RAP) architecture, which we present next as a trajectory diffusion planner with a shared trunk and dual heads for denoising and context reconstruction.

4. Method (RAP): trajectory diffusion architecture; shared trunk + diffusion head + context-reconstruction head; training objective; inference/planning procedure (sample N trajectories, score, execute first action).

We now specify the Reconstruct-and-Plan (RAP) model class and its use as a goal-conditioned planner. RAP instantiates a conditional trajectory diffusion model together with an auxiliary reconstruction task designed to prevent the learned reverse process from becoming weakly dependent on the context under structured missingness. The method is deliberately modular: the diffusion component is trained to represent a conditional trajectory law, while planning is delegated to a sampling-based MPC loop that evaluates candidates under a known return functional.

We fix a planning horizon H and represent a length-H trajectory segment by a vector
x ≡ x₀ ≔ vec(τ) ∈ ℝ^d_τ, τ = (s₀, a₀, …, s_H),
where vec(⋅) denotes a fixed concatenation of states and actions (or, in action-only variants, actions plus a compact state encoding). Each training example consists of a context–trajectory pair (c, x₀) with context c = (s₀, g), where g ∈ 𝒢 denotes a goal descriptor. When demonstrations have variable length, we reduce them to a collection of length-H windows, treating each window start state as s₀ and inheriting the goal g from the parent episode. This reduction yields an empirical context distribution μ_𝒟 that may have holes relative to μ_test, which is precisely the regime in which we seek stable conditioning.

RAP parameterizes all diffusion steps with a shared feature extractor (the )
z = f_ϕ(x_t, c, t),
followed by two task-specific heads. The predicts the noise injected in the forward process,
ε̂ = f_θ(z),
and the predicts the context,
ĉ = f_ψ(z).
In goal-conditioned tasks it is often convenient to decompose ĉ into (ŝ₀, ĝ) and to measure error componentwise; we keep the notation ∥c − ĉ∥₂² for brevity. Architecturally, f_ϕ may be implemented by a transformer over time-indexed tokens with time embedding of t, with c injected by concatenation, cross-attention, or FiLM modulation; our analysis treats these mechanisms as different realizations of a Lipschitz map (x_t, c, t) ↦ z. The salient design choice is not the particular conditioning mechanism, but the requirement that representation z simultaneously supports denoising and context reconstruction at every diffusion step.

We adopt the standard forward noising process
$$ x_t \;=\; \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\,\varepsilon, \qquad \varepsilon\sim\mathcal{N}(0,I), \qquad t\in\{1,\dots,T\}, $$
with schedule {β_t}_t = 1^T, α_t = 1 − β_t, and $\bar\alpha_t=\prod_{i=1}^t\alpha_i$. Training proceeds by sampling (c, x₀) from 𝒟, sampling t (typically uniformly on {1, …, T}) and ε, forming x_t, and minimizing the composite loss

where λ > 0 controls the strength of the anti-collapse regularization. The reconstruction term is enforced , so that the trunk cannot satisfy denoising by relying on x_t alone at low-noise steps while ignoring c. In particular, because the reconstruction head receives only z (and not x_t directly), small reconstruction error forces the trunk to encode context information into z in a manner that remains accessible uniformly across noise levels. We emphasize that the reconstruction target is the conditioning variable c supplied at training time; the loss in therefore does not require additional labels beyond those already present in goal-conditioned planning datasets.

At inference, given a queried context c = (s, g), we sample candidate trajectories by running the learned reverse diffusion starting from Gaussian noise. Concretely, for each sample index j ∈ {1, …, N} we draw x_T^(j) ∼ 𝒩(0, I) and iterate for t = T, T − 1, …, 1:
z_t^(j) = f_ϕ(x_t^(j), c, t), ε̂_t^(j) = f_θ(z_t^(j)), x_t − 1^(j) = denoise(x_t^(j), ε̂_t^(j), t),
where denoise(⋅) denotes a standard DDPM update (or an accelerated deterministic/stochastic variant such as DDIM or a solver-based step). After T steps we decode x̂₀^(j) ≡ x₀^(j) into a trajectory τ̂^(j) = unvec(x̂₀^(j)). The reconstruction head f_ψ is used to modify this sampling rule at test time; its role is to shape the learned representation during training so that the conditional reverse dynamics induced by f_θ ∘ f_ϕ meaningfully depends on c.

RAP uses the learned conditional generator as a proposal distribution inside a receding-horizon controller. Given current state s and fixed goal g, we form the context c = (s, g), generate N candidate trajectories {τ̂^(j)}_j = 1^N as above, and score each candidate under a known return functional
$$ J^{(j)} \;\coloneqq\; R(\hat\tau^{(j)},c) \;=\; \sum_{t=0}^{H-1} r(\hat s_t^{(j)},\hat a_t^{(j)},g), $$
or, in sparse-goal settings, a terminal indicator or shaped variant computable from the generated rollout. We select an index j^⋆ ∈ arg max_jJ^(j) and execute the first action a = â₀^{(j^⋆)} in the environment. After observing the next state, we update the context to (s^′, g) and repeat. This MPC loop separates from : the generator need only place non-negligible probability mass on plausible goal-directed trajectories, while the known reward selects among sampled modes. The dependence of performance on both conditional generation error and the finite sampling budget N is quantified in our subsequent theory.

Training adds only a second head to a standard conditional diffusion model; the trunk computation is shared, so the incremental cost is a single small forward/backward through f_ψ. Inference costs O(NT) network evaluations per replanning step, with the usual tradeoffs between T and sample quality; accelerated samplers reduce the effective number of denoising steps without changing the method. In all cases, RAP is defined by the single principle that conditioning information is explicitly reconstructible from the denoising representation across diffusion time, which is the property we will exploit to obtain conditional generation guarantees under manifold-smoothness assumptions.

5. Theory I — Conditional Generation Under Missingness: formal assumptions (Hölder on manifold, Lipschitz networks); theorem giving upper bound on W₂(p̂(⋅ ∣ c), p^*(⋅ ∣ c)); matching minimax lower bound; interpretation.

We formalize the conditional generation problem induced by offline goal-conditioned planning with holes in the empirical context distribution. Let p^*(⋅ ∣ c) denote the (unknown) conditional law of vectorized trajectories x₀ = vec(τ) given context c = (s₀, g), and let p̂(⋅ ∣ c) be the conditional law implicitly defined by running the learned reverse diffusion dynamics using the denoising network f_θ ∘ f_ϕ (with c held fixed). Our goal in this section is to bound
W₂ (p̂(⋅ ∣ c), p^*(⋅ ∣ c))
for test contexts c ∼ μ_test that may lie outside supp(μ_𝒟).

The principal structural assumption is that conditional behavior varies smoothly along a low-dimensional set of contexts, rather than in the ambient space. Concretely, we assume there exists a k-dimensional (possibly curved) set ℳ_c ⊂ 𝒞 such that both the dataset and test-time contexts concentrate near ℳ_c, and that the conditional law is H"older along ℳ_c in Wasserstein distance: there exist α ∈ (0, 1] and L < ∞ such that for all c, c^′ ∈ ℳ_c,

To quantify structured missingness, we measure the test context’s distance to covered contexts by
δ(c) ≔ dist(c, supp(μ_𝒟)).
When c ∈ ℛ (a missing-context region), we have δ(c) > 0 by definition.

Our analysis treats the learned denoiser and reconstruction maps as Lipschitz function classes. Specifically, we assume that for each diffusion step t the mappings (x_t, c) ↦ f_ϕ(x_t, c, t) and z ↦ f_θ(z) are L-Lipschitz in their continuous arguments, and that the reverse diffusion update denoise(⋅) is stable to perturbations of the noise prediction in a manner controlled by the schedule {β_t}_t = 1^T (a standard property in diffusion perturbation analyses). We summarize statistical/approximation errors at the population level by two quantities:
(i) a diffusion estimation error ε_D, representing the gap between the learned noise predictor and the true conditional score/noise (aggregated over t); and (ii) a reconstruction error ε_R defined by

where the expectation is over (c, x₀) ∼ 𝒟, t, and the forward-process noise. The purpose of is not to use ĉ at test time, but to certify that the shared representation retains conditioning information across diffusion time, thereby preventing the learned reverse dynamics from becoming effectively unconditional.

We sketch the logical structure underlying . First, we compare the ideal reverse diffusion that uses the true conditional score/noise to the learned reverse diffusion that uses f_θ ∘ f_ϕ; standard diffusion stability yields a term proportional to ε_D. Second, we isolate the effect of : even if the denoiser is accurate on average, the representation z = f_ϕ(x_t, c, t) could discard c and thereby induce a reverse process that is weakly dependent on context. The reconstruction constraint rules out this collapse by forcing I(z; c) to be large in an operational sense; quantitatively, it yields the term C₂ε_R^1/2 by translating mean-squared reconstruction error into a bound on the discrepancy between the learned conditional and a context-agnostic surrogate. Third, when c lies outside the dataset support, we cannot expect zero error even with perfect estimation; we therefore relate p^*(⋅ ∣ c) to p^*(⋅ ∣ c^′) at a nearest covered context c^′ and invoke , producing the extrapolation penalty C₃Lδ(c)^α.

The upper bound is meaningful only if its dependence on sample size is unimprovable (up to constants) under the stated structural assumptions. We therefore consider a minimax class 𝒫_k, α of conditional trajectory laws whose context dependence is α-H"older on a k-dimensional manifold, and ask for the best possible worst-case W₂ estimation rate from n offline samples. The following statement is standard in nonparametric estimation on manifolds, and applies to conditional diffusion modeling when we view the model as estimating a smooth family of distributions.

Theorems~– have three consequences. (i) under manifold proximity, the best achievable rate depends on k rather than the ambient dimension of (s₀, g). (ii) even with infinite data near supp(μ_𝒟), extrapolation into ℛ costs at least order δ(c)^α without further structure. (iii) the ε_R^1/2 term quantifies the benefit of enforcing context-retention across diffusion time; it cannot remove the δ(c)^α term, but it prevents an additional failure mode in which p̂(⋅ ∣ c) becomes nearly independent of c even on covered regions. These generation guarantees will be converted into planning guarantees once we account for MPC selection from finitely many sampled trajectories.

6. Theory II — Planning Suboptimality from Generative Error: convert conditional generation error + finite-sample trajectory selection to return bound for MPC; dependence on N, horizon H, Lipschitz reward.

We now convert a conditional generation guarantee for p̂(⋅ ∣ c) into a bound on the planning loss incurred by MPC that selects the best trajectory among N samples. Throughout, we write x = vec(τ) and define the conditional generative error
ε(c) ≔ W₂ (p̂(⋅ ∣ c), p^*(⋅ ∣ c)).
We treat R(τ, c) as a known functional used only for scoring candidate trajectories at inference.

We assume that for each fixed context c, the trajectory reward is L_R-Lipschitz in the vectorized trajectory:

This assumption is natural when R is a sum of per-step costs that are themselves Lipschitz in (s_t, a_t) and the vectorization concatenates the corresponding components. For instance, if
$$ R(\tau,c)=\sum_{t=0}^{H-1} r(s_t,a_t,g),\qquad |r(u,g)-r(u',g)|\le L_r\|u-u'\|_2, $$
and vec(τ) concatenates u_t = (s_t, a_t), then holds with $L_R\le L_r\sqrt{H}$ (up to fixed constants determined by the chosen vectorization norm). Thus the bound below will typically degrade at most polynomially with horizon through L_R.

Let (X, X̂) be any coupling of p^*(⋅ ∣ c) and p̂(⋅ ∣ c). By and Cauchy–Schwarz,

Taking the infimum over couplings yields

This step is the sole place where we pass from distributional error to planning error.

Fix a context c and consider a single MPC planning call. Let τ^⋆(c) ∈ arg max_τR(τ, c) denote an (idealized) optimal horizon-H trajectory for context c.
Let $\tau^{(1)},\dots,\tau^{(N)}\stackrel{\text{i.i.d.}}{\sim}p^*(\cdot\mid c)$ and define the best reward achievable by N samples,
τ_best^*(c) ∈ arg max_{1 ≤ j ≤ N}R(τ^(j), c).
Similarly, let $\hat\tau^{(1)},\dots,\hat\tau^{(N)}\stackrel{\text{i.i.d.}}{\sim}\hat p(\cdot\mid c)$ and define the MPC-selected trajectory
τ̂_MPC(c) ∈ arg max_{1 ≤ j ≤ N}R(τ̂^(j), c).
We quantify the intrinsic limitation of best-of-N selection under the true conditional law by

Equivalently, if J = R(τ, c) for τ ∼ p^*(⋅ ∣ c) and J_(N) = max_j ≤ NJ_j, then Δ_N(c) = R(τ^⋆(c), c) − 𝔼[J_(N)]. In particular, if the reward distribution has a regular upper tail, Δ_N(c) decays with N (e.g., as a quantile gap); conversely, for sparse rewards Δ_N(c) can remain large unless N is sufficiently large.

We decompose
$$ R(\tau^\star,c)-R(\hat\tau_{\mathrm{MPC}},c) =\underbrace{R(\tau^\star,c)-R(\tau^*_{\mathrm{best}},c)}_{\text{finite-sampling under }p^*} +\underbrace{R(\tau^*_{\mathrm{best}},c)-R(\hat\tau_{\mathrm{MPC}},c)}_{\text{model mismatch}}. $$
Taking expectations, the first term is exactly Δ_N(c) by definition . For the second term, we take an optimal W₂ coupling π of p^*(⋅ ∣ c) and p̂(⋅ ∣ c) and draw N i.i.d. coupled pairs (τ^(j), τ̂^(j)) ∼ π. By maximality of τ_best^* and τ̂_MPC and the elementary inequality |max_ju_j − max_jv_j| ≤ max_j|u_j − v_j|, we obtain
R(τ_best^*, c) − R(τ̂_MPC, c) ≤ max_{1 ≤ j ≤ N}|R(τ^(j), c) − R(τ̂^(j), c)|.
Applying and bounding the maximum by a sum yields
$$ \mathbb{E}\bigl[R(\tau^*_{\mathrm{best}},c)-R(\hat\tau_{\mathrm{MPC}},c)\bigr] \;\le\;L_R\sum_{j=1}^N \mathbb{E}\|\mathrm{vec}(\tau^{(j)})-\mathrm{vec}(\hat\tau^{(j)})\|_2, $$
and by exchangeability and we conclude that this term is at most CL_Rε(c) for a constant C (absorbing the crude max-to-sum step into constants, or tightening it under additional tail control on the coupling error). Substituting into the decomposition gives .

Theorem~ is stated for a single planning call at a fixed context. In receding-horizon execution, we replan at each time step with a new context c_t = (s_t, g). Under standard stability assumptions on the environment and the replanning loop, we may apply conditionally at each time and sum (with discount) to obtain a cumulative return degradation controlled by ∑_t ≥ 0γ^t(L_Rε(c_t) + Δ_N(c_t)). Consequently, once we insert a bound on ε(c) from conditional generation analysis, planning performance degrades gracefully with both structured missingness (through ε(c)) and finite sampling (through Δ_N(c)).

7. Impossibility / Lower Bounds Without Coverage: construct indistinguishable offline datasets with different optimal actions/paths in missing regions; show no offline planner can guarantee success; implications for required assumptions and for OOD detection/fallback.

The preceding bounds are necessarily conditional on structure: when μ_𝒟 assigns zero mass to a region of contexts ℛ that is encountered at test time, no purely offline procedure can in general determine what should be done on ℛ. We formalize this via a standard indistinguishability argument (Le Cam two-point method) specialized to goal-conditioned planning: we construct two conditional trajectory laws that coincide on supp(μ_𝒟), yet induce incompatible optimal actions for some c ∈ ℛ. Consequently, any offline learner—whether model-based, behavior-cloning, diffusion-based, or otherwise—must fail on at least one of the two worlds.

We outline one concrete instantiation. Consider a deterministic system with a ``junction’’ state s_J that is reached only from contexts in ℛ (equivalently, the dataset never contains trajectories whose initial condition is such that the induced rollouts enter the junction). From s_J there are two actions, a_L and a_R, leading to two disjoint corridors of length H^′ terminating in terminal states s_goal and s_trap. Define a sparse reward that gives 1 if and only if s_goal is reached by time H, and 0 otherwise. Now define two environments (or conditional trajectory laws) that agree everywhere on supp(μ_𝒟), but differ at the junction: in world~1, a_L leads to s_goal and a_R leads to s_trap; in world~2 the mapping is swapped. Because μ_𝒟(ℛ) = 0, the offline data never contains transitions from s_J (nor any other distinguishing evidence), so the induced joint distribution over 𝒟 is identical under world~1 and world~2. However, optimal behavior at c^† (a context that reaches s_J) differs: one must choose a_L in world~1 and a_R in world~2.

Let ℙ₁ and ℙ₂ denote the distributions over datasets 𝒟 induced by p₁^* and p₂^* together with μ_𝒟. By construction, ℙ₁ = ℙ₂. Fix any algorithm Alg producing π = Alg(𝒟). Since 𝒟 has the same law under both hypotheses, the random output π has the same law under ℙ₁ and ℙ₂. Yet the optimal success events at c^† are disjoint: if π selects (or induces) action a_L at the junction, it succeeds in world~1 and fails in world~2; if it selects a_R, it succeeds in world~2 and fails in world~1. Thus,
𝔼_{𝒟 ∼ ℙ₁}[Succ(π; c^†)] + 𝔼_{𝒟 ∼ ℙ₂}[Succ(π; c^†)] ≤ 1,
which implies the stated $\le \tfrac{1}{2}$ bound for at least one of the two worlds. This is the simplest form of an indistinguishability lower bound: any guarantee that would exceed 1/2 uniformly over both hypotheses would contradict ℙ₁ = ℙ₂.

Theorem~ isolates what smoothness/manifold assumptions buy us: they rule out such arbitrary discontinuities on ℛ by requiring that p^*(⋅ ∣ c) vary regularly with c, enabling extrapolation from nearby covered contexts. In particular, the additional term proportional to δ(c)^α in Theorem~1 (and the matching necessity in Theorem~2) should be viewed as the minimal price of leaving the support of μ_𝒟 even when the problem is otherwise well behaved. Conversely, without any continuity control, no method—including those that reconstruct c at each diffusion step—can infer the correct behavior at c^†, because the training signal contains no information about the relevant transitions.

A practical corollary is that success in missing regions demands either (i) additional structure that links ℛ to observed contexts (smoothness, latent manifold proximity, symmetries, known dynamics), or (ii) an explicit mechanism to detect when the current context is out-of-support and to trigger a safe fallback. In our setting, one natural diagnostic is a distance-to-data estimate for c (or for the trunk representation), which can be used to abstain, increase conservatism, enlarge uncertainty bounds, or switch to a reactive controller. Such mechanisms do not contradict the impossibility result; rather, they acknowledge it by declining to commit to an unidentifiable choice when μ_𝒟(c) is effectively zero.

Even if p^*(⋅ ∣ c) were known, best-of-N MPC can fail in sparse-reward problems when successful trajectories have tiny probability under the generator; in that case N must scale as Ω(1/p) to observe a success with constant probability. This phenomenon is orthogonal to offline identifiability: it persists under perfect coverage and perfect modeling, and motivates guidance, shaping, or hierarchical decompositions when H is large.

8. Complexity and Implementation Notes: training/inference time, memory; accelerated samplers; ablation axes (with/without reconstruction; reconstruct context vs reconstruct intermediate states).

We record the basic resource requirements of RAP and note several implementation choices that materially affect the cost–quality tradeoff in offline planning.

Each stochastic gradient step draws a minibatch of size B, samples t ∼ Unif({1, …, T}) and ε ∼ 𝒩(0, I), forms $x_t=\sqrt{\bar\alpha_t}x_0+\sqrt{1-\bar\alpha_t}\varepsilon$, and evaluates the trunk and two heads. Denoting by C_trunk the cost of a forward/backward pass through f_ϕ(⋅) and by C_head the corresponding cost for one head, the per-step cost is
O (B ⋅ C_trunk + B ⋅ (C_θ + C_ψ)),
and is thus a constant-factor overhead relative to a baseline conditional diffusion model (which omits f_ψ and the reconstruction term). In practice C_ψ is small (a shallow MLP) and the dominant cost is the trunk. Since we sample only one diffusion time index t per datapoint per iteration (as in standard score matching), training cost is independent of T up to a negligible constant for schedule lookups; T matters primarily at inference.

Training memory consists of model parameters plus activation storage for backpropagation. The batch storage for trajectories is O(B ⋅ d_τ), and the additional head increases activation memory by O(B ⋅ d_z) where d_z is the trunk feature dimension. Because reconstruction is computed from the same z used by the diffusion head, we do not require any additional trunk activations beyond those already needed for score matching. When trajectories are stored as length-H windows extracted from longer episodes, it is convenient to precompute the vectorization map x₀ = vec(τ) and to cache normalization statistics (per-dimension mean/variance) so that the diffusion noise scale is commensurate across state and action channels.

At test time, the per-action computational cost is dominated by denoising N candidate trajectories for T reverse steps. Writing C_net for the cost of one network evaluation (trunk plus diffusion head), the sampling cost is
O (N ⋅ T ⋅ C_net).
The scoring cost depends on the reward evaluation. When the reward is a known functional $R(\tau,c)=\sum_{t=0}^{H-1} r(s_t,a_t,g)$, a direct evaluation is O(N ⋅ H) up to the cost of computing r(⋅), which is typically negligible compared to C_net for image-free state spaces. When reward evaluation requires additional checks (e.g. collision or constraint penalties computed from s_t), the scoring term may be comparable to denoising; in that case one should fuse reward computation with unvectorization to avoid repeated passes over memory.

A practical point is that MPC is typically receding-horizon: only the first action of the maximizing sample is executed, after which the context is updated and the optimization is rerun. Consequently, even moderate T and N can suffice in environments where frequent replanning corrects trajectory imperfections; conversely, long-horizon sparse-reward tasks may require either larger N (to hit rare successes) or guided sampling, irrespective of modeling error.

Naively storing all N denoised trajectories requires O(N ⋅ d_τ) memory, which can be large when H is large. This can be reduced by streaming: we maintain the current x_t^(j) for all j during denoising (still O(N ⋅ d_τ)), but after producing x₀^(j) we can immediately compute J^(j) = R(τ^(j), c) and retain only the current best candidate (or top-K) rather than all trajectories. This reduces post-sampling storage to O(d_τ) while leaving sampling memory unchanged; further reductions are possible by sampling candidates sequentially (one j at a time), at the expense of losing vectorization/batching benefits on modern accelerators.

The dependence on T motivates using accelerated sampling schemes. Deterministic or semi-deterministic samplers (e.g. DDIM-style updates) can reduce variance and permit using smaller T without catastrophic degradation in sample quality. Higher-order ODE/SDE solvers (e.g. DPM-Solver-type methods) often allow T in the single digits while maintaining perceptual fidelity in standard diffusion settings; here the relevant metric is not visual quality but the induced distribution over feasible, high-reward trajectories, so the appropriate choice of solver is empirical. A further reduction is possible through distillation of the reverse process into a small number of steps (even T ≈ 1), which directly lowers MPC cost from O(NT) to O(N). Since our theory treats T as an algorithmic parameter and does not rely on asymptotics in T, such accelerations are fully compatible with the framework, though they alter constants through solver bias.

The reconstruction head f_ψ may be tasked with predicting c = (s₀, g) directly, or with predicting its components separately (e.g. ŝ₀ and ĝ) to allow different scalings and losses. We generally normalize s₀ and g to comparable magnitudes and set λ so that ∥ε − ε̂∥₂² and ∥c − ĉ∥₂² are of the same order early in training; otherwise the trunk may ignore one objective. Conditioning c and time t into the trunk can be realized by concatenation followed by an MLP, FiLM-style modulation, or cross-attention; the reconstruction term is intended to prevent the trunk from becoming condition-agnostic under any of these mechanisms.

To isolate the effect of reconstruction, we recommend the following ablation axes.
(i) set λ = 0 and remove f_ψ, yielding a standard conditional diffusion planner trained purely by score matching.
(ii) reconstruct c from z = f_ϕ(x_t, c, t) at every t (the default), versus reconstruct only at small noise levels (e.g. t ≤ t₀) or only at a single t sampled from a restricted set; this tests whether information retention is needed across the entire noise schedule or only near x₀.
(iii) reconstruct the full context (s₀, g) versus reconstruct only g (or only s₀). In goal-conditioned settings, reconstructing only g probes whether the improvement arises from preserving goal information specifically or from a broader regularization effect.
(iv) replace ∥c − ĉ∥² with a loss that predicts selected trajectory states (e.g. endpoints (s₀, s_H) or a subset {s_{t_ℓ}}) from z. This tests whether the benefit comes from enforcing retention of conditioning variables or from encouraging the representation to encode semantically meaningful trajectory features. We emphasize that such variants change the inductive bias (and may implicitly regularize dynamics consistency), but preserve the central principle: auxiliary prediction prevents the trunk from discarding information that the planner later needs.

9. Experimental Plan (Recommended): maze2d-missing-data; AntMaze with corridor/goal removal; diagnostics (goal conditioning accuracy, feasibility, trajectory diversity, mode collapse metrics).

We recommend evaluating RAP in goal-conditioned continuous-control benchmarks where structured missingness can be induced in a controlled manner and where success is sensitive to the ability to condition on (s₀, g) outside the empirical context support. Our primary emphasis is on Maze2D-style navigation and AntMaze-style long-horizon planning, since both admit natural ``holes’’ (unvisited corridors, removed goal sets, or disconnected start–goal pairs) while retaining a well-defined test-time reward.

Let the underlying environment be a planar maze with state s = (x, y, ẋ, ẏ) (or position-only) and goal g = (x_g, y_g), with sparse terminal reward $\mathbbm{1}\{\|s_H-g\|\le \epsilon\}$ or shaped reward −∥s_t − g∥. From a standard offline dataset, we construct missingness variants by editing and/or :

For each protocol, we recommend reporting performance as a function of the distance-to-coverage proxy δ(c), approximated by kNN distance of c = (s₀, g) to the set of dataset contexts, together with separate analyses for in-support'' andout-of-support’’ test contexts.

AntMaze is a high-dimensional setting where diffusion planners are often competitive, but where missingness can be made concrete by removing narrow passages or goals in a specific region. We recommend two controlled edits:
(i) delete all trajectories that enter a chosen corridor or doorway; test on tasks that require that corridor to reach the goal.
(ii) delete all trajectories whose goals lie in a designated region of the maze; test on goals in that region while keeping start states broad.
To reduce confounds, we suggest keeping the behavior policy and environment identical and only modifying dataset filtering; in particular, the dynamics remain fixed, so any performance gap is attributable to offline generalization rather than domain shift.

A minimal baseline set should include (a) a standard conditional trajectory diffusion model trained with score matching only (RAP with λ = 0), (b) an implicit behavior-cloning policy with goal conditioning (e.g. goal-conditioned BC), and (c) an offline RL method appropriate for the benchmark if available. For RAP-specific ablations, we recommend varying: (i) reconstruction on/off; (ii) reconstruction enforced for all t versus only small-noise t; (iii) reconstruct g only versus reconstruct (s₀, g); and (iv) replace context reconstruction with predicting a subset of trajectory states (endpoint or waypoints). To isolate representation effects, all variants should share the same trunk architecture, diffusion schedule, and inference sampler.

Sparse-reward success is necessary but insufficient to diagnose why reconstruction helps. We recommend the following measurements, computed on held-out contexts and stratified by δ(c):

For each benchmark and missingness protocol, we recommend reporting: (i) success/return under μ_test and separately on ℛ, (ii) diagnostics above, and (iii) compute budgets (N, T, H) and wall-clock inference per action. Since RAP targets missing-context generalization, we regard stratified evaluation by δ(c) as essential: improvements confined to near-support contexts are qualitatively different from improvements that persist deeper into ℛ.

10. Discussion: when reconstruction helps/hurts; extensions (latent reconstruction, uncertainty gating, vision); limitations.

The operative effect of the reconstruction head is not to modify the reverse-time sampling rule at inference, but to constrain the learned representation z = f_ϕ(x_t, c, t) so that c remains recoverable even when x_t is heavily corrupted. In conditional diffusion, a frequent pathology in the offline setting is that the trunk learns to rely predominantly on correlations between x_t and a subset of context coordinates that are well-covered by μ_𝒟, while discarding those components of c that are predictive only in rare or missing regions. Imposing ∥c − f_ψ(z)∥² for all t forces a form of conditional identifiability: among hypotheses that fit the score-matching objective comparably well on supp(μ_𝒟), we bias toward those that retain conditioning information globally. Under the manifold-smoothness assumption, this bias is aligned with the desired extrapolation behavior, since p^*(⋅ ∣ c) varies continuously along ℳ_c and thus the ``correct’’ conditioning signal is stable under small context perturbations.

A second regime where reconstruction helps is high-noise conditioning. For large t, the signal-to-noise ratio of x_t is small, and an unconstrained conditional model may effectively revert to an unconditional generator (or a generator conditioned only on easy-to-learn context statistics), thereby producing samples that are not discriminative with respect to g or s₀. Since MPC performance depends on drawing at least one high-reward candidate among τ̂^(1 : N), this collapse can be catastrophic in sparse-reward tasks: the mass of goal-reaching trajectories becomes too small for any feasible N. Reconstruction loss applied uniformly in t is a direct way to prevent this failure mode by ensuring that z remains informative about c throughout the diffusion chain.

Reconstruction is not free regularization; it competes with the score-matching objective for representational capacity. If λ is too large relative to the diffusion loss scale, the trunk may allocate excessive capacity to encoding c at the expense of encoding the fine details of x_t needed to predict ε. In the extreme, this yields a representation that is nearly a function of c alone, which can reduce sample fidelity and increase W₂(p̂(⋅ ∣ c), p^*(⋅ ∣ c)) even on in-support contexts. A more subtle harm is induced mode narrowing: if the reconstruction head is easiest to satisfy by learning a near-deterministic encoding of c into z, gradients may push f_ϕ toward features that correlate strongly with a dominant mode of p^*(⋅ ∣ c), reducing conditional diversity. This can worsen best-of-N planning when multiple qualitatively different homotopy classes of trajectories exist (e.g. distinct corridors in a maze) and the optimal route depends on downstream stochasticity or constraints not well-modeled by the offline data.

Reconstruction can also be mis-specified. If c = (s₀, g) contains nuisance coordinates irrelevant to planning (or coordinates that are noisy labels), forcing accurate reconstruction can encourage the trunk to represent precisely what we would prefer to ignore, thereby decreasing robustness. Conversely, if the true conditional law depends on additional latent context not included in c (e.g. unobserved environment parameters or partial observability), then strong reconstruction of the observed c does not resolve ambiguity and may give a misleading sense of ``conditioning correctness.’’ In such settings the correct remedy is to expand the context, infer a latent variable, or adopt a stochastic conditional objective, rather than to increase λ.

The reconstruction term can be viewed as a tractable surrogate for maximizing mutual information between c and z under the joint distribution induced by the forward diffusion. While we do not require an explicit information-theoretic formalism for the stated bounds, the proof intuition is consistent: small reconstruction error implies that distinct contexts map to distinguishable representations, and hence the conditional score predictor f_θ(z) can, in principle, implement a family of reverse dynamics that varies with c. In the absence of such a constraint, the function class can contain degenerate solutions that fit the marginal of x_t well but implement only weak conditional dependence, a phenomenon exacerbated by structured missingness.

When c is high-dimensional (e.g. images, language, or long goal descriptors), reconstructing c directly can be unnecessarily stringent. A natural variant is to learn an encoder e_η(c) and reconstruct only a low-dimensional latent u = e_η(c), replacing ∥c − ĉ∥² by ∥u − û∥² (or a cosine/contrastive loss). This retains the main mechanism—preventing the trunk from discarding conditioning—while allowing invariances and reducing the risk of overfitting to irrelevant context details. Another variant is to reconstruct only the components of c that matter for planning (e.g. g alone), which is compatible with our analysis provided the smoothness condition is stated in the corresponding reduced context metric.

The reconstruction head yields an auxiliary signal that can be repurposed at test time: the error ∥c − f_ψ(f_ϕ(x_t, c, t))∥ (or its expectation across t) can serve as a proxy for being far from supp(μ_𝒟). While not a calibrated distance estimator in general, it can be used for conservative planning heuristics: increase N or T when the proxy is large, fall back to a safer baseline policy, or reject samples whose internal representations appear inconsistent with c. Such gating is consistent with the bound structure in Thm.~1, in the sense that deterioration deep in ℛ is expected and should be detected rather than ignored.

RAP can be instantiated with pixel observations by letting s be an image (or a learned visual state) and letting c = (s₀, g) be either two images (start and goal) or an image plus a goal embedding. The trunk then combines (i) a temporal model over x_t (trajectory tokens) and (ii) a visual encoder for c. Reconstruction may be applied in the visual embedding space rather than pixel space, to avoid penalizing irrelevant appearance changes. The principal technical caution is that the effective dimension k in the manifold assumption should be interpreted as the intrinsic dimension of the visual factors, not the raw pixel dimension; otherwise the minimax rate becomes vacuous.

Our guarantees rely on smoothness/manifold proximity and thus do not apply to adversarial missingness or to environments where p^*(⋅ ∣ c) changes discontinuously with c (e.g. bifurcations caused by hard contact dynamics or discrete task modes). Moreover, state-coverage holes that remove essential intermediate states can violate the premise that a nearby context suffices to infer feasible trajectories; in such cases, extrapolation from c alone is not information-theoretically justified, aligning with Thm.~4. Finally, RAP does not address the computational hardness of long-horizon sparse-reward planning: even with an accurate generator, the mass of successful trajectories may be too small for moderate N. These limitations motivate combining reconstruction with additional mechanisms such as hierarchical decomposition, reward guidance, or explicit feasibility models, while keeping the central requirement intact: the representation used by the diffusion model must remain genuinely conditional on (s₀, g) throughout the denoising process.

11. Conclusion.

We studied goal-conditioned planning from offline data in the presence of structured missingness, formalized by a region ℛ ⊂ 𝒞 of contexts that is absent under the empirical distribution μ_𝒟 but has positive mass under μ_test. In this regime, a conditional trajectory model is required not only to interpolate within the support of μ_𝒟, but also to extrapolate in a controlled manner to contexts at nonzero distance from that support. Our central claim was that such extrapolation is possible only under explicit structure on the conditional law p^*(⋅ ∣ c), and that the representation used by a conditional diffusion model must be constrained to actually retain the conditioning information in the high-noise part of the diffusion chain.

To this end we proposed Reconstruct-and-Plan (RAP), which augments a standard conditional diffusion trajectory model with an auxiliary head f_ψ that reconstructs c = (s₀, g) (or selected components thereof) from the shared trunk representation z = f_ϕ(x_t, c, t) at every diffusion step. The algorithm does not alter the reverse-time sampling rule at test time; instead, it biases learning toward conditional hypotheses in which the trunk does not collapse to an effectively unconditional representation when x_t is heavily corrupted. This modification is compatible with conventional score-matching training and preserves the typical computational footprint of diffusion planners up to constant factors. At inference, RAP is deployed as a planner via MPC: we sample N candidate trajectories τ̂^(1 : N) ∼ p̂(⋅ ∣ c) and execute the first action from the highest-reward sample under a known reward functional.

Our theoretical results connect this architectural constraint to distributional and planning guarantees. Under a manifold proximity hypothesis—contexts lying on or near a k-dimensional manifold ℳ_c on which p^*(⋅ ∣ c) is α-H"older in W₂—we showed in Thm.~1 that the conditional generation error at a test context c decomposes into three interpretable terms: (i) a diffusion estimation term ε_D reflecting score-matching error, (ii) a representation/conditioning term controlled by the reconstruction error ε_R, and (iii) an unavoidable extrapolation term scaling as L δ(c)^α, where δ(c) is the distance from c to supp(μ_𝒟). This bound explicitly separates what can be improved by model capacity and optimization from what is dictated by dataset coverage. Thm.~2 complements this by giving a minimax lower bound over 𝒫_k, α, showing that dependence on intrinsic dimension k and smoothness α is information-theoretically sharp up to constants, and that a term of order Lδ^α cannot be eliminated when test contexts are separated from the dataset support.

We then translated conditional generation error into planning performance. Thm.~3 provides an additive suboptimality bound for best-of-N MPC selection: when the reward is Lipschitz in the vectorized trajectory, the expected gap between the optimal achievable return (under the true conditional distribution) and the return attained by MPC using samples from p̂(⋅ ∣ c) is controlled by O(L_R ε(c)) plus a finite-sampling term Δ_N. The qualitative implication is that improving the conditional generator in W₂ directly improves planning, and the degradation in missing regions is precisely mediated by δ(c). This formalizes a phenomenon observed in offline diffusion planning: failures in out-of-support contexts are not merely ``distribution shift,’’ but can be quantified as a combination of estimation error and an irreducible extrapolation penalty dictated by the smoothness class.

Conversely, we made explicit the limits of offline planning in the absence of structural assumptions. Thm.~4 provides an indistinguishability-based impossibility: if p^*(⋅ ∣ c) can differ arbitrarily on ℛ while agreeing on supp(μ_𝒟), then no offline method can guarantee nontrivial success uniformly on ℛ. In other words, reconstruction (or any other regularizer) is not a substitute for assumptions; it is a mechanism for aligning the learned representation with the assumptions under which generalization is possible. This separation is useful in practice: when empirical performance deteriorates in missing regions, the correct diagnosis may be that the smoothness/manifold hypothesis is violated, rather than that the learning algorithm is inadequately optimized.

Several practical lessons follow from the analysis. First, tuning λ is not merely a regularization heuristic but a means of trading diffusion estimation against conditional identifiability: overly small λ permits conditional collapse, while overly large λ may consume capacity needed for accurate score prediction. Second, the reconstruction head provides an auxiliary signal that can be repurposed for conservative deployment, e.g. as a proxy for being far from supp(μ_𝒟), suggesting adaptive inference budgets in (N, T) or fallback strategies. Third, since MPC success in sparse reward can be limited by the mass of successful trajectories even under an accurate model, one should view RAP as complementary to guidance, hierarchical decomposition, or alternative search procedures that mitigate needle-in-haystack effects without sacrificing conditional fidelity.

We conclude with open problems that are suggested by, but not resolved within, our framework. A first direction is to formalize reconstruction as an information constraint (or a conditional sufficiency constraint) and to relate ε_R to mutual information between c and z under the forward diffusion, potentially yielding sharper or more interpretable constants. A second direction is to extend the smoothness assumption to settings with discrete task modes or contact-induced discontinuities, where the appropriate notion of ``nearby context’’ may be non-Euclidean or require latent mode variables. A third direction is computational: bridging the gap between the statistical rates captured by W₂ bounds and the algorithmic efficiency of sampling-based MPC in long horizons. Establishing guarantees for guided or amortized samplers, or for planners that combine diffusion proposals with explicit feasibility checks, would sharpen the connection between conditional generation fidelity and end-to-end control performance in the most challenging offline regimes.