Fast Approximation of DIDM Mover’s Distance via Entropic Multiscale Optimal Transport

Metadata

Total Words: 9,554
Export Date: 2026-01-18 03:03:49
Description: Recent theory links the stability, expressivity, and universality of message passing neural networks (MPNNs) on attributed graphs to a new family of hierarchical optimal-transport (OT) distances between distributions of computation-tree analogs (DIDMs). These distances, such as the DIDM Mover’s Distance δ_DIDM^L, are polynomial-time computable but impractical at scale: the exact algorithm requires O(L N⁵log N) time for graphs with N nodes. We develop a scalable approximation δ̂_DIDM^L that replaces each OT subproblem in the recursion by entropic-regularized unbalanced OT (Sinkhorn) and compresses neighborhood measures using multiscale coarsening of computation-IDM representations. We provide end-to-end approximation guarantees that control how entropic bias and coarsening error propagate through depth L, and derive near-quadratic runtimes (in dense regimes) and near-linear-in-edges runtimes under sampling/landmark compression (in sparse regimes). Empirically, δ̂_DIDM^L correlates strongly with MPNN output perturbations and improves robustness auditing, kNN retrieval, and OOD detection at modern graph sizes. Our results turn a metric previously used primarily for theory into a practical stability primitive aligned with MPNN computation.

1. Introduction: motivation from GNN stability metrics; bottleneck of exact δ_DIDM^L; contributions (algorithm + guarantees + empirical validation).
2. Background and Notation: graph-signals, computation IDMs/DIDMs, exact δ_DIDM^L recursion; unbalanced OT; entropic regularization; multiscale clustering primitives.
3. Problem Formulation: define approximation goals (additive/multiplicative), parameter regimes (dense vs sparse), and evaluation tasks (stability auditing, retrieval, OOD).
4. Baseline Exact Computation and Bottlenecks: restate the O(LN⁵log N) method; identify the N² OT subproblems per layer and their size as the core cost driver.
5. Entropic Unbalanced OT as a Drop-in Replacement: define entropic UOT objective; show computability via generalized Sinkhorn; state bias bounds vs exact UOT.
6. Multiscale Coarsening of Neighborhood Measures: define coarsening operator 𝒞_t mapping node-wise neighbor measures to k-bin histograms; show how to build costs between bins from previous-layer costs; discuss choices (k-means on features, WL-style hashing, random projections).
7. Main Algorithm: compute δ̂_DIDM^L via recursive compressed entropic OT; include implementation details for sparse adjacency and batching; discuss optional low-rank/landmark extension.
8. Theory: per-layer error propagation lemma; overall approximation theorem; runtime and memory bounds; (optional) topology/weak$^\*$ approximation under smoothing.
9. Empirical Evaluation Plan: runtime scaling; correlation with MPNN output perturbations; retrieval and OOD detection; ablations on (k, η, T); comparisons to exact on small graphs and to other metrics on medium graphs.
10. Discussion and Limitations: when coarsening fails; dependence on L; relation to aggregation choice; open problems (tight lower bounds, adaptive depth, transformer kernels).

Content

1. Introduction: motivation from GNN stability metrics; bottleneck of exact δ_DIDM^L; contributions (algorithm + guarantees + empirical validation).

Message passing graph neural networks (GNNs) are routinely deployed in regimes where the input graph is not fixed but subject to perturbations: edges may be missing, spurious, or reweighted; node features may be noisy; and the number of observed nodes may vary across samples and time. In such settings, a central methodological question is how to measure the discrepancy between two (G, f) and (H, g) in a manner that is faithful to the computation performed by a depth-L message passing architecture. A discrepancy notion aligned with the network dynamics serves at least three purposes. First, it yields a stability metric for auditing robustness with respect to structural and feature perturbations. Second, it enables similarity search and retrieval of graphs in a geometry that reflects the model’s inductive bias. Third, it provides a principled way to compare learned representations across architectures and depths, by making explicit which local-to-global aggregations are deemed close.

The distance δ_DIDM^L we consider is defined by transporting hierarchical neighborhood measures induced by normalized-sum aggregation, iterated up to depth L. Informally, it compares nodes across two graphs by a base attribute cost, and then recursively augments this cost by an optimal transport discrepancy between the nodes’ (multi)set of neighbors, where neighbor-to-neighbor costs are themselves computed from the previous depth. This recursion naturally lives in a space of distributional representations (computation IDMs/DIDMs) and yields a global graph-to-graph distance as an outer unbalanced optimal transport (OT) problem. The use of OT is not incidental: it accommodates graphs with different sizes and degree distributions by permitting creation and destruction of mass at a controlled penalty, which is required when neighborhoods do not match in cardinality or when the induced measures are subprobabilities.

Despite its conceptual alignment with message passing, the exact computation of δ_DIDM^L is a bottleneck. At each depth t, and for each cross-graph node pair (i, j) ∈ V(G) × V(H), one must solve a transport problem between two neighborhood measures whose ground cost is the depth-(t − 1) node-pair metric. Even when the neighborhood measures are sparse, the number of node pairs is n_Gn_H, and the recursion requires propagating costs across depths. In the worst case, exact OT reduces to min-cost flow computations, and the nested structure of the recursion amplifies runtime and memory demands. Consequently, while δ_DIDM^L is an attractive target for stability analysis, it is not directly usable as a drop-in primitive inside larger pipelines (e.g., retrieval, auditing across many graph pairs, or hyperparameter sweeps) unless we replace the exact OT calls by approximations with controlled error.

We address this gap by designing an approximation scheme that preserves the recursive structure while reducing the per-layer OT burden. The guiding principle is that, for the purpose of evaluating message-passing-induced discrepancies, it is often unnecessary to resolve fine-grained neighbor identities at every depth; rather, it suffices to compare neighbor distributions under a multiscale partition that is stable with respect to the previous-layer costs. We therefore compress each neighborhood measure into a k-bin histogram over clusters and compute OT only at the cluster level, while using entropic regularization to obtain fast Sinkhorn iterations. Entropic unbalanced OT yields a differentiable, numerically stable objective and admits efficient parallelization, at the cost of a controlled smoothing bias that can be traded against runtime. The resulting procedure is a recursion: at depth t we (i) build a k × k cluster-to-cluster cost matrix from the current cross-graph costs, (ii) encode each node’s neighborhood as a k-dimensional histogram, and (iii) evaluate a small entropic unbalanced OT problem for each node pair to obtain the depth-t contribution to the updated ground cost.

Our main contributions are as follows.

Finally, we note that entropic regularization induces a smoothed variant of the distance that is itself of independent interest: fixing η > 0 yields a stable, computationally convenient geometry on computation DIDMs that converges to the unsmoothed one as η → 0. This viewpoint clarifies the role of regularization as an explicit modeling choice rather than a purely numerical trick, and it provides a principled axis along which one can interpolate between fidelity to exact OT and computational tractability.

2. Background and Notation: graph-signals, computation IDMs/DIDMs, exact δ_DIDM^L recursion; unbalanced OT; entropic regularization; multiscale clustering primitives.

We collect the definitions needed to state the exact depth-L distance δ_DIDM^L and the approximation primitives used later. Throughout, (G, f) and (H, g) denote attributed graphs with node sets V(G), V(H) and node attributes f : V(G) → 𝒳 and g : V(H) → 𝒳, where 𝒳 ⊂ ℝ^d is compact. We write n_G = |V(G)|, n_H = |V(H)|, m_G = |E(G)|, m_H = |E(H)|, and N = max (n_G, n_H).

We view a graph-signal (G, f) as a finite relational structure equipped with node features. Let N_G(i) denote the (possibly weighted) neighborhood of a node i ∈ V(G). In the normalized-sum message passing regime, the contribution of i to depth-t computations depends on a normalized empirical measure over its neighbors. Concretely, for each i ∈ V(G) we define a subprobability measure on V(G) by
$$ \mu^{G}_i \;=\; \frac{1}{|V(G)|}\sum_{u\in N_G(i)} w_{iu}\,\delta_u, $$
where w_iu ≥ 0 are edge weights (default w_iu = 1 in the unweighted case) and δ_u denotes the Dirac mass at u. The factor 1/|V(G)| yields total mass at most 1 and makes the construction compatible with varying graph sizes. Analogously, for j ∈ V(H) we obtain μ_j^H.

The computation at depth t induces a cross-graph cost between nodes of G and H that we interpret as an (IDM) ground cost. We define the base (order-0) cost by the attribute discrepancy
d_IDM⁰(i, j) = ∥f(i) − g(j)∥₂, i ∈ V(G), j ∈ V(H).
For t ≥ 1, the depth-t cost augments the base discrepancy by a transport term between the neighbor measures, where the ground metric for transporting neighbor mass is the previous-depth cost. Writing OT_d(⋅, ⋅) for the unbalanced OT objective used in the source with ground metric d, we set
d_IDM^t(i, j) = d_IDM⁰(i, j) + OT_{d_IDM^t − 1} (μ_i^G, μ_j^H).
This recursion is well-defined because d_IDM^t − 1 is a finite matrix on V(G) × V(H) and the neighbor measures have finite support. The compactness of 𝒳 ensures d_IDM⁰ is bounded, and hence, under the standing assumptions of the source unbalanced OT formulation, each d_IDM^t remains finite.

At depth L, the node-level costs d_IDM^L serve as the ground metric in an unbalanced OT problem that transports mass between the node populations of G and H. Let α_G and α_H denote the node measures, taken for definiteness as uniform subprobabilities,
$$ \alpha_G \;=\; \frac{1}{|V(G)|}\sum_{i\in V(G)}\delta_i,\qquad \alpha_H \;=\; \frac{1}{|V(H)|}\sum_{j\in V(H)}\delta_j. $$
The depth-L DIDM Mover’s Distance is then
δ_DIDM^L((G, f), (H, g)) = OT_{d_IDM^L} (α_G, α_H).
Thus δ_DIDM^L is a two-level hierarchy: (i) an inner recursion that compares nodes by transporting their normalized neighborhoods using previous-depth costs, and (ii) an outer transport that aggregates node-to-node costs into a graph-to-graph discrepancy. The terminology ``computation IDM/DIDM’’ reflects that the objects being compared at each stage are measures arising from the computation graph of normalized-sum message passing.

We require unbalanced OT because the measures μ_i^G and μ_j^H (and likewise α_G, α_H) may have different total masses, reflecting differing degrees and differing graph sizes. Given nonnegative vectors a ∈ ℝ₊^p, b ∈ ℝ₊^q and a ground cost matrix C ∈ ℝ₊^p × q, a standard unbalanced OT instance takes the form
OT_C(a, b) = min_{γ ∈ ℝ₊^p × q} ⟨C, γ⟩ + λ D(γ1 ∥ a) + λ D(γ^⊤1 ∥ b),
where D(⋅∥⋅) is a divergence (e.g. a Kullback–Leibler divergence) and λ > 0 controls the penalty for creating/destroying mass. In our notation, OT_d(μ, ν) denotes the corresponding value when C is induced by a ground metric d on the supports of μ, ν. We will use only the two structural properties needed for approximation: (a) OT is monotone and Lipschitz in the ground costs under ∥ ⋅ ∥_∞ perturbations, and (b) it is stable under mild perturbations of the marginals, both of which are standard in this setting.

To accelerate repeated OT evaluations, we replace OT by an entropic-regularized counterpart. For η > 0, we write UOT_C, η(a, b) for the entropic unbalanced OT value obtained by adding a negative-entropy term to the primal objective,
UOT_C, η(a, b) = min_γ ≥ 0 ⟨C, γ⟩ + λ D(γ1 ∥ a) + λ D(γ^⊤1 ∥ b) + η∑_u, vγ_uv(log γ_uv − 1),
with the convention 0log 0 := 0. This regularization yields a strictly convex objective and admits fast scaling algorithms. In practice, we approximate UOT_C, η(a, b) using T Sinkhorn iterations; we will explicitly track both the smoothing bias (controlled by η) and the finite-iteration error (controlled by T) when we propagate approximation guarantees through the recursion.

The remaining bottleneck is that, at depth t, one must evaluate a neighbor-to-neighbor OT term for each pair (i, j) ∈ V(G) × V(H), and the supports of μ_i^G, μ_j^H are subsets of V(G), V(H). To reduce the effective support size, we introduce at each layer t partitions (clusterings) Π_t^G of V(G) and Π_t^H of V(H) into k clusters. The associated coarsening operator 𝒞_t maps a neighbor measure to a k-dimensional histogram by aggregating mass per cluster:
h_t^G(i)[a] = μ_i^G(Π_t^G(a)), a ∈ {1, …, k},
and similarly h_t^H(j). Given a cross-graph cost matrix C approximating the previous-layer costs on V(G) × V(H), we form a compressed k × k ground cost matrix M_t by averaging C across cluster pairs,
M_t[a, b] = Avg_{u ∈ Π_t^G(a), v ∈ Π_t^H(b)}C[u, v].
The compressed OT subproblem UOT_{M_t, η}(h_t^G(i), h_t^H(j)) then serves as a proxy for transporting the original neighborhood measures. The quality of this proxy is governed by the within-cluster diameters (radii) of Π_t^G, Π_t^H as measured by the previous-layer costs, a fact we will exploit when we bound the coarsening error layer-by-layer.

These definitions fully specify the exact recursion for δ_DIDM^L and the two approximation mechanisms—entropic smoothing and multiscale histogram compression—that we will combine in the algorithmic construction and guarantees that follow.

3. Problem Formulation: define approximation goals (additive/multiplicative), parameter regimes (dense vs sparse), and evaluation tasks (stability auditing, retrieval, OOD).

Our objective is to approximate the depth-L DIDM Mover’s Distance δ_DIDM^L((G, f), (H, g)) between two attributed graphs under the recursive definition given in the previous section, while controlling both statistical distortion (approximation error relative to the exact, non-entropic, non-compressed objective) and computational cost. We emphasize that the target is a discrepancy between graphs, but that any algorithm which follows the recursion must implicitly approximate a hierarchy of intermediate node-to-node ground costs.

Given an accuracy parameter ε > 0, we seek an output
δ̂_DIDM^L((G, f), (H, g))
computed from (G, f), (H, g) and algorithmic parameters (in our case (k, η, T)) such that an end-to-end additive guarantee holds:
|δ̂_DIDM^L((G, f), (H, g)) − δ_DIDM^L((G, f), (H, g))| ≤ ε.
We formulate the goal additively for two reasons. First, δ_DIDM^L may be arbitrarily small when graphs are nearly isomorphic in attributes and neighborhoods, so multiplicative approximation cannot hold uniformly. Second, additive control aligns with the natural stability properties of OT objectives with respect to perturbations of ground costs and marginals (which we will leverage when propagating error through depth). When a multiplicative criterion is nevertheless desired (e.g. ranking-only applications), we may interpret our guarantee as a relative bound on instances satisfying a separation condition δ_DIDM^L ≥ τ > 0, in which case
$$ \frac{\big|\widehat\delta^{L}_{\mathrm{DIDM}}-\delta^{L}_{\mathrm{DIDM}}\big|}{\delta^{L}_{\mathrm{DIDM}}}\;\le\;\frac{\varepsilon}{\tau}. $$
This is a corollary rather than a primary objective, and it requires an application-specific lower bound τ.

Our approximation has three explicit knobs. The parameter k controls the support size of compressed neighborhood measures (histogram bin count), η > 0 controls entropic smoothing, and T is the number of iterations in the scaling (Sinkhorn-type) solver used to approximate entropic unbalanced OT values. The algorithm will instantiate these parameters either by direct user choice or by a rule depending on (ε, L) and coarse bounds on cost magnitudes. At a conceptual level, these parameters correspond to three error sources:
(coarsening error from k) + (entropic bias from η) + (optimization error from T).
The central formulation requirement is that these contributions be budgeted across depths so that the sum of per-layer distortions is at most ε. Our later analysis will enforce a layerwise allocation (e.g. each term at most ε/L) in order to obtain a clean end-to-end bound.

We distinguish two input-access regimes, since the bottlenecks depend strongly on how the graph is presented.

The input may include adjacency matrices (or otherwise permit O(1) adjacency queries) so that m_G = Θ(n_G²) and m_H = Θ(n_H²) in the worst case. In this regime the input size is already quadratic, and our computational goal is therefore time in the dominant size parameter N = max (n_G, n_H), up to factors depending on (L, k, T) and polylogarithms. Memory is also a limiting factor: any method that explicitly materializes a cross-graph cost matrix requires Θ(n_Gn_H) space, which is acceptable for moderate N but prohibitive for large-scale retrieval unless block processing is used.

The input is presented via adjacency lists with m_G ≪ n_G² and m_H ≪ n_H². Here, neighborhood aggregation can be streamed in O(m_G + m_H) time per depth, and one would like the overall runtime to scale nearly linearly in the number of edges. However, the definition of δ_DIDM^L couples all pairs of nodes across the two graphs through OT, so an n_Gn_H interaction cost is inherent unless additional compression is introduced. Our formulation therefore permits two levels of ambition: (i) compute δ̂_DIDM^L exactly with respect to the cross-graph pairwise structure (still requiring n_Gn_H interactions), or (ii) compute a distance for auditing and retrieval using landmarking or low-rank sketches that avoid forming all n_Gn_H pairs. The latter sacrifices exactness of the full cross-graph recursion but can preserve the qualitative behavior needed for screening tasks.

We frame the approximation problem in terms of three concrete downstream uses, each imposing slightly different requirements.

Given a graph-signal (G, f) and a perturbed version (G^′, f^′) (edge flips, rewiring, or attribute noise), we compute δ̂_DIDM^L((G, f), (G^′, f^′)) to quantify sensitivity of the induced computation DIDMs. In this setting, an additive ε guarantee is operationally meaningful: if the measured discrepancy is below a tolerance ρ by a margin larger than ε, we may certify stability at scale ρ up to the approximation slack. Moreover, because auditing often requires many comparisons against perturbed instances, it benefits from regimes where the method can reuse intermediate structures (e.g. coarsenings) or process many targets in a batched manner.

Given a database {(H_ℓ, g_ℓ)}_ℓ = 1^M and a query (G, f), we seek to rank candidates by distance and return approximate nearest neighbors. Here, absolute accuracy is less important than among close candidates, but additive control remains useful: if the gap between the best and second-best matches exceeds 2ε, then an additive-ε approximation cannot change the identity of the minimizer. This task further motivates memory-aware computation: we prefer not to materialize all pairwise n_Gn_{H_ℓ} cost matrices simultaneously, and instead compute distances in blocks or via compressed summaries.

Given a training collection 𝒯 of graph-signals, we define an OOD score for a test instance (G, f) by a statistic such as min_{(H, g) ∈ 𝒯}δ_DIDM^L((G, f), (H, g)) or a k-NN average. An additive approximation yields immediate control:
|min_{(H, g) ∈ 𝒯}δ̂_DIDM^L((G, f), (H, g)) − min_{(H, g) ∈ 𝒯}δ_DIDM^L((G, f), (H, g))| ≤ ε,
since the minimum of ε-accurate quantities is itself ε-accurate. Thus, threshold-based OOD rules can be made robust to approximation error by inflating thresholds by ε.

Given (G, f), (H, g) and parameters (L, ε), we seek an algorithm that outputs δ̂_DIDM^L with additive error at most ε, while achieving runtime and memory that are competitive with the input size in the dense regime and with m_G + m_H (up to controlled interaction terms) in the sparse regime. The next sections will present (a) the exact baseline computation and why it is intractable beyond small graphs, and (b) an explicit compressed entropic recursion whose error and complexity can be quantified in terms of (k, η, T, L).

4. Baseline Exact Computation and Bottlenecks: restate the O(LN⁵log N) method; identify the N² OT subproblems per layer and their size as the core cost driver.

We first recall the direct, ``by-definition’’ evaluation of the depth-L DIDM Mover’s Distance δ_DIDM^L((G, f), (H, g)), and we make explicit why it is computationally prohibitive beyond small graphs. The key point is that the recursion defining the order-t IDM ground metric d_IDM^t induces, at each depth, a complete collection of unbalanced OT subproblems indexed by cross-graph node pairs. In the worst case, each such OT subproblem has support size Θ(N), where N = max (n_G, n_H), and therefore admits only cubic-time exact solvers up to polylogarithmic factors. Multiplying these two effects—Θ(N²) subproblems per layer and Θ(N³) time per subproblem—yields an Õ(LN⁵) baseline.

It is convenient to express the exact computation in terms of a sequence of cross-graph cost matrices C_t ∈ ℝ₊^{n_G × n_H}, where C_t[i, j] is the exact value of the order-t IDM ground metric d_IDM^t(i, j) between node i ∈ V(G) and node j ∈ V(H). At depth 0, the ground costs depend only on node attributes:
C₀[i, j] = ∥f(i) − g(j)∥₂ (i ∈ V(G), j ∈ V(H)).
For t ≥ 1, the recursion defining d_IDM^t adds to C₀ an OT discrepancy between the (normalized-sum) neighborhood measures induced by the previous layer. Concretely, writing μ_t, i^G for the order-t neighborhood measure of node i in G and μ_t, j^H for the analogous measure of node j in H, the exact update has the schematic form
C_t[i, j] = C₀[i, j] + OT_{C_t − 1} (μ_t, i^G, μ_t, j^H),
where OT_{C_t − 1}(⋅, ⋅) denotes the (non-entropic) unbalanced OT objective used in the source with ground cost matrix C_t − 1. Finally, the depth-L DIDM Mover’s Distance itself is obtained by an outer unbalanced OT over the node sets with ground cost C_L:
δ_DIDM^L((G, f), (H, g)) = OT_{C_L} (unif(V(G)), unif(V(H))).
The above identities are not meant as alternative definitions, but rather as a bookkeeping device: an algorithm that follows the recursion must (implicitly or explicitly) be able to evaluate OT_{C_t − 1}(μ_t, i^G, μ_t, j^H) for all (i, j), because those values constitute the additive refinement from C_t − 1 to C_t.

At each depth t ∈ {1, …, L}, the recursion asks for one OT value per cross-graph node pair. Thus the number of OT calls is exactly
n_Gn_H = Θ(N²)
per layer (and an additional outer OT call at the end). This quadratic multiplicity is unavoidable for the literal evaluation of the matrix C_t: the updated ground costs C_t[i, j] are defined pointwise for each pair (i, j).

We emphasize that this n_Gn_H factor is present regardless of sparsity of G and H. Sparsity affects the cost of the measures μ_t, i^G, μ_t, j^H (which can be streamed in O(m_G + m_H) time per layer), but it does not reduce the number of cross-graph pairs to be evaluated when one insists on computing the exact recursion on all pairs.

We next examine the size of a single OT subproblem OT_{C_t − 1}(μ_t, i^G, μ_t, j^H). Under normalized-sum message passing, each μ_t, i^G is a (sub)probability measure supported on the neighbors of i (or on a neighborhood multiset with weights derived from degrees). In the worst case (e.g. dense graphs), a node has Θ(N) neighbors, so the support sizes of the two marginals are Θ(N), and the ground cost is given by the restriction of the full (n_G × n_H) matrix C_t − 1 to these supports. Even if one represents μ_t, i^G and μ_t, j^H as vectors over the full vertex sets (padding zeros outside the neighborhood), the induced transportation problem remains an unbalanced OT instance on a complete bipartite graph with Θ(N) nodes per side.

Exact (non-entropic) unbalanced OT with general costs is a min-cost flow / transportation-type problem. With support size s per side, a standard baseline is O(s³log s) time via cost-scaling methods (up to problem-dependent constants), and we may take s = Θ(N) in the worst case. Hence a exact OT evaluation costs Õ(N³) time in the dense regime.

For sparse graphs with maximum degree Δ ≪ N, the per-subproblem support size can be bounded by s = O(Δ), yielding Õ(Δ³) time per OT call. However, the multiplicity n_Gn_H remains, so even in this favorable case the total work per layer scales as Õ(n_Gn_H Δ³), which is still prohibitive when n_Gn_H is large (e.g. in retrieval scenarios).

Combining the preceding two observations yields the stated baseline. In the worst case n_G ≍ n_H ≍ N and supports of neighborhood measures are Θ(N), so at each depth we perform N² OT problems at Õ(N³) time each:
time per layer = Õ(N² ⋅ N³) = Õ(N⁵), total time = Õ(LN⁵).
The final outer OT on n_G versus n_H nodes adds another Õ(N³) term, which is negligible compared to LN⁵.

In addition, a direct implementation requires storing (at least) the current cost matrix C_t − 1 ∈ ℝ^{n_G × n_H} in order to serve as the ground costs for all subproblems at depth t, implying Θ(n_Gn_H) = Θ(N²) memory. This is already at the limit for many large-scale applications, and it becomes more acute when one considers repeated evaluations (e.g. database search) where multiple such matrices might be required or where blockwise computation becomes necessary.

We summarize the obstruction as follows. The DIDM recursion couples node pairs across graphs at every depth, and each coupling requires solving an OT problem whose ground costs are themselves outputs of the previous depth. This ``OT-of-OT’’ nesting is conceptually natural but algorithmically expensive: the baseline cost driver is not the formation of neighborhood measures (which is near-linear in m_G + m_H), but the n_Gn_H exact OT subproblems per layer, each of size up to N. This is precisely the computational pressure that motivates the two relaxations developed next: entropic regularization (to replace exact min-cost flow by scalable matrix-scaling iterations) and histogram coarsening (to reduce support size from Θ(N) to k).

5. Entropic Unbalanced OT as a Drop-in Replacement: define entropic UOT objective; show computability via generalized Sinkhorn; state bias bounds vs exact UOT.

The baseline obstruction identified above is the repeated appearance of unbalanced OT objectives whose supports can be large and whose ground costs vary across layers. Our first relaxation is to replace each (non-entropic) unbalanced OT call by an entropic-regularized unbalanced OT objective, which admits a scalable matrix-scaling solver and interacts well with the recursive structure.

Let M ∈ ℝ₊^p × q be a ground cost matrix and let a ∈ ℝ₊^p, b ∈ ℝ₊^q be nonnegative vectors with ∥a∥₁, ∥b∥₁ ≤ 1 (subprobability marginals, as induced by normalized-sum aggregation). We fix an unbalanced-penalty strength λ > 0 (as in the source formulation) and define the entropic unbalanced OT value by
$$ \mathrm{UOT}_{M,\eta}(a,b) \;:=\; \min_{\Gamma\in\mathbb{R}_+^{p\times q}} \;\Big\langle \Gamma,M\Big\rangle \;+\;\lambda\,\mathrm{KL}(\Gamma\mathbf{1}\,\|\,a) \;+\;\lambda\,\mathrm{KL}(\Gamma^\top\mathbf{1}\,\|\,b) \;+\;\eta\sum_{i=1}^p\sum_{j=1}^q \Gamma_{ij}\big(\log \Gamma_{ij}-1\big), $$
where 1 denotes the all-ones vector of the appropriate dimension and
$$ \mathrm{KL}(u\,\|\,v)\;:=\;\sum_{i} \Big(u_i\log\frac{u_i}{v_i}-u_i+v_i\Big), \qquad u,v\in\mathbb{R}_+^{r}, $$
with the standard conventions 0log 0 = 0 and KL(u ∥ v) = +∞ if u is not absolutely continuous with respect to v (componentwise). The objective is strictly convex in Γ for η > 0, hence the minimizer exists and is unique.

In the recursion, this replacement is literal: whenever the exact recursion requests OT_{C_t − 1}(μ, ν) we instead compute UOT_{C_t − 1, η}(μ, ν), and at the outermost level we compute UOT_{C_L, η}(unif(V(G)), unif(V(H))). We denote by δ_DIDM^L, η the resulting fully entropic variant (no coarsening yet); later we will additionally compress supports to obtain δ̂_DIDM^L.

The computational advantage is that UOT_M, η can be evaluated by a generalized Sinkhorn procedure. Define the Gibbs kernel
K := exp (−M/η) ∈ ℝ₊^p × q,
where the exponential is taken entrywise. One may show (via first-order optimality conditions) that the unique minimizer has the scaling form
Γ^⋆ = diag(u) K diag(v)
for some scaling vectors u ∈ ℝ₊^p, v ∈ ℝ₊^q. These scalings satisfy nonlinear fixed-point relations determined by the unbalanced penalty. A convenient parametrization uses the exponent
$$ \alpha\;:=\;\frac{\lambda}{\lambda+\eta}\in(0,1), $$
and the iterative updates
u ← ( a ⊘ (Kv) )^α, v ← ( b ⊘ (K^⊤u) )^α,
where ⊘ denotes componentwise division and the power is taken componentwise. For η > 0 and strictly positive initialization, these iterations preserve positivity. Under mild conditions (in particular, K strictly positive, which holds when M is finite), the iterates converge to the unique scaling pair (u^⋆, v^⋆), yielding Γ^⋆.

Each Sinkhorn step requires forming Kv and K^⊤u, which costs O(pq) arithmetic operations. Consequently, for a k × k ground matrix (as will arise after coarsening), T Sinkhorn steps cost O(k²T) time and O(k²) memory if K (or M) is materialized. In the uncompressed setting, one obtains the same O(s²T) dependence on the marginal support sizes s, replacing the O(s³) baseline of exact min-cost flow.

Once Γ^(T) = diag(u^(T))Kdiag(v^(T)) is obtained after T iterations, we evaluate the primal objective at Γ^(T) to obtain an approximate value UOT_M, η^(T)(a, b). In practice, when η is small one should work in the log-domain to avoid underflow in K = exp (−M/η) and to stabilize the scaling iterations; this does not alter the asymptotic operation count.

Entropic regularization introduces a controlled bias. Let OT_M(a, b) denote the corresponding non-entropic unbalanced OT value (same λ, no entropy term). Since the entropy term is nonnegative up to an additive constant depending on ∥Γ∥₁, one expects UOT_M, η(a, b) ≤ OT_M(a, b) and a gap that vanishes as η → 0. A standard quantitative bound (proved by comparing an optimal plan for OT_M to a smoothed plan and using that the negative entropy is bounded by log (pq) on couplings of bounded mass) yields an estimate of the form
0 ≤ OT_M(a, b) − UOT_M, η(a, b) ≤ η log (pq) ∥Γ_OT^⋆∥₁ ≤ η log (pq),
where Γ_OT^⋆ is an optimal coupling for the non-entropic problem and we used ∥Γ_OT^⋆∥₁ ≤ 1 for subprobability marginals. Thus, for fixed support size k we may regard the entropic bias as δ_ent(η) = O(ηlog k) per OT call. This is precisely the bias term that we will allocate an ε/L budget to in the layerwise error propagation.

Replacing UOT_M, η by T Sinkhorn steps introduces an additional error δ_sink(T) in objective value. For strictly positive kernels, generalized Sinkhorn is a contraction in an appropriate projective metric, and one obtains geometric convergence of the iterates; translating this into an objective gap yields a bound of the schematic form
|UOT_M, η^(T)(a, b) − UOT_M, η(a, b)| ≤ δ_sink(T), δ_sink(T) ↓ 0 as T → ∞,
with constants depending on η, λ, and dynamic range of M. For our purposes it suffices that, for fixed η > 0, choosing T polynomially large in log (1/ε) makes δ_sink(T) ≤ ε/L.

Entropic regularization replaces each exact transportation solve by T matrix-scaling iterations and yields (i) a runtime that is quadratic in the support size rather than cubic, and (ii) an explicit bias–accuracy knob η that can be tuned in the global ε budget. The remaining issue is that, without further compression, the neighborhood measures may still have large supports. We therefore next introduce a multiscale coarsening that reduces each neighborhood marginal to a k-bin histogram while preserving the induced OT costs up to a controlled additive error.

6. Multiscale Coarsening of Neighborhood Measures: define coarsening operator 𝒞_t mapping node-wise neighbor measures to k-bin histograms; show how to build costs between bins from previous-layer costs; discuss choices (k-means on features, WL-style hashing, random projections).

At depth t ≥ 1, the exact recursion requires evaluating, for many node pairs (i, j) ∈ V(G) × V(H), an unbalanced OT cost between the neighbor measures of i and j with ground metric given by the previous-layer IDM costs. Even after introducing entropic regularization, the support sizes of these neighbor measures can be as large as deg (i) and deg (j), which in turn forces each local OT problem to scale with the (possibly large) degrees. We therefore introduce a coarsening operator 𝒞_t which compresses each node-wise neighbor measure to a k-bin histogram, together with a corresponding k × k ground cost matrix between bins.

We work with normalized-sum aggregation, hence each node induces a subprobability distribution over its neighbors. Concretely, for i ∈ V(G) we define the (unnormalized) neighbor multiset measure
$$ \mu^{G}_{t,i}\;:=\;\frac{1}{|V(G)|}\sum_{u\in N_G(i)} w_{iu}\,\delta_{u}, $$
where w_iu ≥ 0 are optional edge weights (take w_iu ≡ 1 in the unweighted case). Thus ∥μ_t, i^G∥₁ ≤ 1 and similarly for μ_t, j^H. The local OT term at depth t compares μ_t, i^G to μ_t, j^H using the ground costs d_IDM^t − 1(⋅, ⋅) (or their current approximation). Our coarsening replaces μ_t, i^G and μ_t, j^H by k-dimensional histograms while preserving these local OT values up to a controlled additive error.

Fix a layer t. We choose partitions (clusterings)
Π_t^G : V(G) → [k], Π_t^H : V(H) → [k],
mapping each node to one of k bins. Given Π_t^G, the coarsening operator 𝒞_t^G maps any measure μ on V(G) to a histogram in ℝ₊^k by summing mass within each bin:
(𝒞_t^G(μ))[a] := μ({u ∈ V(G) : Π_t^G(u) = a}), a ∈ [k].
In particular, the compressed neighbor representation of node i is
h_t^G(i) := 𝒞_t^G(μ_t, i^G) ∈ ℝ₊^k, h_t^H(j) := 𝒞_t^H(μ_t, j^H) ∈ ℝ₊^k.
These histograms can be constructed in a single streaming pass over adjacency lists: for each edge (i, u) ∈ E(G) we increment h_t^G(i)[Π_t^G(u)] by w_iu/|V(G)|. The time is O(m_G) per layer (and analogously O(m_H)), while the memory footprint is O((n_G + n_H)k) for storing all histograms.

To compare h_t^G(i) and h_t^H(j) via OT, we must define a k × k ground cost matrix M_t that approximates the cost of moving mass between a bin of G and a bin of H under the previous-layer ground metric. Let Ĉ_t − 1 ∈ ℝ₊^{n_G × n_H} denote the currently available approximation of the previous-layer cross-graph cost matrix (intended to approximate d_IDM^t − 1 on node pairs). For bins a, b ∈ [k], define the cluster sets
Π_t^G⁻¹(a) := {u ∈ V(G) : Π_t^G(u) = a}, Π_t^H⁻¹(b) := {v ∈ V(H) : Π_t^H(v) = b}.
A simple and effective choice is the average linkage cost
$$ M_t[a,b]\;:=\;\frac{1}{|\Pi^{G^{-1}}_t(a)|\,|\Pi^{H^{-1}}_t(b)|} \sum_{u\in \Pi^{G^{-1}}_t(a)}\;\sum_{v\in \Pi^{H^{-1}}_t(b)} \widehat C_{t-1}[u,v], $$
with the convention that empty clusters are disallowed (or merged) so that denominators are nonzero. Alternatives include medoid linkage (choosing representatives ū_a ∈ Π_t^G⁻¹(a) and v̄_b ∈ Π_t^H⁻¹(b) and setting M_t[a, b] = Ĉ_t − 1[ū_a, v̄_b]), or trimmed averages to reduce sensitivity to outliers. Any such construction is admissible provided it yields a uniform bound on within-bin variability (formalized below via a cluster radius).

With M_t in hand, we replace each degree-scale local problem by a k-support problem:
UOT_{Ĉ_t − 1, η}(μ_t, i^G, μ_t, j^H) ⤳ UOT_{M_t, η}(h_t^G(i), h_t^H(j)).
Since k is fixed across all pairs (i, j), each evaluation costs O(k²T) arithmetic operations via generalized Sinkhorn, independent of the degrees. The only remaining question is how to choose the partitions Π_t^G, Π_t^H so that this replacement is accurate.

We aim to ensure that nodes grouped into the same bin are close'' under the previous-layer IDM, in the sense that $\widehat C_{t-1}[u,v]$ does not vary too much when $u$ ranges within a bin of $G$ and $v$ ranges within a bin of $H$. One convenient sufficient condition is a bounded-radius hypothesis: there exists $r_t\ge 0$ such that for all $a\in[k]$ and all $u,u'\in \Pi^{G^{-1}}_t(a)$ we have \[ \sup_{v\in V(H)}\big|\widehat C_{t-1}[u,v]-\widehat C_{t-1}[u',v]\big|\;\le\; r_t, \] and analogously with the roles of $G$ and $H$ swapped. Under such a condition, transporting mass within a bin changes costs by at most $r_t$, and the induced error in replacing a measure by its histogram is controlled in OT. In practice, we enforce this criterion approximately by refining the partition as $t$ increases (hencemultiscale’’): early layers may use small k (coarse bins) because costs are dominated by attributes, while later layers may require larger k to keep r_t within the allocated error budget. For simplicity we continue to write k, but one may take k = k_t without altering the structure of the analysis.

We record three implementation patterns for constructing Π_t^G, Π_t^H from data available at depth t.

In all cases, the coarsening step is compatible with streaming edge access, and it isolates the expensive transportation computations to a fixed k × k scale. We now combine this coarsening with entropic UOT recursively to obtain the full compressed entropic DIDM distance and its stated runtime bounds.

7. Main Algorithm: compute δ̂_DIDM^L via recursive compressed entropic OT; include implementation details for sparse adjacency and batching; discuss optional low-rank/landmark extension.

We now specify the procedure producing δ̂_DIDM^L((G, f), (H, g)) by combining (i) the recursive DIDM/IDM construction, (ii) the coarsening operator 𝒞_t from the previous section, and (iii) entropic-regularized unbalanced OT evaluated approximately by T generalized Sinkhorn iterations. The output is a single scalar, but the recursion naturally exposes an intermediate cross-graph cost matrix Ĉ_t ∈ ℝ₊^{n_G × n_H} at each depth t, which we exploit both for defining the next-layer partitions and for building the bin-to-bin costs.

We begin with the attribute-level ground costs
Ĉ₀[i, j] := ∥f(i) − g(j)∥₂, (i, j) ∈ V(G) × V(H).
This is the only step that directly inspects f, g; thereafter, attributes enter only through Ĉ₀ being carried along additively. We set Ĉ ← Ĉ₀ and maintain the invariant that Ĉ represents the current approximation to the recursive IDM ground costs at the current depth.

At depth t we compute an additive correction D̂_t ∈ ℝ₊^{n_G × n_H} whose (i, j) entry approximates the local neighbor-measure OT term. Concretely, we execute the following steps.

We choose clusterings Π_t^G : V(G) → [k] and Π_t^H : V(H) → [k] using any of the constructions described previously (cost-row sketches and k-means, WL-style hashing, or projection bucketing). The only algorithmic requirement is that the partitions can be computed from data available at depth t (typically from Ĉ and/or cached node features) and that empty bins are avoided (e.g., by merging into the nearest nonempty bin).

Given Π_t^G, Π_t^H, we form
$$ M_t[a,b]\;=\;\frac{1}{|\Pi^{G^{-1}}_t(a)|\,|\Pi^{H^{-1}}_t(b)|}\sum_{u\in\Pi^{G^{-1}}_t(a)}\sum_{v\in\Pi^{H^{-1}}_t(b)}\widehat C[u,v], $$
or an alternative linkage (e.g., medoids) if desired. Operationally, we compute M_t with a single pass over Ĉ by accumulating block sums
S_t[a, b] := ∑_{u : Π_t^G(u) = a} ∑_{v : Π_t^H(v) = b}Ĉ[u, v],
and then dividing by cluster sizes. This costs O(n_Gn_H) time and O(k²) auxiliary memory beyond storing Ĉ.

We compute histograms via streaming adjacency access:
for each edge (i, u) ∈ E(G) we add w_iu/|V(G)| to coordinate Π_t^G(u) of h_t^G(i), and similarly for H. This requires O(m_G + m_H) time per layer and O((n_G + n_H)k) memory.

For each pair (i, j) ∈ V(G) × V(H) we define
D̂_t[i, j] := Sinkhorn-UOT(M_t, h_t^G(i), h_t^H(j); η, T),
where Sinkhorn-UOT denotes the entropic unbalanced OT solver specified in Section~5. Since M_t is shared by all pairs at depth t, we may precompute the Gibbs kernel K_t := exp (−M_t/η) ∈ ℝ₊^k × k once, and reuse it throughout the n_Gn_H subproblems.

We update the running cross-graph ground costs by
Ĉ ← Ĉ + D̂_t,
so that Ĉ carries the attribute term and all neighbor-OT corrections accumulated up to depth t.

After completing depths t = 1, …, L, we return the global distance as an entropic unbalanced OT between uniform node measures on V(G) and V(H) with ground cost Ĉ:
$$ \widehat\delta^{L}_{\mathrm{DIDM}}((G,f),(H,g)) \;:=\; \mathrm{Sinkhorn}\text{-}\mathrm{UOT}\big(\widehat C,\ \tfrac{1}{n_G}\mathbf{1}_{n_G},\ \tfrac{1}{n_H}\mathbf{1}_{n_H};\ \eta,\ T\big). $$
In applications where exact mass preservation is desired, one may take a nearly balanced regime (large mass-penalty parameters in the unbalanced formulation), but we keep the unbalanced setting as in the source definition.

The sparse structure of G and H affects only the histogram construction. We store adjacency lists in CSR/CSC form and build the k-dimensional histograms in one pass over edges; no per-node sorting is required. When k is moderate, we store histograms densely as contiguous arrays of length k per node to enable batched linear algebra in the Sinkhorn loops; when k is large but degrees are small, we may store histograms sparsely and densify only within batches.

The dominant term is the evaluation of D̂_t[i, j] over n_Gn_H pairs, each involving T iterations over k × k matrix–vector products. We therefore implement step (iv) in batches:
choose blocks I ⊆ V(G) and J ⊆ V(H) and compute {D̂_t[i, j]}_{i ∈ I, j ∈ J} jointly using tensorized operations. Since K_t is shared, each Sinkhorn iteration reduces to applying K_t and K_t^⊤ to many vectors in parallel, which is well-suited to GPU/BLAS3 execution. Memory permits storing Ĉ as a dense matrix in the dense regime; otherwise, one may store and update Ĉ in blocks, writing blocks to disk or recomputing them as needed. The only subtlety is that forming M_t from Ĉ requires access to all entries; in block mode we perform a streaming pass over blocks to accumulate S_t[a, b], then a second streaming pass to compute D̂_t and update Ĉ blockwise.

When n_Gn_H is prohibitively large (e.g., retrieval over many candidates), we may replace Ĉ by a low-rank surrogate built from r ≪ min {n_G, n_H} landmarks per graph. At each depth, we compute costs only between all nodes and the opposite graph’s landmarks (or between landmark sets), derive partitions and M_t from these sketches, and evaluate local UOT terms using the coarsened histograms together with landmark-induced features. The resulting quantity is not, in general, an additive-ε approximation of the full δ_DIDM^L without further structural assumptions; nevertheless it yields a computationally efficient proxy that preserves stability trends and is empirically adequate for auditing and candidate filtering. The theoretical development in the next section therefore treats the full pairwise version as the primary object, and regards landmark compression as an optional acceleration layer on top of it.

8. Theory: per-layer error propagation lemma; overall approximation theorem; runtime and memory bounds; (optional) topology/weak$^\*$ approximation under smoothing.

We analyze the discrepancy between the exact depth-L distance δ_DIDM^L((G, f), (H, g)) and the output δ̂_DIDM^L((G, f), (H, g)) produced by Algorithm~1. Let C_t^⋆ ∈ ℝ₊^{n_G × n_H} denote the exact recursive IDM ground-cost matrix at order t (so C₀^⋆[i, j] = ∥f(i) − g(j)∥₂ and $C_t^\star=C_0^\star+\sum_{s=1}^t D_s^\star$, where D_s^⋆[i, j] is the exact unbalanced OT term between the order-(s − 1) neighbor measures of i and j with ground metric d_IDM^s − 1). Likewise, let Ĉ_t be the matrix maintained by Algorithm~1 after completing depth t, and D̂_t its additive correction at depth t.

We work with the entrywise ℓ_∞ norm ∥A∥_∞ := max_i, j|A[i, j]|. Fix a depth t ≥ 1. The update Ĉ_t = Ĉ_t − 1 + D̂_t differs from C_t^⋆ = C_t − 1^⋆ + D_t^⋆ for three logically distinct reasons:

The following lemma packages these contributions into an additive bound suitable for induction.

Fix a pair (i, j). Write μ_i and ν_j for the exact neighbor measures at depth t (over V(G) and V(H), respectively), and μ̂_i = 𝒞_t(μ_i), ν̂_j = 𝒞_t(ν_j) for their coarsened k-bin representations. The exact local term has the form D_t^⋆[i, j] = OT_{C_t − 1^⋆}(μ_i, ν_j), whereas Algorithm~1 computes an entropic approximation based on the bin-to-bin cost M_t derived from Ĉ_t − 1 and on (μ̂_i, ν̂_j). By Theorem~1 (Lipschitz stability in the ground cost), replacing C_t − 1^⋆ by Ĉ_t − 1 perturbs any OT value by at most ∥Ĉ_t − 1 − C_t − 1^⋆∥_∞ ≤ ε_t − 1 (after accounting for the aggregation into M_t, whose entries are averages and hence inherit the same ℓ_∞ perturbation bound). By Theorem~2, replacing (μ_i, ν_j) by (μ̂_i, ν̂_j) perturbs the OT value by at most δ_coarse, t. Finally, by definition of δ_ent, t(η) and δ_sink, t(T), the entropic relaxation and finite Sinkhorn iterations contribute additively. Summing these contributions gives |D̂_t[i, j] − D_t^⋆[i, j]| bounded by the right-hand side, and adding the common update Ĉ_t = Ĉ_t − 1 + D̂_t yields the stated ℓ_∞ bound. ▫

Iterating Lemma~ yields an explicit telescoping control of the error across depths. In particular, letting ε₀ = 0 and defining ε_t recursively by the lemma’s right-hand side, we obtain $\|\widehat C_L-C_L^\star\|_\infty\le \sum_{t=1}^L\big(\delta_{\mathrm{coarse},t}+\delta_{\mathrm{ent},t}(\eta)+\delta_{\mathrm{sink},t}(T)\big)$. The final output δ̂_DIDM^L is itself an entropic Sinkhorn-UOT value computed with ground cost Ĉ_L, so we incorporate one additional entropic-and-iteration error term at the top level.

The coarsening requirement is met whenever the induced cluster radii r_t (in the order-(t − 1) metric) satisfy r_t ≲ ε/L, which can be enforced by increasing k or by refining partitions adaptively until a target within-bin dispersion criterion is met. The entropic bias δ_ent, t(η) decreases to 0 as η → 0 (on bounded-cost instances), while the iteration error δ_sink, t(T) decreases with T for fixed η; thus we may first choose η = η(ε, L) small enough to drive the bias below ε/(3L), and then choose T = T(ε, η) large enough to drive the solver error below ε/(3L). Theorem~ is therefore an existence statement with an explicit additive budget across the L recursive layers.

We now account for the computational cost implied by Algorithm~1. At each depth t, histogram construction costs O(m_G + m_H) via a single streaming pass over adjacency lists. Computing the bin-to-bin costs M_t by block sums costs O(n_Gn_H) time given access to Ĉ_t − 1, and the dominant term is the evaluation of n_Gn_H local entropic UOT problems on a shared k × k ground cost. If each Sinkhorn-UOT call takes O(k²T) arithmetic operations, then the total time is
Õ(L n_Gn_H k²T + L(m_G + m_H)),
and the working memory is Õ(n_Gn_H + (n_G + n_H)k) to store Ĉ and the per-node histograms (with optional block mode reducing peak memory at the expense of additional streaming passes). In the dense regime and for constant (k, T, L), this is near-quadratic in N = max {n_G, n_H}, which is information-theoretically unavoidable up to polylogarithmic factors under the standard dense-input model (Theorem~5).

Finally, we record the conceptual consequence of fixing η > 0 and interpreting Algorithm~1 (with k large and T → ∞) as computing a smoothed distance δ_DIDM^L, η in which all OT subproblems are replaced by exact entropic unbalanced OT. On compact families of graph-signals with uniformly bounded per-layer costs, the entropic objective converges to the unregularized objective as η → 0, and the layerwise perturbation bound lifts this convergence through the recursion. Concretely, one obtains a distortion bound of the form |δ_DIDM^L, η − δ_DIDM^L| ≤ O(Lη) (Proposition~6), and δ_DIDM^L, η metrizes the same weak$^\*$ topology on the corresponding DIDM space. In applications where statistical stability is prioritized over exactness, this observation justifies choosing η as a fixed smoothing scale and treating δ_DIDM^L, η as the primary target, with δ̂_DIDM^L serving as its computational surrogate via finite T and moderate k.

9. Empirical Evaluation Plan: runtime scaling; correlation with MPNN output perturbations; retrieval and OOD detection; ablations on (k, η, T); comparisons to exact on small graphs and to other metrics on medium graphs.

We now specify an empirical program whose purpose is to validate, in controlled and application-relevant regimes, the two claims implicit in our theory: (i) the approximation–runtime tradeoffs predicted by Theorems~ and~4 are observable at practical scales, and (ii) the resulting distance is a meaningful proxy for stability of message passing outputs and for downstream tasks such as retrieval and out-of-distribution (OOD) detection.

We will evaluate on a mixture of (a) synthetic graph-signals where we can control n, m, attribute dimension, and perturbations, and (b) standard attributed-graph benchmarks with moderate node counts. For synthetic data we consider: Erds–R'enyi graphs at varying densities (to separate sparse vs. dense behavior), stochastic block models (to introduce mesoscopic structure relevant to coarsening), and geometric random graphs (to align with Wasserstein-type behavior). For real data we include molecular graphs (small n, sparse m), ego-networks and citation graphs (medium n, sparse m), and (when feasible) dense similarity graphs derived from kNN constructions on point clouds. In all cases we normalize the neighborhood measures consistently with the normalized-sum convention in the distance definition.

We will measure wall-clock time and peak memory as functions of (n_G, n_H, m_G, m_H), depth L, and parameters (k, η, T). The primary diagnostic is a scaling plot in which we fix (k, η, T, L) and vary n_G ≈ n_H ≈ N, reporting the empirical exponent in the fit time ≈ c N^α separately for sparse and dense graph families. In the dense-input model, we expect α ≈ 2 (up to polylog factors), consistent with Theorem~5; in the sparse regime, we expect the histogram construction term to scale as m_G + m_H while the cross-graph interaction remains governed by n_Gn_H unless block mode or landmark compression is enabled. We will therefore report three configurations: (i) full materialization of Ĉ_t (baseline implementation of Algorithm~1), (ii) block processing of Ĉ_t to reduce peak memory, and (iii) optional landmark/Nystr"om-style compression as an explicit ``proxy mode’’ (not covered by the main guarantee but relevant for retrieval-scale deployments). For each configuration we will report throughput (pairs per second) for the n_Gn_H local k × k Sinkhorn-UOT calls and isolate their contribution by profiling.

On instances with n_G, n_H ≤ 30–50, we will compute a high-accuracy reference for δ_DIDM^L by replacing each entropic UOT with a near-exact solver for the unregularized unbalanced OT (e.g., min-cost flow or a sufficiently small η with very large T) and by disabling coarsening (k = n with singleton bins). This provides an empirical upper bound on the gap
|δ̂_DIDM^L − δ_DIDM^L|
as a function of (k, η, T) and L. We will report absolute error and relative error on instances where the reference distance is bounded away from 0. In addition, we will separately quantify the three error channels suggested by Lemma~: (a) by comparing exact UOT vs. entropic UOT at fixed marginals and ground costs, (b) by tracking objective decrease with T, and (c) by comparing costs computed using true neighbor measures versus k-bin histograms under fixed partitions. This decomposition is intended to empirically corroborate the additive error budget in Theorem~.

We will perform a three-way sweep over k ∈ {8, 16, 32, 64, 128}, η on a logarithmic grid, and T ∈ {10, 20, 50, 100, 200}, at fixed L, reporting (i) runtime, (ii) peak memory, and (iii) approximation error relative to the small-graph reference where available. To connect to the theoretical control parameters, we will also report a measured within-bin dispersion statistic (an empirical surrogate for a radius r_t) and evaluate whether larger k reduces this dispersion in a predictable manner. In settings without a reference distance, we will instead report self-consistency diagnostics (e.g., monotonicity of the computed value as T increases at fixed η, and stability of rankings as k increases). The outcome of this sweep will be a practical rule for setting (k, η, T) for a target latency or a target empirical error level.

A central motivation for DIDM-type distances is to quantify changes that are relevant to message passing. We will therefore test whether δ̂_DIDM^L correlates with perturbations of a depth-L MPNN output. Concretely, fix an MPNN Φ_L with normalized-sum aggregation, and for each base graph-signal (G, f) generate perturbed versions (G^′, f^′) by: (i) random edge deletions/additions at controlled rates, (ii) attribute noise f^′(v) = f(v) + ξ_v with ∥ξ_v∥ controlled, and (iii) structured perturbations (e.g., rewiring within/between blocks in an SBM). For each perturbation we measure
Δ_Φ := ∥Φ_L(G, f) − Φ_L(G^′, f^′)∥ and Δ_δ := δ̂_DIDM^L((G, f), (G^′, f^′)).
We will report Spearman and Kendall rank correlations between Δ_Φ and Δ_δ across perturbation families, as well as calibration curves of Δ_Φ conditional on bins of Δ_δ. As baselines we include simpler structural distances (e.g., degree distribution Wasserstein distance, spectral distances) and embedding distances (e.g., Euclidean distance between pooled node embeddings). The hypothesis is not that δ̂_DIDM^L predicts Δ_Φ exactly, but that it provides a monotone proxy aligned with the aggregation depth L.

For retrieval, we will use δ̂_DIDM^L as a similarity score in kNN search: given a query graph-signal, rank a database by increasing distance and evaluate precision@ℓ and mean average precision under label-based relevance (e.g., molecular property classes, graph categories). For OOD detection, we consider a training set 𝒟_in and test graphs drawn from a shifted distribution; we compute an OOD score
$$ s(x) := \min_{x'\in \mathcal{D}_{\mathrm{in}}}\widehat\delta^L_{\mathrm{DIDM}}(x,x') \quad\text{or}\quad s(x):=\frac{1}{q}\sum_{x'\in \mathrm{NN}_q(x)}\widehat\delta^L_{\mathrm{DIDM}}(x,x'), $$
and report ROC-AUC for distinguishing in- vs. out-of-distribution samples. We will compare against baselines built from (i) graph kernels (e.g., WL subtree kernels), (ii) optimal transport baselines at the graph level (e.g., FGW where feasible), and (iii) MPNN embedding-based Mahalanobis or kNN distances. In retrieval/OOD experiments we will emphasize block mode and (optionally) landmarks to ensure that memory does not scale as n_Gn_H for large databases.

On medium-scale datasets where exact DIDM is infeasible, we will compare to metrics that are computationally standard: spectral distances, WL-based kernels, graphlet-based distances, and FGW/GW approximations with entropic regularization. We will report both task performance and computational cost, with the intent to position δ̂_DIDM^L as a metric that is (a) explicitly depth-aligned with message passing, and (b) tunable through (k, η, T) to span a range of runtime–fidelity tradeoffs.

10. Discussion and Limitations: when coarsening fails; dependence on L; relation to aggregation choice; open problems (tight lower bounds, adaptive depth, transformer kernels).

We conclude by clarifying the regimes in which the proposed compressed entropic recursion is well behaved, and by isolating limitations that are intrinsic to the DIDM construction as well as those specific to our approximation strategy.

Our approximation replaces exact neighborhood measures by k-bin histograms induced by partitions Π_t^G, Π_t^H. Theorem~2 makes explicit that the coarsening error is controlled by a within-bin radius parameter r_t in the relevant previous-layer metric. This is simultaneously a strength and a limitation: the guarantee is meaningful precisely when there exist partitions that are for the transport costs that matter. There are natural counter-regimes. First, if the previous-layer cross-graph cost matrix exhibits high intrinsic dimension (e.g., many near-orthogonal attribute directions and weak structural regularity), then any fixed k may force large r_t, and the additive error term δ_coarse, t may dominate. Second, if the graph contains many small, idiosyncratic neighborhoods (e.g., heavy-tailed degree with numerous unique local motifs), then coarse binning may blur rare but decisive mass, again inflating r_t. Third, heterophilous graphs can produce neighbor measures whose support is effectively spread across many distant bins in the previous-layer metric; in such cases the coarsening step may systematically underestimate the cost of matching ``incompatible’’ neighborhoods unless k scales accordingly. These failure modes are not merely artifacts of our proof technique: any compression that replaces a distribution by a coarse quantization must pay, in the worst case, an error comparable to the quantization radius in the induced geometry.

A related practical limitation is that our pseudocode constructs M_t via averages over cross-cluster blocks of Ĉ_t − 1. This is a crude representative of inter-cluster geometry and may be unstable when within-cluster variability is large. One can mitigate this by storing additional per-block statistics (e.g., quantiles) or by refining M_t using a small set of sampled representatives, but the analysis would then require a correspondingly finer stability statement than Theorem~1 with respect to such surrogates.

Our approximation guarantee is additive and, in its simplest form, scales at most linearly with L through the telescoping recursion (Theorem~3). This linear dependence is sharp for a worst-case additive budget: each layer introduces an additional OT evaluation and hence an additional channel for entropic bias, Sinkhorn truncation error, and coarsening. Consequently, if one targets a fixed end-to-end tolerance ε while increasing L, then the per-layer tolerances must shrink on the order of ε/L, which in turn pushes (k, T) upward and η downward. From a computational standpoint, deeper distances are therefore unavoidably more expensive if one insists on uniform approximation quality across layers.

Independently of approximation, the captured by DIDM-type constructions may saturate for large L on graphs exhibiting mixing phenomena akin to oversmoothing in message passing. In such cases many neighborhoods become similar in distributional terms, and the incremental contribution of deeper layers to d_IDM^t may become small relative to numerical error. This suggests that, beyond a certain depth, the dominant concern is not only runtime but also statistical relevance: whether the additional layer meaningfully refines the geometry induced by the distance.

Our definitions, and hence our approximation program, are aligned with normalized-sum aggregation (the setting in which DIDMs naturally represent neighborhood measures as normalized subprobabilities). If one changes the aggregation rule, the induced measure structure can change qualitatively. For example, max aggregation is not linear in neighbor multisets and does not admit a direct representation as an OT between neighbor measures without additional lifting; degree-normalized variants modify the scaling of mass and can change the unbalancedness parameters implicit in the OT subproblems; attention-weighted aggregation introduces data-dependent reweightings that are not fixed by the graph alone. In such cases, a DIDM-style recursion may still be definable, but the stability constants and even the appropriate notion of marginal penalties in unbalanced OT may differ. Thus, while our method is conceptually modular (replace each OT call by entropic UOT; compress measures), the present guarantees should be viewed as specific to the normalized-sum convention and to costs that decompose as attribute discrepancy plus neighbor-transport discrepancy.

Theorem~5 only addresses an input-reading lower bound in the dense regime. It leaves open the more delicate question of whether, for fixed L, one can beat the apparent Θ(n_Gn_H) cross-graph interaction cost in the dense setting for nontrivial accuracy guarantees. In particular, it would be valuable to establish conditional lower bounds (e.g., under SETH-type hypotheses) showing that even δ_DIDM^L within additive ε requires near-quadratic time for worst-case attributed graphs, thereby justifying our focus on constant-factor improvements, coarsening, and entropic smoothing. Conversely, identifying structural assumptions (low doubling dimension of attributes, bounded treewidth, low-rank cost structure, or clusterability in the induced IDM metric) under which subquadratic algorithms are possible remains an attractive direction.

Our analysis allocates a uniform error budget ε/L per layer. This is conservative: empirically, some layers contribute little incremental mass transport, and spending computational effort uniformly can be wasteful. A principled adaptive-depth procedure would (i) estimate an upper bound on the remaining contribution of deeper layers to the final distance and (ii) terminate when this bound falls below a prescribed tolerance. Analogously, one could allocate (k_t, η_t, T_t) per layer to balance coarsening and entropic bias against the magnitude of that layer’s contribution. Making such procedures rigorous appears to require sharper continuity properties of the recursion, including bounds that depend on observable quantities (e.g., measured within-bin dispersions and Sinkhorn residuals) rather than worst-case radii.

Finally, there is a conceptual connection between DIDM recursions and attention mechanisms: both compare neighborhoods via data-dependent similarity scores. One could replace the OT-based neighborhood comparison by a kernel induced by a transformer block (e.g., softmax attention over learned pairwise costs) and then iterate across layers. The resulting object would no longer be an OT distance, and the metric properties would generally fail, but one might gain expressive power and amortized computation via learned low-rank structure. Understanding whether one can retain stability guarantees reminiscent of Theorem~1 for such ``transformer kernels,’’ and whether they admit coarsening schemes analogous to 𝒞_t with provable error control, constitutes a concrete open problem at the boundary of optimal transport, graph representation, and modern sequence models.

In summary, our theory delineates a tractable approximation envelope for DIDM distances through entropic smoothing and multiscale coarsening, but it also exposes the central bottlenecks: the geometric feasibility of quantization in the induced IDM metric, the inevitable accumulation with depth, and the sensitivity to the aggregation semantics that define the underlying neighborhood measures.

Fast Approximation of DIDM Mover’s Distance via Entropic Multiscale Optimal Transport

Metadata

Table of Contents

Content

1. Introduction: motivation from GNN stability metrics; bottleneck of exact δDIDML; contributions (algorithm + guarantees + empirical validation).

2. Background and Notation: graph-signals, computation IDMs/DIDMs, exact δDIDML recursion; unbalanced OT; entropic regularization; multiscale clustering primitives.