← Back

Fast Approximation of DIDM Mover’s Distance via Entropic Multiscale Optimal Transport

Metadata

Table of Contents

  1. 1. Introduction: motivation from GNN stability metrics; bottleneck of exact δDIDML; contributions (algorithm + guarantees + empirical validation).
  2. 2. Background and Notation: graph-signals, computation IDMs/DIDMs, exact δDIDML recursion; unbalanced OT; entropic regularization; multiscale clustering primitives.
  3. 3. Problem Formulation: define approximation goals (additive/multiplicative), parameter regimes (dense vs sparse), and evaluation tasks (stability auditing, retrieval, OOD).
  4. 4. Baseline Exact Computation and Bottlenecks: restate the O(LN5log N) method; identify the N2 OT subproblems per layer and their size as the core cost driver.
  5. 5. Entropic Unbalanced OT as a Drop-in Replacement: define entropic UOT objective; show computability via generalized Sinkhorn; state bias bounds vs exact UOT.
  6. 6. Multiscale Coarsening of Neighborhood Measures: define coarsening operator 𝒞t mapping node-wise neighbor measures to k-bin histograms; show how to build costs between bins from previous-layer costs; discuss choices (k-means on features, WL-style hashing, random projections).
  7. 7. Main Algorithm: compute δ̂DIDML via recursive compressed entropic OT; include implementation details for sparse adjacency and batching; discuss optional low-rank/landmark extension.
  8. 8. Theory: per-layer error propagation lemma; overall approximation theorem; runtime and memory bounds; (optional) topology/weak$^\*$ approximation under smoothing.
  9. 9. Empirical Evaluation Plan: runtime scaling; correlation with MPNN output perturbations; retrieval and OOD detection; ablations on (k, η, T); comparisons to exact on small graphs and to other metrics on medium graphs.
  10. 10. Discussion and Limitations: when coarsening fails; dependence on L; relation to aggregation choice; open problems (tight lower bounds, adaptive depth, transformer kernels).

Content

1. Introduction: motivation from GNN stability metrics; bottleneck of exact δDIDML; contributions (algorithm + guarantees + empirical validation).

Message passing graph neural networks (GNNs) are routinely deployed in regimes where the input graph is not fixed but subject to perturbations: edges may be missing, spurious, or reweighted; node features may be noisy; and the number of observed nodes may vary across samples and time. In such settings, a central methodological question is how to measure the discrepancy between two (G, f) and (H, g) in a manner that is faithful to the computation performed by a depth-L message passing architecture. A discrepancy notion aligned with the network dynamics serves at least three purposes. First, it yields a stability metric for auditing robustness with respect to structural and feature perturbations. Second, it enables similarity search and retrieval of graphs in a geometry that reflects the model’s inductive bias. Third, it provides a principled way to compare learned representations across architectures and depths, by making explicit which local-to-global aggregations are deemed close.

The distance δDIDML we consider is defined by transporting hierarchical neighborhood measures induced by normalized-sum aggregation, iterated up to depth L. Informally, it compares nodes across two graphs by a base attribute cost, and then recursively augments this cost by an optimal transport discrepancy between the nodes’ (multi)set of neighbors, where neighbor-to-neighbor costs are themselves computed from the previous depth. This recursion naturally lives in a space of distributional representations (computation IDMs/DIDMs) and yields a global graph-to-graph distance as an outer unbalanced optimal transport (OT) problem. The use of OT is not incidental: it accommodates graphs with different sizes and degree distributions by permitting creation and destruction of mass at a controlled penalty, which is required when neighborhoods do not match in cardinality or when the induced measures are subprobabilities.

Despite its conceptual alignment with message passing, the exact computation of δDIDML is a bottleneck. At each depth t, and for each cross-graph node pair (i, j) ∈ V(G) × V(H), one must solve a transport problem between two neighborhood measures whose ground cost is the depth-(t − 1) node-pair metric. Even when the neighborhood measures are sparse, the number of node pairs is nGnH, and the recursion requires propagating costs across depths. In the worst case, exact OT reduces to min-cost flow computations, and the nested structure of the recursion amplifies runtime and memory demands. Consequently, while δDIDML is an attractive target for stability analysis, it is not directly usable as a drop-in primitive inside larger pipelines (e.g., retrieval, auditing across many graph pairs, or hyperparameter sweeps) unless we replace the exact OT calls by approximations with controlled error.

We address this gap by designing an approximation scheme that preserves the recursive structure while reducing the per-layer OT burden. The guiding principle is that, for the purpose of evaluating message-passing-induced discrepancies, it is often unnecessary to resolve fine-grained neighbor identities at every depth; rather, it suffices to compare neighbor distributions under a multiscale partition that is stable with respect to the previous-layer costs. We therefore compress each neighborhood measure into a k-bin histogram over clusters and compute OT only at the cluster level, while using entropic regularization to obtain fast Sinkhorn iterations. Entropic unbalanced OT yields a differentiable, numerically stable objective and admits efficient parallelization, at the cost of a controlled smoothing bias that can be traded against runtime. The resulting procedure is a recursion: at depth t we (i) build a k × k cluster-to-cluster cost matrix from the current cross-graph costs, (ii) encode each node’s neighborhood as a k-dimensional histogram, and (iii) evaluate a small entropic unbalanced OT problem for each node pair to obtain the depth-t contribution to the updated ground cost.

Our main contributions are as follows.

Finally, we note that entropic regularization induces a smoothed variant of the distance that is itself of independent interest: fixing η > 0 yields a stable, computationally convenient geometry on computation DIDMs that converges to the unsmoothed one as η → 0. This viewpoint clarifies the role of regularization as an explicit modeling choice rather than a purely numerical trick, and it provides a principled axis along which one can interpolate between fidelity to exact OT and computational tractability.


2. Background and Notation: graph-signals, computation IDMs/DIDMs, exact δDIDML recursion; unbalanced OT; entropic regularization; multiscale clustering primitives.

We collect the definitions needed to state the exact depth-L distance δDIDML and the approximation primitives used later. Throughout, (G, f) and (H, g) denote attributed graphs with node sets V(G), V(H) and node attributes f : V(G) → 𝒳 and g : V(H) → 𝒳, where 𝒳 ⊂ ℝd is compact. We write nG = |V(G)|, nH = |V(H)|, mG = |E(G)|, mH = |E(H)|, and N = max (nG, nH).

We view a graph-signal (G, f) as a finite relational structure equipped with node features. Let NG(i) denote the (possibly weighted) neighborhood of a node i ∈ V(G). In the normalized-sum message passing regime, the contribution of i to depth-t computations depends on a normalized empirical measure over its neighbors. Concretely, for each i ∈ V(G) we define a subprobability measure on V(G) by
$$ \mu^{G}_i \;=\; \frac{1}{|V(G)|}\sum_{u\in N_G(i)} w_{iu}\,\delta_u, $$
where wiu ≥ 0 are edge weights (default wiu = 1 in the unweighted case) and δu denotes the Dirac mass at u. The factor 1/|V(G)| yields total mass at most 1 and makes the construction compatible with varying graph sizes. Analogously, for j ∈ V(H) we obtain μjH.

The computation at depth t induces a cross-graph cost between nodes of G and H that we interpret as an (IDM) ground cost. We define the base (order-0) cost by the attribute discrepancy
dIDM0(i, j) = ∥f(i) − g(j)∥2,   i ∈ V(G), j ∈ V(H).
For t ≥ 1, the depth-t cost augments the base discrepancy by a transport term between the neighbor measures, where the ground metric for transporting neighbor mass is the previous-depth cost. Writing OTd(⋅, ⋅) for the unbalanced OT objective used in the source with ground metric d, we set
dIDMt(i, j) = dIDM0(i, j) + OTdIDMt − 1 (μiG, μjH).
This recursion is well-defined because dIDMt − 1 is a finite matrix on V(G) × V(H) and the neighbor measures have finite support. The compactness of 𝒳 ensures dIDM0 is bounded, and hence, under the standing assumptions of the source unbalanced OT formulation, each dIDMt remains finite.

At depth L, the node-level costs dIDML serve as the ground metric in an unbalanced OT problem that transports mass between the node populations of G and H. Let αG and αH denote the node measures, taken for definiteness as uniform subprobabilities,
$$ \alpha_G \;=\; \frac{1}{|V(G)|}\sum_{i\in V(G)}\delta_i,\qquad \alpha_H \;=\; \frac{1}{|V(H)|}\sum_{j\in V(H)}\delta_j. $$
The depth-L DIDM Mover’s Distance is then
δDIDML((G, f), (H, g)) = OTdIDML (αG, αH).
Thus δDIDML is a two-level hierarchy: (i) an inner recursion that compares nodes by transporting their normalized neighborhoods using previous-depth costs, and (ii) an outer transport that aggregates node-to-node costs into a graph-to-graph discrepancy. The terminology ``computation IDM/DIDM’’ reflects that the objects being compared at each stage are measures arising from the computation graph of normalized-sum message passing.

We require unbalanced OT because the measures μiG and μjH (and likewise αG, αH) may have different total masses, reflecting differing degrees and differing graph sizes. Given nonnegative vectors a ∈ ℝ+p, b ∈ ℝ+q and a ground cost matrix C ∈ ℝ+p × q, a standard unbalanced OT instance takes the form
OTC(a, b) = minγ ∈ ℝ+p × q ⟨C, γ⟩ + λD(γ1 ∥ a) + λD(γ1 ∥ b),
where D(⋅∥⋅) is a divergence (e.g. a Kullback–Leibler divergence) and λ > 0 controls the penalty for creating/destroying mass. In our notation, OTd(μ, ν) denotes the corresponding value when C is induced by a ground metric d on the supports of μ, ν. We will use only the two structural properties needed for approximation: (a) OT is monotone and Lipschitz in the ground costs under ∥ ⋅ ∥ perturbations, and (b) it is stable under mild perturbations of the marginals, both of which are standard in this setting.

To accelerate repeated OT evaluations, we replace OT by an entropic-regularized counterpart. For η > 0, we write UOTC, η(a, b) for the entropic unbalanced OT value obtained by adding a negative-entropy term to the primal objective,
UOTC, η(a, b) = minγ ≥ 0 ⟨C, γ⟩ + λD(γ1 ∥ a) + λD(γ1 ∥ b) + ηu, vγuv(log γuv − 1),
with the convention 0log 0 := 0. This regularization yields a strictly convex objective and admits fast scaling algorithms. In practice, we approximate UOTC, η(a, b) using T Sinkhorn iterations; we will explicitly track both the smoothing bias (controlled by η) and the finite-iteration error (controlled by T) when we propagate approximation guarantees through the recursion.

The remaining bottleneck is that, at depth t, one must evaluate a neighbor-to-neighbor OT term for each pair (i, j) ∈ V(G) × V(H), and the supports of μiG, μjH are subsets of V(G), V(H). To reduce the effective support size, we introduce at each layer t partitions (clusterings) ΠtG of V(G) and ΠtH of V(H) into k clusters. The associated coarsening operator 𝒞t maps a neighbor measure to a k-dimensional histogram by aggregating mass per cluster:
htG(i)[a] = μiG(ΠtG(a)),   a ∈ {1, …, k},
and similarly htH(j). Given a cross-graph cost matrix C approximating the previous-layer costs on V(G) × V(H), we form a compressed k × k ground cost matrix Mt by averaging C across cluster pairs,
Mt[a, b] = Avgu ∈ ΠtG(a), v ∈ ΠtH(b)C[u, v].
The compressed OT subproblem UOTMt, η(htG(i), htH(j)) then serves as a proxy for transporting the original neighborhood measures. The quality of this proxy is governed by the within-cluster diameters (radii) of ΠtG, ΠtH as measured by the previous-layer costs, a fact we will exploit when we bound the coarsening error layer-by-layer.

These definitions fully specify the exact recursion for δDIDML and the two approximation mechanisms—entropic smoothing and multiscale histogram compression—that we will combine in the algorithmic construction and guarantees that follow.


3. Problem Formulation: define approximation goals (additive/multiplicative), parameter regimes (dense vs sparse), and evaluation tasks (stability auditing, retrieval, OOD).

Our objective is to approximate the depth-L DIDM Mover’s Distance δDIDML((G, f), (H, g)) between two attributed graphs under the recursive definition given in the previous section, while controlling both statistical distortion (approximation error relative to the exact, non-entropic, non-compressed objective) and computational cost. We emphasize that the target is a discrepancy between graphs, but that any algorithm which follows the recursion must implicitly approximate a hierarchy of intermediate node-to-node ground costs.

Given an accuracy parameter ε > 0, we seek an output
δ̂DIDML((G, f), (H, g))
computed from (G, f), (H, g) and algorithmic parameters (in our case (k, η, T)) such that an end-to-end additive guarantee holds:
|δ̂DIDML((G, f), (H, g)) − δDIDML((G, f), (H, g))| ≤ ε.
We formulate the goal additively for two reasons. First, δDIDML may be arbitrarily small when graphs are nearly isomorphic in attributes and neighborhoods, so multiplicative approximation cannot hold uniformly. Second, additive control aligns with the natural stability properties of OT objectives with respect to perturbations of ground costs and marginals (which we will leverage when propagating error through depth). When a multiplicative criterion is nevertheless desired (e.g. ranking-only applications), we may interpret our guarantee as a relative bound on instances satisfying a separation condition δDIDML ≥ τ > 0, in which case
$$ \frac{\big|\widehat\delta^{L}_{\mathrm{DIDM}}-\delta^{L}_{\mathrm{DIDM}}\big|}{\delta^{L}_{\mathrm{DIDM}}}\;\le\;\frac{\varepsilon}{\tau}. $$
This is a corollary rather than a primary objective, and it requires an application-specific lower bound τ.

Our approximation has three explicit knobs. The parameter k controls the support size of compressed neighborhood measures (histogram bin count), η > 0 controls entropic smoothing, and T is the number of iterations in the scaling (Sinkhorn-type) solver used to approximate entropic unbalanced OT values. The algorithm will instantiate these parameters either by direct user choice or by a rule depending on (ε, L) and coarse bounds on cost magnitudes. At a conceptual level, these parameters correspond to three error sources:
(coarsening error from k) + (entropic bias from η) + (optimization error from T).
The central formulation requirement is that these contributions be budgeted across depths so that the sum of per-layer distortions is at most ε. Our later analysis will enforce a layerwise allocation (e.g. each term at most ε/L) in order to obtain a clean end-to-end bound.

We distinguish two input-access regimes, since the bottlenecks depend strongly on how the graph is presented.

The input may include adjacency matrices (or otherwise permit O(1) adjacency queries) so that mG = Θ(nG2) and mH = Θ(nH2) in the worst case. In this regime the input size is already quadratic, and our computational goal is therefore time in the dominant size parameter N = max (nG, nH), up to factors depending on (L, k, T) and polylogarithms. Memory is also a limiting factor: any method that explicitly materializes a cross-graph cost matrix requires Θ(nGnH) space, which is acceptable for moderate N but prohibitive for large-scale retrieval unless block processing is used.

The input is presented via adjacency lists with mG ≪ nG2 and mH ≪ nH2. Here, neighborhood aggregation can be streamed in O(mG + mH) time per depth, and one would like the overall runtime to scale nearly linearly in the number of edges. However, the definition of δDIDML couples all pairs of nodes across the two graphs through OT, so an nGnH interaction cost is inherent unless additional compression is introduced. Our formulation therefore permits two levels of ambition: (i) compute δ̂DIDML exactly with respect to the cross-graph pairwise structure (still requiring nGnH interactions), or (ii) compute a distance for auditing and retrieval using landmarking or low-rank sketches that avoid forming all nGnH pairs. The latter sacrifices exactness of the full cross-graph recursion but can preserve the qualitative behavior needed for screening tasks.

We frame the approximation problem in terms of three concrete downstream uses, each imposing slightly different requirements.

Given a graph-signal (G, f) and a perturbed version (G, f) (edge flips, rewiring, or attribute noise), we compute δ̂DIDML((G, f), (G, f)) to quantify sensitivity of the induced computation DIDMs. In this setting, an additive ε guarantee is operationally meaningful: if the measured discrepancy is below a tolerance ρ by a margin larger than ε, we may certify stability at scale ρ up to the approximation slack. Moreover, because auditing often requires many comparisons against perturbed instances, it benefits from regimes where the method can reuse intermediate structures (e.g. coarsenings) or process many targets in a batched manner.

Given a database {(H, g)} = 1M and a query (G, f), we seek to rank candidates by distance and return approximate nearest neighbors. Here, absolute accuracy is less important than among close candidates, but additive control remains useful: if the gap between the best and second-best matches exceeds 2ε, then an additive-ε approximation cannot change the identity of the minimizer. This task further motivates memory-aware computation: we prefer not to materialize all pairwise nGnH cost matrices simultaneously, and instead compute distances in blocks or via compressed summaries.

Given a training collection 𝒯 of graph-signals, we define an OOD score for a test instance (G, f) by a statistic such as min(H, g) ∈ 𝒯δDIDML((G, f), (H, g)) or a k-NN average. An additive approximation yields immediate control:
|min(H, g) ∈ 𝒯δ̂DIDML((G, f), (H, g)) − min(H, g) ∈ 𝒯δDIDML((G, f), (H, g))| ≤ ε,
since the minimum of ε-accurate quantities is itself ε-accurate. Thus, threshold-based OOD rules can be made robust to approximation error by inflating thresholds by ε.

Given (G, f), (H, g) and parameters (L, ε), we seek an algorithm that outputs δ̂DIDML with additive error at most ε, while achieving runtime and memory that are competitive with the input size in the dense regime and with mG + mH (up to controlled interaction terms) in the sparse regime. The next sections will present (a) the exact baseline computation and why it is intractable beyond small graphs, and (b) an explicit compressed entropic recursion whose error and complexity can be quantified in terms of (k, η, T, L).


4. Baseline Exact Computation and Bottlenecks: restate the O(LN5log N) method; identify the N2 OT subproblems per layer and their size as the core cost driver.

We first recall the direct, ``by-definition’’ evaluation of the depth-L DIDM Mover’s Distance δDIDML((G, f), (H, g)), and we make explicit why it is computationally prohibitive beyond small graphs. The key point is that the recursion defining the order-t IDM ground metric dIDMt induces, at each depth, a complete collection of unbalanced OT subproblems indexed by cross-graph node pairs. In the worst case, each such OT subproblem has support size Θ(N), where N = max (nG, nH), and therefore admits only cubic-time exact solvers up to polylogarithmic factors. Multiplying these two effects—Θ(N2) subproblems per layer and Θ(N3) time per subproblem—yields an (LN5) baseline.

It is convenient to express the exact computation in terms of a sequence of cross-graph cost matrices Ct ∈ ℝ+nG × nH, where Ct[i, j] is the exact value of the order-t IDM ground metric dIDMt(i, j) between node i ∈ V(G) and node j ∈ V(H). At depth 0, the ground costs depend only on node attributes:
C0[i, j] = ∥f(i) − g(j)∥2   (i ∈ V(G), j ∈ V(H)).
For t ≥ 1, the recursion defining dIDMt adds to C0 an OT discrepancy between the (normalized-sum) neighborhood measures induced by the previous layer. Concretely, writing μt, iG for the order-t neighborhood measure of node i in G and μt, jH for the analogous measure of node j in H, the exact update has the schematic form
Ct[i, j] = C0[i, j] + OTCt − 1 (μt, iG, μt, jH),
where OTCt − 1(⋅, ⋅) denotes the (non-entropic) unbalanced OT objective used in the source with ground cost matrix Ct − 1. Finally, the depth-L DIDM Mover’s Distance itself is obtained by an outer unbalanced OT over the node sets with ground cost CL:
δDIDML((G, f), (H, g)) = OTCL (unif(V(G)), unif(V(H))).
The above identities are not meant as alternative definitions, but rather as a bookkeeping device: an algorithm that follows the recursion must (implicitly or explicitly) be able to evaluate OTCt − 1(μt, iG, μt, jH) for all (i, j), because those values constitute the additive refinement from Ct − 1 to Ct.

At each depth t ∈ {1, …, L}, the recursion asks for one OT value per cross-graph node pair. Thus the number of OT calls is exactly
nGnH = Θ(N2)
per layer (and an additional outer OT call at the end). This quadratic multiplicity is unavoidable for the literal evaluation of the matrix Ct: the updated ground costs Ct[i, j] are defined pointwise for each pair (i, j).

We emphasize that this nGnH factor is present regardless of sparsity of G and H. Sparsity affects the cost of the measures μt, iG, μt, jH (which can be streamed in O(mG + mH) time per layer), but it does not reduce the number of cross-graph pairs to be evaluated when one insists on computing the exact recursion on all pairs.

We next examine the size of a single OT subproblem OTCt − 1(μt, iG, μt, jH). Under normalized-sum message passing, each μt, iG is a (sub)probability measure supported on the neighbors of i (or on a neighborhood multiset with weights derived from degrees). In the worst case (e.g. dense graphs), a node has Θ(N) neighbors, so the support sizes of the two marginals are Θ(N), and the ground cost is given by the restriction of the full (nG × nH) matrix Ct − 1 to these supports. Even if one represents μt, iG and μt, jH as vectors over the full vertex sets (padding zeros outside the neighborhood), the induced transportation problem remains an unbalanced OT instance on a complete bipartite graph with Θ(N) nodes per side.

Exact (non-entropic) unbalanced OT with general costs is a min-cost flow / transportation-type problem. With support size s per side, a standard baseline is O(s3log s) time via cost-scaling methods (up to problem-dependent constants), and we may take s = Θ(N) in the worst case. Hence a exact OT evaluation costs (N3) time in the dense regime.

For sparse graphs with maximum degree Δ ≪ N, the per-subproblem support size can be bounded by s = O(Δ), yielding (Δ3) time per OT call. However, the multiplicity nGnH remains, so even in this favorable case the total work per layer scales as (nGnHΔ3), which is still prohibitive when nGnH is large (e.g. in retrieval scenarios).

Combining the preceding two observations yields the stated baseline. In the worst case nG ≍ nH ≍ N and supports of neighborhood measures are Θ(N), so at each depth we perform N2 OT problems at (N3) time each:
time per layer = (N2 ⋅ N3) = (N5),   total time = (LN5).
The final outer OT on nG versus nH nodes adds another (N3) term, which is negligible compared to LN5.

In addition, a direct implementation requires storing (at least) the current cost matrix Ct − 1 ∈ ℝnG × nH in order to serve as the ground costs for all subproblems at depth t, implying Θ(nGnH) = Θ(N2) memory. This is already at the limit for many large-scale applications, and it becomes more acute when one considers repeated evaluations (e.g. database search) where multiple such matrices might be required or where blockwise computation becomes necessary.

We summarize the obstruction as follows. The DIDM recursion couples node pairs across graphs at every depth, and each coupling requires solving an OT problem whose ground costs are themselves outputs of the previous depth. This ``OT-of-OT’’ nesting is conceptually natural but algorithmically expensive: the baseline cost driver is not the formation of neighborhood measures (which is near-linear in mG + mH), but the nGnH exact OT subproblems per layer, each of size up to N. This is precisely the computational pressure that motivates the two relaxations developed next: entropic regularization (to replace exact min-cost flow by scalable matrix-scaling iterations) and histogram coarsening (to reduce support size from Θ(N) to k).


5. Entropic Unbalanced OT as a Drop-in Replacement: define entropic UOT objective; show computability via generalized Sinkhorn; state bias bounds vs exact UOT.

The baseline obstruction identified above is the repeated appearance of unbalanced OT objectives whose supports can be large and whose ground costs vary across layers. Our first relaxation is to replace each (non-entropic) unbalanced OT call by an entropic-regularized unbalanced OT objective, which admits a scalable matrix-scaling solver and interacts well with the recursive structure.

Let M ∈ ℝ+p × q be a ground cost matrix and let a ∈ ℝ+p, b ∈ ℝ+q be nonnegative vectors with a1, ∥b1 ≤ 1 (subprobability marginals, as induced by normalized-sum aggregation). We fix an unbalanced-penalty strength λ > 0 (as in the source formulation) and define the entropic unbalanced OT value by
$$ \mathrm{UOT}_{M,\eta}(a,b) \;:=\; \min_{\Gamma\in\mathbb{R}_+^{p\times q}} \;\Big\langle \Gamma,M\Big\rangle \;+\;\lambda\,\mathrm{KL}(\Gamma\mathbf{1}\,\|\,a) \;+\;\lambda\,\mathrm{KL}(\Gamma^\top\mathbf{1}\,\|\,b) \;+\;\eta\sum_{i=1}^p\sum_{j=1}^q \Gamma_{ij}\big(\log \Gamma_{ij}-1\big), $$
where 1 denotes the all-ones vector of the appropriate dimension and
$$ \mathrm{KL}(u\,\|\,v)\;:=\;\sum_{i} \Big(u_i\log\frac{u_i}{v_i}-u_i+v_i\Big), \qquad u,v\in\mathbb{R}_+^{r}, $$
with the standard conventions 0log 0 = 0 and KL(u ∥ v) = +∞ if u is not absolutely continuous with respect to v (componentwise). The objective is strictly convex in Γ for η > 0, hence the minimizer exists and is unique.

In the recursion, this replacement is literal: whenever the exact recursion requests OTCt − 1(μ, ν) we instead compute UOTCt − 1, η(μ, ν), and at the outermost level we compute UOTCL, η(unif(V(G)), unif(V(H))). We denote by δDIDML, η the resulting fully entropic variant (no coarsening yet); later we will additionally compress supports to obtain δ̂DIDML.

The computational advantage is that UOTM, η can be evaluated by a generalized Sinkhorn procedure. Define the Gibbs kernel
K := exp (−M/η) ∈ ℝ+p × q,
where the exponential is taken entrywise. One may show (via first-order optimality conditions) that the unique minimizer has the scaling form
Γ = diag(u) K diag(v)
for some scaling vectors u ∈ ℝ+p, v ∈ ℝ+q. These scalings satisfy nonlinear fixed-point relations determined by the unbalanced penalty. A convenient parametrization uses the exponent
$$ \alpha\;:=\;\frac{\lambda}{\lambda+\eta}\in(0,1), $$
and the iterative updates
u ← ( a ⊘ (Kv) )α,   v ← ( b ⊘ (Ku) )α,
where denotes componentwise division and the power is taken componentwise. For η > 0 and strictly positive initialization, these iterations preserve positivity. Under mild conditions (in particular, K strictly positive, which holds when M is finite), the iterates converge to the unique scaling pair (u, v), yielding Γ.

Each Sinkhorn step requires forming Kv and Ku, which costs O(pq) arithmetic operations. Consequently, for a k × k ground matrix (as will arise after coarsening), T Sinkhorn steps cost O(k2T) time and O(k2) memory if K (or M) is materialized. In the uncompressed setting, one obtains the same O(s2T) dependence on the marginal support sizes s, replacing the O(s3) baseline of exact min-cost flow.

Once Γ(T) = diag(u(T))Kdiag(v(T)) is obtained after T iterations, we evaluate the primal objective at Γ(T) to obtain an approximate value UOTM, η(T)(a, b). In practice, when η is small one should work in the log-domain to avoid underflow in K = exp (−M/η) and to stabilize the scaling iterations; this does not alter the asymptotic operation count.

Entropic regularization introduces a controlled bias. Let OTM(a, b) denote the corresponding non-entropic unbalanced OT value (same λ, no entropy term). Since the entropy term is nonnegative up to an additive constant depending on Γ1, one expects UOTM, η(a, b) ≤ OTM(a, b) and a gap that vanishes as η → 0. A standard quantitative bound (proved by comparing an optimal plan for OTM to a smoothed plan and using that the negative entropy is bounded by log (pq) on couplings of bounded mass) yields an estimate of the form
0 ≤ OTM(a, b) − UOTM, η(a, b) ≤ η log (pq) ∥ΓOT1 ≤ η log (pq),
where ΓOT is an optimal coupling for the non-entropic problem and we used ΓOT1 ≤ 1 for subprobability marginals. Thus, for fixed support size k we may regard the entropic bias as δent(η) = O(ηlog k) per OT call. This is precisely the bias term that we will allocate an ε/L budget to in the layerwise error propagation.

Replacing UOTM, η by T Sinkhorn steps introduces an additional error δsink(T) in objective value. For strictly positive kernels, generalized Sinkhorn is a contraction in an appropriate projective metric, and one obtains geometric convergence of the iterates; translating this into an objective gap yields a bound of the schematic form
|UOTM, η(T)(a, b) − UOTM, η(a, b)| ≤ δsink(T),   δsink(T) ↓ 0 as T → ∞,
with constants depending on η, λ, and dynamic range of M. For our purposes it suffices that, for fixed η > 0, choosing T polynomially large in log (1/ε) makes δsink(T) ≤ ε/L.

Entropic regularization replaces each exact transportation solve by T matrix-scaling iterations and yields (i) a runtime that is quadratic in the support size rather than cubic, and (ii) an explicit bias–accuracy knob η that can be tuned in the global ε budget. The remaining issue is that, without further compression, the neighborhood measures may still have large supports. We therefore next introduce a multiscale coarsening that reduces each neighborhood marginal to a k-bin histogram while preserving the induced OT costs up to a controlled additive error.


6. Multiscale Coarsening of Neighborhood Measures: define coarsening operator 𝒞t mapping node-wise neighbor measures to k-bin histograms; show how to build costs between bins from previous-layer costs; discuss choices (k-means on features, WL-style hashing, random projections).

At depth t ≥ 1, the exact recursion requires evaluating, for many node pairs (i, j) ∈ V(G) × V(H), an unbalanced OT cost between the neighbor measures of i and j with ground metric given by the previous-layer IDM costs. Even after introducing entropic regularization, the support sizes of these neighbor measures can be as large as deg (i) and deg (j), which in turn forces each local OT problem to scale with the (possibly large) degrees. We therefore introduce a coarsening operator 𝒞t which compresses each node-wise neighbor measure to a k-bin histogram, together with a corresponding k × k ground cost matrix between bins.

We work with normalized-sum aggregation, hence each node induces a subprobability distribution over its neighbors. Concretely, for i ∈ V(G) we define the (unnormalized) neighbor multiset measure
$$ \mu^{G}_{t,i}\;:=\;\frac{1}{|V(G)|}\sum_{u\in N_G(i)} w_{iu}\,\delta_{u}, $$
where wiu ≥ 0 are optional edge weights (take wiu ≡ 1 in the unweighted case). Thus μt, iG1 ≤ 1 and similarly for μt, jH. The local OT term at depth t compares μt, iG to μt, jH using the ground costs dIDMt − 1(⋅, ⋅) (or their current approximation). Our coarsening replaces μt, iG and μt, jH by k-dimensional histograms while preserving these local OT values up to a controlled additive error.

Fix a layer t. We choose partitions (clusterings)
ΠtG : V(G) → [k],   ΠtH : V(H) → [k],
mapping each node to one of k bins. Given ΠtG, the coarsening operator 𝒞tG maps any measure μ on V(G) to a histogram in +k by summing mass within each bin:
(𝒞tG(μ))[a] := μ({u ∈ V(G) : ΠtG(u) = a}),   a ∈ [k].
In particular, the compressed neighbor representation of node i is
htG(i) := 𝒞tG(μt, iG) ∈ ℝ+k,   htH(j) := 𝒞tH(μt, jH) ∈ ℝ+k.
These histograms can be constructed in a single streaming pass over adjacency lists: for each edge (i, u) ∈ E(G) we increment htG(i)[ΠtG(u)] by wiu/|V(G)|. The time is O(mG) per layer (and analogously O(mH)), while the memory footprint is O((nG + nH)k) for storing all histograms.

To compare htG(i) and htH(j) via OT, we must define a k × k ground cost matrix Mt that approximates the cost of moving mass between a bin of G and a bin of H under the previous-layer ground metric. Let t − 1 ∈ ℝ+nG × nH denote the currently available approximation of the previous-layer cross-graph cost matrix (intended to approximate dIDMt − 1 on node pairs). For bins a, b ∈ [k], define the cluster sets
ΠtG−1(a) := {u ∈ V(G) : ΠtG(u) = a},   ΠtH−1(b) := {v ∈ V(H) : ΠtH(v) = b}.
A simple and effective choice is the average linkage cost
$$ M_t[a,b]\;:=\;\frac{1}{|\Pi^{G^{-1}}_t(a)|\,|\Pi^{H^{-1}}_t(b)|} \sum_{u\in \Pi^{G^{-1}}_t(a)}\;\sum_{v\in \Pi^{H^{-1}}_t(b)} \widehat C_{t-1}[u,v], $$
with the convention that empty clusters are disallowed (or merged) so that denominators are nonzero. Alternatives include medoid linkage (choosing representatives a ∈ ΠtG−1(a) and b ∈ ΠtH−1(b) and setting Mt[a, b] = t − 1[a, b]), or trimmed averages to reduce sensitivity to outliers. Any such construction is admissible provided it yields a uniform bound on within-bin variability (formalized below via a cluster radius).

With Mt in hand, we replace each degree-scale local problem by a k-support problem:
UOTt − 1, η(μt, iG, μt, jH)  ⤳  UOTMt, η(htG(i), htH(j)).
Since k is fixed across all pairs (i, j), each evaluation costs O(k2T) arithmetic operations via generalized Sinkhorn, independent of the degrees. The only remaining question is how to choose the partitions ΠtG, ΠtH so that this replacement is accurate.

We aim to ensure that nodes grouped into the same bin are close'' under the previous-layer IDM, in the sense that $\widehat C_{t-1}[u,v]$ does not vary too much when $u$ ranges within a bin of $G$ and $v$ ranges within a bin of $H$. One convenient sufficient condition is a bounded-radius hypothesis: there exists $r_t\ge 0$ such that for all $a\in[k]$ and all $u,u'\in \Pi^{G^{-1}}_t(a)$ we have \[ \sup_{v\in V(H)}\big|\widehat C_{t-1}[u,v]-\widehat C_{t-1}[u',v]\big|\;\le\; r_t, \] and analogously with the roles of $G$ and $H$ swapped. Under such a condition, transporting mass within a bin changes costs by at most $r_t$, and the induced error in replacing a measure by its histogram is controlled in OT. In practice, we enforce this criterion approximately by refining the partition as $t$ increases (hencemultiscale’’): early layers may use small k (coarse bins) because costs are dominated by attributes, while later layers may require larger k to keep rt within the allocated error budget. For simplicity we continue to write k, but one may take k = kt without altering the structure of the analysis.

We record three implementation patterns for constructing ΠtG, ΠtH from data available at depth t.

In all cases, the coarsening step is compatible with streaming edge access, and it isolates the expensive transportation computations to a fixed k × k scale. We now combine this coarsening with entropic UOT recursively to obtain the full compressed entropic DIDM distance and its stated runtime bounds.


7. Main Algorithm: compute δ̂DIDML via recursive compressed entropic OT; include implementation details for sparse adjacency and batching; discuss optional low-rank/landmark extension.

We now specify the procedure producing δ̂DIDML((G, f), (H, g)) by combining (i) the recursive DIDM/IDM construction, (ii) the coarsening operator 𝒞t from the previous section, and (iii) entropic-regularized unbalanced OT evaluated approximately by T generalized Sinkhorn iterations. The output is a single scalar, but the recursion naturally exposes an intermediate cross-graph cost matrix t ∈ ℝ+nG × nH at each depth t, which we exploit both for defining the next-layer partitions and for building the bin-to-bin costs.

We begin with the attribute-level ground costs
0[i, j] := ∥f(i) − g(j)∥2,   (i, j) ∈ V(G) × V(H).
This is the only step that directly inspects f, g; thereafter, attributes enter only through 0 being carried along additively. We set  ← 0 and maintain the invariant that represents the current approximation to the recursive IDM ground costs at the current depth.

At depth t we compute an additive correction t ∈ ℝ+nG × nH whose (i, j) entry approximates the local neighbor-measure OT term. Concretely, we execute the following steps.


We choose clusterings ΠtG : V(G) → [k] and ΠtH : V(H) → [k] using any of the constructions described previously (cost-row sketches and k-means, WL-style hashing, or projection bucketing). The only algorithmic requirement is that the partitions can be computed from data available at depth t (typically from and/or cached node features) and that empty bins are avoided (e.g., by merging into the nearest nonempty bin).


Given ΠtG, ΠtH, we form
$$ M_t[a,b]\;=\;\frac{1}{|\Pi^{G^{-1}}_t(a)|\,|\Pi^{H^{-1}}_t(b)|}\sum_{u\in\Pi^{G^{-1}}_t(a)}\sum_{v\in\Pi^{H^{-1}}_t(b)}\widehat C[u,v], $$
or an alternative linkage (e.g., medoids) if desired. Operationally, we compute Mt with a single pass over by accumulating block sums
St[a, b] := ∑u : ΠtG(u) = a ∑v : ΠtH(v) = b[u, v],
and then dividing by cluster sizes. This costs O(nGnH) time and O(k2) auxiliary memory beyond storing .


We compute histograms via streaming adjacency access:
for each edge (i, u) ∈ E(G) we add wiu/|V(G)| to coordinate ΠtG(u) of htG(i), and similarly for H. This requires O(mG + mH) time per layer and O((nG + nH)k) memory.


For each pair (i, j) ∈ V(G) × V(H) we define
t[i, j] := Sinkhorn-UOT(MthtG(i), htH(j); ηT),
where Sinkhorn-UOT denotes the entropic unbalanced OT solver specified in Section~5. Since Mt is shared by all pairs at depth t, we may precompute the Gibbs kernel Kt := exp (−Mt/η) ∈ ℝ+k × k once, and reuse it throughout the nGnH subproblems.


We update the running cross-graph ground costs by
 ←  + t,
so that carries the attribute term and all neighbor-OT corrections accumulated up to depth t.

After completing depths t = 1, …, L, we return the global distance as an entropic unbalanced OT between uniform node measures on V(G) and V(H) with ground cost :
$$ \widehat\delta^{L}_{\mathrm{DIDM}}((G,f),(H,g)) \;:=\; \mathrm{Sinkhorn}\text{-}\mathrm{UOT}\big(\widehat C,\ \tfrac{1}{n_G}\mathbf{1}_{n_G},\ \tfrac{1}{n_H}\mathbf{1}_{n_H};\ \eta,\ T\big). $$
In applications where exact mass preservation is desired, one may take a nearly balanced regime (large mass-penalty parameters in the unbalanced formulation), but we keep the unbalanced setting as in the source definition.

The sparse structure of G and H affects only the histogram construction. We store adjacency lists in CSR/CSC form and build the k-dimensional histograms in one pass over edges; no per-node sorting is required. When k is moderate, we store histograms densely as contiguous arrays of length k per node to enable batched linear algebra in the Sinkhorn loops; when k is large but degrees are small, we may store histograms sparsely and densify only within batches.

The dominant term is the evaluation of t[i, j] over nGnH pairs, each involving T iterations over k × k matrix–vector products. We therefore implement step (iv) in batches:
choose blocks I ⊆ V(G) and J ⊆ V(H) and compute {t[i, j]}i ∈ I, j ∈ J jointly using tensorized operations. Since Kt is shared, each Sinkhorn iteration reduces to applying Kt and Kt to many vectors in parallel, which is well-suited to GPU/BLAS3 execution. Memory permits storing as a dense matrix in the dense regime; otherwise, one may store and update in blocks, writing blocks to disk or recomputing them as needed. The only subtlety is that forming Mt from requires access to all entries; in block mode we perform a streaming pass over blocks to accumulate St[a, b], then a second streaming pass to compute t and update blockwise.

When nGnH is prohibitively large (e.g., retrieval over many candidates), we may replace by a low-rank surrogate built from r ≪ min {nG, nH} landmarks per graph. At each depth, we compute costs only between all nodes and the opposite graph’s landmarks (or between landmark sets), derive partitions and Mt from these sketches, and evaluate local UOT terms using the coarsened histograms together with landmark-induced features. The resulting quantity is not, in general, an additive-ε approximation of the full δDIDML without further structural assumptions; nevertheless it yields a computationally efficient proxy that preserves stability trends and is empirically adequate for auditing and candidate filtering. The theoretical development in the next section therefore treats the full pairwise version as the primary object, and regards landmark compression as an optional acceleration layer on top of it.


8. Theory: per-layer error propagation lemma; overall approximation theorem; runtime and memory bounds; (optional) topology/weak$^\*$ approximation under smoothing.

We analyze the discrepancy between the exact depth-L distance δDIDML((G, f), (H, g)) and the output δ̂DIDML((G, f), (H, g)) produced by Algorithm~1. Let Ct ∈ ℝ+nG × nH denote the exact recursive IDM ground-cost matrix at order t (so C0[i, j] = ∥f(i) − g(j)∥2 and $C_t^\star=C_0^\star+\sum_{s=1}^t D_s^\star$, where Ds[i, j] is the exact unbalanced OT term between the order-(s − 1) neighbor measures of i and j with ground metric dIDMs − 1). Likewise, let t be the matrix maintained by Algorithm~1 after completing depth t, and t its additive correction at depth t.

We work with the entrywise norm A := maxi, j|A[i, j]|. Fix a depth t ≥ 1. The update t = t − 1 + t differs from Ct = Ct − 1 + Dt for three logically distinct reasons:

The following lemma packages these contributions into an additive bound suitable for induction.


Fix a pair (i, j). Write μi and νj for the exact neighbor measures at depth t (over V(G) and V(H), respectively), and μ̂i = 𝒞t(μi), ν̂j = 𝒞t(νj) for their coarsened k-bin representations. The exact local term has the form Dt[i, j] = OTCt − 1(μi, νj), whereas Algorithm~1 computes an entropic approximation based on the bin-to-bin cost Mt derived from t − 1 and on (μ̂i, ν̂j). By Theorem~1 (Lipschitz stability in the ground cost), replacing Ct − 1 by t − 1 perturbs any OT value by at most t − 1 − Ct − 1 ≤ εt − 1 (after accounting for the aggregation into Mt, whose entries are averages and hence inherit the same perturbation bound). By Theorem~2, replacing (μi, νj) by (μ̂i, ν̂j) perturbs the OT value by at most δcoarse, t. Finally, by definition of δent, t(η) and δsink, t(T), the entropic relaxation and finite Sinkhorn iterations contribute additively. Summing these contributions gives |t[i, j] − Dt[i, j]| bounded by the right-hand side, and adding the common update t = t − 1 + t yields the stated bound.

Iterating Lemma~ yields an explicit telescoping control of the error across depths. In particular, letting ε0 = 0 and defining εt recursively by the lemma’s right-hand side, we obtain $\|\widehat C_L-C_L^\star\|_\infty\le \sum_{t=1}^L\big(\delta_{\mathrm{coarse},t}+\delta_{\mathrm{ent},t}(\eta)+\delta_{\mathrm{sink},t}(T)\big)$. The final output δ̂DIDML is itself an entropic Sinkhorn-UOT value computed with ground cost L, so we incorporate one additional entropic-and-iteration error term at the top level.


The coarsening requirement is met whenever the induced cluster radii rt (in the order-(t − 1) metric) satisfy rt ≲ ε/L, which can be enforced by increasing k or by refining partitions adaptively until a target within-bin dispersion criterion is met. The entropic bias δent, t(η) decreases to 0 as η → 0 (on bounded-cost instances), while the iteration error δsink, t(T) decreases with T for fixed η; thus we may first choose η = η(ε, L) small enough to drive the bias below ε/(3L), and then choose T = T(ε, η) large enough to drive the solver error below ε/(3L). Theorem~ is therefore an existence statement with an explicit additive budget across the L recursive layers.

We now account for the computational cost implied by Algorithm~1. At each depth t, histogram construction costs O(mG + mH) via a single streaming pass over adjacency lists. Computing the bin-to-bin costs Mt by block sums costs O(nGnH) time given access to t − 1, and the dominant term is the evaluation of nGnH local entropic UOT problems on a shared k × k ground cost. If each Sinkhorn-UOT call takes O(k2T) arithmetic operations, then the total time is
(LnGnHk2T + L(mG + mH)),
and the working memory is (nGnH + (nG + nH)k) to store and the per-node histograms (with optional block mode reducing peak memory at the expense of additional streaming passes). In the dense regime and for constant (k, T, L), this is near-quadratic in N = max {nG, nH}, which is information-theoretically unavoidable up to polylogarithmic factors under the standard dense-input model (Theorem~5).

Finally, we record the conceptual consequence of fixing η > 0 and interpreting Algorithm~1 (with k large and T → ∞) as computing a smoothed distance δDIDML, η in which all OT subproblems are replaced by exact entropic unbalanced OT. On compact families of graph-signals with uniformly bounded per-layer costs, the entropic objective converges to the unregularized objective as η → 0, and the layerwise perturbation bound lifts this convergence through the recursion. Concretely, one obtains a distortion bound of the form |δDIDML, η − δDIDML| ≤ O(Lη) (Proposition~6), and δDIDML, η metrizes the same weak$^\*$ topology on the corresponding DIDM space. In applications where statistical stability is prioritized over exactness, this observation justifies choosing η as a fixed smoothing scale and treating δDIDML, η as the primary target, with δ̂DIDML serving as its computational surrogate via finite T and moderate k.


9. Empirical Evaluation Plan: runtime scaling; correlation with MPNN output perturbations; retrieval and OOD detection; ablations on (k, η, T); comparisons to exact on small graphs and to other metrics on medium graphs.

We now specify an empirical program whose purpose is to validate, in controlled and application-relevant regimes, the two claims implicit in our theory: (i) the approximation–runtime tradeoffs predicted by Theorems~ and~4 are observable at practical scales, and (ii) the resulting distance is a meaningful proxy for stability of message passing outputs and for downstream tasks such as retrieval and out-of-distribution (OOD) detection.

We will evaluate on a mixture of (a) synthetic graph-signals where we can control n, m, attribute dimension, and perturbations, and (b) standard attributed-graph benchmarks with moderate node counts. For synthetic data we consider: Erds–R'enyi graphs at varying densities (to separate sparse vs. dense behavior), stochastic block models (to introduce mesoscopic structure relevant to coarsening), and geometric random graphs (to align with Wasserstein-type behavior). For real data we include molecular graphs (small n, sparse m), ego-networks and citation graphs (medium n, sparse m), and (when feasible) dense similarity graphs derived from kNN constructions on point clouds. In all cases we normalize the neighborhood measures consistently with the normalized-sum convention in the distance definition.

We will measure wall-clock time and peak memory as functions of (nG, nH, mG, mH), depth L, and parameters (k, η, T). The primary diagnostic is a scaling plot in which we fix (k, η, T, L) and vary nG ≈ nH ≈ N, reporting the empirical exponent in the fit time ≈ cNα separately for sparse and dense graph families. In the dense-input model, we expect α ≈ 2 (up to polylog factors), consistent with Theorem~5; in the sparse regime, we expect the histogram construction term to scale as mG + mH while the cross-graph interaction remains governed by nGnH unless block mode or landmark compression is enabled. We will therefore report three configurations: (i) full materialization of t (baseline implementation of Algorithm~1), (ii) block processing of t to reduce peak memory, and (iii) optional landmark/Nystr"om-style compression as an explicit ``proxy mode’’ (not covered by the main guarantee but relevant for retrieval-scale deployments). For each configuration we will report throughput (pairs per second) for the nGnH local k × k Sinkhorn-UOT calls and isolate their contribution by profiling.

On instances with nG, nH ≤ 3050, we will compute a high-accuracy reference for δDIDML by replacing each entropic UOT with a near-exact solver for the unregularized unbalanced OT (e.g., min-cost flow or a sufficiently small η with very large T) and by disabling coarsening (k = n with singleton bins). This provides an empirical upper bound on the gap
|δ̂DIDML − δDIDML|
as a function of (k, η, T) and L. We will report absolute error and relative error on instances where the reference distance is bounded away from 0. In addition, we will separately quantify the three error channels suggested by Lemma~: (a) by comparing exact UOT vs. entropic UOT at fixed marginals and ground costs, (b) by tracking objective decrease with T, and (c) by comparing costs computed using true neighbor measures versus k-bin histograms under fixed partitions. This decomposition is intended to empirically corroborate the additive error budget in Theorem~.

We will perform a three-way sweep over k ∈ {8, 16, 32, 64, 128}, η on a logarithmic grid, and T ∈ {10, 20, 50, 100, 200}, at fixed L, reporting (i) runtime, (ii) peak memory, and (iii) approximation error relative to the small-graph reference where available. To connect to the theoretical control parameters, we will also report a measured within-bin dispersion statistic (an empirical surrogate for a radius rt) and evaluate whether larger k reduces this dispersion in a predictable manner. In settings without a reference distance, we will instead report self-consistency diagnostics (e.g., monotonicity of the computed value as T increases at fixed η, and stability of rankings as k increases). The outcome of this sweep will be a practical rule for setting (k, η, T) for a target latency or a target empirical error level.

A central motivation for DIDM-type distances is to quantify changes that are relevant to message passing. We will therefore test whether δ̂DIDML correlates with perturbations of a depth-L MPNN output. Concretely, fix an MPNN ΦL with normalized-sum aggregation, and for each base graph-signal (G, f) generate perturbed versions (G, f) by: (i) random edge deletions/additions at controlled rates, (ii) attribute noise f(v) = f(v) + ξv with ξv controlled, and (iii) structured perturbations (e.g., rewiring within/between blocks in an SBM). For each perturbation we measure
ΔΦ := ∥ΦL(G, f) − ΦL(G, f)∥   and   Δδ := δ̂DIDML((G, f), (G, f)).
We will report Spearman and Kendall rank correlations between ΔΦ and Δδ across perturbation families, as well as calibration curves of ΔΦ conditional on bins of Δδ. As baselines we include simpler structural distances (e.g., degree distribution Wasserstein distance, spectral distances) and embedding distances (e.g., Euclidean distance between pooled node embeddings). The hypothesis is not that δ̂DIDML predicts ΔΦ exactly, but that it provides a monotone proxy aligned with the aggregation depth L.

For retrieval, we will use δ̂DIDML as a similarity score in kNN search: given a query graph-signal, rank a database by increasing distance and evaluate precision@ and mean average precision under label-based relevance (e.g., molecular property classes, graph categories). For OOD detection, we consider a training set 𝒟in and test graphs drawn from a shifted distribution; we compute an OOD score
$$ s(x) := \min_{x'\in \mathcal{D}_{\mathrm{in}}}\widehat\delta^L_{\mathrm{DIDM}}(x,x') \quad\text{or}\quad s(x):=\frac{1}{q}\sum_{x'\in \mathrm{NN}_q(x)}\widehat\delta^L_{\mathrm{DIDM}}(x,x'), $$
and report ROC-AUC for distinguishing in- vs. out-of-distribution samples. We will compare against baselines built from (i) graph kernels (e.g., WL subtree kernels), (ii) optimal transport baselines at the graph level (e.g., FGW where feasible), and (iii) MPNN embedding-based Mahalanobis or kNN distances. In retrieval/OOD experiments we will emphasize block mode and (optionally) landmarks to ensure that memory does not scale as nGnH for large databases.

On medium-scale datasets where exact DIDM is infeasible, we will compare to metrics that are computationally standard: spectral distances, WL-based kernels, graphlet-based distances, and FGW/GW approximations with entropic regularization. We will report both task performance and computational cost, with the intent to position δ̂DIDML as a metric that is (a) explicitly depth-aligned with message passing, and (b) tunable through (k, η, T) to span a range of runtime–fidelity tradeoffs.


10. Discussion and Limitations: when coarsening fails; dependence on L; relation to aggregation choice; open problems (tight lower bounds, adaptive depth, transformer kernels).

We conclude by clarifying the regimes in which the proposed compressed entropic recursion is well behaved, and by isolating limitations that are intrinsic to the DIDM construction as well as those specific to our approximation strategy.

Our approximation replaces exact neighborhood measures by k-bin histograms induced by partitions ΠtG, ΠtH. Theorem~2 makes explicit that the coarsening error is controlled by a within-bin radius parameter rt in the relevant previous-layer metric. This is simultaneously a strength and a limitation: the guarantee is meaningful precisely when there exist partitions that are for the transport costs that matter. There are natural counter-regimes. First, if the previous-layer cross-graph cost matrix exhibits high intrinsic dimension (e.g., many near-orthogonal attribute directions and weak structural regularity), then any fixed k may force large rt, and the additive error term δcoarse, t may dominate. Second, if the graph contains many small, idiosyncratic neighborhoods (e.g., heavy-tailed degree with numerous unique local motifs), then coarse binning may blur rare but decisive mass, again inflating rt. Third, heterophilous graphs can produce neighbor measures whose support is effectively spread across many distant bins in the previous-layer metric; in such cases the coarsening step may systematically underestimate the cost of matching ``incompatible’’ neighborhoods unless k scales accordingly. These failure modes are not merely artifacts of our proof technique: any compression that replaces a distribution by a coarse quantization must pay, in the worst case, an error comparable to the quantization radius in the induced geometry.

A related practical limitation is that our pseudocode constructs Mt via averages over cross-cluster blocks of t − 1. This is a crude representative of inter-cluster geometry and may be unstable when within-cluster variability is large. One can mitigate this by storing additional per-block statistics (e.g., quantiles) or by refining Mt using a small set of sampled representatives, but the analysis would then require a correspondingly finer stability statement than Theorem~1 with respect to such surrogates.

Our approximation guarantee is additive and, in its simplest form, scales at most linearly with L through the telescoping recursion (Theorem~3). This linear dependence is sharp for a worst-case additive budget: each layer introduces an additional OT evaluation and hence an additional channel for entropic bias, Sinkhorn truncation error, and coarsening. Consequently, if one targets a fixed end-to-end tolerance ε while increasing L, then the per-layer tolerances must shrink on the order of ε/L, which in turn pushes (k, T) upward and η downward. From a computational standpoint, deeper distances are therefore unavoidably more expensive if one insists on uniform approximation quality across layers.

Independently of approximation, the captured by DIDM-type constructions may saturate for large L on graphs exhibiting mixing phenomena akin to oversmoothing in message passing. In such cases many neighborhoods become similar in distributional terms, and the incremental contribution of deeper layers to dIDMt may become small relative to numerical error. This suggests that, beyond a certain depth, the dominant concern is not only runtime but also statistical relevance: whether the additional layer meaningfully refines the geometry induced by the distance.

Our definitions, and hence our approximation program, are aligned with normalized-sum aggregation (the setting in which DIDMs naturally represent neighborhood measures as normalized subprobabilities). If one changes the aggregation rule, the induced measure structure can change qualitatively. For example, max aggregation is not linear in neighbor multisets and does not admit a direct representation as an OT between neighbor measures without additional lifting; degree-normalized variants modify the scaling of mass and can change the unbalancedness parameters implicit in the OT subproblems; attention-weighted aggregation introduces data-dependent reweightings that are not fixed by the graph alone. In such cases, a DIDM-style recursion may still be definable, but the stability constants and even the appropriate notion of marginal penalties in unbalanced OT may differ. Thus, while our method is conceptually modular (replace each OT call by entropic UOT; compress measures), the present guarantees should be viewed as specific to the normalized-sum convention and to costs that decompose as attribute discrepancy plus neighbor-transport discrepancy.

Theorem~5 only addresses an input-reading lower bound in the dense regime. It leaves open the more delicate question of whether, for fixed L, one can beat the apparent Θ(nGnH) cross-graph interaction cost in the dense setting for nontrivial accuracy guarantees. In particular, it would be valuable to establish conditional lower bounds (e.g., under SETH-type hypotheses) showing that even δDIDML within additive ε requires near-quadratic time for worst-case attributed graphs, thereby justifying our focus on constant-factor improvements, coarsening, and entropic smoothing. Conversely, identifying structural assumptions (low doubling dimension of attributes, bounded treewidth, low-rank cost structure, or clusterability in the induced IDM metric) under which subquadratic algorithms are possible remains an attractive direction.

Our analysis allocates a uniform error budget ε/L per layer. This is conservative: empirically, some layers contribute little incremental mass transport, and spending computational effort uniformly can be wasteful. A principled adaptive-depth procedure would (i) estimate an upper bound on the remaining contribution of deeper layers to the final distance and (ii) terminate when this bound falls below a prescribed tolerance. Analogously, one could allocate (kt, ηt, Tt) per layer to balance coarsening and entropic bias against the magnitude of that layer’s contribution. Making such procedures rigorous appears to require sharper continuity properties of the recursion, including bounds that depend on observable quantities (e.g., measured within-bin dispersions and Sinkhorn residuals) rather than worst-case radii.

Finally, there is a conceptual connection between DIDM recursions and attention mechanisms: both compare neighborhoods via data-dependent similarity scores. One could replace the OT-based neighborhood comparison by a kernel induced by a transformer block (e.g., softmax attention over learned pairwise costs) and then iterate across layers. The resulting object would no longer be an OT distance, and the metric properties would generally fail, but one might gain expressive power and amortized computation via learned low-rank structure. Understanding whether one can retain stability guarantees reminiscent of Theorem~1 for such ``transformer kernels,’’ and whether they admit coarsening schemes analogous to 𝒞t with provable error control, constitutes a concrete open problem at the boundary of optimal transport, graph representation, and modern sequence models.

In summary, our theory delineates a tractable approximation envelope for DIDM distances through entropic smoothing and multiscale coarsening, but it also exposes the central bottlenecks: the geometric feasibility of quantization in the induced IDM metric, the inevitable accumulation with depth, and the sensitivity to the aggregation semantics that define the underlying neighborhood measures.