← Back

Multi-Relational DIDM Mover’s Distance: Compact, Stable, and Universal Theory for Edge-Attributed GNNs

Metadata

Table of Contents

  1. 1. Introduction: motivation from edge-rich and multi-relational graphs (KGs, molecules, temporal interaction graphs); limitations of existing compact/separating metrics; summary of contributions.
  2. 2. Preliminaries: measures/weak* topology, (unbalanced) optimal transport; graphon-signals; recap of IDM/DIDM idea and why compactness+separation matter (universality + generalization).
  3. 3. Multi-Relational Directed Edge-Attributed Graphon-Signals: define (W, E, f), metric identification under measure-preserving relabelings; embedding finite graphs as step multi-graphons; discussion of dense-regime scope and relation to practice.
  4. 4. Enriched IDM/DIDM Hierarchy: define Ht, computation IDMs γt, DIDMs ΓL; define recursive ground metric dMR-IDML with relation/edge costs; define δMR-DIDML.
  5. 5. Compactness and Topology: prove Ht and 𝒫(Ht) compact; prove 𝒲r, sd, q, R/∼ compact under δMR-DIDML; relate to cut distance/coarser topologies where appropriate.
  6. 6. Relation-Aware Edge-Conditioned MPNNs: define graph/graphon MPNNs with per-relation kernels and edge-conditioned messages; show factorization through IDMs/DIDMs (equivalence lemma).
  7. 7. Stability: Lipschitz Bounds: inductive Lipschitz proof on IDM space; propagate to DIDMs via OT; explicit dependence on depth, Lipschitz constants, relation count, and edge-attribute bounds.
  8. 8. Fine-Grained Expressivity and Universality: separation theorem (topological converse) and Stone–Weierstrass density on compact domains; discuss what class of continuous functionals is captured.
  9. 9. Algorithmics: Polynomial-Time Computation on Finite Graphs: layered OT recursion; block-structured cost matrices by relation channels; complexity bounds; practical accelerations (optional entropic OT).
  10. 10. Generalization: Covering-Number Bounds: apply uniform Monte Carlo estimation over compact metric domain; discuss dependence on covering numbers and what can/cannot be made explicit.
  11. 11. Empirical Plan (Optional but Strengthening): correlation of δ with output perturbations; KG and molecule benchmarks; ablations on relation mismatch and edge-feature influence.
  12. 12. Limitations and Future Work: sparse regime, attention/transformers, quantitative approximation rates, explicit metric entropy.

Content

1. Introduction: motivation from edge-rich and multi-relational graphs (KGs, molecules, temporal interaction graphs); limitations of existing compact/separating metrics; summary of contributions.

A substantial fraction of contemporary graph learning problems is driven by and data. Knowledge graphs represent heterogeneous semantics by relation labels and often attach qualifiers or timestamps to edges; molecular graphs carry bond types together with geometric or energetic attributes; temporal and interaction networks record directed events with weights, times, and contexts. In such settings, the inductive bias of message passing is not merely aggregate over neighbors'', but ratheraggregate over relation- and attribute-conditioned neighborhoods’’, and the analytic objects that govern approximation, stability, and generalization must faithfully encode this additional structure.

We are thus led to a metric-design question: can we endow a space of multi-relational, possibly directed, edge-attributed graph objects with a distance that is simultaneously (i) compatible with the invariances of interest (relabeling of nodes, graphon-type identifications), (ii) so that uniform approximation and distribution-free learning statements become accessible, (iii) enough to separate non-isomorphic signals at the appropriate limit level, and (iv) on finite graphs in time polynomial in the number of nodes? While there is a large literature on graph distances, metrics that satisfy all four desiderata for relation-aware, edge-attributed graphs are not readily available.

To place the difficulty in context, we recall that many classical distances either disregard rich edge information or are not suited to compactness/separation arguments. Graph edit distances and maximum common subgraph variants are combinatorial and typically intractable; kernel methods (e.g. Weisfeiler–Lehman type constructions) are efficient but depend on fixed feature maps and do not directly yield a compact metric structure on an ambient signal space; spectral or Laplacian-based distances can be unstable under perturbations of the graph and are not, in general, separating modulo node relabeling. Optimal-transport-based notions such as Gromov–Wasserstein are attractive as they incorporate invariances, but their standard forms are not tailored to labeled relations and edge attributes with an explicit recursion matching message passing depth; moreover, separation properties at the level of graphon-signals and compactness of the induced quotient spaces are delicate. Finally, metrics arising from cut norms and related graphon distances yield compactness for unlabeled dense graphs, yet they are not designed to track the attribute- and relation-conditioned aggregations that drive modern edge-conditioned MPNNs.

Our approach is to build the geometry directly from the computation performed by relation-aware, edge-conditioned message passing. At a high level, we attach to each ``node’’ (in a finite graph, or a point in the unit interval representation) a recursively enriched object that records its feature together with a measure-valued summary of its typed and attributed neighborhood. This yields an (IDM) hierarchy, and by pushing forward the base measure we obtain an (DIDM) that represents the entire graph(-signal) as a probability measure on a compact state space. Distances are then defined by (unbalanced) optimal transport between these DIDMs, using a ground cost that itself is defined recursively and incorporates (a) the previous-layer IDM distance between neighbor states, (b) a penalty for relation mismatch, and (c) a discrepancy term for edge attributes. In this way, the metric mirrors the inductive structure of message passing: at each depth, node representations are functions of distributions of neighbor representations enriched with relation labels and edge attributes.

This construction yields a single framework that simultaneously addresses the analytic and algorithmic requirements outlined above. The central point is that, by working with measures on compact spaces and employing weak* topologies, we obtain compactness by design, while the use of optimal transport yields a metrization that is sufficiently strong to control Lipschitz message passing operations. In particular, the distance is not an auxiliary tool: it is the natural ``mover’s distance’’ induced by the computation hierarchy. The resulting metric is also explicitly relation-aware, accommodates directed kernels, and supports edge attributes in a bounded Euclidean domain without sacrificing measurability or compactness.

The contributions may be summarized as follows. First, we define the enriched IDM/DIDM recursion for multi-relational directed edge-attributed graphon-signals and show that, for each fixed depth, the induced IDM spaces and DIDM spaces are compact metrizable. Second, we define the corresponding mover’s distance δMR-DIDML as an optimal-transport distance between DIDMs with a recursively defined ground metric, and we show that this distance metrizes the weak* topology on the relevant measure spaces. Third, we prove a factorization statement: relation-aware edge-conditioned normalized-sum MPNNs at depth t depend on the input only through the order-t IDMs, and graph-level readouts depend only through the order-L DIDM. Fourth, we establish Lipschitz stability: any such MPNN with Lipschitz message/update/readout maps is Lipschitz with respect to δMR-DIDML, yielding quantitative robustness to perturbations measured in our geometry. Fifth, we prove a converse-type separation statement: up to arbitrarily small tolerance, if Lipschitz-bounded depth-L MPNNs fail to distinguish two DIDMs, then the DIDMs must be close in the OT sense; this is the key input for Stone–Weierstrass-type arguments. Sixth, we obtain universality: scalar-valued relation-aware edge-conditioned MPNN readouts form a uniformly dense family in the continuous functions on the compact metric-identified domain. Finally, we provide a polynomial-time exact algorithm for finite graphs at fixed depth: the distance can be computed via a layered recursion consisting of N2 optimal transport subproblems per layer and one final transport solve, each reducible to min-cost flow, leading to worst-case time (LN5) for graphs with at most N nodes (for fixed relation count).

Two methodological aspects are worth emphasizing. We use optimal transport at intermediate stages to accommodate the fact that neighborhood measures may have total mass less than one (e.g. due to normalization conventions or sparsity), which avoids forcing artificial mass-matching and improves stability under degree fluctuations. Moreover, relation labels are incorporated as elements of a discrete metric space, so that relation mismatch contributes additively to the ground cost in a way that is both transparent and compatible with the recursion. These choices yield a metric that is expressive enough to reflect typed, attributed neighborhoods, while remaining analytically tractable and algorithmically reducible to standard transportation problems.

In sum, we provide a compact, separating, relation-aware metric geometry for edge-attributed graphon-signals and their finite counterparts, tailored to the computational structure of edge-conditioned MPNNs. This geometry supports a unified theory in which approximation (universality), stability (Lipschitz bounds), and learning (distribution-free generalization via covering arguments) follow from the same compactness-and-separation backbone, while exact computation on finite graphs remains polynomial-time for fixed depth.


2. Preliminaries: measures/weak* topology, (unbalanced) optimal transport; graphon-signals; recap of IDM/DIDM idea and why compactness+separation matter (universality + generalization).

Let (X, d) be a compact metric space. We write M ≤ 1(X) for the set of finite Borel measures on X with total mass at most one, and 𝒫(X) for Borel probability measures. We will repeatedly use the duality between measures and continuous test functions: for μ ∈ M ≤ 1(X) and φ ∈ C(X, ℝ) we write μ, φ⟩ := ∫Xφdμ. The weak* topology on M ≤ 1(X) is the coarsest topology such that μ ↦ ⟨μ, φ is continuous for every φ ∈ C(X, ℝ); equivalently, μn → μ weak* if and only if φdμn → ∫φdμ for all continuous φ. Since X is compact, C(X, ℝ) is a Banach space under the sup norm, and M ≤ 1(X) is naturally identified with a weak*-closed subset of the dual of C(X, ℝ) (Riesz–Markov). In particular, M ≤ 1(X) is weak* compact, and so is 𝒫(X) ⊂ M ≤ 1(X).

For measurable maps T : X → Y between measurable spaces, we denote by $T_\*\mu$ the push-forward measure on Y, defined by $(T_\*\mu)(A):=\mu(T^{-1}(A))$. When X = [0, 1] equipped with Lebesgue measure λ, push-forward along node-level maps will be our basic mechanism for encoding graph(-signal) objects as measures on a state space.

Let (X, d) be a compact metric space. In the balanced setting, for μ, ν ∈ 𝒫(X) one defines the 1-Wasserstein distance
W1(μ, ν) := infπ ∈ Π(μ, ν)X × Xd(x, y) dπ(x, y),
where Π(μ, ν) is the set of couplings with marginals μ and ν. We recall that on compact X, W1 metrizes weak* convergence on 𝒫(X): μn → μ weak* if and only if W1(μn, μ) → 0. We also use the corresponding dual form
W1(μ, ν) = sup {∫Xφdμ − ∫Xφdνφ ∈ Lip1(X)},
which makes explicit why optimal transport is the correct distance for controlling integrals of Lipschitz test functions.

In our recursion, intermediate objects are neighborhood measures whose total mass may be strictly smaller than one. This occurs already in finite graphs with normalized adjacency (e.g. degree normalization) or in sparse regimes where neighborhood measures record only the realized edges. We therefore allow transport on M ≤ 1(X), which permits partial matching of mass together with a penalty for creation/destruction. Concretely, we assume the existence of a functional
OTdM ≤ 1(X) × M ≤ 1(X) → [0, ∞)
with the following properties (all standard in unbalanced OT constructions on compact spaces): (i) OTd(μ, ν) = 0 if and only if μ = ν; (ii) OTd is jointly lower semicontinuous for weak* convergence; (iii) OTd metrizes weak* convergence on mass-bounded sets; and (iv) OTd satisfies a Lipschitz test-function bound analogous to the balanced case, namely that differences of integrals of 1-Lipschitz functions are controlled by OTd. One may keep in mind a representative formulation via couplings π ∈ M ≤ 1(X × X) with relaxed marginal constraints and a divergence penalty on the mismatch, but the precise choice is not essential for the structural arguments: what matters is that the ground cost is induced by a metric on X and that the resulting transport distance is compatible with weak* compactness.

Our ambient objects will be dense-limit graph(-signal) representations. We recall the usual analytic model: a (possibly directed) graphon is a measurable kernel W : [0, 1]2 → [0, 1], interpreted as an edge-probability (or weighted adjacency) function. To encode multi-relational structure, we consider a finite relation set [R] and a tuple W = (Wr)r ∈ [R] of such kernels. To incorporate edge attributes, we attach to each relation an attribute kernel Er : [0, 1]2 → Bsq, where Bsq is the closed Euclidean ball in q of radius s. Finally, node features are modeled by a measurable signal f : [0, 1] → Brd. We write (W, E, f) for the resulting multi-relational directed edge-attributed graphon-signal. The compactness of the feature and attribute ranges will be used repeatedly: it bounds all local states produced by our recursion and ensures that weak* compactness applies at every measure-valued layer.

We briefly recall the organizing principle behind the IDM/DIDM construction. Message passing at a node depends on its own current state together with a collection (in the graphon setting, a distribution) of neighbor states enriched by relation labels and edge attributes. This suggests defining, for each depth t, a Ht whose elements consist of a base coordinate (the current node state) and a measure coordinate recording a typed, attributed neighborhood distribution over the previous state space. At depth 0 we take H0 := Brd. Given Ht − 1, we form an enriched neighborhood domain
Zt − 1 := Ht − 1 × ℛ × Bsq,   ℛ = [R] with d(r, r) = 1r ≠ r,
and we set Ht := Ht − 1 × M ≤ 1(Zt − 1), endowed with the product topology where M ≤ 1(Zt − 1) carries the weak* topology. For a given graphon-signal (W, E, f), the computation map γ(W, E, f), t : [0, 1] → Ht assigns to each point x ∈ [0, 1] its order-t enriched state: the first coordinate is γt − 1(x), while the second coordinate is the push-forward of the neighborhood sampling measure on [0, 1] (weighted by the kernels Wr(x, ⋅)) through the map y ↦ (γt − 1(y), r, Er(x, y)), aggregated over relations. The object is then the DIDM
$$ \Gamma_{(\mathbf{W},\mathbf{E},f),t}\;:=\;(\gamma_{(\mathbf{W},\mathbf{E},f),t})_\*\lambda\ \in\ \mathcal{P}(H_t), $$
which records the distribution of local computation states under the base measure.

Two structural requirements drive the subsequent theory. First, compactness: since H0 is compact and M ≤ 1(⋅) over a compact space is weak* compact, the recursion produces compact Ht by induction, and hence compact DIDM spaces 𝒫(Ht). This is the backbone for uniform approximation statements, because continuous functions on compact domains are precisely the setting in which Stone–Weierstrass yields uniform density once we have point separation by a function class.

Second, separation and stability: we will equip each Ht with a recursively defined metric dMR-IDMt whose measure coordinate is compared using OTdMR-IDMt − 1 on Zt − 1, augmented by explicit penalties for relation mismatch and edge-attribute discrepancy. The corresponding mover’s distance on DIDMs is then defined by optimal transport between Γ’s using dMR-IDML as ground cost. Because this geometry is built from transport on compact spaces, it both metrizes weak* and controls Lipschitz test functions; consequently it is well suited to proving that Lipschitz message passing maps are stable (small perturbations in δMR-DIDML imply small changes in outputs). Moreover, compactness yields finite covering numbers, which are the key input for distribution-free generalization bounds for Lipschitz hypothesis classes via uniform concentration over metric nets.

We next specialize these preliminaries to the concrete multi-relational directed edge-attributed graphon-signal model, including the relevant identifications under relabelings and the embedding of finite graphs as step objects.


3. Multi-Relational Directed Edge-Attributed Graphon-Signals: define (W, E, f), metric identification under measure-preserving relabelings; embedding finite graphs as step multi-graphons; discussion of dense-regime scope and relation to practice.

Fix a finite relation set [R], a node-feature ball Brd ⊂ ℝd, and an edge-attribute ball Bsq ⊂ ℝq. A is a triple
(W, E, f) ∈ 𝒲r, sd, q, R,
where W = (Wρ)ρ ∈ [R] is a tuple of measurable kernels Wρ : [0, 1]2 → [0, 1], E = (Eρ)ρ ∈ [R] is a tuple of measurable kernels Eρ : [0, 1]2 → Bsq, and f : [0, 1] → Brd is a measurable node signal. We interpret Wρ(x, y) as the (possibly directed) weight of the relation-ρ edge from x to y, and Eρ(x, y) as the corresponding q-dimensional edge attribute. We do not impose symmetry: in general Wρ(x, y) ≠ Wρ(y, x) and likewise for Eρ, so the model covers directed graphs. In the multi-relational setting, relations are not mutually exclusive: different Wρ may place positive mass on the same pair (x, y), encoding multiplex structure.

Edge attributes should be read through the edge weight: the attribute kernel Eρ(x, y) is always defined as a measurable map into a compact set, but its contribution to any neighborhood aggregate will be weighted by Wρ(x, y). In particular, values of Eρ on pairs where Wρ(x, y) = 0 are immaterial for the induced computation measures we construct later, and hence for the geometric comparisons built from those measures. This convention allows us to keep the state space compact (by forcing Eρ to take values in Bsq everywhere) without constraining the model on null edge sets.

As usual in graphon theory, the analytic representation is not unique: relabeling the latent space [0, 1] by a measure-preserving transformation yields an equivalent object. Let σ : [0, 1] → [0, 1] be a measure-preserving bijection (modulo null sets). We define the relabeled graphon-signal (Wσ, Eσ, fσ) by
Wρσ(x, y) := Wρ(σ(x), σ(y)),   Eρσ(x, y) := Eρ(σ(x), σ(y)),   fσ(x) := f(σ(x)).
We then write
(W, E, f) ∼ (V, F, g)
if there exists such a σ with (V, F, g) = (Wσ, Eσ, fσ) almost everywhere. This is the natural invariance for any node-wise computation whose base distribution on nodes is Lebesgue measure: if we push forward λ along the relabeling, the distribution of local neighborhoods is unchanged. All of our subsequent constructions (IDMs, DIDMs, and the transport distances derived from them) are designed to be invariant under , and the final metric space is the quotient 𝒲r, sd, q, R/∼ obtained by metric identification at distance zero.

We now record the canonical embedding of a finite multi-relational directed edge-attributed graph with node features into 𝒲r, sd, q, R. Let
G = (V, {Eρ}ρ ∈ [R], fG, {eρ}ρ ∈ [R])
be a finite graph with |V| = n, where fG : V → Brd and each relation ρ has an adjacency weight function Aρ : V × V → [0, 1] (with Aρ(i, j) = 0 meaning no ρ-edge) and an edge attribute map eρ : V × V → Bsq (defined on all ordered pairs, with arbitrary values allowed when Aρ(i, j) = 0). Choose an ordering V = {1, …, n} and partition [0, 1] into intervals $I_i:=[\frac{i-1}{n},\frac{i}{n})$ for i = 1, …, n (with the last interval closed at 1). Define the step kernels and signal by, for x ∈ Ii and y ∈ Ij,
WρG(x, y) := Aρ(i, j),   EρG(x, y) := eρ(i, j),   fG(x) := fG(i).
Then (WG, EG, fG) ∈ 𝒲r, sd, q, R is a step multi-graphon-signal representing G. Different orderings of V lead to relabelings of [0, 1] by measure-preserving permutations of the partition, hence to -equivalent objects. This embedding is the bridge between the analytic definitions and the finite-graph computations: the IDM/DIDM maps defined later reduce on step objects to explicit operations on finite neighborhoods with the same normalization induced by Lebesgue measure on the intervals Ii.

The step embedding above is compatible with the convention that node neighborhoods are recorded as measures with total mass at most one. Indeed, in the graphon setting, neighborhood aggregation at a point x uses the kernel values Wρ(x, ⋅) as densities against λ(dy). For step objects, this corresponds to distributing mass $\frac{1}{n}A_\rho(i,j)$ to each neighbor j under relation ρ. In particular, even when Aρ(i, j) ∈ {0, 1}, the total outgoing mass of a node equals its (weighted) out-degree divided by n, which may be strictly smaller than one. This is precisely why we work with M ≤ 1(⋅) and allow unbalanced transport in subsequent layers.

The representation (W, E, f) is the dense-limit model: Wρ is a bounded kernel on [0, 1]2 rather than a σ-finite object as in graphex/sparse graph limit theories. Consequently, the most direct statistical interpretation is as a limit of dense graphs (or dense weighted graphs) with multiple edge types and attributes. This choice is deliberate: it yields compact state spaces once node and edge features are bounded, and compactness is the structural condition enabling weak* metrizability, separation results, and uniform approximation via Stone–Weierstrass in later sections.

At the same time, the finite-graph distance computation we ultimately obtain does not require dense inputs: if the adjacency matrices are sparse, then the induced neighborhood measures simply have small total mass. Unbalanced transport then compares such measures without forcing artificial normalization. Thus sparsity manifests in computation as smaller supports and smaller transported mass, which can be exploited algorithmically even though the analytic ambient class is dense. We emphasize that we do not claim to develop a full sparse-limit theory here; rather, we use the dense graphon-signal formalism as a compact analytic envelope within which our recursion and stability/expressivity arguments can be carried out cleanly, while still allowing finite sparse graphs to be embedded and compared.

The multi-relational directed edge-attributed graphon-signal formalism provides (i) a common language for finite and limit objects, (ii) a canonical invariance under measure-preserving relabelings, and (iii) bounded feature/attribute ranges ensuring compactness at every recursive stage. In the next section we build the enriched IDM/DIDM hierarchy on top of this representation and define the relation- and attribute-aware transport metrics used throughout the theory.


4. Enriched IDM/DIDM Hierarchy: define Ht, computation IDMs γt, DIDMs ΓL; define recursive ground metric dMR-IDML with relation/edge costs; define δMR-DIDML.

We define a hierarchy of compact IDM state spaces (Ht)t ≥ 0 which records, at each depth, both a node state and a labeled, edge-attributed neighborhood measure. We set
H0 := Brd.
For t ≥ 1 we define
Ht := Ht − 1 × M ≤ 1 (Ht − 1 × ℛ × Bsq),
equipped with the product σ-algebra and, later, a recursively defined metric. Thus an element h ∈ Ht can be written as h = (hself, hnbh) where hself ∈ Ht − 1 is the ``current’’ state and hnbh is a subprobability measure over triples consisting of a neighbor state in Ht − 1, a relation label in ℛ = [R], and an edge attribute in Bsq.

Fix (W, E, f) ∈ 𝒲r, sd, q, R. We define recursively measurable maps
γ(W, E, f), t : [0, 1] → Ht,   t = 0, 1, …,
interpreted as order-t computation-IDMs. At depth t = 0 we set
γ0(x) := f(x) ∈ H0.
Assume γt − 1 has been defined. For each x ∈ [0, 1] we define the outgoing enriched neighborhood measure μx, t ∈ M ≤ 1(Ht − 1 × ℛ × Bsq) by specifying its action on measurable sets A ⊆ Ht − 1 × ℛ × Bsq:

The factor 1/R is a convenient normalization ensuring μx, t has total mass at most one:
$$ \mu_{x,t}\bigl(H_{t-1}\times\mathcal{R}\times B_s^q\bigr) = \frac{1}{R}\sum_{\rho\in[R]}\int_0^1 W_\rho(x,y)\,dy \le 1, $$
since each Wρ is bounded by 1. (Any other fixed normalization can be absorbed into constants in later Lipschitz bounds; we choose to remain within M ≤ 1(⋅) at every depth.) We then set
γt(x) := (γt − 1(x), μx, t) ∈ Ht.
By construction, μx, t depends on (W, E) only through the weighted push-forward of Lebesgue measure by the map y ↦ (γt − 1(y), ρ, Eρ(x, y)), with weight Wρ(x, y); thus edge attributes are incorporated only through the edges on which they are weighted, in accordance with our measurability convention from Section~.

The depth-L DIDM is the distribution of order-L IDMs under the base node measure λ on [0, 1]. Concretely, for each L ∈ ℕ we define
Γ(W, E, f), L := (γ(W, E, f), L)*λ ∈ 𝒫(HL).
This is the canonical graph-level representation induced by the local recursion: it records the law of rooted depth-L enriched neighborhoods when a root x is sampled uniformly from [0, 1].

We now define metrics dMR-IDMt on Ht recursively. At depth t = 0 we take the Euclidean metric
dMR-IDM0(u, v) := ∥u − v2,   u, v ∈ Brd.
Fix weights λR ≥ 0 and λE ≥ 0. For t ≥ 1, write h = (x, μ) and h = (x, ν) with x, x ∈ Ht − 1 and μ, ν ∈ M ≤ 1(Ht − 1 × ℛ × Bsq). We define a ground cost on atoms
ct − 1  : (Ht − 1 × ℛ × Bsq)2 → ℝ ≥ 0
by

where d(ρ, ρ) = 1ρ ≠ ρ. We then set

Here OTct − 1 denotes an optimal transport distance on subprobability measures, so that OTct − 1(μ, ν) remains finite even when μ and ν have different total masses. This choice is forced by the normalization in : neighborhood mass reflects (normalized) outgoing edge mass and may vary across nodes and across graphs.

For a fixed recursion depth L, we define the multi-relational DIDM mover’s distance between two multi-graphon-signals (W, E, f) and (V, F, g) by transporting their depth-L DIDMs with ground metric dMR-IDML:

Because Γ(W, E, f), L is defined as a push-forward of Lebesgue measure, it is immediate that relabelings by measure-preserving bijections σ leave Γ(W, E, f), L unchanged, and hence leave δMR-DIDML unchanged. In particular, for step multi-graphon-signals induced by finite graphs, the recursion – reduces to an explicit layered optimal transport computation on finite neighborhoods (with relation mismatch penalized by λR and attribute mismatch penalized by λE), which is the algorithmic form used later to obtain polynomial-time exact computation results.


5. Compactness and Topology: prove Ht and 𝒫(Ht) compact; prove 𝒲r, sd, q, R/∼ compact under δMR-DIDML; relate to cut distance/coarser topologies where appropriate.

We equip each kernel component Wρ with the weak* topology inherited from the unit ball of L([0, 1]2), and each coordinate of Eρ with the weak* topology inherited from the unit ball of L([0, 1]2) (viewing Eρ as a q-tuple of bounded real kernels and enforcing the pointwise constraint Eρ(x, y) ∈ Bsq a.e.). Likewise we equip each coordinate of f with the weak* topology of the unit ball of L([0, 1]) and enforce f(x) ∈ Brd a.e. Under these choices, the admissible set 𝒲r, sd, q, R is a weak* closed subset of a product of weak* compact sets; hence 𝒲r, sd, q, R is compact by Banach–Alaoglu and Tychonoff. This is the ambient compactness used to extract convergent subsequences of kernels/signals without any regularity beyond boundedness.

We now justify that the recursive state spaces Ht are compact and that their associated measure spaces are compact in the weak* sense. The key observation is that if K is compact Hausdorff, then the set M ≤ 1(K) of finite Borel measures of total mass at most one is weak* compact when identified with a weak* closed subset of the unit ball of C(K)* via the Riesz representation theorem. Moreover, 𝒫(K) is weak* compact for the same reason.

Since ℛ = [R] is finite (hence compact in the discrete topology) and Bsq is compact, we have that Ht − 1 × ℛ × Bsq is compact whenever Ht − 1 is compact. Consequently M ≤ 1(Ht − 1 × ℛ × Bsq) is weak* compact, and then
Ht = Ht − 1 × M ≤ 1 (Ht − 1 × ℛ × Bsq)
is compact as a product of compact sets. Inducting from H0 = Brd yields compactness of every Ht. In particular, for each fixed L, the DIDM codomain 𝒫(HL) is weak* compact.

Compactness alone is not sufficient for quantitative control; we also require that the recursively defined costs induce the relevant weak* topologies. For this we use the standard fact that on a compact metric space (K, d), the 1-Wasserstein distance W1(d) metrizes weak* convergence on 𝒫(K). The same conclusion holds for customary optimal transport distances on M ≤ 1(K) (e.g. formulations via semi-couplings or via a reservoir point with finite creation/destruction cost), provided the transport cost is continuous and bounded on K × K; we write OTd for a fixed such choice.

Applying this at each depth, we note that if dMR-IDMt − 1 is a metric inducing the topology of Ht − 1, then the atom cost ct − 1 in is continuous and bounded on the compact product Ht − 1 × ℛ × Bsq. Hence OTct − 1 metrizes weak* on M ≤ 1(Ht − 1 × ℛ × Bsq). The recursion then equips Ht with a product-type metric which metrizes precisely the product of the topology of Ht − 1 and the weak* topology on M ≤ 1(⋅). Inducting on t yields that (Ht, dMR-IDMt) is a compact metric space for each t, and that (𝒫(Ht), OTdMR-IDMt) is a compact metric space whose topology agrees with weak* on probability measures.

Fix L. The map
ΦL : 𝒲r, sd, q, R → 𝒫(HL),   ΦL(W, E, f) := Γ(W, E, f), L,
is well-defined by construction. We view δMR-DIDML in as the pullback pseudometric
δMR-DIDML(X, Y) = OTdMR-IDML(ΦL(X), ΦL(Y)),
and we denote by the induced metric identification (i.e. X ∼ Y iff δMR-DIDML(X, Y) = 0). The quotient (𝒲r, sd, q, R/∼,δMR-DIDML) is then isometric to the image ΦL(𝒲r, sd, q, R) equipped with OTdMR-IDML.

Thus, to prove compactness of the quotient, it suffices to show that ΦL(𝒲r, sd, q, R) is compact in 𝒫(HL). Since 𝒲r, sd, q, R is compact in the weak* product topology described above, it is enough to verify that ΦL is continuous with respect to these topologies. This continuity follows by induction on t: for any bounded continuous test function ζ on Ht, the map
(W, E, f) ↦ ∫01ζ(γ(W, E, f), t(x)) dx
depends on (W, E, f) only through iterated operations of (i) composition with continuous functions on compact domains, and (ii) integration of bounded kernels against bounded test functions, as in . These operations are continuous under weak* convergence of bounded kernels/signals. Therefore ΦL is continuous into the compact metric space (𝒫(HL), OTdMR-IDML), and ΦL(𝒲r, sd, q, R) is compact. The claimed compactness of the quotient follows.

In the classical case R = 1 and without attributes, the cut metric on graphons is a standard relabeling-invariant topology. Our distance δMR-DIDML shares the same invariance under measure-preserving relabelings (as already noted after ), but it is tailored to the depth-L enriched neighborhood statistics encoded by Γ(⋅), L. In particular, cut convergence of kernels (and analogous multi-relational cut-type convergence defined by taking the maximum cut norm over relations) implies convergence of all integrals against bounded test functions of finite depth, and therefore implies δMR-DIDML → 0 for each fixed L. Conversely, because δMR-DIDML only probes depth-L neighborhoods, it is generally coarser than cut-type topologies: distinct kernels may induce the same depth-L DIDM and hence have zero δMR-DIDML at fixed L. This is precisely why we treat L as a resolution parameter in later separation and universality results.


6. Relation-Aware Edge-Conditioned MPNNs: define graph/graphon MPNNs with per-relation kernels and edge-conditioned messages; show factorization through IDMs/DIDMs (equivalence lemma).

We formalize the class of relation-aware, edge-conditioned MPNNs that will be controlled by the multi-relational DIDM distances defined above. Throughout, we fix an embedding width pt ∈ ℕ at each depth t ∈ {0, …, L} and an output dimension m ∈ ℕ. We consider update maps ϕ(t) : ℝpt − 1 × ℝpt → ℝpt, relation-specific message maps φρ(t) : ℝpt − 1 × ℝpt − 1 × Bsq → ℝpt for ρ ∈ [R], and a readout ψ : ℝpL → ℝm. (Lipschitz assumptions on these maps are imposed later, in the stability section.)

Given a multi-graphon-signal (W, E, f) ∈ 𝒲r, sd, q, R, we define node embeddings h(t) : [0, 1] → ℝpt recursively by

where η(0) : Brd → ℝp0 is a fixed measurable (typically continuous) feature lift. The resulting graph-level embedding is defined via average pooling,

The normalization implicit in matches the graphon scaling and the neighborhood measures used in the IDM recursion: neighbors are sampled according to Lebesgue measure and weighted by Wρ(x, y), with relation ρ and attribute Eρ(x, y) exposed to the message map.

For finite graphs G = (V, {Eρ}ρ ∈ [R], f, {eρ}ρ ∈ [R]) with |V| = n, we recover the usual normalized-sum MPNN by replacing integrals with normalized sums. Writing Aρ(i, u) ∈ {0, 1} for adjacency (directed allowed) and eρ(i, u) ∈ Bsq for the edge attribute when Aρ(i, u) = 1, we set

with readout $F(G):=\psi\bigl(\frac{1}{n}\sum_{i\in V}h^{(L)}(i)\bigr)$. This is consistent with applying – to the induced step graphon-signal of G.

We recall that the order-t enriched IDM γ(W, E, f), t(x) ∈ Ht stores the current state and a measure-valued summary of the 1-hop neighborhood at the previous depth. Concretely, writing γt − 1(y) := γ(W, E, f), t − 1(y) ∈ Ht − 1, the last coordinate of γt(x) is a finite measure μx, t ∈ M ≤ 1(Ht − 1 × ℛ × Bsq) characterized (up to null sets) by

Thus, any computation at depth t that aggregates a continuous function of (γt − 1(y), ρ, Eρ(x, y)) against the weights Wρ(x, y) can be viewed as an integral against μx, t.

The preceding lemma is the precise sense in which our enriched IDMs/DIDMs are the canonical state variables for relation-aware edge-conditioned MPNNs: the network does not access (W, E, f) beyond what is recorded in γt pointwise and in ΓL at the graph level. This factorization is the structural input for the stability bounds that follow, where we will control μ ↦ ∫ΘLdμ by optimal transport on 𝒫(HL).


7. Stability: Lipschitz Bounds: inductive Lipschitz proof on IDM space; propagate to DIDMs via OT; explicit dependence on depth, Lipschitz constants, relation count, and edge-attribute bounds.

We now impose global Lipschitz assumptions on the maps defining the MPNN and prove that the resulting graph-level functional is Lipschitz with respect to δMR-DIDML. The argument proceeds in two steps: we first establish an inductive Lipschitz bound for the node-level factor maps Θt : Ht → ℝpt with respect to the recursive ground metrics dMR-IDMt, and then lift this bound to the graph level by the Kantorovich–Rubinstein control of integrals under optimal transport on 𝒫(HL).

Fix L and assume that η(0) is Lη-Lipschitz (w.r.t. ∥ ⋅ ∥2 on Brd), each update ϕ(t) is Lϕ, t-Lipschitz (w.r.t. the product Euclidean norm), each message map φρ(t) is Lφ, t-Lipschitz uniformly over ρ ∈ [R] (again w.r.t. a product Euclidean norm on pt − 1 × ℝpt − 1 × ℝq), and ψ is Lψ-Lipschitz. Since Ht is compact for each t and Θt will be measurable and (under the above assumptions) continuous, we also have finiteness of the range bounds
Bt := supξ ∈ HtΘt(ξ)∥2 < ∞.
When an explicit estimate is desired, one may upper bound Bt inductively using B0 ≤ supx ∈ Brdη(0)(x)∥2 and the inequalities
sup ∥φρ(t)(h, h, a)∥2 ≤ ∥φρ(t)(0, 0, 0)∥2 + Lφ, t(2Bt − 1 + s),   a ∈ Bsq,
together with sup ∥ϕ(t)(u, v)∥2 ≤ ∥ϕ(t)(0, 0)∥2 + Lϕ, t(∥u2+∥v2) and the fact that each neighborhood measure has mass  ≤ 1.

We write the generic element of Ht as ξ = (z, μ) with z ∈ Ht − 1 and μ ∈ M ≤ 1(Ht − 1 × ℛ × Bsq). Recall that the recursive metric dMR-IDMt has the schematic form

where t − 1 is the ground metric on Ht − 1 × ℛ × Bsq given by
t − 1((ξ, ρ, a), (ξ, ρ, a)) := dMR-IDMt − 1(ξ, ξ) + λ1{ρ ≠ ρ} + λE ∥a − a2,
for fixed weights λ, λE > 0 used in the definition of dMR-IDMt.

We now propagate the Lipschitz bound to the DIDM level. Recall that the graph-level functional factors as
F(W, E, f) = Ψ(Γ(W, E, f), L),   Ψ(μ) := ψ(∫ΘLdμ).
Since ΘL is KL-Lipschitz on (HL, dMR-IDML), the map μ ↦ ∫ΘLdμ is KL-Lipschitz on (𝒫(HL), OTdMR-IDML) (again by the Kantorovich–Rubinstein bound applied coordinatewise). Composing with the Lψ-Lipschitz readout ψ yields the desired stability estimate with constant C := LψKL.


8. Fine-Grained Expressivity and Universality: separation theorem (topological converse) and Stone–Weierstrass density on compact domains; discuss what class of continuous functionals is captured.

The stability result of Theorem~ asserts that every depth-L relation-aware edge-conditioned MPNN readout is Lipschitz with respect to δMR-DIDML. We now establish a partial converse: on the compact DIDM domain, agreement of Lipschitz-bounded MPNN readouts forces closeness in the OT-geometry induced by the enriched IDM recursion. Formally, let L, m, C denote the class of m-valued depth-L MPNN readouts Hϕ, ψ (as in the factorization statement) whose constituent maps have Lipschitz constants bounded by C and whose parameters are otherwise unconstrained.

The argument is purely topological and uses the compactness and weak* metrizability of 𝒫(HL) under OTdMR-IDML (Theorems~1–2 in our sequence). Since OTdMR-IDML metrizes weak*, to prove OTdMR-IDML(μ, ν) ≤ ε it suffices to control |∫Φdμ − ∫Φdν| uniformly over a separating set of continuous test functions Φ ∈ C(HL). We therefore show that, for L sufficiently large and C sufficiently permissive, the collection of maps realizable as ΘL : HL → ℝ by depth-L relation-aware edge-conditioned architectures is (after closing under linear combinations and products) uniformly dense in C(HL). In particular, for any Φ ∈ C(HL) and η > 0 we can realize Φ̃ = Φ̃ϕ such that supξ ∈ HL|Φ(ξ) − Φ̃(ξ)| ≤ η. By choosing m large enough (or working coordinatewise) and setting ψ to be a linear functional, the graph-level map μ ↦ ∫Φ̃dμ becomes a member of L, m, C up to an arbitrarily small uniform perturbation. The hypothesis of Theorem~ then yields |∫Φdμ − ∫Φdν| small for all Φ in a dense subset of C(HL), hence for all Φ ∈ C(HL) by approximation; weak* proximity follows, and metrizability converts it to a quantitative OT bound.

We make the preceding density statement precise. Let 𝒜L ⊂ C(HL) be the set of scalar node-level maps ΘL realizable by depth-L relation-aware, edge-conditioned normalized-sum architectures with continuous constituent maps. By closure properties of continuous composition, 𝒜L is closed under multiplication after enlarging the architecture class in the standard way (e.g., by concatenating embeddings and applying a final bilinear map), and it contains constants. Moreover, 𝒜L separates points of HL once L is sufficiently large: if ξ ≠ ξ in HL, then either their HL − 1-coordinates differ, in which case the inductive expressivity at depth L − 1 separates them, or their last-coordinate measures in M ≤ 1(HL − 1 × ℛ × Bsq) differ. In the latter case there exists a continuous test function on HL − 1 × ℛ × Bsq witnessing the discrepancy, and by the edge-conditioned form of φρ(L) we can approximate such tests by message functions of (ξ, ρ, a), including dependence on the relation label ρ and edge attribute a ∈ Bsq (in particular, the discrete metric on permits exact gating by ρ at bounded Lipschitz cost). Integrating these messages against the last-coordinate measure produces a scalar summary that differs between ξ and ξ.

Since HL is compact Hausdorff, the Stone–Weierstrass theorem applies: the uniform closure of any subalgebra of C(HL) that contains constants and separates points equals C(HL). Thus, after taking the algebra generated by 𝒜L, we obtain uniform density in C(HL). This yields point separation of measures on HL via the standard corollary: if μ ≠ ν are Borel probability measures on a compact metric space, then there exists Φ ∈ C(HL) with Φdμ ≠ ∫Φdν, and we may approximate Φ uniformly by functions realized by node-level MPNN embeddings.

We now lift the preceding density to functionals on the graphon-signal quotient (𝒲r, sd, q, R/∼,δMR-DIDML). Recall that every depth-L readout factors through the DIDM,
(W, E, f) ↦ Γ(W, E, f), L ∈ 𝒫(HL),   μ ↦ ψ(∫ΘLdμ).
Since (W, E, f) ↦ Γ(W, E, f), L is continuous and the quotient is compact, it follows that any continuous functional on 𝒲r, sd, q, R/∼ may be uniformly approximated by compositions of continuous functionals on 𝒫(HL) with the DIDM map. Combining this with the Stone–Weierstrass density on HL (and the fact that integration μ ↦ ∫Φdμ produces a separating family of continuous functions on 𝒫(HL)), we obtain the following.

Theorems~ and~ together identify δMR-DIDML as the appropriate topology for depth-L relation-aware message passing: (i) every such readout is continuous (indeed Lipschitz) in this topology; (ii) conversely, up to uniform approximation, continuous functionals on the compact quotient are representable. Equivalently, depth-L architectures capture precisely those graphon-signal functionals that depend continuously on the order-L enriched neighborhood statistics encoded by the DIDM Γ(W, E, f), L, where the enrichment includes the relation channel and edge attribute space Bsq. We next turn to the algorithmic question of computing δMR-DIDML on finite graphs.


9. Algorithmics: Polynomial-Time Computation on Finite Graphs: layered OT recursion; block-structured cost matrices by relation channels; complexity bounds; practical accelerations (optional entropic OT).

We now address the computational problem suggested by the preceding topological results: given two finite multi-relational directed edge-attributed graphs
G = (V, {Er}r ∈ [R], f, {er}r ∈ [R]),   H = (U, {Fr}r ∈ [R], g, {r}r ∈ [R]),
how do we compute the induced distance δMR-DIDML(G, H) exactly, and what is the worst-case complexity? Throughout we assume |V|,|U| ≤ N and that L and R are fixed constants (as in the metric definition); arithmetic is performed in the unit-cost RAM model on O(log N)-bit numbers.

The finite-graph computation mirrors the enriched IDM recursion on step graphon-signals. We maintain an |V|×|U| cost matrix Dt whose entry Dt(i, j) represents the order-t ground distance dMR-IDMt between the order-t IDMs rooted at i ∈ V and j ∈ U. At depth t = 0 we set
D0(i, j) := ∥f(i) − g(j)∥2,
which is well-defined since node features lie in the compact ball Brd.

For each layer t ≥ 1, and for each node pair (i, j) ∈ V × U, we form a source measure μi, t and a target measure νj, t supported on neighbor-atoms enriched with relation labels and edge attributes. Concretely, write ArG(i, u) ∈ {0, 1} for the indicator of a directed r-edge from i to u in G (and similarly ArH(j, v)). We define
$$ \mu_{i,t}\;:=\;\sum_{r\in[R]}\sum_{u\in V}\frac{A_r^G(i,u)}{|V|}\,\delta_{(u,r,e_r(i,u))}, \qquad \nu_{j,t}\;:=\;\sum_{r\in[R]}\sum_{v\in U}\frac{A_r^H(j,v)}{|U|}\,\delta_{(v,r,\tilde e_r(j,v))}, $$
which are finite measures of total mass at most 1 (indeed,  ≤ ∑rdegr(i)/|V| and  ≤ ∑rdegr(j)/|U|). The ground cost between atoms (u, r, a) and (v, r, b) is taken to be
ct((u, r, a), (v, r, b)) := Dt − 1(u, v) + λR1{r ≠ r} + λEa − b2,
with fixed weights λR, λE ≥ 0. We then set the transport discrepancy
Tt(i, j) := OTct(μi, t, νj, t),   and   Dt := Dt − 1 + Tt,
where OTct denotes the chosen (possibly unbalanced) transport objective with ground costs ct.

Each quantity Tt(i, j) is an optimal transport problem on a finite support of size at most R|V| versus R|U|, hence O(N) for fixed R. In the balanced case (equal total mass), Tt(i, j) is exactly a transportation linear program on a complete bipartite graph with O(N2) edges, solvable in polynomial time by standard min-cost flow reductions. In the unbalanced setting (total masses differ), we can reduce to a balanced transportation instance by adjoining a dummy ``reservoir’’ node on each side: the dummy absorbs excess mass with a prescribed deletion/insertion cost (equivalently, one uses an unbalanced OT objective that can be written as a min-cost flow on an augmented bipartite network). Since the augmentation increases the support size by at most a constant factor, the polynomial-time solvability and asymptotic bounds are unchanged.

After completing the L layers, we perform a final transport between the uniform node measures on V and U with ground cost DL:
$$ \delta^{L}_{\mathrm{MR\text{-}DIDM}}(G,H) \;:=\; OT_{D_L}\Bigl(\tfrac{1}{|V|}\sum_{i\in V}\delta_i,\;\tfrac{1}{|U|}\sum_{j\in U}\delta_j\Bigr). $$
This last step is again a transportation/min-cost flow problem on O(N) nodes per side.

A salient feature of the multi-relational setting is that the cost matrix in each neighborhood OT admits an explicit block structure. If we order the atoms of μi, t (and νj, t) by their relation labels, the ground cost matrix decomposes into R × R blocks indexed by (r, r). The diagonal blocks (r = r) have entries
Dt − 1(u, v) + λEer(i, u) − r(j, v)∥2,
while every off-diagonal block (r ≠ r) is obtained from the corresponding diagonal form by adding the constant penalty λR. Thus, for fixed (i, j), once the base matrix (Dt − 1(u, v))u, v is available, the additional overhead to incorporate relations is purely additive and can be implemented without changing the asymptotic order. In particular, for fixed q the edge-attribute term is a Euclidean distance in Bsq and can be computed on the fly (or precomputed per (i, j) and relation-pair) within the same polynomial budget.

Let n := |V| and m := |U| and set N := max {n, m}. For each layer t, we must solve nm ≤ N2 neighborhood OT problems, each over supports of size O(RN) = O(N) for fixed R. Using a strongly polynomial min-cost flow algorithm or a standard network simplex bound, one may take the running time of each such OT solve to be O(N3log N) (up to constants depending on the precise flow routine and the fixed dimensions q, d through cost evaluation). Consequently, each layer costs O(N2 ⋅ N3log N) = O(N5log N) time, and the L layers contribute O(LN5log N). The final OT solve contributes an additional O(N3log N), which is dominated. Memory usage is governed by storage of the current ground matrix Dt − 1 and the updated matrix Dt, which can be streamed to maintain O(N2) working space.

The stated O(LN5log N) bound is a dense worst-case guarantee. In typical sparse graphs, each measure μi, t is supported only on actual neighbors of i across all relations; thus its support size is O(∑rdegr(i)) rather than O(N). In that regime, the neighborhood OT problems shrink substantially, and the empirical runtime is closer to a degree-weighted sum over node pairs (and can be reduced further by restricting transport to truncated neighborhoods if one is willing to approximate the metric).

A separate, widely used acceleration is entropic regularization. Replacing exact OT by its entropic variant yields Sinkhorn iterations with per-iteration cost essentially quadratic in the support size, often leading to substantial speedups in practice. For our purposes this is optional: entropic OT produces an approximation to δMR-DIDML with controllable error, but the core theoretical statements (compactness, stability, and exact polynomial-time computability) concern the unregularized distance. Finally, the block structure induced by relations is amenable to implementation-level optimizations (e.g., precomputing relation-gated costs or reusing portions of the cost matrix across (i, j)), which improves constants but does not alter the worst-case asymptotic order.


10. Generalization: Covering-Number Bounds: apply uniform Monte Carlo estimation over compact metric domain; discuss dependence on covering numbers and what can/cannot be made explicit.

We briefly explain how compactness of our metric domains yields distribution-free generalization guarantees for Lipschitz-bounded predictors, and we make explicit the precise point at which covering numbers enter. Let (X, d) denote either of the compact metric spaces
X = (𝒫(HL), OTdMR-IDML)   or   X = (𝒲r, sd, q, R/∼, δMR-DIDML),
with L fixed. In either case, compactness implies total boundedness: for every ε > 0 there exists a finite ε-net, and hence the covering number
$$ \mathcal{N}(X,d,\varepsilon)\;:=\;\min\Bigl\{M:\exists x_1,\dots,x_M\in X\text{ with }X\subseteq\bigcup_{k=1}^M B_d(x_k,\varepsilon)\Bigr\} $$
is finite.

We consider an i.i.d. sample (xi, yi)i = 1n drawn from an arbitrary distribution P on X × Y, where Y ⊂ ℝp is bounded. For a predictor h : X → ℝp and a loss  : ℝp × Y → ℝ, define the population and empirical risks
$$ \mathcal{R}(h)\;:=\;\mathbb{E}_{(x,y)\sim\mathsf{P}}\bigl[\ell(h(x),y)\bigr], \qquad \widehat{\mathcal{R}}_n(h)\;:=\;\frac{1}{n}\sum_{i=1}^n \ell\bigl(h(x_i),y_i\bigr). $$
Assume (⋅, y) is L-Lipschitz uniformly in y and bounded in an interval of length B (e.g.  ∈ [0, 1]). If h is Lh-Lipschitz with respect to d, then for each fixed y the map x ↦ (h(x), y) is LLh-Lipschitz on X. In particular, for any two probability measures μ, ν ∈ 𝒫(X) one has the standard transport bound

valid for 1-Wasserstein by Kantorovich–Rubinstein duality and, for the unbalanced variants we allow, as a direct consequence of their dual feasibility constraints (hence as an inequality sufficient for stability). Applying with ϕ(x) = (h(x), y) and then averaging over y yields

where PX is the X-marginal and $\widehat{\mathsf{P}}_{X,n}:=\frac{1}{n}\sum_{i=1}^n\delta_{x_i}$ is the empirical input measure. The additional term in accounts for sampling noise in y even at fixed x (and is bounded by a standard Hoeffding or Bernstein inequality using B); for our present purpose, the salient point is that the component enters only through Lh, while the component is $OT_d(\mathsf{P}_X,\widehat{\mathsf{P}}_{X,n})$.

We thus reduce the problem to controlling the empirical transport discrepancy $OT_d(\mathsf{P}_X,\widehat{\mathsf{P}}_{X,n})$ uniformly over all data distributions. A convenient route, which makes the dependence on the metric entropy of (X, d) explicit, is to compare PX to a discretization on an ε-net. Fix ε > 0, let {z1, …, zM} be an ε-net with M = 𝒩(X, d, ε), and choose a measurable nearest-net projection π : X → {z1, …, zM} with d(x, π(x)) ≤ ε for all x ∈ X. By the triangle inequality and the push-forward contraction property of OTd,
$$ OT_d(\mathsf{P}_X,\widehat{\mathsf{P}}_{X,n}) \;\le\; OT_d(\mathsf{P}_X,\pi_\#\mathsf{P}_X) \;+\; OT_d(\pi_\#\mathsf{P}_X,\pi_\#\widehat{\mathsf{P}}_{X,n}) \;+\; OT_d(\pi_\#\widehat{\mathsf{P}}_{X,n},\widehat{\mathsf{P}}_{X,n}). $$
The first and third terms are at most ε because the coupling (x, π(x)) moves each point by at most ε, hence

Now π#PX and $\pi_\#\widehat{\mathsf{P}}_{X,n}$ are measures on the finite set {zk}k = 1M. Writing D := diam(X), one has the crude domination OTd(α, β) ≤ D ∥α − βTV on any space of diameter D, hence
$$ OT_d(\pi_\#\mathsf{P}_X,\pi_\#\widehat{\mathsf{P}}_{X,n}) \;\le\; D\,\|\pi_\#\mathsf{P}_X-\pi_\#\widehat{\mathsf{P}}_{X,n}\|_{\mathrm{TV}}. $$
Since $\pi_\#\widehat{\mathsf{P}}_{X,n}$ is the empirical distribution of a multinomial variable with M categories, standard concentration for multinomials gives, for example, a bound of the form

where C > 0 is universal. Combining – yields the schematic high-probability estimate

and plugging into gives the desired distribution-free control of the generalization gap for any hypothesis class with a uniform Lipschitz bound. In our setting, Theorem~5 provides exactly such a bound for relation-aware edge-conditioned MPNNs in terms of δMR-DIDML (or OTdMR-IDML), so immediately implies a sample-complexity statement whose dependence on the network parameters enters only through the induced Lipschitz constant.

We emphasize what is, and is not, explicit in . The diameter D is elementary to bound from the construction: H0 = Brd has diameter 2r, and each recursion step adds bounded relation and edge-attribute penalties, so D can be bounded in terms of (r, s, L, λR, λE) alone. By contrast, sharp control of ε ↦ 𝒩(X, d, ε) is typically delicate. Compactness ensures finiteness of 𝒩(X, d, ε) for each fixed ε > 0, which suffices for qualitative generalization. Quantitative rates require metric entropy estimates for Wasserstein-type spaces and, for the quotient graphon-signal domain, additional control under measure-preserving relabelings. Even for X = 𝒫(K) with K ⊂ ℝd, known bounds for log 𝒩(X, W1, ε) are at best exponential in d and polynomial/exponential in ε−1; under our recursion in L (which alternates products with measure spaces), naive entropy bounds compound rapidly and are not expected to be tight. Accordingly, we regard as a principled distribution-free guarantee whose constants can be made explicit in special regimes (e.g. low-dimensional features, restricted graph families, or bounded-degree sparsity), while in full generality we rely on compactness to ensure that the covering numbers exist and hence that uniform Monte Carlo estimation applies.


11. Empirical Plan (Optional but Strengthening): correlation of δ with output perturbations; KG and molecule benchmarks; ablations on relation mismatch and edge-feature influence.

Although our main contribution is structural and metric-theoretic, we may strengthen the narrative by an empirical study whose purpose is not to compete with task-optimized architectures, but to test whether the distance δMR-DIDML behaves in practice as the theory suggests: it should track perturbations that matter to relation-aware, edge-conditioned MPNNs, and it should respond in a controlled way to relation mismatches and edge-attribute discrepancies.

Fix an MPNN architecture class of the type covered by Theorem~5 (normalized sum aggregation, relation-conditioned message maps, and a Lipschitz readout), and train a representative model Fθ on a supervised task. For each test input graph G, generate perturbed graphs G by applying controlled modifications and measure the pair
(δMR-DIDML(G, G),  ∥Fθ(G) − Fθ(G)∥2).
We then evaluate (i) rank correlation (Spearman) between δ and output deviation, and (ii) the tightness of the inequality Fθ(G) − Fθ(G)∥2 ≤ δMR-DIDML(G, G) for an empirically estimated constant (e.g. the smallest satisfying the bound on a calibration set). The perturbation family should include: (a) edge rewiring within a fixed relation r (structure-only changes), (b) relation-label swaps (pure relation mismatch), (c) additive noise on edge attributes er (pure attribute mismatch), and (d) node-feature noise (base-layer mismatch). In addition, we may consider compositions of these perturbations to test additivity across coordinates in the recursive construction. A useful diagnostic is to plot robustness curves p ↦ 𝔼[∥Fθ(G) − Fθ(Gp)∥2] versus 𝔼[δ(G, Gp)], where p is a perturbation rate (e.g. fraction of edges rewired).

Knowledge graphs provide a canonical multi-relational directed setting (R large, edges typed, directionality essential). We propose to use standard link-prediction datasets (e.g. WN18RR, FB15k-237) and treat each triple (u, r, v) as a directed edge of relation type r. Edge attributes may be absent (q = 0) in the base setting; alternatively, one may attach lightweight attributes such as confidence scores (when available) or learned relation embeddings used only as fixed edge features (to test the edge-attribute coordinate without changing the relational coordinate). The supervised map Fθ can be a graph-level predictor defined on induced subgraphs (e.g. enclosing subgraphs around candidate links), so that Fθ(G) is meaningful for each sampled subgraph instance. In this regime, we expect that increasing λR in the ground cost
c((u, r, a), (v, r, b)) = Dt − 1[u, v] + λR1r ≠ r + λEa − b2
should increase sensitivity of δ to relation swaps while leaving structure-only perturbations relatively unaffected. A direct experiment is to take a subgraph instance G and form G by randomly reassigning a fraction ρ of relation labels among existing edges while keeping endpoints fixed; we then test whether δ(G, G) increases approximately monotonically in ρ and whether this increase correlates with Fθ(G) − Fθ(G)∥2.

Molecular graphs naturally instantiate a small-R multi-relational domain by taking relation labels to be bond types (single/double/triple/aromatic) and direction to be either absent (by symmetrization) or present (by encoding an arbitrary but fixed orientation to match our directed formalism). Edge attributes can include geometric bond lengths, stereochemical indicators, or simple physicochemical descriptors computed from coordinates when available. We may use graph regression and classification tasks (e.g. QM9 for regression, ZINC for constrained property prediction) and train an edge-conditioned message passing model. Perturbations can be made physically interpretable: (a) small additive noise to bond lengths (attribute perturbation), (b) bond type flips subject to valence constraints (relation perturbation), and (c) substructure rewiring that preserves degree sequences (structure perturbation). Since many molecular predictors are known to be sensitive to bond order and geometry, we expect δ to reflect such perturbations via the λR and λE channels, respectively, and to correlate with output changes more strongly than distances that ignore edge attributes or relation labels.

We propose explicit ablations that mirror the metric definition.

For each ablation, we report both correlation statistics and a calibration of in the inequality Fθ(G) − Fθ(G)∥2 ≤ δ(G, G), thereby connecting observed stability to the theoretical form of Theorem~5.

To contextualize δMR-DIDML, we compare against (i) feature-only costs (e.g. 2 on pooled node features), (ii) relation-agnostic variants (collapse all r), and (iii) graph distances based on Gromov–Wasserstein or assignment costs when feasible. For computation, we implement the layered OT recursion as in Algorithm~MR-DIDM-Distance, using exact min-cost flow for small graphs and an entropic approximation (Sinkhorn) for larger instances to enable systematic sweeps over L and hyperparameters. The empirical study should explicitly record runtime as a function of N and L, as well as sensitivity of the correlation metrics to approximate OT tolerances, since in practice one may trade exactness for throughput while retaining the qualitative behavior predicted by the Lipschitz framework.


12. Limitations and Future Work: sparse regime, attention/transformers, quantitative approximation rates, explicit metric entropy.

Our development is deliberately aligned with the normalized-sum, relation-aware, edge-conditioned MPNN formalism, and with a recursion depth L that is fixed when defining δMR-DIDML. This choice yields a clean compactness–stability–universality story, but it also exposes several limitations that suggest natural extensions.

The exact computation guarantee of Theorem~8 is worst-case dense and scales as (LN5) when we instantiate the recursion naively with N2 transport subproblems per layer, each on supports of size O(N). In many domains (knowledge graphs, molecules, and most relational data), the graphs are sparse and degrees are bounded or heavy-tailed. In that setting, the last-coordinate neighborhood measures in the order-t IDMs have supports of size closer to deg (i) than to N, and thus the effective cost of each OT call is governed by deg (i)deg(j) rather than N2. It would be valuable to formalize an explicit complexity bound, stated directly in terms of the degree sequences and the number of nonzeros per relation, and to identify the appropriate data structures so that the recursion can be implemented without materializing dense N × N cost matrices Dt. A related direction is to exploit the fact that the ground costs used in the OT subproblems decompose as
c((u, r, a), (v, r, b)) = Dt − 1[u, v] + λR1r ≠ r + λEa − b2,
so that one may hope for incremental updates or warm-started min-cost flow solvers across (i, j) pairs or across layers t. At present, our theory guarantees polynomial-time computability but does not provide the strongest possible bounds in sparse inputs, nor does it characterize the best achievable asymptotics under commonly used sparsity models.

Our factorization and stability results rely on the fact that the per-node aggregation at each layer can be written as an integral of a fixed Lipschitz message map against a neighborhood measure. Attention mechanisms (including graph transformers) typically introduce weights, e.g. softmax scores that depend jointly on query and key representations; these weights are not merely a function of an edge label (r, a) but also of the current node embeddings. While one may still view attention as producing an integral against a probability kernel, the kernel itself becomes an output of the model, and the relevant continuity properties are then more subtle: Lipschitzness must control not only the integrand but also the induced reweighting of neighbor measures. An important future direction is to define an enriched IDM hierarchy that can represent such adaptive reweightings. One plausible route is to enlarge the last-coordinate space to include or features (e.g. storing joint distributions of neighbor embeddings and edge attributes), or to move from measures to as coordinates. Doing so would require revisiting compactness (since kernel spaces are larger) and re-proving weak* metrizability for the resulting measure-of-kernels recursion. A second route is to restrict attention to Lipschitz families with bounded temperature and bounded feature norms and to show that the resulting transformer layers are Lipschitz with respect to our existing OT geometry; this would yield stability but likely at the cost of conservative constants.

Our metric definition is exact, and our computation result is exact. In practice, one often uses entropic regularization (Sinkhorn) or other approximate OT solvers, and one may also approximate neighborhood measures by sampling (especially when N is large or when the underlying object is a graphon-signal). At present, we do not provide quantitative approximation guarantees that control the deviation between the exact δMR-DIDML and an approximation δ̂L. Two issues arise. First, for entropic OT one would like explicit bounds of the form
|OTd(μ, ν) − OTdε(μ, ν)| ≤ α(ε; d, μ, ν),
with α(ε; ⋅) → 0 as ε ↓ 0, preferably uniform over μ, ν in 𝒫(Ht) and with dependence on the diameter of (Ht, dMR-IDMt). Second, because our ground metric is defined recursively, any error introduced at layer t − 1 influences the costs at layer t, so one needs a stability-to-approximation analysis that tracks how additive or multiplicative errors propagate across the depth L. Establishing such bounds would convert our qualitative robustness statement into a robustness guarantee, in which one can trade accuracy for speed while maintaining a provable control on the induced perturbation of MPNN outputs via Theorem~5.

The distribution-free generalization statement we outline uses compactness to assert finiteness of covering numbers, but it remains non-quantitative. For learning-theoretic purposes, one would like explicit upper bounds on the metric entropy of (𝒫(HL), OTdMR-IDML) or of the quotient space (𝒲r, sd, q, R/∼,δMR-DIDML) in terms of (d, q, R, L) and the radii (r, s), possibly also incorporating constraints such as bounded degree or bounded variation of kernels. Even in classical Wasserstein spaces, sharp entropy estimates are delicate and depend strongly on dimension and regularity; here, the recursive construction increases effective dimension with L because each layer introduces a measure-valued coordinate. Nevertheless, obtaining even coarse bounds (e.g. of the form log 𝒩(ε) ≤ Cεβ with explicit β) would sharpen the generalization bound in Theorem~9 and clarify how depth affects sample complexity. A particularly relevant question is whether the recursion produces entropy growth that is exponential in L, or whether additional structure (such as bounded supports induced by sparsity, or Lipschitz regularity of kernels) yields milder scaling.

Our universality result is stated on the metric-identified quotient 𝒲r, sd, q, R/∼. While this is the correct domain for continuous function approximation under δMR-DIDML, it may identify objects that are distinct under other equivalence notions (e.g. cut distance for graphons or task-specific invariances). Understanding the relationship between δMR-DIDML-equivalence and classical graphon equivalences, and identifying conditions under which they coincide or differ, remains open. Likewise, our ground costs treat relation labels via the discrete metric 1r ≠ r; in some applications relations have internal geometry (e.g. hierarchical types), suggesting a replacement by a metric on or even a continuous relation embedding space. Extending the theory to such settings appears straightforward at the level of definitions, but the induced constants and entropy properties may change, and it is not yet clear which choices yield the best empirical and theoretical trade-offs.

In short, the present framework establishes a compact OT geometry that is tailored to relation-aware, edge-conditioned MPNNs and is computable in polynomial time, but it leaves open: (i) sharper sparse and large-scale algorithms, (ii) a principled integration of attention and transformer layers into the IDM/DIDM factorization, (iii) quantitative approximation and depth-dependent error bounds, and (iv) explicit metric entropy estimates enabling sharper, parameter-free generalization rates. These directions are, in our view, the most direct paths toward converting the structural results here into a mature theory of scalable, statistically efficient relational representation learning.