Multi-Relational DIDM Mover’s Distance: Compact, Stable, and Universal Theory for Edge-Attributed GNNs

Metadata

Total Words: 10,122
Export Date: 2026-01-18 03:07:49
Description: Recent theory unified expressivity, universality, and distribution-free generalization of message passing neural networks (MPNNs) on attributed dense graphs by constructing compact, GNN-aligned metrics via hierarchical optimal transport over computation-tree analogs (IDMs/DIDMs). However, real-world 2026 graph systems are rarely plain undirected graphs: they are directed, multi-relational, and rich in edge attributes (knowledge graphs, interaction networks, molecular graphs with bond types/features). We extend the IDM/DIDM hierarchy to multi-relational directed edge-attributed graphon-signals by enriching computation measures with relation labels and edge attributes, and define a corresponding DIDM mover’s distance δ_MR-DIDM^L based on (unbalanced) optimal transport with a recursively defined ground metric. We prove that the resulting graphon-signal domain is compact, and that relation-aware, edge-conditioned normalized-sum MPNNs are Lipschitz and (up to metric identification) separating with respect to this distance, enabling Stone–Weierstrass universality and covering-number-based generalization bounds without distributional assumptions or parameter-count dependence. On finite graphs, the distance is computable in polynomial time via a layered OT recursion with block-structured costs reflecting relation channels; we outline efficient implementations exploiting relation sparsity. Empirical evaluations (to be included) would test correlation between δ_MR-DIDM^L and MPNN output perturbations on edge-feature molecular datasets and multi-relational knowledge graph benchmarks.

1. Introduction: motivation from edge-rich and multi-relational graphs (KGs, molecules, temporal interaction graphs); limitations of existing compact/separating metrics; summary of contributions.
2. Preliminaries: measures/weak* topology, (unbalanced) optimal transport; graphon-signals; recap of IDM/DIDM idea and why compactness+separation matter (universality + generalization).
3. Multi-Relational Directed Edge-Attributed Graphon-Signals: define (W, E, f), metric identification under measure-preserving relabelings; embedding finite graphs as step multi-graphons; discussion of dense-regime scope and relation to practice.
4. Enriched IDM/DIDM Hierarchy: define H_t, computation IDMs γ_t, DIDMs Γ_L; define recursive ground metric d_MR-IDM^L with relation/edge costs; define δ_MR-DIDM^L.
5. Compactness and Topology: prove H_t and 𝒫(H_t) compact; prove 𝒲_r, s^d, q, R/∼ compact under δ_MR-DIDM^L; relate to cut distance/coarser topologies where appropriate.
6. Relation-Aware Edge-Conditioned MPNNs: define graph/graphon MPNNs with per-relation kernels and edge-conditioned messages; show factorization through IDMs/DIDMs (equivalence lemma).
7. Stability: Lipschitz Bounds: inductive Lipschitz proof on IDM space; propagate to DIDMs via OT; explicit dependence on depth, Lipschitz constants, relation count, and edge-attribute bounds.
8. Fine-Grained Expressivity and Universality: separation theorem (topological converse) and Stone–Weierstrass density on compact domains; discuss what class of continuous functionals is captured.
9. Algorithmics: Polynomial-Time Computation on Finite Graphs: layered OT recursion; block-structured cost matrices by relation channels; complexity bounds; practical accelerations (optional entropic OT).
10. Generalization: Covering-Number Bounds: apply uniform Monte Carlo estimation over compact metric domain; discuss dependence on covering numbers and what can/cannot be made explicit.
11. Empirical Plan (Optional but Strengthening): correlation of δ with output perturbations; KG and molecule benchmarks; ablations on relation mismatch and edge-feature influence.
12. Limitations and Future Work: sparse regime, attention/transformers, quantitative approximation rates, explicit metric entropy.

Content

1. Introduction: motivation from edge-rich and multi-relational graphs (KGs, molecules, temporal interaction graphs); limitations of existing compact/separating metrics; summary of contributions.

A substantial fraction of contemporary graph learning problems is driven by and data. Knowledge graphs represent heterogeneous semantics by relation labels and often attach qualifiers or timestamps to edges; molecular graphs carry bond types together with geometric or energetic attributes; temporal and interaction networks record directed events with weights, times, and contexts. In such settings, the inductive bias of message passing is not merely aggregate over neighbors'', but ratheraggregate over relation- and attribute-conditioned neighborhoods’’, and the analytic objects that govern approximation, stability, and generalization must faithfully encode this additional structure.

We are thus led to a metric-design question: can we endow a space of multi-relational, possibly directed, edge-attributed graph objects with a distance that is simultaneously (i) compatible with the invariances of interest (relabeling of nodes, graphon-type identifications), (ii) so that uniform approximation and distribution-free learning statements become accessible, (iii) enough to separate non-isomorphic signals at the appropriate limit level, and (iv) on finite graphs in time polynomial in the number of nodes? While there is a large literature on graph distances, metrics that satisfy all four desiderata for relation-aware, edge-attributed graphs are not readily available.

To place the difficulty in context, we recall that many classical distances either disregard rich edge information or are not suited to compactness/separation arguments. Graph edit distances and maximum common subgraph variants are combinatorial and typically intractable; kernel methods (e.g. Weisfeiler–Lehman type constructions) are efficient but depend on fixed feature maps and do not directly yield a compact metric structure on an ambient signal space; spectral or Laplacian-based distances can be unstable under perturbations of the graph and are not, in general, separating modulo node relabeling. Optimal-transport-based notions such as Gromov–Wasserstein are attractive as they incorporate invariances, but their standard forms are not tailored to labeled relations and edge attributes with an explicit recursion matching message passing depth; moreover, separation properties at the level of graphon-signals and compactness of the induced quotient spaces are delicate. Finally, metrics arising from cut norms and related graphon distances yield compactness for unlabeled dense graphs, yet they are not designed to track the attribute- and relation-conditioned aggregations that drive modern edge-conditioned MPNNs.

Our approach is to build the geometry directly from the computation performed by relation-aware, edge-conditioned message passing. At a high level, we attach to each ``node’’ (in a finite graph, or a point in the unit interval representation) a recursively enriched object that records its feature together with a measure-valued summary of its typed and attributed neighborhood. This yields an (IDM) hierarchy, and by pushing forward the base measure we obtain an (DIDM) that represents the entire graph(-signal) as a probability measure on a compact state space. Distances are then defined by (unbalanced) optimal transport between these DIDMs, using a ground cost that itself is defined recursively and incorporates (a) the previous-layer IDM distance between neighbor states, (b) a penalty for relation mismatch, and (c) a discrepancy term for edge attributes. In this way, the metric mirrors the inductive structure of message passing: at each depth, node representations are functions of distributions of neighbor representations enriched with relation labels and edge attributes.

This construction yields a single framework that simultaneously addresses the analytic and algorithmic requirements outlined above. The central point is that, by working with measures on compact spaces and employing weak^* topologies, we obtain compactness by design, while the use of optimal transport yields a metrization that is sufficiently strong to control Lipschitz message passing operations. In particular, the distance is not an auxiliary tool: it is the natural ``mover’s distance’’ induced by the computation hierarchy. The resulting metric is also explicitly relation-aware, accommodates directed kernels, and supports edge attributes in a bounded Euclidean domain without sacrificing measurability or compactness.

The contributions may be summarized as follows. First, we define the enriched IDM/DIDM recursion for multi-relational directed edge-attributed graphon-signals and show that, for each fixed depth, the induced IDM spaces and DIDM spaces are compact metrizable. Second, we define the corresponding mover’s distance δ_MR-DIDM^L as an optimal-transport distance between DIDMs with a recursively defined ground metric, and we show that this distance metrizes the weak^* topology on the relevant measure spaces. Third, we prove a factorization statement: relation-aware edge-conditioned normalized-sum MPNNs at depth t depend on the input only through the order-t IDMs, and graph-level readouts depend only through the order-L DIDM. Fourth, we establish Lipschitz stability: any such MPNN with Lipschitz message/update/readout maps is Lipschitz with respect to δ_MR-DIDM^L, yielding quantitative robustness to perturbations measured in our geometry. Fifth, we prove a converse-type separation statement: up to arbitrarily small tolerance, if Lipschitz-bounded depth-L MPNNs fail to distinguish two DIDMs, then the DIDMs must be close in the OT sense; this is the key input for Stone–Weierstrass-type arguments. Sixth, we obtain universality: scalar-valued relation-aware edge-conditioned MPNN readouts form a uniformly dense family in the continuous functions on the compact metric-identified domain. Finally, we provide a polynomial-time exact algorithm for finite graphs at fixed depth: the distance can be computed via a layered recursion consisting of N² optimal transport subproblems per layer and one final transport solve, each reducible to min-cost flow, leading to worst-case time Õ(LN⁵) for graphs with at most N nodes (for fixed relation count).

Two methodological aspects are worth emphasizing. We use optimal transport at intermediate stages to accommodate the fact that neighborhood measures may have total mass less than one (e.g. due to normalization conventions or sparsity), which avoids forcing artificial mass-matching and improves stability under degree fluctuations. Moreover, relation labels are incorporated as elements of a discrete metric space, so that relation mismatch contributes additively to the ground cost in a way that is both transparent and compatible with the recursion. These choices yield a metric that is expressive enough to reflect typed, attributed neighborhoods, while remaining analytically tractable and algorithmically reducible to standard transportation problems.

In sum, we provide a compact, separating, relation-aware metric geometry for edge-attributed graphon-signals and their finite counterparts, tailored to the computational structure of edge-conditioned MPNNs. This geometry supports a unified theory in which approximation (universality), stability (Lipschitz bounds), and learning (distribution-free generalization via covering arguments) follow from the same compactness-and-separation backbone, while exact computation on finite graphs remains polynomial-time for fixed depth.

2. Preliminaries: measures/weak* topology, (unbalanced) optimal transport; graphon-signals; recap of IDM/DIDM idea and why compactness+separation matter (universality + generalization).

Let (X, d) be a compact metric space. We write M_≤ 1(X) for the set of finite Borel measures on X with total mass at most one, and 𝒫(X) for Borel probability measures. We will repeatedly use the duality between measures and continuous test functions: for μ ∈ M_≤ 1(X) and φ ∈ C(X, ℝ) we write ⟨μ, φ⟩ := ∫_Xφ dμ. The weak^* topology on M_≤ 1(X) is the coarsest topology such that μ ↦ ⟨μ, φ⟩ is continuous for every φ ∈ C(X, ℝ); equivalently, μ_n → μ weak^* if and only if ∫φ dμ_n → ∫φ dμ for all continuous φ. Since X is compact, C(X, ℝ) is a Banach space under the sup norm, and M_≤ 1(X) is naturally identified with a weak^*-closed subset of the dual of C(X, ℝ) (Riesz–Markov). In particular, M_≤ 1(X) is weak^* compact, and so is 𝒫(X) ⊂ M_≤ 1(X).

For measurable maps T : X → Y between measurable spaces, we denote by $T_\*\mu$ the push-forward measure on Y, defined by $(T_\*\mu)(A):=\mu(T^{-1}(A))$. When X = [0, 1] equipped with Lebesgue measure λ, push-forward along node-level maps will be our basic mechanism for encoding graph(-signal) objects as measures on a state space.

Let (X, d) be a compact metric space. In the balanced setting, for μ, ν ∈ 𝒫(X) one defines the 1-Wasserstein distance
W₁(μ, ν) := inf_{π ∈ Π(μ, ν)}∫_X × Xd(x, y) dπ(x, y),
where Π(μ, ν) is the set of couplings with marginals μ and ν. We recall that on compact X, W₁ metrizes weak^* convergence on 𝒫(X): μ_n → μ weak^* if and only if W₁(μ_n, μ) → 0. We also use the corresponding dual form
W₁(μ, ν) = sup {∫_Xφ dμ − ∫_Xφ dν: φ ∈ Lip₁(X)},
which makes explicit why optimal transport is the correct distance for controlling integrals of Lipschitz test functions.

In our recursion, intermediate objects are neighborhood measures whose total mass may be strictly smaller than one. This occurs already in finite graphs with normalized adjacency (e.g. degree normalization) or in sparse regimes where neighborhood measures record only the realized edges. We therefore allow transport on M_≤ 1(X), which permits partial matching of mass together with a penalty for creation/destruction. Concretely, we assume the existence of a functional
OT_d: M_≤ 1(X) × M_≤ 1(X) → [0, ∞)
with the following properties (all standard in unbalanced OT constructions on compact spaces): (i) OT_d(μ, ν) = 0 if and only if μ = ν; (ii) OT_d is jointly lower semicontinuous for weak^* convergence; (iii) OT_d metrizes weak^* convergence on mass-bounded sets; and (iv) OT_d satisfies a Lipschitz test-function bound analogous to the balanced case, namely that differences of integrals of 1-Lipschitz functions are controlled by OT_d. One may keep in mind a representative formulation via couplings π ∈ M_≤ 1(X × X) with relaxed marginal constraints and a divergence penalty on the mismatch, but the precise choice is not essential for the structural arguments: what matters is that the ground cost is induced by a metric on X and that the resulting transport distance is compatible with weak^* compactness.

Our ambient objects will be dense-limit graph(-signal) representations. We recall the usual analytic model: a (possibly directed) graphon is a measurable kernel W : [0, 1]² → [0, 1], interpreted as an edge-probability (or weighted adjacency) function. To encode multi-relational structure, we consider a finite relation set [R] and a tuple W = (W_r)_{r ∈ [R]} of such kernels. To incorporate edge attributes, we attach to each relation an attribute kernel E_r : [0, 1]² → B_s^q, where B_s^q is the closed Euclidean ball in ℝ^q of radius s. Finally, node features are modeled by a measurable signal f : [0, 1] → B_r^d. We write (W, E, f) for the resulting multi-relational directed edge-attributed graphon-signal. The compactness of the feature and attribute ranges will be used repeatedly: it bounds all local states produced by our recursion and ensures that weak^* compactness applies at every measure-valued layer.

We briefly recall the organizing principle behind the IDM/DIDM construction. Message passing at a node depends on its own current state together with a collection (in the graphon setting, a distribution) of neighbor states enriched by relation labels and edge attributes. This suggests defining, for each depth t, a H_t whose elements consist of a base coordinate (the current node state) and a measure coordinate recording a typed, attributed neighborhood distribution over the previous state space. At depth 0 we take H₀ := B_r^d. Given H_t − 1, we form an enriched neighborhood domain
Z_t − 1 := H_t − 1 × ℛ × B_s^q, ℛ = [R] with d_ℛ(r, r^′) = 1_{r ≠ r^′},
and we set H_t := H_t − 1 × M_≤ 1(Z_t − 1), endowed with the product topology where M_≤ 1(Z_t − 1) carries the weak^* topology. For a given graphon-signal (W, E, f), the computation map γ_{(W, E, f), t} : [0, 1] → H_t assigns to each point x ∈ [0, 1] its order-t enriched state: the first coordinate is γ_t − 1(x), while the second coordinate is the push-forward of the neighborhood sampling measure on [0, 1] (weighted by the kernels W_r(x, ⋅)) through the map y ↦ (γ_t − 1(y), r, E_r(x, y)), aggregated over relations. The object is then the DIDM
$$ \Gamma_{(\mathbf{W},\mathbf{E},f),t}\;:=\;(\gamma_{(\mathbf{W},\mathbf{E},f),t})_\*\lambda\ \in\ \mathcal{P}(H_t), $$
which records the distribution of local computation states under the base measure.

Two structural requirements drive the subsequent theory. First, compactness: since H₀ is compact and M_≤ 1(⋅) over a compact space is weak^* compact, the recursion produces compact H_t by induction, and hence compact DIDM spaces 𝒫(H_t). This is the backbone for uniform approximation statements, because continuous functions on compact domains are precisely the setting in which Stone–Weierstrass yields uniform density once we have point separation by a function class.

Second, separation and stability: we will equip each H_t with a recursively defined metric d_MR-IDM^t whose measure coordinate is compared using OT_{d_MR-IDM^t − 1} on Z_t − 1, augmented by explicit penalties for relation mismatch and edge-attribute discrepancy. The corresponding mover’s distance on DIDMs is then defined by optimal transport between Γ’s using d_MR-IDM^L as ground cost. Because this geometry is built from transport on compact spaces, it both metrizes weak^* and controls Lipschitz test functions; consequently it is well suited to proving that Lipschitz message passing maps are stable (small perturbations in δ_MR-DIDM^L imply small changes in outputs). Moreover, compactness yields finite covering numbers, which are the key input for distribution-free generalization bounds for Lipschitz hypothesis classes via uniform concentration over metric nets.

We next specialize these preliminaries to the concrete multi-relational directed edge-attributed graphon-signal model, including the relevant identifications under relabelings and the embedding of finite graphs as step objects.

3. Multi-Relational Directed Edge-Attributed Graphon-Signals: define (W, E, f), metric identification under measure-preserving relabelings; embedding finite graphs as step multi-graphons; discussion of dense-regime scope and relation to practice.

Fix a finite relation set [R], a node-feature ball B_r^d ⊂ ℝ^d, and an edge-attribute ball B_s^q ⊂ ℝ^q. A is a triple
(W, E, f) ∈ 𝒲_r, s^d, q, R,
where W = (W_ρ)_{ρ ∈ [R]} is a tuple of measurable kernels W_ρ : [0, 1]² → [0, 1], E = (E_ρ)_{ρ ∈ [R]} is a tuple of measurable kernels E_ρ : [0, 1]² → B_s^q, and f : [0, 1] → B_r^d is a measurable node signal. We interpret W_ρ(x, y) as the (possibly directed) weight of the relation-ρ edge from x to y, and E_ρ(x, y) as the corresponding q-dimensional edge attribute. We do not impose symmetry: in general W_ρ(x, y) ≠ W_ρ(y, x) and likewise for E_ρ, so the model covers directed graphs. In the multi-relational setting, relations are not mutually exclusive: different W_ρ may place positive mass on the same pair (x, y), encoding multiplex structure.

Edge attributes should be read through the edge weight: the attribute kernel E_ρ(x, y) is always defined as a measurable map into a compact set, but its contribution to any neighborhood aggregate will be weighted by W_ρ(x, y). In particular, values of E_ρ on pairs where W_ρ(x, y) = 0 are immaterial for the induced computation measures we construct later, and hence for the geometric comparisons built from those measures. This convention allows us to keep the state space compact (by forcing E_ρ to take values in B_s^q everywhere) without constraining the model on null edge sets.

As usual in graphon theory, the analytic representation is not unique: relabeling the latent space [0, 1] by a measure-preserving transformation yields an equivalent object. Let σ : [0, 1] → [0, 1] be a measure-preserving bijection (modulo null sets). We define the relabeled graphon-signal (W^σ, E^σ, f^σ) by
W_ρ^σ(x, y) := W_ρ(σ(x), σ(y)), E_ρ^σ(x, y) := E_ρ(σ(x), σ(y)), f^σ(x) := f(σ(x)).
We then write
(W, E, f) ∼ (V, F, g)
if there exists such a σ with (V, F, g) = (W^σ, E^σ, f^σ) almost everywhere. This is the natural invariance for any node-wise computation whose base distribution on nodes is Lebesgue measure: if we push forward λ along the relabeling, the distribution of local neighborhoods is unchanged. All of our subsequent constructions (IDMs, DIDMs, and the transport distances derived from them) are designed to be invariant under ∼, and the final metric space is the quotient 𝒲_r, s^d, q, R/∼ obtained by metric identification at distance zero.

We now record the canonical embedding of a finite multi-relational directed edge-attributed graph with node features into 𝒲_r, s^d, q, R. Let
G = (V, {E_ρ}_{ρ ∈ [R]}, f_G, {e_ρ}_{ρ ∈ [R]})
be a finite graph with |V| = n, where f_G : V → B_r^d and each relation ρ has an adjacency weight function A_ρ : V × V → [0, 1] (with A_ρ(i, j) = 0 meaning no ρ-edge) and an edge attribute map e_ρ : V × V → B_s^q (defined on all ordered pairs, with arbitrary values allowed when A_ρ(i, j) = 0). Choose an ordering V = {1, …, n} and partition [0, 1] into intervals $I_i:=[\frac{i-1}{n},\frac{i}{n})$ for i = 1, …, n (with the last interval closed at 1). Define the step kernels and signal by, for x ∈ I_i and y ∈ I_j,
W_ρ^G(x, y) := A_ρ(i, j), E_ρ^G(x, y) := e_ρ(i, j), f^G(x) := f_G(i).
Then (W^G, E^G, f^G) ∈ 𝒲_r, s^d, q, R is a step multi-graphon-signal representing G. Different orderings of V lead to relabelings of [0, 1] by measure-preserving permutations of the partition, hence to ∼-equivalent objects. This embedding is the bridge between the analytic definitions and the finite-graph computations: the IDM/DIDM maps defined later reduce on step objects to explicit operations on finite neighborhoods with the same normalization induced by Lebesgue measure on the intervals I_i.

The step embedding above is compatible with the convention that node neighborhoods are recorded as measures with total mass at most one. Indeed, in the graphon setting, neighborhood aggregation at a point x uses the kernel values W_ρ(x, ⋅) as densities against λ(dy). For step objects, this corresponds to distributing mass $\frac{1}{n}A_\rho(i,j)$ to each neighbor j under relation ρ. In particular, even when A_ρ(i, j) ∈ {0, 1}, the total outgoing mass of a node equals its (weighted) out-degree divided by n, which may be strictly smaller than one. This is precisely why we work with M_≤ 1(⋅) and allow unbalanced transport in subsequent layers.

The representation (W, E, f) is the dense-limit model: W_ρ is a bounded kernel on [0, 1]² rather than a σ-finite object as in graphex/sparse graph limit theories. Consequently, the most direct statistical interpretation is as a limit of dense graphs (or dense weighted graphs) with multiple edge types and attributes. This choice is deliberate: it yields compact state spaces once node and edge features are bounded, and compactness is the structural condition enabling weak^* metrizability, separation results, and uniform approximation via Stone–Weierstrass in later sections.

At the same time, the finite-graph distance computation we ultimately obtain does not require dense inputs: if the adjacency matrices are sparse, then the induced neighborhood measures simply have small total mass. Unbalanced transport then compares such measures without forcing artificial normalization. Thus sparsity manifests in computation as smaller supports and smaller transported mass, which can be exploited algorithmically even though the analytic ambient class is dense. We emphasize that we do not claim to develop a full sparse-limit theory here; rather, we use the dense graphon-signal formalism as a compact analytic envelope within which our recursion and stability/expressivity arguments can be carried out cleanly, while still allowing finite sparse graphs to be embedded and compared.

The multi-relational directed edge-attributed graphon-signal formalism provides (i) a common language for finite and limit objects, (ii) a canonical invariance under measure-preserving relabelings, and (iii) bounded feature/attribute ranges ensuring compactness at every recursive stage. In the next section we build the enriched IDM/DIDM hierarchy on top of this representation and define the relation- and attribute-aware transport metrics used throughout the theory.

4. Enriched IDM/DIDM Hierarchy: define H_t, computation IDMs γ_t, DIDMs Γ_L; define recursive ground metric d_MR-IDM^L with relation/edge costs; define δ_MR-DIDM^L.

We define a hierarchy of compact IDM state spaces (H_t)_t ≥ 0 which records, at each depth, both a node state and a labeled, edge-attributed neighborhood measure. We set
H₀ := B_r^d.
For t ≥ 1 we define
H_t := H_t − 1 × M_≤ 1 (H_t − 1 × ℛ × B_s^q),
equipped with the product σ-algebra and, later, a recursively defined metric. Thus an element h ∈ H_t can be written as h = (h^self, h^nbh) where h^self ∈ H_t − 1 is the ``current’’ state and h^nbh is a subprobability measure over triples consisting of a neighbor state in H_t − 1, a relation label in ℛ = [R], and an edge attribute in B_s^q.

Fix (W, E, f) ∈ 𝒲_r, s^d, q, R. We define recursively measurable maps
γ_{(W, E, f), t} : [0, 1] → H_t, t = 0, 1, …,
interpreted as order-t computation-IDMs. At depth t = 0 we set
γ₀(x) := f(x) ∈ H₀.
Assume γ_t − 1 has been defined. For each x ∈ [0, 1] we define the outgoing enriched neighborhood measure μ_x, t ∈ M_≤ 1(H_t − 1 × ℛ × B_s^q) by specifying its action on measurable sets A ⊆ H_t − 1 × ℛ × B_s^q:

The factor 1/R is a convenient normalization ensuring μ_x, t has total mass at most one:
$$ \mu_{x,t}\bigl(H_{t-1}\times\mathcal{R}\times B_s^q\bigr) = \frac{1}{R}\sum_{\rho\in[R]}\int_0^1 W_\rho(x,y)\,dy \le 1, $$
since each W_ρ is bounded by 1. (Any other fixed normalization can be absorbed into constants in later Lipschitz bounds; we choose to remain within M_≤ 1(⋅) at every depth.) We then set
γ_t(x) := (γ_t − 1(x), μ_x, t) ∈ H_t.
By construction, μ_x, t depends on (W, E) only through the weighted push-forward of Lebesgue measure by the map y ↦ (γ_t − 1(y), ρ, E_ρ(x, y)), with weight W_ρ(x, y); thus edge attributes are incorporated only through the edges on which they are weighted, in accordance with our measurability convention from Section~.

The depth-L DIDM is the distribution of order-L IDMs under the base node measure λ on [0, 1]. Concretely, for each L ∈ ℕ we define
Γ_{(W, E, f), L} := (γ_{(W, E, f), L})_*λ ∈ 𝒫(H_L).
This is the canonical graph-level representation induced by the local recursion: it records the law of rooted depth-L enriched neighborhoods when a root x is sampled uniformly from [0, 1].

We now define metrics d_MR-IDM^t on H_t recursively. At depth t = 0 we take the Euclidean metric
d_MR-IDM⁰(u, v) := ∥u − v∥₂, u, v ∈ B_r^d.
Fix weights λ_R ≥ 0 and λ_E ≥ 0. For t ≥ 1, write h = (x, μ) and h^′ = (x^′, ν) with x, x^′ ∈ H_t − 1 and μ, ν ∈ M_≤ 1(H_t − 1 × ℛ × B_s^q). We define a ground cost on atoms
c^t − 1 : (H_t − 1 × ℛ × B_s^q)² → ℝ_≥ 0
by

where d_ℛ(ρ, ρ^′) = 1_{ρ ≠ ρ^′}. We then set

Here OT_{c^t − 1} denotes an optimal transport distance on subprobability measures, so that OT_{c^t − 1}(μ, ν) remains finite even when μ and ν have different total masses. This choice is forced by the normalization in : neighborhood mass reflects (normalized) outgoing edge mass and may vary across nodes and across graphs.

For a fixed recursion depth L, we define the multi-relational DIDM mover’s distance between two multi-graphon-signals (W, E, f) and (V, F, g) by transporting their depth-L DIDMs with ground metric d_MR-IDM^L:

Because Γ_{(W, E, f), L} is defined as a push-forward of Lebesgue measure, it is immediate that relabelings by measure-preserving bijections σ leave Γ_{(W, E, f), L} unchanged, and hence leave δ_MR-DIDM^L unchanged. In particular, for step multi-graphon-signals induced by finite graphs, the recursion – reduces to an explicit layered optimal transport computation on finite neighborhoods (with relation mismatch penalized by λ_R and attribute mismatch penalized by λ_E), which is the algorithmic form used later to obtain polynomial-time exact computation results.

5. Compactness and Topology: prove H_t and 𝒫(H_t) compact; prove 𝒲_r, s^d, q, R/∼ compact under δ_MR-DIDM^L; relate to cut distance/coarser topologies where appropriate.

We equip each kernel component W_ρ with the weak^* topology inherited from the unit ball of L^∞([0, 1]²), and each coordinate of E_ρ with the weak^* topology inherited from the unit ball of L^∞([0, 1]²) (viewing E_ρ as a q-tuple of bounded real kernels and enforcing the pointwise constraint E_ρ(x, y) ∈ B_s^q a.e.). Likewise we equip each coordinate of f with the weak^* topology of the unit ball of L^∞([0, 1]) and enforce f(x) ∈ B_r^d a.e. Under these choices, the admissible set 𝒲_r, s^d, q, R is a weak^* closed subset of a product of weak^* compact sets; hence 𝒲_r, s^d, q, R is compact by Banach–Alaoglu and Tychonoff. This is the ambient compactness used to extract convergent subsequences of kernels/signals without any regularity beyond boundedness.

We now justify that the recursive state spaces H_t are compact and that their associated measure spaces are compact in the weak^* sense. The key observation is that if K is compact Hausdorff, then the set M_≤ 1(K) of finite Borel measures of total mass at most one is weak^* compact when identified with a weak^* closed subset of the unit ball of C(K)^* via the Riesz representation theorem. Moreover, 𝒫(K) is weak^* compact for the same reason.

Since ℛ = [R] is finite (hence compact in the discrete topology) and B_s^q is compact, we have that H_t − 1 × ℛ × B_s^q is compact whenever H_t − 1 is compact. Consequently M_≤ 1(H_t − 1 × ℛ × B_s^q) is weak^* compact, and then
H_t = H_t − 1 × M_≤ 1 (H_t − 1 × ℛ × B_s^q)
is compact as a product of compact sets. Inducting from H₀ = B_r^d yields compactness of every H_t. In particular, for each fixed L, the DIDM codomain 𝒫(H_L) is weak^* compact.

Compactness alone is not sufficient for quantitative control; we also require that the recursively defined costs induce the relevant weak^* topologies. For this we use the standard fact that on a compact metric space (K, d), the 1-Wasserstein distance W₁(d) metrizes weak^* convergence on 𝒫(K). The same conclusion holds for customary optimal transport distances on M_≤ 1(K) (e.g. formulations via semi-couplings or via a reservoir point with finite creation/destruction cost), provided the transport cost is continuous and bounded on K × K; we write OT_d for a fixed such choice.

Applying this at each depth, we note that if d_MR-IDM^t − 1 is a metric inducing the topology of H_t − 1, then the atom cost c^t − 1 in is continuous and bounded on the compact product H_t − 1 × ℛ × B_s^q. Hence OT_{c^t − 1} metrizes weak^* on M_≤ 1(H_t − 1 × ℛ × B_s^q). The recursion then equips H_t with a product-type metric which metrizes precisely the product of the topology of H_t − 1 and the weak^* topology on M_≤ 1(⋅). Inducting on t yields that (H_t, d_MR-IDM^t) is a compact metric space for each t, and that (𝒫(H_t), OT_{d_MR-IDM^t}) is a compact metric space whose topology agrees with weak^* on probability measures.

Fix L. The map
Φ_L : 𝒲_r, s^d, q, R → 𝒫(H_L), Φ_L(W, E, f) := Γ_{(W, E, f), L},
is well-defined by construction. We view δ_MR-DIDM^L in as the pullback pseudometric
δ_MR-DIDM^L(X, Y) = OT_{d_MR-IDM^L}(Φ_L(X), Φ_L(Y)),
and we denote by ∼ the induced metric identification (i.e. X ∼ Y iff δ_MR-DIDM^L(X, Y) = 0). The quotient (𝒲_r, s^d, q, R/∼,δ_MR-DIDM^L) is then isometric to the image Φ_L(𝒲_r, s^d, q, R) equipped with OT_{d_MR-IDM^L}.

Thus, to prove compactness of the quotient, it suffices to show that Φ_L(𝒲_r, s^d, q, R) is compact in 𝒫(H_L). Since 𝒲_r, s^d, q, R is compact in the weak^* product topology described above, it is enough to verify that Φ_L is continuous with respect to these topologies. This continuity follows by induction on t: for any bounded continuous test function ζ on H_t, the map
(W, E, f) ↦ ∫₀¹ζ(γ_{(W, E, f), t}(x)) dx
depends on (W, E, f) only through iterated operations of (i) composition with continuous functions on compact domains, and (ii) integration of bounded kernels against bounded test functions, as in . These operations are continuous under weak^* convergence of bounded kernels/signals. Therefore Φ_L is continuous into the compact metric space (𝒫(H_L), OT_{d_MR-IDM^L}), and Φ_L(𝒲_r, s^d, q, R) is compact. The claimed compactness of the quotient follows.

In the classical case R = 1 and without attributes, the cut metric on graphons is a standard relabeling-invariant topology. Our distance δ_MR-DIDM^L shares the same invariance under measure-preserving relabelings (as already noted after ), but it is tailored to the depth-L enriched neighborhood statistics encoded by Γ_(⋅), L. In particular, cut convergence of kernels (and analogous multi-relational cut-type convergence defined by taking the maximum cut norm over relations) implies convergence of all integrals against bounded test functions of finite depth, and therefore implies δ_MR-DIDM^L → 0 for each fixed L. Conversely, because δ_MR-DIDM^L only probes depth-L neighborhoods, it is generally coarser than cut-type topologies: distinct kernels may induce the same depth-L DIDM and hence have zero δ_MR-DIDM^L at fixed L. This is precisely why we treat L as a resolution parameter in later separation and universality results.

6. Relation-Aware Edge-Conditioned MPNNs: define graph/graphon MPNNs with per-relation kernels and edge-conditioned messages; show factorization through IDMs/DIDMs (equivalence lemma).

We formalize the class of relation-aware, edge-conditioned MPNNs that will be controlled by the multi-relational DIDM distances defined above. Throughout, we fix an embedding width p_t ∈ ℕ at each depth t ∈ {0, …, L} and an output dimension m ∈ ℕ. We consider update maps ϕ^(t) : ℝ^{p_t − 1} × ℝ^p_t → ℝ^p_t, relation-specific message maps φ_ρ^(t) : ℝ^{p_t − 1} × ℝ^{p_t − 1} × B_s^q → ℝ^p_t for ρ ∈ [R], and a readout ψ : ℝ^p_L → ℝ^m. (Lipschitz assumptions on these maps are imposed later, in the stability section.)

Given a multi-graphon-signal (W, E, f) ∈ 𝒲_r, s^d, q, R, we define node embeddings h^(t) : [0, 1] → ℝ^p_t recursively by

where η⁽⁰⁾ : B_r^d → ℝ^p₀ is a fixed measurable (typically continuous) feature lift. The resulting graph-level embedding is defined via average pooling,

The normalization implicit in matches the graphon scaling and the neighborhood measures used in the IDM recursion: neighbors are sampled according to Lebesgue measure and weighted by W_ρ(x, y), with relation ρ and attribute E_ρ(x, y) exposed to the message map.

For finite graphs G = (V, {E_ρ}_{ρ ∈ [R]}, f, {e_ρ}_{ρ ∈ [R]}) with |V| = n, we recover the usual normalized-sum MPNN by replacing integrals with normalized sums. Writing A_ρ(i, u) ∈ {0, 1} for adjacency (directed allowed) and e_ρ(i, u) ∈ B_s^q for the edge attribute when A_ρ(i, u) = 1, we set

with readout $F(G):=\psi\bigl(\frac{1}{n}\sum_{i\in V}h^{(L)}(i)\bigr)$. This is consistent with applying – to the induced step graphon-signal of G.

We recall that the order-t enriched IDM γ_{(W, E, f), t}(x) ∈ H_t stores the current state and a measure-valued summary of the 1-hop neighborhood at the previous depth. Concretely, writing γ_t − 1(y) := γ_{(W, E, f), t − 1}(y) ∈ H_t − 1, the last coordinate of γ_t(x) is a finite measure μ_x, t ∈ M_≤ 1(H_t − 1 × ℛ × B_s^q) characterized (up to null sets) by

Thus, any computation at depth t that aggregates a continuous function of (γ_t − 1(y), ρ, E_ρ(x, y)) against the weights W_ρ(x, y) can be viewed as an integral against μ_x, t.

The preceding lemma is the precise sense in which our enriched IDMs/DIDMs are the canonical state variables for relation-aware edge-conditioned MPNNs: the network does not access (W, E, f) beyond what is recorded in γ_t pointwise and in Γ_L at the graph level. This factorization is the structural input for the stability bounds that follow, where we will control μ ↦ ∫Θ_L dμ by optimal transport on 𝒫(H_L).

7. Stability: Lipschitz Bounds: inductive Lipschitz proof on IDM space; propagate to DIDMs via OT; explicit dependence on depth, Lipschitz constants, relation count, and edge-attribute bounds.

We now impose global Lipschitz assumptions on the maps defining the MPNN and prove that the resulting graph-level functional is Lipschitz with respect to δ_MR-DIDM^L. The argument proceeds in two steps: we first establish an inductive Lipschitz bound for the node-level factor maps Θ_t : H_t → ℝ^p_t with respect to the recursive ground metrics d_MR-IDM^t, and then lift this bound to the graph level by the Kantorovich–Rubinstein control of integrals under optimal transport on 𝒫(H_L).

Fix L and assume that η⁽⁰⁾ is L_η-Lipschitz (w.r.t. ∥ ⋅ ∥₂ on B_r^d), each update ϕ^(t) is L_ϕ, t-Lipschitz (w.r.t. the product Euclidean norm), each message map φ_ρ^(t) is L_φ, t-Lipschitz uniformly over ρ ∈ [R] (again w.r.t. a product Euclidean norm on ℝ^{p_t − 1} × ℝ^{p_t − 1} × ℝ^q), and ψ is L_ψ-Lipschitz. Since H_t is compact for each t and Θ_t will be measurable and (under the above assumptions) continuous, we also have finiteness of the range bounds
B_t := sup_{ξ ∈ H_t}∥Θ_t(ξ)∥₂ < ∞.
When an explicit estimate is desired, one may upper bound B_t inductively using B₀ ≤ sup_{x ∈ B_r^d}∥η⁽⁰⁾(x)∥₂ and the inequalities
sup ∥φ_ρ^(t)(h, h^′, a)∥₂ ≤ ∥φ_ρ^(t)(0, 0, 0)∥₂ + L_φ, t(2B_t − 1 + s), a ∈ B_s^q,
together with sup ∥ϕ^(t)(u, v)∥₂ ≤ ∥ϕ^(t)(0, 0)∥₂ + L_ϕ, t(∥u∥₂+∥v∥₂) and the fact that each neighborhood measure has mass ≤ 1.

We write the generic element of H_t as ξ = (z, μ) with z ∈ H_t − 1 and μ ∈ M_≤ 1(H_t − 1 × ℛ × B_s^q). Recall that the recursive metric d_MR-IDM^t has the schematic form

where d̃^t − 1 is the ground metric on H_t − 1 × ℛ × B_s^q given by
d̃^t − 1((ξ, ρ, a), (ξ^′, ρ^′, a^′)) := d_MR-IDM^t − 1(ξ, ξ^′) + λ_ℛ 1_{{ρ ≠ ρ^′}} + λ_E ∥a − a^′∥₂,
for fixed weights λ_ℛ, λ_E > 0 used in the definition of d_MR-IDM^t.

We now propagate the Lipschitz bound to the DIDM level. Recall that the graph-level functional factors as
F(W, E, f) = Ψ(Γ_{(W, E, f), L}), Ψ(μ) := ψ(∫Θ_L dμ).
Since Θ_L is K_L-Lipschitz on (H_L, d_MR-IDM^L), the map μ ↦ ∫Θ_L dμ is K_L-Lipschitz on (𝒫(H_L), OT_{d_MR-IDM^L}) (again by the Kantorovich–Rubinstein bound applied coordinatewise). Composing with the L_ψ-Lipschitz readout ψ yields the desired stability estimate with constant C := L_ψK_L.

8. Fine-Grained Expressivity and Universality: separation theorem (topological converse) and Stone–Weierstrass density on compact domains; discuss what class of continuous functionals is captured.

The stability result of Theorem~ asserts that every depth-L relation-aware edge-conditioned MPNN readout is Lipschitz with respect to δ_MR-DIDM^L. We now establish a partial converse: on the compact DIDM domain, agreement of Lipschitz-bounded MPNN readouts forces closeness in the OT-geometry induced by the enriched IDM recursion. Formally, let ℋ_L, m, C denote the class of ℝ^m-valued depth-L MPNN readouts H_ϕ, ψ (as in the factorization statement) whose constituent maps have Lipschitz constants bounded by C and whose parameters are otherwise unconstrained.

The argument is purely topological and uses the compactness and weak^* metrizability of 𝒫(H_L) under OT_{d_MR-IDM^L} (Theorems~1–2 in our sequence). Since OT_{d_MR-IDM^L} metrizes weak^*, to prove OT_{d_MR-IDM^L}(μ, ν) ≤ ε it suffices to control |∫Φ dμ − ∫Φ dν| uniformly over a separating set of continuous test functions Φ ∈ C(H_L). We therefore show that, for L sufficiently large and C sufficiently permissive, the collection of maps realizable as Θ_L : H_L → ℝ by depth-L relation-aware edge-conditioned architectures is (after closing under linear combinations and products) uniformly dense in C(H_L). In particular, for any Φ ∈ C(H_L) and η > 0 we can realize Φ̃ = Φ̃_ϕ such that sup_{ξ ∈ H_L}|Φ(ξ) − Φ̃(ξ)| ≤ η. By choosing m large enough (or working coordinatewise) and setting ψ to be a linear functional, the graph-level map μ ↦ ∫Φ̃ dμ becomes a member of ℋ_L, m, C up to an arbitrarily small uniform perturbation. The hypothesis of Theorem~ then yields |∫Φ dμ − ∫Φ dν| small for all Φ in a dense subset of C(H_L), hence for all Φ ∈ C(H_L) by approximation; weak^* proximity follows, and metrizability converts it to a quantitative OT bound.

We make the preceding density statement precise. Let 𝒜_L ⊂ C(H_L) be the set of scalar node-level maps Θ_L realizable by depth-L relation-aware, edge-conditioned normalized-sum architectures with continuous constituent maps. By closure properties of continuous composition, 𝒜_L is closed under multiplication after enlarging the architecture class in the standard way (e.g., by concatenating embeddings and applying a final bilinear map), and it contains constants. Moreover, 𝒜_L separates points of H_L once L is sufficiently large: if ξ ≠ ξ^′ in H_L, then either their H_L − 1-coordinates differ, in which case the inductive expressivity at depth L − 1 separates them, or their last-coordinate measures in M_≤ 1(H_L − 1 × ℛ × B_s^q) differ. In the latter case there exists a continuous test function on H_L − 1 × ℛ × B_s^q witnessing the discrepancy, and by the edge-conditioned form of φ_ρ^(L) we can approximate such tests by message functions of (ξ, ρ, a), including dependence on the relation label ρ and edge attribute a ∈ B_s^q (in particular, the discrete metric on ℛ permits exact gating by ρ at bounded Lipschitz cost). Integrating these messages against the last-coordinate measure produces a scalar summary that differs between ξ and ξ^′.

Since H_L is compact Hausdorff, the Stone–Weierstrass theorem applies: the uniform closure of any subalgebra of C(H_L) that contains constants and separates points equals C(H_L). Thus, after taking the algebra generated by 𝒜_L, we obtain uniform density in C(H_L). This yields point separation of measures on H_L via the standard corollary: if μ ≠ ν are Borel probability measures on a compact metric space, then there exists Φ ∈ C(H_L) with ∫Φ dμ ≠ ∫Φ dν, and we may approximate Φ uniformly by functions realized by node-level MPNN embeddings.

We now lift the preceding density to functionals on the graphon-signal quotient (𝒲_r, s^d, q, R/∼,δ_MR-DIDM^L). Recall that every depth-L readout factors through the DIDM,
(W, E, f) ↦ Γ_{(W, E, f), L} ∈ 𝒫(H_L), μ ↦ ψ(∫Θ_L dμ).
Since (W, E, f) ↦ Γ_{(W, E, f), L} is continuous and the quotient is compact, it follows that any continuous functional on 𝒲_r, s^d, q, R/∼ may be uniformly approximated by compositions of continuous functionals on 𝒫(H_L) with the DIDM map. Combining this with the Stone–Weierstrass density on H_L (and the fact that integration μ ↦ ∫Φ dμ produces a separating family of continuous functions on 𝒫(H_L)), we obtain the following.

Theorems~ and~ together identify δ_MR-DIDM^L as the appropriate topology for depth-L relation-aware message passing: (i) every such readout is continuous (indeed Lipschitz) in this topology; (ii) conversely, up to uniform approximation, continuous functionals on the compact quotient are representable. Equivalently, depth-L architectures capture precisely those graphon-signal functionals that depend continuously on the order-L enriched neighborhood statistics encoded by the DIDM Γ_{(W, E, f), L}, where the enrichment includes the relation channel ℛ and edge attribute space B_s^q. We next turn to the algorithmic question of computing δ_MR-DIDM^L on finite graphs.

9. Algorithmics: Polynomial-Time Computation on Finite Graphs: layered OT recursion; block-structured cost matrices by relation channels; complexity bounds; practical accelerations (optional entropic OT).

We now address the computational problem suggested by the preceding topological results: given two finite multi-relational directed edge-attributed graphs
G = (V, {E_r}_{r ∈ [R]}, f, {e_r}_{r ∈ [R]}), H = (U, {F_r}_{r ∈ [R]}, g, {ẽ_r}_{r ∈ [R]}),
how do we compute the induced distance δ_MR-DIDM^L(G, H) exactly, and what is the worst-case complexity? Throughout we assume |V|,|U| ≤ N and that L and R are fixed constants (as in the metric definition); arithmetic is performed in the unit-cost RAM model on O(log N)-bit numbers.

The finite-graph computation mirrors the enriched IDM recursion on step graphon-signals. We maintain an |V|×|U| cost matrix D_t whose entry D_t(i, j) represents the order-t ground distance d_MR-IDM^t between the order-t IDMs rooted at i ∈ V and j ∈ U. At depth t = 0 we set
D₀(i, j) := ∥f(i) − g(j)∥₂,
which is well-defined since node features lie in the compact ball B_r^d.

For each layer t ≥ 1, and for each node pair (i, j) ∈ V × U, we form a source measure μ_i, t and a target measure ν_j, t supported on neighbor-atoms enriched with relation labels and edge attributes. Concretely, write A_r^G(i, u) ∈ {0, 1} for the indicator of a directed r-edge from i to u in G (and similarly A_r^H(j, v)). We define
$$ \mu_{i,t}\;:=\;\sum_{r\in[R]}\sum_{u\in V}\frac{A_r^G(i,u)}{|V|}\,\delta_{(u,r,e_r(i,u))}, \qquad \nu_{j,t}\;:=\;\sum_{r\in[R]}\sum_{v\in U}\frac{A_r^H(j,v)}{|U|}\,\delta_{(v,r,\tilde e_r(j,v))}, $$
which are finite measures of total mass at most 1 (indeed, ≤ ∑_rdeg_r(i)/|V| and ≤ ∑_rdeg_r(j)/|U|). The ground cost between atoms (u, r, a) and (v, r^′, b) is taken to be
c_t((u, r, a), (v, r^′, b)) := D_t − 1(u, v) + λ_R1_{{r ≠ r^′}} + λ_E∥a − b∥₂,
with fixed weights λ_R, λ_E ≥ 0. We then set the transport discrepancy
T_t(i, j) := OT_{c_t}(μ_i, t, ν_j, t), and D_t := D_t − 1 + T_t,
where OT_{c_t} denotes the chosen (possibly unbalanced) transport objective with ground costs c_t.

Each quantity T_t(i, j) is an optimal transport problem on a finite support of size at most R|V| versus R|U|, hence O(N) for fixed R. In the balanced case (equal total mass), T_t(i, j) is exactly a transportation linear program on a complete bipartite graph with O(N²) edges, solvable in polynomial time by standard min-cost flow reductions. In the unbalanced setting (total masses differ), we can reduce to a balanced transportation instance by adjoining a dummy ``reservoir’’ node on each side: the dummy absorbs excess mass with a prescribed deletion/insertion cost (equivalently, one uses an unbalanced OT objective that can be written as a min-cost flow on an augmented bipartite network). Since the augmentation increases the support size by at most a constant factor, the polynomial-time solvability and asymptotic bounds are unchanged.

After completing the L layers, we perform a final transport between the uniform node measures on V and U with ground cost D_L:
$$ \delta^{L}_{\mathrm{MR\text{-}DIDM}}(G,H) \;:=\; OT_{D_L}\Bigl(\tfrac{1}{|V|}\sum_{i\in V}\delta_i,\;\tfrac{1}{|U|}\sum_{j\in U}\delta_j\Bigr). $$
This last step is again a transportation/min-cost flow problem on O(N) nodes per side.

A salient feature of the multi-relational setting is that the cost matrix in each neighborhood OT admits an explicit block structure. If we order the atoms of μ_i, t (and ν_j, t) by their relation labels, the ground cost matrix decomposes into R × R blocks indexed by (r, r^′). The diagonal blocks (r = r^′) have entries
D_t − 1(u, v) + λ_E∥e_r(i, u) − ẽ_r(j, v)∥₂,
while every off-diagonal block (r ≠ r^′) is obtained from the corresponding diagonal form by adding the constant penalty λ_R. Thus, for fixed (i, j), once the base matrix (D_t − 1(u, v))_u, v is available, the additional overhead to incorporate relations is purely additive and can be implemented without changing the asymptotic order. In particular, for fixed q the edge-attribute term is a Euclidean distance in B_s^q and can be computed on the fly (or precomputed per (i, j) and relation-pair) within the same polynomial budget.

Let n := |V| and m := |U| and set N := max {n, m}. For each layer t, we must solve nm ≤ N² neighborhood OT problems, each over supports of size O(RN) = O(N) for fixed R. Using a strongly polynomial min-cost flow algorithm or a standard network simplex bound, one may take the running time of each such OT solve to be O(N³log N) (up to constants depending on the precise flow routine and the fixed dimensions q, d through cost evaluation). Consequently, each layer costs O(N² ⋅ N³log N) = O(N⁵log N) time, and the L layers contribute O(LN⁵log N). The final OT solve contributes an additional O(N³log N), which is dominated. Memory usage is governed by storage of the current ground matrix D_t − 1 and the updated matrix D_t, which can be streamed to maintain O(N²) working space.

The stated O(LN⁵log N) bound is a dense worst-case guarantee. In typical sparse graphs, each measure μ_i, t is supported only on actual neighbors of i across all relations; thus its support size is O(∑_rdeg_r(i)) rather than O(N). In that regime, the neighborhood OT problems shrink substantially, and the empirical runtime is closer to a degree-weighted sum over node pairs (and can be reduced further by restricting transport to truncated neighborhoods if one is willing to approximate the metric).

A separate, widely used acceleration is entropic regularization. Replacing exact OT by its entropic variant yields Sinkhorn iterations with per-iteration cost essentially quadratic in the support size, often leading to substantial speedups in practice. For our purposes this is optional: entropic OT produces an approximation to δ_MR-DIDM^L with controllable error, but the core theoretical statements (compactness, stability, and exact polynomial-time computability) concern the unregularized distance. Finally, the block structure induced by relations is amenable to implementation-level optimizations (e.g., precomputing relation-gated costs or reusing portions of the cost matrix across (i, j)), which improves constants but does not alter the worst-case asymptotic order.

10. Generalization: Covering-Number Bounds: apply uniform Monte Carlo estimation over compact metric domain; discuss dependence on covering numbers and what can/cannot be made explicit.

We briefly explain how compactness of our metric domains yields distribution-free generalization guarantees for Lipschitz-bounded predictors, and we make explicit the precise point at which covering numbers enter. Let (X, d) denote either of the compact metric spaces
X = (𝒫(H_L), OT_{d_MR-IDM^L}) or X = (𝒲_r, s^d, q, R/∼, δ_MR-DIDM^L),
with L fixed. In either case, compactness implies total boundedness: for every ε > 0 there exists a finite ε-net, and hence the covering number
$$ \mathcal{N}(X,d,\varepsilon)\;:=\;\min\Bigl\{M:\exists x_1,\dots,x_M\in X\text{ with }X\subseteq\bigcup_{k=1}^M B_d(x_k,\varepsilon)\Bigr\} $$
is finite.

We consider an i.i.d. sample (x_i, y_i)_i = 1ⁿ drawn from an arbitrary distribution P on X × Y, where Y ⊂ ℝ^p is bounded. For a predictor h : X → ℝ^p and a loss ℓ : ℝ^p × Y → ℝ, define the population and empirical risks
$$ \mathcal{R}(h)\;:=\;\mathbb{E}_{(x,y)\sim\mathsf{P}}\bigl[\ell(h(x),y)\bigr], \qquad \widehat{\mathcal{R}}_n(h)\;:=\;\frac{1}{n}\sum_{i=1}^n \ell\bigl(h(x_i),y_i\bigr). $$
Assume ℓ(⋅, y) is L_ℓ-Lipschitz uniformly in y and bounded in an interval of length B_ℓ (e.g. ℓ ∈ [0, 1]). If h is L_h-Lipschitz with respect to d, then for each fixed y the map x ↦ ℓ(h(x), y) is L_ℓL_h-Lipschitz on X. In particular, for any two probability measures μ, ν ∈ 𝒫(X) one has the standard transport bound

valid for 1-Wasserstein by Kantorovich–Rubinstein duality and, for the unbalanced variants we allow, as a direct consequence of their dual feasibility constraints (hence as an inequality sufficient for stability). Applying with ϕ(x) = ℓ(h(x), y) and then averaging over y yields

where P_X is the X-marginal and $\widehat{\mathsf{P}}_{X,n}:=\frac{1}{n}\sum_{i=1}^n\delta_{x_i}$ is the empirical input measure. The additional term in accounts for sampling noise in y even at fixed x (and is bounded by a standard Hoeffding or Bernstein inequality using B_ℓ); for our present purpose, the salient point is that the component enters only through L_h, while the component is $OT_d(\mathsf{P}_X,\widehat{\mathsf{P}}_{X,n})$.

We thus reduce the problem to controlling the empirical transport discrepancy $OT_d(\mathsf{P}_X,\widehat{\mathsf{P}}_{X,n})$ uniformly over all data distributions. A convenient route, which makes the dependence on the metric entropy of (X, d) explicit, is to compare P_X to a discretization on an ε-net. Fix ε > 0, let {z₁, …, z_M} be an ε-net with M = 𝒩(X, d, ε), and choose a measurable nearest-net projection π : X → {z₁, …, z_M} with d(x, π(x)) ≤ ε for all x ∈ X. By the triangle inequality and the push-forward contraction property of OT_d,
$$ OT_d(\mathsf{P}_X,\widehat{\mathsf{P}}_{X,n}) \;\le\; OT_d(\mathsf{P}_X,\pi_\#\mathsf{P}_X) \;+\; OT_d(\pi_\#\mathsf{P}_X,\pi_\#\widehat{\mathsf{P}}_{X,n}) \;+\; OT_d(\pi_\#\widehat{\mathsf{P}}_{X,n},\widehat{\mathsf{P}}_{X,n}). $$
The first and third terms are at most ε because the coupling (x, π(x)) moves each point by at most ε, hence

Now π_#P_X and $\pi_\#\widehat{\mathsf{P}}_{X,n}$ are measures on the finite set {z_k}_k = 1^M. Writing D := diam(X), one has the crude domination OT_d(α, β) ≤ D ∥α − β∥_TV on any space of diameter D, hence
$$ OT_d(\pi_\#\mathsf{P}_X,\pi_\#\widehat{\mathsf{P}}_{X,n}) \;\le\; D\,\|\pi_\#\mathsf{P}_X-\pi_\#\widehat{\mathsf{P}}_{X,n}\|_{\mathrm{TV}}. $$
Since $\pi_\#\widehat{\mathsf{P}}_{X,n}$ is the empirical distribution of a multinomial variable with M categories, standard concentration for multinomials gives, for example, a bound of the form

where C > 0 is universal. Combining – yields the schematic high-probability estimate

and plugging into gives the desired distribution-free control of the generalization gap for any hypothesis class with a uniform Lipschitz bound. In our setting, Theorem~5 provides exactly such a bound for relation-aware edge-conditioned MPNNs in terms of δ_MR-DIDM^L (or OT_{d_MR-IDM^L}), so immediately implies a sample-complexity statement whose dependence on the network parameters enters only through the induced Lipschitz constant.

We emphasize what is, and is not, explicit in . The diameter D is elementary to bound from the construction: H₀ = B_r^d has diameter 2r, and each recursion step adds bounded relation and edge-attribute penalties, so D can be bounded in terms of (r, s, L, λ_R, λ_E) alone. By contrast, sharp control of ε ↦ 𝒩(X, d, ε) is typically delicate. Compactness ensures finiteness of 𝒩(X, d, ε) for each fixed ε > 0, which suffices for qualitative generalization. Quantitative rates require metric entropy estimates for Wasserstein-type spaces and, for the quotient graphon-signal domain, additional control under measure-preserving relabelings. Even for X = 𝒫(K) with K ⊂ ℝ^d, known bounds for log 𝒩(X, W₁, ε) are at best exponential in d and polynomial/exponential in ε⁻¹; under our recursion in L (which alternates products with measure spaces), naive entropy bounds compound rapidly and are not expected to be tight. Accordingly, we regard as a principled distribution-free guarantee whose constants can be made explicit in special regimes (e.g. low-dimensional features, restricted graph families, or bounded-degree sparsity), while in full generality we rely on compactness to ensure that the covering numbers exist and hence that uniform Monte Carlo estimation applies.

11. Empirical Plan (Optional but Strengthening): correlation of δ with output perturbations; KG and molecule benchmarks; ablations on relation mismatch and edge-feature influence.

Although our main contribution is structural and metric-theoretic, we may strengthen the narrative by an empirical study whose purpose is not to compete with task-optimized architectures, but to test whether the distance δ_MR-DIDM^L behaves in practice as the theory suggests: it should track perturbations that matter to relation-aware, edge-conditioned MPNNs, and it should respond in a controlled way to relation mismatches and edge-attribute discrepancies.

Fix an MPNN architecture class of the type covered by Theorem~5 (normalized sum aggregation, relation-conditioned message maps, and a Lipschitz readout), and train a representative model F_θ on a supervised task. For each test input graph G, generate perturbed graphs G^′ by applying controlled modifications and measure the pair
(δ_MR-DIDM^L(G, G^′), ∥F_θ(G) − F_θ(G^′)∥₂).
We then evaluate (i) rank correlation (Spearman) between δ and output deviation, and (ii) the tightness of the inequality ∥F_θ(G) − F_θ(G^′)∥₂ ≤ Ĉ δ_MR-DIDM^L(G, G^′) for an empirically estimated constant Ĉ (e.g. the smallest Ĉ satisfying the bound on a calibration set). The perturbation family should include: (a) edge rewiring within a fixed relation r (structure-only changes), (b) relation-label swaps (pure relation mismatch), (c) additive noise on edge attributes e_r (pure attribute mismatch), and (d) node-feature noise (base-layer mismatch). In addition, we may consider compositions of these perturbations to test additivity across coordinates in the recursive construction. A useful diagnostic is to plot robustness curves p ↦ 𝔼[∥F_θ(G) − F_θ(G_p)∥₂] versus 𝔼[δ(G, G_p)], where p is a perturbation rate (e.g. fraction of edges rewired).

Knowledge graphs provide a canonical multi-relational directed setting (R large, edges typed, directionality essential). We propose to use standard link-prediction datasets (e.g. WN18RR, FB15k-237) and treat each triple (u, r, v) as a directed edge of relation type r. Edge attributes may be absent (q = 0) in the base setting; alternatively, one may attach lightweight attributes such as confidence scores (when available) or learned relation embeddings used only as fixed edge features (to test the edge-attribute coordinate without changing the relational coordinate). The supervised map F_θ can be a graph-level predictor defined on induced subgraphs (e.g. enclosing subgraphs around candidate links), so that F_θ(G) is meaningful for each sampled subgraph instance. In this regime, we expect that increasing λ_R in the ground cost
c((u, r, a), (v, r^′, b)) = D_t − 1[u, v] + λ_R1_{r ≠ r^′} + λ_E∥a − b∥₂
should increase sensitivity of δ to relation swaps while leaving structure-only perturbations relatively unaffected. A direct experiment is to take a subgraph instance G and form G^′ by randomly reassigning a fraction ρ of relation labels among existing edges while keeping endpoints fixed; we then test whether δ(G, G^′) increases approximately monotonically in ρ and whether this increase correlates with ∥F_θ(G) − F_θ(G^′)∥₂.

Molecular graphs naturally instantiate a small-R multi-relational domain by taking relation labels to be bond types (single/double/triple/aromatic) and direction to be either absent (by symmetrization) or present (by encoding an arbitrary but fixed orientation to match our directed formalism). Edge attributes can include geometric bond lengths, stereochemical indicators, or simple physicochemical descriptors computed from coordinates when available. We may use graph regression and classification tasks (e.g. QM9 for regression, ZINC for constrained property prediction) and train an edge-conditioned message passing model. Perturbations can be made physically interpretable: (a) small additive noise to bond lengths (attribute perturbation), (b) bond type flips subject to valence constraints (relation perturbation), and (c) substructure rewiring that preserves degree sequences (structure perturbation). Since many molecular predictors are known to be sensitive to bond order and geometry, we expect δ to reflect such perturbations via the λ_R and λ_E channels, respectively, and to correlate with output changes more strongly than distances that ignore edge attributes or relation labels.

We propose explicit ablations that mirror the metric definition.

For each ablation, we report both correlation statistics and a calibration of Ĉ in the inequality ∥F_θ(G) − F_θ(G^′)∥₂ ≤ Ĉ δ(G, G^′), thereby connecting observed stability to the theoretical form of Theorem~5.

To contextualize δ_MR-DIDM^L, we compare against (i) feature-only costs (e.g. ℓ₂ on pooled node features), (ii) relation-agnostic variants (collapse all r), and (iii) graph distances based on Gromov–Wasserstein or assignment costs when feasible. For computation, we implement the layered OT recursion as in Algorithm~MR-DIDM-Distance, using exact min-cost flow for small graphs and an entropic approximation (Sinkhorn) for larger instances to enable systematic sweeps over L and hyperparameters. The empirical study should explicitly record runtime as a function of N and L, as well as sensitivity of the correlation metrics to approximate OT tolerances, since in practice one may trade exactness for throughput while retaining the qualitative behavior predicted by the Lipschitz framework.

12. Limitations and Future Work: sparse regime, attention/transformers, quantitative approximation rates, explicit metric entropy.

Our development is deliberately aligned with the normalized-sum, relation-aware, edge-conditioned MPNN formalism, and with a recursion depth L that is fixed when defining δ_MR-DIDM^L. This choice yields a clean compactness–stability–universality story, but it also exposes several limitations that suggest natural extensions.

The exact computation guarantee of Theorem~8 is worst-case dense and scales as Õ(LN⁵) when we instantiate the recursion naively with N² transport subproblems per layer, each on supports of size O(N). In many domains (knowledge graphs, molecules, and most relational data), the graphs are sparse and degrees are bounded or heavy-tailed. In that setting, the last-coordinate neighborhood measures in the order-t IDMs have supports of size closer to deg (i) than to N, and thus the effective cost of each OT call is governed by deg (i)deg^′(j) rather than N². It would be valuable to formalize an explicit complexity bound, stated directly in terms of the degree sequences and the number of nonzeros per relation, and to identify the appropriate data structures so that the recursion can be implemented without materializing dense N × N cost matrices D_t. A related direction is to exploit the fact that the ground costs used in the OT subproblems decompose as
c((u, r, a), (v, r^′, b)) = D_t − 1[u, v] + λ_R1_{r ≠ r^′} + λ_E∥a − b∥₂,
so that one may hope for incremental updates or warm-started min-cost flow solvers across (i, j) pairs or across layers t. At present, our theory guarantees polynomial-time computability but does not provide the strongest possible bounds in sparse inputs, nor does it characterize the best achievable asymptotics under commonly used sparsity models.

Our factorization and stability results rely on the fact that the per-node aggregation at each layer can be written as an integral of a fixed Lipschitz message map against a neighborhood measure. Attention mechanisms (including graph transformers) typically introduce weights, e.g. softmax scores that depend jointly on query and key representations; these weights are not merely a function of an edge label (r, a) but also of the current node embeddings. While one may still view attention as producing an integral against a probability kernel, the kernel itself becomes an output of the model, and the relevant continuity properties are then more subtle: Lipschitzness must control not only the integrand but also the induced reweighting of neighbor measures. An important future direction is to define an enriched IDM hierarchy that can represent such adaptive reweightings. One plausible route is to enlarge the last-coordinate space to include or features (e.g. storing joint distributions of neighbor embeddings and edge attributes), or to move from measures to as coordinates. Doing so would require revisiting compactness (since kernel spaces are larger) and re-proving weak^* metrizability for the resulting measure-of-kernels recursion. A second route is to restrict attention to Lipschitz families with bounded temperature and bounded feature norms and to show that the resulting transformer layers are Lipschitz with respect to our existing OT geometry; this would yield stability but likely at the cost of conservative constants.

Our metric definition is exact, and our computation result is exact. In practice, one often uses entropic regularization (Sinkhorn) or other approximate OT solvers, and one may also approximate neighborhood measures by sampling (especially when N is large or when the underlying object is a graphon-signal). At present, we do not provide quantitative approximation guarantees that control the deviation between the exact δ_MR-DIDM^L and an approximation δ̂^L. Two issues arise. First, for entropic OT one would like explicit bounds of the form
|OT_d(μ, ν) − OT_d^ε(μ, ν)| ≤ α(ε; d, μ, ν),
with α(ε; ⋅) → 0 as ε ↓ 0, preferably uniform over μ, ν in 𝒫(H_t) and with dependence on the diameter of (H_t, d_MR-IDM^t). Second, because our ground metric is defined recursively, any error introduced at layer t − 1 influences the costs at layer t, so one needs a stability-to-approximation analysis that tracks how additive or multiplicative errors propagate across the depth L. Establishing such bounds would convert our qualitative robustness statement into a robustness guarantee, in which one can trade accuracy for speed while maintaining a provable control on the induced perturbation of MPNN outputs via Theorem~5.

The distribution-free generalization statement we outline uses compactness to assert finiteness of covering numbers, but it remains non-quantitative. For learning-theoretic purposes, one would like explicit upper bounds on the metric entropy of (𝒫(H_L), OT_{d_MR-IDM^L}) or of the quotient space (𝒲_r, s^d, q, R/∼,δ_MR-DIDM^L) in terms of (d, q, R, L) and the radii (r, s), possibly also incorporating constraints such as bounded degree or bounded variation of kernels. Even in classical Wasserstein spaces, sharp entropy estimates are delicate and depend strongly on dimension and regularity; here, the recursive construction increases effective dimension with L because each layer introduces a measure-valued coordinate. Nevertheless, obtaining even coarse bounds (e.g. of the form log 𝒩(ε) ≤ Cε^−β with explicit β) would sharpen the generalization bound in Theorem~9 and clarify how depth affects sample complexity. A particularly relevant question is whether the recursion produces entropy growth that is exponential in L, or whether additional structure (such as bounded supports induced by sparsity, or Lipschitz regularity of kernels) yields milder scaling.

Our universality result is stated on the metric-identified quotient 𝒲_r, s^d, q, R/∼. While this is the correct domain for continuous function approximation under δ_MR-DIDM^L, it may identify objects that are distinct under other equivalence notions (e.g. cut distance for graphons or task-specific invariances). Understanding the relationship between δ_MR-DIDM^L-equivalence and classical graphon equivalences, and identifying conditions under which they coincide or differ, remains open. Likewise, our ground costs treat relation labels via the discrete metric 1_{r ≠ r^′}; in some applications relations have internal geometry (e.g. hierarchical types), suggesting a replacement by a metric on ℛ or even a continuous relation embedding space. Extending the theory to such settings appears straightforward at the level of definitions, but the induced constants and entropy properties may change, and it is not yet clear which choices yield the best empirical and theoretical trade-offs.

In short, the present framework establishes a compact OT geometry that is tailored to relation-aware, edge-conditioned MPNNs and is computable in polynomial time, but it leaves open: (i) sharper sparse and large-scale algorithms, (ii) a principled integration of attention and transformer layers into the IDM/DIDM factorization, (iii) quantitative approximation and depth-dependent error bounds, and (iv) explicit metric entropy estimates enabling sharper, parameter-free generalization rates. These directions are, in our view, the most direct paths toward converting the structural results here into a mature theory of scalable, statistically efficient relational representation learning.