Budgeted Communication Expressiveness: Unifying Message Passing and Graph Transformers via WL Canonicalization

Metadata

Total Words: 10,998
Export Date: 2026-01-18 04:22:11
Description: Modern graph models in 2026 span sparse message-passing GNNs and dense-attention graph Transformers, yet existing expressiveness theory is fragmented: CONGEST/RL-CONGEST analyses capture local messaging, while Transformer analyses often ignore explicit communication limits. Inspired by the RL-CONGEST framework of Cui et al. (2025) and their communication-complexity lower bound for simulating a single WL iteration, we introduce the Budgeted Graph Foundation Model (BGFM), which treats each layer as a bandwidth-limited distributed round with an explicit global bit budget and a configurable randomness source. BGFM subsumes (i) standard message passing (budget proportional to m·width), (ii) sparse attention (budget proportional to attended pairs), and (iii) dense attention as a limiting case. Within BGFM, we study WL-style refinement as a canonicalization primitive. Our main result shows that even with flexible communication patterns, exact collision-free WL refinement is not ‘one layer away’: without unique identifiers and without shared public randomness, any deterministic or zero-error randomized BGFM needs depth at least Ω(D + m/(Blog n)) in the routed setting. We complement this with matching gather-sort-scatter upper bounds given IDs, and we provide a contrasting depth-1 public-coin algorithm that computes WL fingerprints with vanishing collision probability. Empirically, we propose controlled ablations that vary attention sparsity (communication budget) and measure WL-type recovery, directly testing the predicted depth–budget tradeoffs and the practical role of shared seeds.

1. Introduction: why WL-as-function, why communication budgets matter for 2026 graph Transformers; summary of contributions and how BGFM extends RL-CONGEST.
2. Related Work: WL/GNN expressiveness, RL-CONGEST and distributed models, attention vs virtual nodes, communication complexity and distributed lower bounds, random features vs shared randomness.
3. The BGFM Model: formal definition, routed vs direct variants, randomness (private/public), node compute class 𝒞, and how message passing and attention instantiate BGFM.
4. Problems and Metrics: define WL-ITER (exact), WL-FP (approximate fingerprints), and evaluation metrics (collision probability, type-recovery rate).
5. Lower Bounds for Exact WL-ITER: communication-complexity reduction (EQ/INDEX), lifting to BGFM, and the role of diameter in the routed variant.
6. Matching Upper Bounds with IDs: budgeted gather-sort-scatter algorithms; extensions to direct (all-to-all) budgeted communication; discussion of node compute requirements and practical approximations.
7. Randomness Separations and Approximate WL: public-coin fingerprinting algorithm (depth 1), private-coin limitations under low budget, and a tight tradeoff between bits and collision probability.
8. Experiments (Optional but Strengthening): synthetic and real graphs; vary attention budget/sparsity; measure WL-type recovery; compare private vs public seeds and ID features; test where theory matches practice.
9. Extensions and Open Problems: multi-token/hub variants, beyond 1-step WL, tighter expander-based upper bounds under budget, and learnability vs computability.

Content

1. Introduction: why WL-as-function, why communication budgets matter for 2026 graph Transformers; summary of contributions and how BGFM extends RL-CONGEST.

We consider the one-step Weisfeiler–Lehman refinement procedure as an explicit distributed function to be computed under communication constraints. Concretely, an input instance consists of an undirected graph G = (V, E) and an initial color vector x ∈ [p(n)]ⁿ. Each node u forms its local WL-type
$$ t(u)\;=\;\Bigl(x_u,\ \multiset{x_v : v\in N(u)}\Bigr), $$
and the refinement task asks for a new labeling y that is the quotient of V by equality of t(⋅), i.e., y_u = y_v ⇔ t(u) = t(v). This task, which we denote by , is elementary from the perspective of classical WL theory, yet it is the appropriate unit of computation when one wishes to separate (i) local neighborhood aggregation, which is essentially free once adjacency is locally accessible, from (ii) the global step that turns multisets into collision-free identifiers shared across the graph. The latter step is precisely where limited bandwidth, limited attention, and limited global coordination become mathematically constraining.

Our starting point is the observation that many graph representation architectures used in current practice can be interpreted as layered distributed protocols. Each layer transforms node states by combining local computation with information received from other nodes. The relevant resource is not only depth (the number of layers), but also the amount of information exchanged per layer. This perspective is particularly salient for graph Transformers in regimes where attention is sparse or budgeted: even if a node may in principle attend to arbitrary others, the model typically enforces a bound on the number of attended pairs, or on the total number of key–value transfers, for computational reasons. In such settings, the correct abstraction is not a per-edge bandwidth cap, but a global cap on the aggregate amount of communication (equivalently, attention mass) that can be expended per layer. Our goal is to make this constraint explicit and then to characterize, for a canonical WL computation, what depth is necessary and sufficient as a function of that budget.

To this end we introduce the (BGFM), a synchronous layered model that retains the essential degrees of freedom of modern graph foundation models while exposing a single, auditable communication parameter. In BGFM, each layer ℓ allows nodes to send point-to-point messages, after which each node applies an update function from a prescribed computation class 𝒞. The distinguishing feature is that the total number of transmitted words in a layer is at most B, i.e., the total communicated bits are O(Blog n) under the standard word-RAM encoding. We consider two variants. In , messages must traverse the edges of the input graph, possibly over multiple layers; this captures message-passing GNNs and graph algorithms executed over the underlying network. In , nodes may communicate nonlocally, but still within the same global budget B; this captures sparse-attention Transformers, architectures with a small number of global tokens, and models that implement a limited set of learned or engineered long-range links per layer.

It is instructive to contrast BGFM with the standard distributed family and its randomized-listening variants. In , one constrains the message size per edge per round. This is appropriate when all edges are simultaneously usable and when the limiting factor is local bandwidth. In our setting, the limiting factor is often global: one may be able to communicate along many edges, but only a small number of communications can be afforded in total per layer (e.g., due to attention sparsity, accelerator memory bandwidth, or a requirement that inference scale subquadratically in n). BGFM can be viewed as a relaxation in one direction and a tightening in another: we do not insist on per-edge uniformity, but we insist that the sum of all communication fits within B. This permits faithful modeling of nonuniform communication patterns (leaders, hubs, virtual nodes) while still enabling lower bounds via information arguments.

We emphasize that even for , local access to neighbor colors does not solve the problem: computing t(u) is local, but producing a globally consistent, collision-free label for each distinct type is a global coordination problem. In particular, if two distant regions of the graph contain identical WL-types, the output labels must coincide, and if they contain different WL-types, the labels must differ. This is a canonicalization requirement rather than a mere aggregation requirement. Under small B, the canonicalization step becomes the bottleneck, and it is here that communication complexity enters. Our lower bounds formalize this intuition by reducing from equality testing instances encoded into graphs whose correct refinement output reveals whether two strings are equal; any BGFM computation that succeeds must therefore transmit enough information, across layers, to resolve that equality instance.

The contributions we establish within this framework are as follows.

The logical structure of the development is therefore as follows. We first specify BGFM precisely, including the accounting of communicated words per layer and the distinction between routed and direct communication. We then treat as a canonicalization problem and prove lower bounds by encoding equality instances into WL-refinement outputs, translating any BGFM execution into a bounded-communication transcript. Next we present the matching gather–sort–scatter upper bound in the presence of IDs, exhibiting that the lower bounds are not artifacts of the reduction but reflect the true communication bottleneck. Finally, we analyze the bounded-error regime and show that shared randomness permits depth-1 approximate refinement with logarithmic fingerprint length, together with separations between public and private coins.

The one-dimensional Weisfeiler–Lehman (WL) refinement, also known as color refinement, is a classical procedure in graph isomorphism testing and in the study of equitable partitions; see, e.g., surveys around . In the modern learning literature it has become standard to use 1-WL as a yardstick for the distinguishing power of message-passing graph neural networks (MPGNNs): a broad class of neighborhood-aggregation architectures cannot separate any pair of graphs that 1-WL fails to separate, and conversely sufficiently expressive aggregators match 1-WL on node and graph distinctions under mild conditions . Subsequent work studied higher-order variants (e.g. k-WL, and corresponding k-GNNs) and the effect of architectural choices such as multisets versus sets, injective aggregation, and invariant/equivariant readouts . Our focus is intentionally orthogonal: we isolate a single step of WL refinement as a distributed task and ask what global coordination is required to produce collision-free identifiers for WL-types under explicit communication budgets.

A recurring theme in both distributed computing and GNN theory is that symmetry breaking mechanisms (unique IDs, port numberings, random features, positional encodings) can fundamentally change what can be computed in bounded time or depth. In distributed algorithms, unique identifiers are often assumed in the and models to avoid trivial impossibility results on anonymous networks; conversely, in anonymous settings one typically relies on randomness or on additional structure (e.g. orientation, leader) to break symmetries . In GNNs, node IDs are usually not available as features, and a substantial body of work studies whether random initial features, Laplacian eigenvectors, or other positional encodings can simulate the effect of identifiers and thereby increase expressiveness . Our results separate regimes where identifiers are available (yielding a straightforward gather–sort–scatter solution for exact refinement) from regimes where they are not, and we further separate shared randomness (a common seed) from independent randomness at nodes in the approximate setting.

The classical distributed graph computation models and impose synchronous rounds and restrict message size per edge (in , typically O(log n) bits), and a large lower-bound literature identifies tasks whose round complexity is constrained by information propagation and bandwidth bottlenecks . Several variants aim to model alternative bottlenecks. The model allows all-to-all communication with per-link bandwidth limits; the node-capacitated clique and related models instead cap per-node throughput, capturing settings where the network is dense but each processor has bounded I/O . Randomized-listening and radio-network variants constrain which neighbors can successfully receive messages in a given round, inducing a global contention phenomenon. These models are close in spirit to our concern with global throughput, but they typically constrain communication in a manner (per edge, per node, or via contention rules), whereas our abstraction accounts directly for a budget on the total number of transmitted words. This accounting is intended to match architectural constraints in foundation-model-style computation, where the limiting factor is often a fixed attention or memory bandwidth budget per layer rather than the existence of many simultaneous low-cost links.

Graph Transformers and related attention-based architectures have been developed to overcome locality limitations of MPGNNs by enabling nonlocal interactions, often mediated by learned attention patterns, global tokens, or explicitly introduced virtual nodes . In practice, attention is frequently sparse for computational reasons: one restricts each node to attend to a bounded number of other nodes, or one uses low-rank and routing mechanisms that reduce the number of key–value transfers. Virtual-node constructions can be viewed as concentrating communication through a small set of hubs, enabling a form of global aggregation in few layers but at a cost in bandwidth through those hubs. Our model is designed to encompass both message passing and sparse attention within a single quantitative parameter, namely the total per-layer communication budget. In this sense, our upper bounds can be interpreted as formalizing the intuitive algorithm ``send everything to a global token, canonicalize, and broadcast back,’’ while our lower bounds quantify when such a strategy is information-theoretically unavoidable for exact refinement.

The use of two-party (and multi-party) communication complexity to prove distributed lower bounds is by now standard. Starting from the seminal connections between distributed computation and communication protocols, a large literature shows how to embed instances of , set disjointness, and related primitives into graph problems so that any distributed algorithm with limited communication yields a protocol with correspondingly limited transcript size . In the model, this technique was refined to produce tight lower bounds for problems such as minimum spanning tree, shortest paths, and cut computations, often via carefully designed graph partitions that force information to cross a sparse cut . Our lower bounds follow the same meta-principle but target a different bottleneck: rather than a sparse cut limiting simultaneous edge transmissions, we impose an explicit cap on total communication per layer, which naturally translates to a bound on transcript length over L layers. The reduction from is particularly natural for WL refinement because exact canonicalization requires global agreement on whether two local multisets (or collections of them) coincide, and equality testing is the canonical primitive capturing that requirement.

On the algorithmic side, representing multisets compactly and comparing them under bandwidth constraints is a classical problem, with tools ranging from sorting-based canonicalization to randomized hashing and sketching. In distributed settings, exact multiset equality typically requires enough information to reconstruct or uniquely identify the underlying multiset, which in turn forces near-linear communication in the input size in worst cases. In contrast, approximate equality and fingerprinting admit efficient randomized protocols: with shared randomness, one can evaluate suitable hash functions or polynomial fingerprints so that unequal objects collide with small probability while equal objects collide with probability 1 . In streaming and sketching, linear sketches for frequency vectors (e.g. AMS and related families) provide compact summaries with provable collision and concentration guarantees . Our depth-1 construction is best viewed as importing these fingerprinting ideas into a WL-type setting: the WL-type of a node is a pair consisting of a color and a multiset of neighbor colors, and pairwise-independent hashing suffices to produce short fingerprints with collision probability bounded by a user-specified ε.

The distinction between public-coin and private-coin randomized protocols is fundamental in communication complexity: public coins can reduce the cost of equality from Θ(log n) bits to O(1) bits at constant error, while private-coin protocols for equality require Θ(log n) bits . In learning systems, one often has access to a shared pseudorandom seed (or, equivalently, a fixed randomized feature map) across all nodes at inference time, which operationally corresponds to public coins. By contrast, assigning each node independent random features corresponds to private coins unless those features are correlated via a shared seed. Our separation result instantiates this classic gap within under low budget: shared seeds enable globally consistent fingerprints essentially for free, while independent randomness forces additional communication (or depth) to reconcile local randomness into a globally comparable representation.

To summarize, we treat one-step WL refinement as a minimal but nontrivial canonicalization primitive and analyze it through the lens of globally budgeted communication, motivated by both sparse-attention architectures and bandwidth-limited distributed computation. The related literatures above provide, respectively, (i) the WL/GNN expressiveness baseline that motivates the task, (ii) the role of IDs and randomness in symmetry breaking, (iii) distributed bandwidth models that motivate explicit accounting, and (iv) communication-complexity tools and fingerprinting techniques that underlie our lower and upper bounds. We now turn to a formal definition of the Budgeted Graph Foundation Model and its routed and direct variants.

3. The BGFM Model: formal definition, routed vs direct variants, randomness (private/public), node compute class 𝒞, and how message passing and attention instantiate BGFM.

We formalize a synchronous distributed model intended to capture the dominant bottleneck in several graph representation architectures, namely a per-layer limit on how many can be communicated in total. The model is parameterized by a communication budget B (words per layer), a depth L (number of layers), a local computation class 𝒞, and a randomness regime (none, private coins, or public coins). The input is a graph G = (V, E) (undirected unless stated) on n = |V| nodes and m = |E| edges, together with an initial feature/color vector x ∈ [p(n)]ⁿ for some fixed polynomial p. Each node u ∈ V initially knows n, its own feature x_u, and (in the routed variant) its adjacency list N(u); we do not assume global knowledge of G unless explicitly provided by preprocessing.

Before the L layers begin we allow an optional preprocessing map
f_pre : 𝒢 → 𝒢^′,
which may augment the instance with auxiliary features (e.g. IDs, a leader mark, a BFS-tree labeling, or additional edges) depending on the regime under study. Since preprocessing can trivialize symmetry-related lower bounds if left unconstrained, we treat it as an explicit part of the model specification: in no-IDs'' regimes we disallow preprocessing that introduces unique identifiers, while inIDs allowed’’ regimes we permit f_pre to provide distinct IDs in [poly(n)]. When needed, one can also attach an explicit cost to preprocessing; in this work our main statements either (i) disallow ID-creating preprocessing entirely, or (ii) allow it without cost and focus on the ensuing layer complexity.

The computation proceeds in L synchronous layers ℓ = 1, 2, …, L. Each node u maintains a state s_u^(ℓ) after layer ℓ, with s_u⁽⁰⁾ determined by the input features and any preprocessing output. A layer consists of (a) message generation, (b) message delivery subject to the communication constraints, and (c) local state update. Concretely, each node u applies a message function MSG^(ℓ) to its current state to produce one or more messages; each received message is a finite string over {0, 1}, typically of length O(log n) bits (one ``word’’), though we allow larger messages provided they are charged proportionally in words. The node then applies an update function UPD^(ℓ) to its previous state and the multiset of received messages:
s_u^(ℓ) ← UPD^(ℓ)(s_u^{(ℓ − 1)}, { { M_v → u^(ℓ) : v sent to u } }).
We require UPD^(ℓ) ∈ 𝒞 for every node u and layer ℓ. The choice of 𝒞 is orthogonal to our communication accounting; for exact canonicalization procedures we will typically assume 𝒞 = TIME(poly(n)), while for lower bounds we allow arbitrary local computation so that the bottleneck is information transfer.

The defining constraint of BGFM is that the amount of communication per layer is bounded. We measure communication in words, where one word represents O(log n) bits—enough to send a node ID in [poly(n)], a color in [poly(n)], or a hash value of comparable length. Formally, in each layer ℓ the sum of message lengths (in words) over all delivered messages is at most B. Equivalently, the total number of communicated bits per layer is O(Blog n). This accounting differs from standard edge-bandwidth models: we do not require that each edge (or each node) has its own bandwidth limit; rather, we constrain the throughput, which is the appropriate abstraction when a system enforces a fixed attention/memory bandwidth per layer.

We consider two variants that differ in which pairs of nodes are eligible to exchange a message in a single layer.

In the routed variant, all communication must traverse edges of the input graph G. In each layer ℓ, a message M_u → v^(ℓ) may be delivered only if {u, v} ∈ E (or, in a directed formulation, (u, v) ∈ E). Multi-hop communication is possible only by relaying information across successive layers. Thus, even if the global budget B is large, the graph diameter D(G) can impose an unavoidable latency term in tasks that require correlating far-apart regions. Routed-BGFM is the natural abstraction for message-passing architectures and for distributed settings where the underlying network is exactly G.

In the direct variant, a node may send a message to any other node in one layer (point-to-point), subject only to the global budget B. This captures nonlocal interactions such as attention or global read/write to a shared memory. Since point-to-point delivery requires an addressing mechanism, we explicitly expose the dependence on identifiers: if unique IDs are present then a message can specify its recipient by ID; if IDs are absent, then any nontrivial addressing must be implemented by the algorithm itself and charged to the budget. The direct variant removes the diameter barrier but does not remove the information bottleneck: if the task requires moving Θ(m) words of neighborhood-derived information into a consistent global view, then the number of layers is still constrained by the rate B.

We allow randomized algorithms and distinguish two standard regimes.

Each node u has access to an independent random tape R_u, and MSG^(ℓ) and UPD^(ℓ) may depend on R_u. Private coins model independently sampled random initial features or independent dropout-style randomness. Importantly, private randomness is automatically comparable across nodes without additional communication: two nodes cannot assume they sampled the same hash function or the same random challenge unless they coordinate.

All nodes share the same random tape R. Equivalently, there is a public seed (or globally fixed randomness) that defines shared hash functions, random sketches, and other correlated choices. Public coins are strictly stronger in low-communication settings, since they enable nodes to compute globally consistent fingerprints without exchanging the seed. In our approximate WL setting, public coins will permit depth-1 fingerprinting with O(log (n/ε)) bits, while private coins can require additional communication to align the randomness.

The budget B is stated in words to make it invariant under constant-factor encoding choices. When we compare BGFM to architectures that use fixed-width vectors, we interpret a message of dimension d as Θ(d/log n) words, since each coordinate must be discretized to O(log n) bits to preserve 1/poly(n)-scale distinctions. For our canonicalization and fingerprinting tasks, the relevant messages will be integers modulo a prime or color/ID values, and thus naturally fit in O(1) words. One may also impose a per-message cap (e.g. one word per message); this only strengthens our lower bounds and does not affect our upper bounds, which can always stream longer objects across multiple messages and layers.

BGFM is designed so that common graph representation mechanisms embed as special cases by appropriately setting B.

An L-layer message-passing GNN on G that sends a w-word message across each directed edge per layer has total per-layer communication Θ(mw) words (up to constant factors depending on whether we count both directions). Thus it is captured by routed-BGFM with B = Θ(mw). Conversely, any routed-BGFM algorithm with budget B can be seen as a message-passing procedure in which only a subset of edges are ``active’’ per layer, with average activity constrained by B/m.

A graph Transformer layer that realizes at most K attended query–key pairs (i.e. at most K value transfers) and uses d_model-dimensional values induces total communication Θ(K ⋅ d_model/log n) words per layer, hence is a direct-BGFM instance with B = Θ(K ⋅ d_model/log n). Virtual-node or global-token constructions correspond to spending a large fraction of B on messages incident to a designated hub node; BGFM makes this concentration explicit, as the hub does not create bandwidth from nothing but merely reallocates the global budget.

We emphasize what BGFM does and does not attempt to capture. We do not charge for local access to N(u) in routed-BGFM (adjacency is part of the input), nor for purely local computation beyond requiring membership in 𝒞. We also do not impose fairness constraints on which nodes may communicate: in any layer the algorithm may allocate the entire budget B to a small set of nodes if desired. This is deliberate, as it reflects architectures where a small number of tokens can dominate attention. The model is therefore well-suited to asking a specific question: given a global throughput limit B, what depth L is forced (or sufficient) to implement collision-free canonicalization tasks such as one-step WL refinement?

In the next section we instantiate this question by defining the exact task and its approximate fingerprinting relaxation , together with the metrics by which we evaluate success under bounded error.

4. Problems and Metrics: define WL-ITER (exact), WL-FP (approximate fingerprints), and evaluation metrics (collision probability, type-recovery rate).

We now specify the computational tasks studied in this work. Throughout, an instance consists of an undirected graph G = (V, E) on n = |V| nodes and m = |E| edges, together with an initial color vector x ∈ [p(n)]ⁿ for some fixed polynomial p. We interpret x as the current coloring on which a Weisfeiler–Lehman refinement step is to be performed. Our focus on one-step refinement isolates the information bottleneck arising already from aggregating and canonicalizing neighborhood information.

For a node u ∈ V, we define its one-step WL type as
$$ t(u)\ :=\ \Big(x_u,\ \multiset{ x_v : v\in N(u)}\Big), $$
where N(u) denotes the neighborhood of u in G, and $\multiset{\cdot}$ denotes a multiset (so multiplicities matter). In standard WL refinement, nodes with equal types receive equal new colors, and nodes with distinct types receive distinct new colors. Importantly, the multiset component can have total size deg (u), so the collection {t(u) : u ∈ V} contains Θ(m) color occurrences in the worst case.

We emphasize that t(u) is a mathematical function of (G, x) and not, by itself, a communication primitive. In routed-BGFM, computing t(u) from the local inputs requires learning {x_v : v ∈ N(u)}, which can be achieved by a local neighbor exchange if the budget allocation permits. In direct-BGFM, the same information can be obtained either by explicit point-to-point queries or by preprocessing/local access assumptions. Our lower bounds will be stated in a way that only strengthens when we grant nodes additional local access, whereas our upper bounds will explicitly account for any required communication.

The first task is the exact one-step WL refinement problem, which we treat as a canonicalization task for the partition of nodes induced by types.

Thus y must induce the equivalence relation of type equality. In particular, requires both:
(i) (t(u) = t(v) ⇒ y_u = y_v), and
(ii) (t(u) ≠ t(v) ⇒ y_u ≠ y_v).
We will typically impose correctness, or equivalently correctness for randomized algorithms (i.e., the output must satisfy the biconditional with probability 1 over the algorithm’s randomness). This choice is deliberate: even a single mistaken collision y_u = y_v for t(u) ≠ t(v) breaks the refinement partition and, in downstream iterations, can only propagate the error.

In anonymous graphs (no unique IDs), remains well-defined because the output is required only to be consistent with type equality; no particular numeric naming of types is mandated. Nevertheless, producing a collision-free assignment y in a distributed setting is an inherently global coordination problem: two far-apart nodes of the same type must agree on the same label, while two nodes of different types must avoid accidental agreement. Our lower bounds in the no-IDs regime formalize this coordination barrier under a global communication budget.

We also study a bounded-error relaxation in which nodes output short of their types. This relaxation is motivated by the observation that many architectures do not aim to produce an explicit collision-free renaming of all distinct types; rather, they compute a compressed embedding in which distinct local neighborhoods may collide.

The one-sided requirement reflects the intended interpretation of fingerprints as sketches of types: identical inputs must map to identical outputs regardless of randomness. The bounded collision requirement captures that the sketch should be injective on types , but we permit collisions among distinct types at a controlled rate. We will be particularly interested in the dependence of k on (n, ε), and in the tradeoff between depth L, budget B, and the randomness regime.

The distinction between public and private coins is operationally significant for . With public coins, nodes can apply the randomly chosen hash functions to their local multisets without exchanging seeds, enabling consistent sketches at essentially no communication cost. With private coins, nodes that hash locally do not automatically align their hash functions, and additional communication may be necessary to ensure that equality of types implies equality of fingerprints. Our separation results will formalize that, under tight budgets, public coins can strictly reduce the required depth or total communicated bits.

We evaluate BGFM algorithms for and along three axes: depth, communication, and error.

An algorithm is parameterized by a depth L and per-layer communication budget B (words), as defined by BGFM. The principal complexity measure is the minimal L achievable for a given B (and vice versa), under the specified computation class 𝒞 and randomness regime. In routed-BGFM, we will additionally track the dependence on the graph diameter D(G), since information about types may need to traverse the graph.

For , Definition~ states a per-pair collision bound. It is often convenient to control a stronger failure event
Fail := {∃ u ≠ v: t(u) ≠ t(v) ∧ f_u = f_v}.
We say that an algorithm achieves at most ε if Pr [Fail] ≤ ε. A per-pair bound implies a global bound via a union bound over O(n²) pairs when the collision probability is sufficiently small; conversely, a global bound immediately implies that all pairs are distinguished simultaneously with probability at least 1 − ε. Our depth-1 public-coin fingerprinting upper bound will be stated in this global form, since it directly matches the intended ``no collisions anywhere’’ guarantee up to probability ε.

While our formal guarantees are phrased in terms of collision probability, it is sometimes useful to quantify how well the fingerprint partition approximates the true type partition. Let Π_t be the partition of V into equality classes of t(⋅), and Π_f the partition into equality classes of f(⋅). We define the as
$$ \mathrm{TRR}\ :=\ 1\ -\ \frac{\big|\{ \{u,v\}: u\neq v,\ t(u)\neq t(v),\ f_u=f_v\}\big|}{\big|\{ \{u,v\}: u\neq v,\ t(u)\neq t(v)\}\big|}, $$
and we consider 𝔼[TRR] over the algorithm’s randomness. Under a global failure guarantee Pr [Fail] ≤ ε, we have TRR = 1 with probability at least 1 − ε, hence 𝔼[TRR] ≥ 1 − ε. We stress that TRR is a diagnostic rather than an additional requirement; our main results will be stated using the explicit collision/failure probabilities above.

With these problem definitions and metrics in place, we next turn to lower bounds for in BGFM, obtained via reductions from communication complexity and, in the routed setting, via diameter-padding constructions that enforce an Ω(D(G)) latency term.

5. Lower Bounds for Exact WL-ITER: communication-complexity reduction (EQ/INDEX), lifting to BGFM, and the role of diameter in the routed variant.

We now prove that exact one-step canonicalization in the sense of Definition~ is communication-limited in BGFM. The guiding principle is that forces the system to implement a collision-free renaming of the Θ(m) neighborhood-color occurrences implicit in the multiset data {t(u) : u ∈ V}. When unique IDs are unavailable, no node can a priori serve as a canonical coordinator, and any correct algorithm must implicitly transmit enough information to prevent accidental agreement between distinct types.

Our lower bounds are obtained by embedding standard two-party communication problems into instances and then any BGFM algorithm into a two-party protocol whose communication is bounded by the BGFM budget B times the depth L. In the routed variant we additionally enforce an Ω(D(G)) latency term by diameter padding.

We use the two-party Equality problem EQ_m: Alice receives a ∈ {0, 1}^m, Bob receives b ∈ {0, 1}^m, and they must decide whether a = b. It is classical that any deterministic protocol for EQ_m communicates Θ(m) bits, and the same Ω(m) lower bound holds for randomized protocols (by a standard fooling-set argument, or by the fact that zero-error implies a partition into monochromatic rectangles as in the deterministic case). We emphasize that we deliberately work in the deterministic/zero-error regime since itself requires the biconditional y_u = y_v ⇔ t(u) = t(v) to hold with probability 1.

Our reduction will map (a, b) to an instance (G(a, b), x(a, b)) of with two designated nodes r_A, r_B such that

for every correct output y. Given an algorithm for , Alice and Bob will simulate it on (G(a, b), x(a, b)) and then decide EQ_m by comparing the produced labels of r_A and r_B. Thus any BGFM algorithm yields a two-party protocol for EQ_m.

We formalize the simulation step in the following standard lemma. Fix a partition V = V_A ∪ V_B where Alice simulates all nodes in V_A and Bob simulates all nodes in V_B. We assume both parties know the construction of G and x, and each party knows the private part of the input (the edges and initial colors incident to its simulated nodes) induced by the reduction. They then jointly simulate the L-layer BGFM algorithm 𝒜.

In each BGFM layer ℓ, every message is an O(log n)-bit word by definition of the budget unit. The total number of such words sent system-wide is at most B. In the two-party simulation, whenever a message originates at a node in V_A and is delivered to a node in V_B (or vice versa), the sender’s party transmits its content to the other party. Messages whose endpoints lie on the same side need not be communicated. Since the total number of words is at most B, the total cross-party communication in layer ℓ is at most O(Blog n) bits, and summing over L layers yields O(L ⋅ Blog n). For zero-error randomness, we include the random tape as local to each side (private coins) or shared (public coins); in either case the induced protocol remains zero-error with the same communication bound.

Lemma~ reduces lower bounding BGFM depth to exhibiting reductions satisfying and invoking CC(EQ_m) = Ω(m).

We now describe a concrete reduction. Fix m and set the node set to contain two roots r_A, r_B and, for each position i ∈ [m] and bit β ∈ {0, 1}, two auxiliary nodes u_i, β^A and u_i, β^B. We assign initial colors by
x_{r_A} = x_{r_B} = c_⋆, x_{u_i, β^A} = x_{u_i, β^B} = c_i, β,
where the colors c_⋆ and c_i, β are all distinct, but each c_i, β appears (once on Alice’s side and once on Bob’s side), so no node is uniquely identified by its initial color alone. The edges encode a and b by selecting exactly one neighbor per position: connect r_A to u_{i, a_i}^A for each i ∈ [m], and connect r_B to u_{i, b_i}^B for each i ∈ [m]. No other edges are needed for the basic (direct-BGFM) lower bound; the remaining nodes may be isolated.

By construction,
$$ t(r_A)=\Big(c_\star,\ \multiset{ c_{i,a_i} : i\in[m]}\Big),\qquad t(r_B)=\Big(c_\star,\ \multiset{ c_{i,b_i} : i\in[m]}\Big). $$
Since the colors c_i, 0 and c_i, 1 are distinct for each i, we have t(r_A) = t(r_B) if and only if for every i the selected neighbor-color matches, i.e., a_i = b_i for all i, which is equivalent to a = b. Thus holds for the designated roots.

Combining this reduction with Lemma~, we immediately obtain that any deterministic or zero-error randomized direct-BGFM algorithm for must satisfy
$$ L\cdot B\log n = \Omega(m), \qquad\text{hence}\qquad L=\Omega\!\left(\frac{m}{B\log n}\right), $$
on a worst-case family of n-node m-edge instances (here m = Θ(number of encoded bits), and n = Θ(m)). This is the content of Theorem~3 from the global context.

In routed-BGFM, the preceding gadget alone does not enforce any latency beyond the communication bound, because the two roots could be adjacent and information could flow in O(1) layers if the budget permits. To obtain a lower bound that also depends on the diameter, we apply a diameter-padding transformation that separates the Alice and Bob regions by a long path.

Concretely, we take the above instance and connect r_A to r_B by an induced path P of length D, introducing intermediate vertices p₁, …, p_D − 1 with arbitrary constant initial color (say c_⋆) and edges (r_A, p₁), (p₁, p₂), …, (p_D − 1, r_B). All bit-encoding neighbors of r_A remain attached to r_A, and similarly for r_B. The resulting graph has diameter Θ(D). Since routed-BGFM permits only one-hop communication per layer (multi-hop only across layers), any influence of Alice’s side on Bob’s side (and vice versa) must traverse P, requiring at least D layers for information to propagate from r_A to r_B. This yields an Ω(D) lower bound that is independent of B.

To combine this with the Ω(m/(Blog n)) term, we observe that the simulation argument already yields a protocol whose communication is O(L ⋅ Blog n), and EQ_m requires Ω(m) bits. The diameter padding does not reduce the amount of information that must cross between the two regions; it only restricts the rate at which such information can traverse the graph. Therefore, on the padded family we obtain
$$ L=\Omega\!\left(D+\frac{m}{B\log n}\right), $$
for deterministic and zero-error randomized routed-BGFM algorithms without IDs (and without preprocessing that creates IDs). This establishes the form of Theorem~2 from the global context.

Finally, we note the role of the mild assumption B = o(n²): if B were Θ(n²), then in the direct variant the model can essentially realize dense all-to-all exchange of Θ(n²) words per layer, and the Ω(m/(Blog n)) term becomes vacuous on sparse graphs. Our lower bounds are meaningful precisely in the sparse-attention or bandwidth-limited regimes in which B is subquadratic and the global information budget per layer is the constraining resource.

6. Matching Upper Bounds with IDs: budgeted gather-sort-scatter algorithms; extensions to direct (all-to-all) budgeted communication; discussion of node compute requirements and practical approximations.

We now show that the communication lower bounds from Section~ are essentially tight once we allow unique identifiers. Throughout this section we assume that each node u ∈ V is given an ID(u) ∈ [poly(n)] as part of the input features, or that such IDs are created by an allowed preprocessing step. We also assume 𝒞 = TIME(poly(n)), so that a designated coordinator can perform sorting/dictionary operations on Õ(m) words.

The basic idea is the canonical template: we first assemble all one-step WL types t(u) at a leader r, compute a collision-free renaming of types at r, and then disseminate the resulting labels y_u back to the nodes. Since each type $t(u)=(x_u,\multiset\{x_v:v\in N(u)\})$ contains Θ(deg (u)) color occurrences, the total input size of the global canonicalization problem is Θ(m) words, and hence any algorithm that is forced to move this information through a global budget of B words per layer should have depth Θ(m/B) up to additive routing latency.

With IDs, we may select a leader r as the node of minimum ID (or assume r is specified). Each node u locally materializes its WL type
$$ t(u) \;=\; \bigl(x_u,\; M_u\bigr),\qquad M_u := \multiset\{x_v : v\in N(u)\}. $$
In routed-BGFM, forming M_u requires one local neighbor exchange of the current colors; in direct-BGFM we may equivalently treat this as local adjacency access. We represent t(u) as a sequence of ℓ_u := 1 + deg (u) words: one word for x_u and one word per neighbor color occurrence (with repetition). For exactness it is convenient (but not necessary) to sort the neighbor colors locally so that M_u is represented canonically as a nondecreasing list; local sorting costs Õ(deg (u)log n) time and does not affect the communication accounting.

Let
W := ∑_u ∈ Vℓ_u = n + ∑_u ∈ Vdeg (u) = n + 2m = Θ(m)
denote the total number of words that encode all types.

We first give the routed-BGFM upper bound. We assume that a rooted spanning tree T of G with root r is available for routing (e.g., computed in preprocessing, or obtained by any standard exploration whose total message volume is O(m) and hence fits within the same asymptotic depth bound we prove below). Once T is fixed, we can treat each edge of T as a store-and-forward link; messages traverse one tree edge per layer.

Each node u must send its ℓ_u-word description of t(u) to r. We pipeline these words upward along T. Concretely, each node maintains an outgoing queue of words destined for its parent. In each layer, we select an arbitrary set of at most B directed tree edges and transmit one word along each selected edge. This respects the per-layer global budget of B words. Since all words are destined to r, no routing decisions beyond parent pointers are needed.

Two constraints govern the number of layers: (i) at rate B, since in total W words must be moved system-wide; and (ii) , since a word originating at distance d from r must traverse d edges, implying an additive Ω(D) term. Standard pipelining yields that the gather completes in

layers. One way to formalize~ is to charge one unit of global budget to each word-edge traversal. The total number of traversals is at most W ⋅ D(G) in the worst case, but because all paths are directed toward the root and edges can be fully pipelined, the critical path contributes only O(D) layers while the remaining work is amortized at throughput B words/layer; this is the usual ``store-and-forward with bounded total injections’’ analysis.

After the gather, the leader r holds all types t(u) (together with the sender IDs, so that labels can be returned). Node r now computes a collision-free map
t(u) ↦ y_u
such that y_u = y_v if and only if t(u) = t(v). Since we require exactness, we do hash types into a short range. Instead, we sort (or otherwise canonically group) the received records by their full representation. For example, r may sort the list of pairs (ID(u), t(u)) by the key t(u) using comparison-based sorting in time Õ(Wlog n), then scan the sorted list and assign consecutive ranks to distinct keys. This computation fits in TIME(poly(n)) and uses O(W) words of memory at r.

Finally, r sends the resulting label y_u (one word) back to each node u along T, again using pipelining under the global budget B. This takes
$$ L_{\mathrm{scatter}} \;=\; O\!\left(D(G)+\frac{n}{B}\right) \;=\; O\!\left(D(G)+\frac{m}{B}\right) $$
layers, dominated by~ when m = Ω(n). Combining the steps yields the routed-BGFM depth bound
$$ L \;=\; O\!\left(D(G)+\frac{m}{B}\right), $$
matching Theorem~4 (up to the log n factor gap in the deterministic lower bound, which is an artifact of measuring the budget in words).

In direct-BGFM, nodes may send point-to-point messages to arbitrary destinations, with the only constraint being the total per-layer budget of B words. With IDs, each node can address the leader r directly. The gather therefore becomes a pure streaming problem: each node u sends its ℓ_u-word record of t(u) to r, possibly split across multiple layers. Since r can receive at most B words per layer in total, the gather finishes in
$$ L_{\mathrm{gather}}^{\mathrm{dir}} \;=\; O\!\left(\frac{W}{B}\right) \;=\; O\!\left(\frac{m}{B}\right) $$
layers. The leader performs the same exact canonicalization as above, and then streams one-word labels y_u back to the nodes, taking O(n/B) layers. Thus the direct-BGFM depth bound is
$$ L \;=\; O\!\left(\frac{m}{B}\right), $$
again matching Theorem~4.

The only nontrivial computation occurs at the coordinator r, which must implement an exact collision-free grouping of {t(u) : u ∈ V}. Under 𝒞 = TIME(poly(n)) this is straightforward via sorting or balanced search trees keyed by full type descriptions. If one wishes to reduce the peak memory at r, a standard variant is to use several aggregators r₁, …, r_s and perform a multi-stage merge: nodes first stream their types to aggregators, each aggregator sorts locally and emits a sorted run, and a final merge (at one node or distributed) produces global ranks. This does not change the asymptotic communication volume W and hence does not improve the m/B term, but it can reduce the per-node memory footprint from O(m) to O(m/s) at the cost of additional constant-factor coordination.

Finally, if 𝒞 were substantially weaker than polynomial time (e.g., constant-depth circuits), exact sorting/grouping may be infeasible even after communication succeeds. In such regimes one may replace exact canonicalization by short fingerprints of types and accept a small collision probability. Since this alters the correctness condition from ∀u, v to a bounded-error guarantee over random seeds, we treat it separately in the next section, where we also explain why shared (public-coin) randomness enables qualitatively stronger guarantees than private randomness under the same communication budget.

7. Randomness Separations and Approximate WL: public-coin fingerprinting algorithm (depth 1), private-coin limitations under low budget, and a tight tradeoff between bits and collision probability.

The lower bounds of Section~ apply to one-step refinement, where the output must be a collision-free canonical labeling of the set of WL-types. In this section we relax the requirement to a bounded-error fingerprinting condition () and show that, with public coins, the task becomes solvable in constant depth (indeed depth~1 after the local neighbor-color exchange). We then formalize a separation showing that private coins are strictly weaker under small budgets: in low-depth regimes, achieving the same collision guarantees with private randomness requires an additional Ω(log n) factor in communicated bits (or an Ω(log n) increase in depth at fixed budget).

We assume that each node u can materialize its one-step type
$$ t(u) \;=\; \bigl(x_u,\; M_u\bigr), \qquad M_u := \multiset\{x_v : v\in N(u)\}, $$
after one local exchange of colors along the edges (routed-BGFM) or by local adjacency access (direct-BGFM). The remaining step is to produce a short bit string f_u such that equality of WL-types implies equality of fingerprints deterministically, while distinct types collide only with small probability over shared randomness.

We fix public coins that choose a prime P and a small collection of pairwise-independent hash functions
g₁, …, g_s : [p(n)] → ℤ_P and h : [p(n)] → ℤ_P.
(For concreteness, one may take the standard affine family z ↦ az + b mod P with a ≠ 0, which is pairwise independent when P > p(n).) Given these public choices, each node u computes

This computation is purely local once u has received {x_v : v ∈ N(u)}. Thus, after the one neighbor exchange needed to read neighbor colors, yields a depth-1 algorithm in either BGFM variant. In particular, in the direct-BGFM variant there is no need to spend any additional global communication budget B beyond the color exchange inherent in defining t(u).

If t(u) = t(v), then x_u = x_v and M_u = M_v as multisets. Since the public functions h, g_i are fixed and the sums S_i(⋅) are multiset-invariant, it follows deterministically that f_u = f_v. This direction is unconditional and does not use randomness bounds.

Fix two nodes u ≠ v with t(u) ≠ t(v). Then either x_u ≠ x_v or M_u ≠ M_v. If x_u ≠ x_v, then Pr [h(x_u) = h(x_v)] ≤ 1/P by pairwise independence. Otherwise x_u = x_v and the multisets differ. Let Δ denote the (nonempty) multiset difference between M_u and M_v: formally, Δ is a signed multiset over [p(n)] with integer coefficients c_z such that ∑_zc_z ⋅ 1_{z ∈ M_u} − 1_{z ∈ M_v} represents the discrepancy, and Δ ≠ 0. Then for a fixed i,
S_i(u) − S_i(v) ≡ ∑_zc_z g_i(z) (mod P).
For a pairwise-independent g_i, the random variable ∑_zc_zg_i(z) is uniform over ℤ_P whenever Δ ≠ 0 (one may formalize this by conditioning on one element z^⋆ with c_z^⋆ ≠ 0 and using that g_i(z^⋆) is uniform and independent of the remaining g_i(z) values). Consequently,
$$ \Pr[S_i(u)=S_i(v)] \;=\; \Pr[S_i(u)-S_i(v)\equiv 0 \!\!\!\pmod P] \;\le\; \frac{1}{P}. $$
Using s independent choices g₁, …, g_s (via the public seed) and taking the conjunction of all s modular sums, we obtain
Pr [(S₁(u), …, S_s(u)) = (S₁(v), …, S_s(v))] ≤ P^−s.
Choosing P = Θ(n^c/ε) for a sufficiently large constant c and s = Θ(1), or more simply choosing P = Θ(n²/ε) and s = 1, yields a per-pair collision probability at most O(ε/n²). A union bound over the at most $\binom{n}{2}$ unordered pairs implies
Pr [∃u ≠ v: t(u) ≠ t(v) ∧ f_u = f_v] ≤ ε,
which is the guarantee in Theorem~5.

Each field element in ℤ_P is representable in O(log P) bits. Thus the total fingerprint length is
k = O((s + 1)log P) = Θ(log (n/ε))
for the above parameter choices, matching the dependence claimed in Theorem~5 up to constant factors. We emphasize that the log n term is not an artifact: if one requests a collision guarantee over all node pairs, then ε must be allocated across Θ(n²) events, and a logarithmic dependence on n is unavoidable for such a global guarantee.

We now contrast the public-coin construction with the private-coin regime, where each node samples its own randomness independently. The formal obstacle is that requires : if t(u) = t(v), then f_u = f_v must hold deterministically (or at least with probability 1, depending on the exact definition one adopts). Under private coins, two nodes of the same type cannot in general coordinate on the same random hash function (or the same random salt) unless they exchange enough information to synchronize their randomness; such synchronization is itself an information-transfer task.

This intuition can be made precise via the equality reduction underlying Theorem~6. One embeds a two-party EQ instance into a graph in which two designated nodes u^⋆, v^⋆ have equal WL-type if and only if Alice’s string equals Bob’s string. Any BGFM that, with collision probability at most 1/3, allows a referee (or the designated nodes themselves) to distinguish whether t(u^⋆) = t(v^⋆) yields a two-party communication protocol for EQ with comparable transcript size. In the public-coin setting, EQ admits an O(1)-bit protocol with shared randomness (e.g., compare a shared random hash of the strings). In the private-coin setting, the classic lower bound is Θ(log n) communicated bits for constant error. Translating back, one obtains that for constant depth and small per-layer budget B, private-coin BGFM must either (i) increase the total communicated bits by Ω(log n) compared to the public-coin algorithm, or (ii) increase depth by Ω(log n) at fixed B, in order to reach the same bounded-error collision guarantee.

Operationally, the public-coin algorithm uses the shared seed to ensure that all nodes apply the sketches to their multisets. With only private coins, any attempt to achieve the implication t(u) = t(v) ⇒ f_u = f_v forces the randomness to be a deterministic function of t(u) (or to be communicated), and this reintroduces the global-information bottleneck that exact suffers from under low budgets.

The public-coin fingerprinting viewpoint also clarifies a sharp tradeoff between the number of output bits and the admissible collision probability. If we demand only a collision probability δ, then choosing P = Θ(1/δ) and constant s suffices, giving k = Θ(log (1/δ)). If instead we demand a guarantee Pr [∃u ≠ v: t(u) ≠ t(v) ∧ f_u = f_v] ≤ ε, then setting δ = Θ(ε/n²) and applying a union bound yields k = Θ(log (n/ε)), exactly as in Theorem~5. This is the regime relevant to comparing refinement outcomes across all nodes simultaneously, and it is the sense in which depth-1 approximate refinement exhibits a qualitative separation from exact canonicalization under the same BGFM communication constraints.

8. Experiments (Optional but Strengthening): synthetic and real graphs; vary attention budget/sparsity; measure WL-type recovery; compare private vs public seeds and ID features; test where theory matches practice.

We include an experimental study intended to validate, in a controlled setting, the qualitative predictions of our BGFM analysis: (i) the sharp gap between exact canonicalization and bounded-error fingerprinting under small budgets, (ii) the dependence of recovery on the global communication budget B (or the number of attended pairs K in direct-BGFM instantiations), and (iii) the separation between public-coin and private-coin randomness in low-depth regimes. The experiments are not meant to optimize downstream task accuracy; rather, we measure the intrinsic ability of budgeted architectures to reproduce the one-step partition (exactly or approximately) from the same local information.

We evaluate two families of implementations consistent with the model.
In the setting we implement synchronous message passing on the input graph G, where each layer permits at most B words of communication globally (counting one word as O(log n) bits). Practically, we realize this via a global scheduler that selects a multiset of directed edge transmissions each round, of total cardinality at most B, and zeroes out all other edge messages. In the setting we implement sparse-attention layers in which at most K ordered node pairs (u, v) are active per layer; each active pair allows a one-word message m_u → v, so the per-layer word budget is B = Θ(K) up to constant encoding factors. We consider both fixed attention patterns (e.g., random or degree-biased) and learned patterns (trained end-to-end to minimize WL-type collisions), but always enforce the same B constraint for comparability.

Given an initial color vector x ∈ [p(n)]ⁿ, we compute the ground-truth one-step WL-types
$$ t(u)=(x_u,\multiset\{x_v:v\in N(u)\}) $$
and the induced equivalence relation u ∼ v ⇔ t(u) = t(v). We then evaluate algorithms that output either (a) exact labels y_u and aim to satisfy y_u = y_v ⇔ t(u) = t(v), or (b) fingerprints f_u ∈ {0, 1}^k and aim to satisfy the condition. Since our emphasis is one-step refinement, we restrict attention to algorithms that (as in our theoretical setup) are allowed one local exchange to materialize neighbor colors, and then use additional BGFM layers under budget B for any subsequent coordination.

We report three primary metrics. First, the
PairErr := Pr_u ≠ v[1{t(u) = t(v), f_u ≠ f_v} + 1{t(u) ≠ t(v), f_u = f_v}],
estimated over uniformly sampled node pairs; for exact algorithms we set f_u := y_u. Second, the
GlobCol := 1{∃u ≠ v: t(u) ≠ t(v) ∧ f_u = f_v},
which matches the guarantee stated for Theorem~5; we average GlobCol over random seeds and graph instances. Third, we compute a between the induced partition by f and the true WL partition, using adjusted Rand index (ARI) and normalized mutual information (NMI), since these summarize multiway clustering agreement without privileging any specific labeling.

For each method we sweep depth L ∈ {1, 2, 4, 8, …} and budget B (or K) across a logarithmic grid, reporting PairErr, Pr [GlobCol = 1], ARI, and wall-clock communication volume. The central plots are : for a fixed target error (e.g., Pr [GlobCol = 1] ≤ 0.1), we record the minimal L as a function of B.

We compare the following algorithmic variants, all using the same local neighbor-color exchange to form t(u) when needed.
(i) implementing : we choose a public seed defining the hash family, and each node outputs f_u = (h(x_u), S₁(u), …, S_s(u)). We vary k by changing P and s, and we check empirically that Pr [GlobCol = 1] tracks the predicted dependence on k and n.
(ii) : each node samples its own hash parameters independently and applies the same sketch form; this deliberately violates type-consistency unless additional synchronization is performed. We then test two natural “repair” mechanisms: (a) communicating a seed from a designated leader (when IDs are present), and (b) attempting to derive a shared seed from local structure (no IDs). The first consumes budget; the second is expected to fail on symmetric instances.
(iii) as a correctness baseline when IDs are allowed; we implement streaming under budget B and measure the empirical depth needed to complete the gather and scatter phases on graphs of varying diameter and edge count.
(iv) (direct-BGFM instantiations) trained to minimize a supervised loss that encourages f_u = f_v when t(u) = t(v) and discourages collisions otherwise, with a fixed k. We treat these as empirical counterparts to “adaptive routing/attention” policies and test whether learning can beat the public-coin sketch at the same B and L (our expectation is negative in worst-case families, but potentially positive on benign real graphs).

We use both synthetic and real graphs to separate worst-case phenomena from typical-case behavior.
On the synthetic side, we include: (a) paths and grids to expose the routed diameter barrier; (b) random d-regular expanders to test whether small diameter eliminates the routed overhead and isolates the m/B term; (c) stochastic block models to vary the number of distinct WL-types by tuning intra-/inter-community densities; and (d) explicit lower-bound gadgets derived from equality-type constructions (with controlled m and planted pairs of nodes whose type equality encodes a hidden bit), to test private/public separations under symmetry and limited communication.
On the real side, we use standard graph collections (citation networks, collaboration networks, and small social graphs), but we evaluate only WL-type recovery metrics, not node classification accuracy, to avoid confounding expressivity with label noise.

We treat the availability of unique identifiers as a binary experimental condition: (a) , where each node has a unique ID feature in [n^O(1)]; (b) , where nodes have no IDs and only the input colors x. For anonymous graphs we also test mild preprocessing that assigns random features (private coins) to nodes, which is a common heuristic in practice; our measurements record whether this suffices to emulate public coins at low budget, and how much additional communication is required to synchronize such randomness to achieve type-consistency.

We use the experiments to check three theory-aligned predictions.
First, for public-coin sketches we expect Pr [GlobCol = 1] to decrease essentially exponentially in k at fixed n, and to require k = Θ(log (n/ε)) to keep Pr [GlobCol = 1] ≤ ε as n grows. Second, for exact refinement with IDs we expect the measured depth to scale as D + Θ(m/B) in routed graphs (up to scheduling constants), and as Θ(m/B) in direct graphs. Third, for private coins at small B and constant L, we expect a persistent gap relative to public coins on the equality-derived gadgets: either type-consistency fails (same-type nodes disagree), or collision rates remain bounded away from the public-coin baseline unless an additional Ω̃(log n) number of bits is devoted to seed synchronization, consistent with the qualitative content of Theorem~6.

9. Extensions and Open Problems: multi-token/hub variants, beyond 1-step WL, tighter expander-based upper bounds under budget, and learnability vs computability.

Our analysis isolates the one-step refinement operator and a global communication budget as the primary bottlenecks in and . Several natural extensions suggest themselves; we record them as concrete directions where we do not yet have sharp theorems.

A common architectural modification of message-passing and attention-based models is to introduce a small number of auxiliary ``tokens’’ (sometimes called hubs, global nodes, or registers) that can communicate broadly. In our language, this corresponds to augmenting the node set with r extra nodes H = {h₁, …, h_r} and edges from each h_i to many or all u ∈ V, or alternatively allowing a restricted set of nodes to be endpoints of many direct-BGFM messages subject to the same global budget B. It is tempting to view such hubs as a way to simulate public coins (by broadcasting a seed) and to compress the gather–sort–scatter routine (by centralizing coordination).

Two questions appear fundamental. First, what is the correct analogue of the lower bounds when we permit r hubs but keep the per-layer global budget B fixed? If hubs are allowed to receive fan-in outside the budget accounting, the model collapses to a centralized streaming algorithm and our bounds are no longer meaningful. If, however, every hub message is charged against B, then hubs may still improve constants but should not change the asymptotic m/(Blog n) information constraint for exact canonicalization. We do not currently have a theorem formalizing this intuition for the most permissive hub semantics used in practice (e.g., a single CLS'' token attending to many nodes with a fixed-dimensional embedding). Second, for routed-BGFM, augmenting $G$ with a hub connected to all nodes destroys the diameter $D(G)$ term, but only if those hub edges are present and usable under the same budget. It is therefore natural to ask for a tradeoff theorem of the form \[ L \;\ge\; \Omega\!\left(\frac{m}{B\log n}\right)\quad\text{and}\quad L \;\ge\; \Omega\!\left(\frac{\mathrm{dist}_G(A,B)}{\mathrm{bandwidth\ through\ hubs}}\right), \] where the second term captures how much of the original diameter barrier remains when only $O(B)$ words can traverse the hub per layer. A clean formulation may require introducing an explicitedge-capacity’’ view of the budget.

We restricted attention to a single WL refinement step because it cleanly separates local multiset formation from global canonicalization. For k-step WL (iterating the one-step operator k times), two distinct resource phenomena arise.

On the one hand, even the k-hop WL-type requires information propagation to radius k, implying an unavoidable Ω(k) depth term in routed-BGFM. On the other hand, each refinement step produces new colors of magnitude potentially polynomial in n, and exact global relabeling after each step seems to require repeated instances of the same canonicalization bottleneck. This suggests a lower bound of the schematic form
$$ L \;\ge\; \Omega\!\left(k + \frac{T_k}{B\log n}\right), $$
where T_k measures the total description length of the multiset data that must be globally de-duplicated across k iterations. Even defining T_k in a model-independent way is subtle: in dense graphs the neighborhood multisets grow quickly, while in bounded-degree graphs the number of distinct WL-types may stabilize early.

An open problem is to obtain matching upper bounds that avoid naively paying a full gather–sort–scatter cost at every iteration. In classical distributed settings one can sometimes exploit incremental refinement, stable partitions, or the fact that many nodes share types. In BGFM terms, we would like an algorithm whose total communication scales with the number of rather than with the raw number of edges. Whether such incremental schemes can be made worst-case optimal under a strict global budget B remains unclear.

Our routed-BGFM upper bound for exact with IDs follows a generic BFS-tree streaming schedule, yielding L = O(D + m/B). This is tight in general, but it may be pessimistic on graphs with good routing properties. For constant-degree expanders, for instance, one expects high throughput for many-to-one and one-to-many communication; in classical routing models, expanders support near-optimal congestion–dilation tradeoffs.

The corresponding open question in BGFM is: can we replace the term m/B by a smaller quantity that reflects the existence of many edge-disjoint paths or good mixing, while still respecting the budget constraint? In particular, if B is moderately large (say B = Θ(n)) and G is an expander of diameter O(log n), can we complete exact gather–sort–scatter in O(log n) layers, essentially matching the diameter, by using the budget to saturate the network’s expansion rather than streaming edge-by-edge? Any such statement would require a careful distinction between (i) local edge-capacity constraints (as in ) and (ii) our global word budget, where the scheduler may concentrate all communication on a small cut. One possible path is to show that on expanders, any schedule that uses B words per layer can be ``spread’’ so as to achieve balanced load and thereby improve end-to-end throughput. We do not know whether this is true in our accounting, and we suspect counterexamples when the budget is global but adversarially scheduled.

The public-coin depth-1 sketch for highlights that the main obstacle to approximate refinement is not information flow but of hashing across nodes. This raises several questions about the power of preprocessing and weak symmetry breaking.

First, in anonymous graphs, what is the minimal additional information that must be injected to emulate public coins? A single shared seed of O(log n) bits suffices, but producing it without IDs is nontrivial on symmetric instances. One can allow a preprocessing map f_pre that assigns random features (private coins) to nodes; empirically this often works, but our lower-bound perspective suggests families where it cannot be synchronized without spending Ω̃(log n) additional bits in the computation itself. It would be valuable to formalize a taxonomy of preprocessing regimes (e.g., independent random bits per node, limited shared randomness among neighbors, or a small number of globally shared bits) and to pin down which regimes collapse the public/private separation of Theorem~6.

Second, even with public coins, we have only analyzed one-step fingerprints based on additive sketches of neighbor multisets. Are there strictly better depth-1 sketches (at comparable k = Θ(log (n/ε))) that reduce constants or improve robustness to adversarially chosen colors x? Conversely, can we prove that Θ(log (n/ε)) bits are necessary for collision probability ≤ ε in the worst case, even allowing arbitrary 𝒞 and arbitrary depth-1 computation after the neighbor exchange? Such lower bounds would complement Theorem~5 by showing the sketch length is not merely sufficient but essentially necessary under the stated failure criterion.

Finally, our model is motivated by architectures trained end-to-end, where the communication pattern (attention) and the update maps are learned. This brings a conceptual distinction: a function may be in BGFM with some depth/budget, yet difficult to from data by gradient-based methods; conversely, a learned heuristic may succeed on typical instances without contradicting worst-case lower bounds.

A natural open problem is to articulate learnability statements that respect the BGFM resource constraints. For example, one may fix B and L, restrict UPD^(ℓ) to a parametric family (say, bounded-width MLPs), and ask whether minimizing a supervised collision loss can reliably recover the public-coin fingerprint map, or approximate the gather–sort–scatter labeling on benign graph distributions. On the negative side, one may ask for distributional lower bounds: are there graph families (perhaps derived from equality gadgets) such that any model within a given (B, L) budget that achieves low collision probability must implicitly implement a hard communication protocol, thereby making learning sample-inefficient or optimization-hard? Establishing such results would require importing tools from statistical query complexity, hardness of learning, or information-theoretic generalization, and interfacing them with our communication-complexity reductions.

We view these questions as complementary to our current theorems: the theorems identify what must happen in successful computation under budget, while the open problems ask which of those computations are realized (or approximated) by the inductive biases and training dynamics of modern graph foundation models.