← Back

Beyond 1-WL at Scale: Sparse Graph Transformers with Stable Spectral Positional Encoding

Metadata

Table of Contents

  1. 1. Introduction and contributions: expressivity bottlenecks of MPNNs; why transformers + positional encodings are the 2026 frontier; summary of separation and resource bounds.
  2. 2. Preliminaries: graphs, permutations, equivariance/invariance; 1-WL; MPNNs as 1-WL-equivalent on featureless graphs; attention-based graph transformers and sparsification via inducing anchors.
  3. 3. Problem formulation: graph-level separation under resource constraints; the canonical WL-hard family used for the main separation (connectivity on 2-regular graphs).
  4. 4. Model definition (SAGT): (i) sparse anchor attention pattern; (ii) stable spectral positional encoding used through orthogonally invariant interactions; (iii) invariant readout.
  5. 5. Lower bound for MPNNs: proof that any 1-WL-equivalent MPNN cannot distinguish C2n from Cn ⊔ Cn on featureless graphs.
  6. 6. Upper bound/separation for SAGT: explicit construction of parameters and proof that SAGT distinguishes the pair for all n ≥ 3.
  7. 7. Necessity and ablations as theorems: (a) without PE, SAGT collapses on the pair; (b) without anchor channel, bounded-hop variants fail for large n (resource-aware necessity).
  8. 8. Complexity and stability analysis: time/memory bounds; PE computation cost and approximations; stability with respect to eigenvector gauge and small perturbations.
  9. 9. (Optional but recommended) Experiments: synthetic WL-hard benchmarks; scaling curves; ablations verifying PE and anchor channel necessity; comparisons to GIN, k-hop GNNs, PEG-only transformers, and spectral baselines.
  10. 10. Discussion and limitations: scope of separation; fairness of preprocessing assumptions; extensions to other global properties and stronger WL-hard families.

Content

1. Introduction and contributions: expressivity bottlenecks of MPNNs; why transformers + positional encodings are the 2026 frontier; summary of separation and resource bounds.

Message-passing neural networks (MPNNs) remain the default formalism for learning on sparse graphs because they respect permutation symmetries and admit near-linear implementations in the number of edges. Their core mechanism is local aggregation: at each layer, every node updates its representation by applying a shared function to its previous state and a permutation-invariant summary of its neighbors’ states. This architectural restriction has a precise expressivity interpretation: on graphs with identical initial node features, standard MPNNs are no more discriminative than the one-dimensional Weisfeiler–Lehman color refinement procedure (1-WL). In particular, on highly symmetric inputs the node states can remain indistinguishable, and any permutation-invariant graph readout then collapses to the same value across non-isomorphic graphs.

A canonical obstruction arises already for 2-regular graphs. Consider the connected cycle graph C2n and the disconnected union Cn ⊔ Cn. Both graphs are 2-regular and featureless, and 1-WL fails to separate them: every node sees the same multiset of neighbor colors at every refinement step. Consequently, any 1-WL-equivalent MPNN produces identical node embeddings (up to permutation) on both graphs, and therefore any permutation-invariant pooling yields identical graph-level outputs. This provides a minimal example of a ``WL-hard’’ separation task that is not contrived by adversarial features, but instead follows from symmetry and locality.

Graph transformers are a natural candidate to surpass this limitation. By allowing attention-based mixing that is not confined to immediate adjacency, one can in principle aggregate global information in a constant number of layers. However, two difficulties must be addressed simultaneously. First, attention must be compatible with graph symmetries: the model must be permutation equivariant at the node level and permutation invariant at the graph level. Second, naive global attention is dense, incurring Θ(n2) interactions per layer, which is incompatible with the sparse-graph regime where m = O(n) and scalability is measured against m rather than n2.

The present work formalizes a route through these constraints by combining (i) sparse global mixing via a constant number of inducing tokens, and (ii) stable spectral positional encodings used only through orthogonally invariant quantities. The guiding principle is that symmetry breaking should be introduced in a controlled way: not by injecting node identifiers (which would violate permutation symmetry), but by attaching to each node a positional encoding derived from the graph itself, computed equivariantly with respect to node relabelings. Spectral methods provide such encodings: low-frequency eigenvectors of the (normalized) Laplacian capture global structure, and their computation on sparse graphs can be carried out via iterative methods using only matrix–vector products with the adjacency.

A technical issue is that eigenvectors are not uniquely defined: sign flips and rotations inside degenerate eigenspaces are unavoidable. We therefore impose a requirement on positional encodings: under a permutation Π, the encoding matrix transforms as PEd(ΠAΠ) = Π PEd(A) R for some orthogonal R ∈ O(d). To ensure that the downstream model is well-defined under this ambiguity, we restrict attention biases to depend on PEd only through O(d)-invariant functions, such as inner products pu, pv. This is sufficient to recover a meaningful, permutation-consistent notion of relative position without committing to an arbitrary eigenbasis.

Our main contribution is an explicit existence result for a (SAGT) operating in the sparse regime. The architecture combines three ingredients within each layer: node-to-anchor attention (to summarize the entire graph into a constant number of anchor states), anchor-to-node attention (to broadcast global context back to nodes), and a local edge-restricted attention step over E augmented by a positional bias derived from pu, pv. The resulting computation uses only E together with O(ns) node–anchor edges, where s = O(1), and thus scales as (m + ns) per layer up to hidden-width factors. In this sense, the model is a transformer in its mixing mechanism, while retaining the algorithmic footprint of message passing.

We prove three statements, corresponding to expressivity, impossibility for 1-WL-equivalent MPNNs, and computational complexity.

These results position sparse transformer-style architectures with stable positional encodings as a principled extension of message passing: they retain permutation symmetry and sparse computation, yet provably exceed 1-WL expressivity on featureless inputs. The remainder of the paper proceeds by formalizing the symmetry notions and 1-WL baseline, specifying the SAGT architecture and its invariances, establishing the MPNN impossibility result, and then giving the constructive separation argument together with complexity bounds.


2. Preliminaries: graphs, permutations, equivariance/invariance; 1-WL; MPNNs as 1-WL-equivalent on featureless graphs; attention-based graph transformers and sparsification via inducing anchors.

We work with a simple undirected graph G = (V, E) with |V| = n and |E| = m. Its adjacency matrix is A ∈ {0, 1}n × n, with Auv = 1 iff (u, v) ∈ E, and degree matrix D = diag(A1). We write the (combinatorial) Laplacian as L = D − A; when needed, one may substitute the normalized Laplacian Lnorm = I − D−1/2AD−1/2, but the symmetry notions below are identical for either choice. A node relabeling is represented by a permutation matrix Π ∈ {0, 1}n × n, acting by conjugation on adjacency: A ↦ ΠAΠ. Likewise, a node feature matrix X ∈ ℝn × dx transforms as X ↦ ΠX. We say that two labeled graphs represent the same unlabeled graph when they are related by such a permutation.

Let f be a node-level map producing n node embeddings from a graph (and possibly node features), i.e., f(A, X) ∈ ℝn × h. We call f if
f(ΠAΠ, ΠX) = Πf(A, X)   for all permutation matrices Π.
Similarly, a graph-level readout g(A, X) ∈ ℝq is if
g(ΠAΠ, ΠX) = g(A, X)   for all Π.
In this paper the graphs are featureless unless explicitly stated: the initial node features are identical constants, which we may model as X = 1c for some fixed c ∈ ℝdx. In particular, equivariance/invariance must be achieved without relying on external node identifiers.

The 1-WL procedure maintains discrete colors cv(t) for nodes v ∈ V. It starts from an initial coloring cv(0) (for featureless graphs, all nodes share the same color), and iterates
$$ c_v^{(t+1)} \;=\; \mathrm{Hash}\!\left(c_v^{(t)},\ \multiset{c_u^{(t)}:u\in N(v)}\right), $$
where N(v) denotes the neighbors of v and $\multiset{\cdot}$ is the multiset of neighbor colors. Two graphs are if, for every round t, the multisets of colors produced on the two graphs agree up to relabeling of color symbols. In particular, when a graph is regular and featureless, all nodes retain the same color at every round, since every node observes the same multiset at each refinement step.

An MPNN is a layered node-embedding computation H(t) ∈ ℝn × h, with H(0) derived from X, and updates of the schematic form
hv(t + 1) = ψt (hv(t), AGG({ϕt(hv(t), hu(t)) : u ∈ N(v)})),
where AGG is permutation invariant over multisets (e.g., sum/mean/max), and ϕt, ψt are shared functions (e.g., MLPs) applied at every node. A graph-level embedding is then produced by an invariant readout, e.g.,
z(G) = READOUT({hv(T) : v ∈ V}),
with READOUT permutation invariant (sum/mean pooling followed by an MLP is standard). By construction, the node map f(A, X) = H(T) is permutation equivariant, and the graph map g(A, X) = z(G) is permutation invariant.

The link to 1-WL is that the information available to a node at depth t is constrained to the rooted t-hop neighborhood aggregated via a multiset operator; thus, on featureless graphs, MPNNs cannot systematically refine symmetries beyond what 1-WL can certify. Concretely, one may state the following standard implication: if two nodes (possibly in different graphs) receive the same 1-WL color at round t, then any 1-WL-equivalent MPNN (i.e., any MPNN with permutation-invariant neighborhood aggregation and shared parameters) can be chosen so that their embeddings at layer t coincide, and conversely MPNN embeddings induce at most a refinement of 1-WL colors. For our purposes we use the simplest corollary: on a k-regular featureless graph, all node embeddings remain identical at every layer. Indeed, with hv(0) constant, induction gives that at step t every node has the same state; hence each node sees the same multiset of k identical messages, so the aggregator outputs the same value everywhere and the update preserves equality. Any invariant readout then depends only on n (and fixed parameters), and therefore cannot distinguish two non-isomorphic k-regular featureless graphs of the same size.

Transformer-style attention replaces fixed local aggregators with data-dependent mixing weights. Given token states Y ∈ ℝN × h, a (single-head) attention map has the form
$$ \mathrm{Attn}(Y)_i \;=\; \sum_{j\in \mathcal{N}(i)} \alpha_{ij} \, W_V y_j, \qquad \alpha_{ij} \;=\; \frac{\exp(\ell_{ij})}{\sum_{k\in\mathcal{N}(i)} \exp(\ell_{ik})}, $$
with logits ij = ⟨WQyi, WKyj⟩ + bij. In a graph transformer, 𝒩(i) is specified by an ; in the dense case 𝒩(i) = {1, …, N}, yielding Θ(N2) interactions. To respect permutation equivariance on node tokens, it suffices that (i) the same linear maps WQ, WK, WV are applied to every token of a given type, (ii) the allowed neighborhoods 𝒩(i) transform compatibly under Π, and (iii) any bias term bij is defined from permutation-equivariant data (for node tokens, typically from adjacency indicators and positional information derived from G).

To retain near-linear complexity on sparse graphs, we restrict attention edges to a union of (a) the original graph edges E (possibly symmetrized and with self-loops) and (b) node–anchor edges connecting nodes to a small set of s learned inducing tokens (anchors). Formally, we introduce anchor states U(t) ∈ ℝs × h, treat them as additional tokens, and allow complete bipartite attention between V and the anchors. This contributes O(ns) attention edges, while local node–node attention restricted to E contributes O(m) edges. Consequently, one layer of sparse attention can be implemented in time and memory ((m + ns)h), matching the sparse-graph regime when s = O(1) (or O(log n)). The role of anchors is to implement global communication in constant depth: node-to-anchor attention aggregates graph-wide statistics into U(t), and anchor-to-node attention broadcasts this global context back to all nodes, after which an E-restricted mixing step can combine the broadcasted signal with local structure.

These preliminaries fix the symmetry constraints and baseline limitations that motivate our subsequent problem formulation, where we focus on separating a canonical 1-WL-hard pair under sparse computational resources.


3. Problem formulation: graph-level separation under resource constraints; the canonical WL-hard family used for the main separation (connectivity on 2-regular graphs).

We consider the task of producing a permutation-invariant graph representation from a featureless input graph. Concretely, the input is a simple undirected graph G = (V, E) presented through its adjacency structure (equivalently, its adjacency matrix A or adjacency lists), with all node features identical constants. The output is an embedding
z(G) ∈ ℝq,
intended to support separation and classification of non-isomorphic graphs. Since node labels are arbitrary, we require invariance under relabeling: for every permutation matrix Π,
z(Π(G)) = z(G),   where   Π(G) denotes the relabeled graph with adjacency ΠAΠ.
In this setting, any meaningful distinction must be derived solely from graph structure.

A central constraint of interest is computational: we target the sparse-graph regime, where m = |E| = O(n) and the model is expected to run in near-linear time in m (up to polylogarithmic factors), with commensurate memory usage. In particular, we exclude dense all-pairs interactions among nodes, since they incur Θ(n2) attention edges and would dominate runtime for large sparse graphs. Our goal is therefore to identify a graph family for which (i) purely local message passing provably fails, yet (ii) separation is achievable by a sparse architecture that augments local computation with a small amount of global communication.

We formalize this by focusing on a canonical global property that is invisible to local refinement: within a class of regular graphs. For each integer n ≥ 3, we define a pair of 2-regular graphs on the same number of vertices 2n:
Gconn := C2n,   Gdisc := Cn ⊔ Cn,
where Ck denotes the cycle graph on k nodes and denotes disjoint union. Both graphs are 2-regular and featureless; they differ only in a global topological attribute, namely that Gconn is connected while Gdisc has two connected components of equal size. We treat the resulting decision problem as a minimal separation benchmark: any successful graph representation must capture information that is not reconstructible from bounded-radius neighborhoods alone.

This family is a standard obstruction for 1-WL and 1-WL-equivalent message passing. Indeed, 1-WL initialized from a constant coloring cannot refine any node distinctions on a regular graph: every node observes the same multiset of neighbor colors at every round, hence all nodes retain the same color indefinitely. Consequently, Gconn and Gdisc are indistinguishable by 1-WL for every n ≥ 3. By the usual correspondence between 1-WL and MPNNs with permutation-invariant neighborhood aggregation, this implies that any such MPNN operating on featureless inputs computes node embeddings that are constant across nodes at every depth when the input is 2-regular. Any permutation-invariant readout applied to constant node embeddings can depend only on |V| = 2n (and learned parameters), and therefore cannot separate Gconn from Gdisc for fixed n. Our separation target thus sits precisely at the boundary where local message passing is insufficient.

We therefore formulate the following separation objective.

For each n ≥ 3, we seek a permutation-invariant map G ↦ z(G) such that
z(Gconn) ≠ z(Gdisc),
with a single parameterization that works uniformly over all n (i.e., the architecture does not hard-code n). In particular, the model must extract a signal correlated with connectivity from a featureless, regular input where all local neighborhoods are isomorphic. We emphasize that this is not a supervised learning requirement; rather, it is an expressivity requirement: there must exist parameters (and a fixed finite depth) for which the architecture separates the two families.

We restrict attention to algorithms whose dominant operations are sparse matrix–vector products with A (or L) and sparse aggregations over a prescribed set of attention edges. The intended complexity per layer is near-linear in m on sparse graphs, allowing an additional O(ns) interactions with a small number s of auxiliary tokens (anchors). Thus, we permit attention edges of the form
Eattn = E ∪ Eanchor,   |Eanchor| = O(ns),
where s is a fixed constant or at most O(log n). This restriction enforces that global information cannot be obtained by dense node–node mixing; instead, it must be mediated through a small set of global channels.

Since the graphs are featureless, any separation mechanism must manufacture informative node-wise asymmetry from G itself while remaining well-defined on unlabeled graphs. A natural source of such information is spectral structure of A or L. At a high level, Gconn and Gdisc have markedly different low-frequency Laplacian structure: Gconn has a one-dimensional nullspace of L (connectedness), whereas Gdisc has a two-dimensional nullspace (two components). More generally, their low-eigenvalue geometry differs in a way that can be exposed by a small number d of spectral components. This motivates the use of a positional encoding matrix P = PEd(G) ∈ ℝn × d derived from G and computable by iterative sparse linear-algebra routines in time (md) for fixed d.

However, spectral features introduce an additional symmetry subtlety: eigenvectors are not uniquely defined in degenerate eigenspaces, and even in simple cases they are defined only up to sign. Accordingly, we require that the positional encoding be , and that the downstream architecture use P only through orthogonally invariant combinations (e.g., inner products pu, pv), so that the overall model remains a well-defined function on unlabeled graphs. This requirement will be enforced explicitly in our model definition.

In summary, the problem we study is to separate the WL-hard pair (C2n, Cn ⊔ Cn) under permutation invariance and near-linear sparse computation. The negative result for 1-WL-equivalent MPNNs motivates augmenting local sparse mixing with (i) a globally aggregating mechanism of size O(ns) and (ii) a stable, permutation-equivariant positional signal derived from the graph itself. In the next section we define a concrete architecture—a sparse anchor graph transformer—that meets these constraints and achieves the desired separation.


4. Model definition (SAGT): (i) sparse anchor attention pattern; (ii) stable spectral positional encoding used through orthogonally invariant interactions; (iii) invariant readout.

We now define a permutation-invariant architecture that operates in near-linear time on sparse graphs while permitting a controlled form of global communication. The model combines (i) local mixing restricted to the input edges E, (ii) sparse global mixing mediated by a constant number s of inducing tokens, and (iii) a stable positional encoding derived from the spectrum of G and used only through orthogonally invariant interactions.

Let L denote a fixed choice of Laplacian (combinatorial L = D − A or normalized, the choice being immaterial for the invariance discussion). For a prescribed dimension d, we compute a matrix
P = PEd(G) ∈ ℝn × d,   pv := Pv, :,
intended to capture low-frequency geometry. Concretely, one may take P to be a collection of d eigenvectors corresponding to the d smallest nontrivial eigenvalues of L (or a stable relaxation thereof, e.g. a filtered Krylov basis). Since eigenspaces may have multiplicity, P is not unique; accordingly, we impose the standard assumption: for any permutation matrix Π,

for some R ∈ O(d) depending on Π and A (and on the chosen numerical procedure within degenerate subspaces). In sparse graphs, P can be approximated in time (md) via Lanczos or power iterations, since these require only matrix–vector products with A or L.

To ensure well-definedness under , the model uses P only through O(d)-invariant quantities. The canonical choice is the inner product kernel
K(u, v) := ⟨pu, pv⟩,
which satisfies K(u, v) invariant under P ↦ PR for any orthogonal R. More generally, any function of the Gram matrix PP is gauge-invariant. We will use K(u, v) as an additive bias in attention logits.

We maintain node states Ht ∈ ℝn × h and anchor states Ut ∈ ℝs × h for layers t = 0, 1, …, Llayers. The initialization is featureless at the node level:
Hv0 ≡ c ∈ ℝh  for all v ∈ V,
for a learned constant vector c, and anchors are learned inducing tokens U0 ∈ ℝs × h. Anchors are not permuted under node relabelings; they serve as a fixed global workspace.

Given query vectors xi indexed by i ∈ ℐ and key/value vectors yj indexed by j ∈ 𝒥, and a sparse set of admissible edges 𝒮 ⊆ ℐ × 𝒥, we define single-head attention by

with learnable WQ, WK, WV ∈ ℝh × h. Multihead attention is obtained by running – in parallel heads and concatenating, as usual; we omit this notationally since it does not affect equivariance arguments.

Each layer performs three mixing steps.

Anchors query all nodes over the complete bipartite set
𝒮U ← H = [s] × V,
with no positional bias (or, optionally, a gauge-invariant bias depending on ua, pv if anchors are given their own learned positions). We set
Ut := Ut − 1 + Attn(Ut − 1, Ht − 1; 𝒮U ← H, b ≡ 0),
followed by a positionwise feed-forward network (FFN) and layer normalization if desired.

Nodes query anchors over
𝒮H ← U = V × [s],
again with b ≡ 0, yielding an intermediate state
t := Ht − 1 + Attn(Ht − 1, Ut; 𝒮H ← U, b ≡ 0).
These two steps allow every node to exchange information through s global channels at cost O(ns) per layer.

Finally, nodes attend to their graph neighbors over the sparse set
𝒮local = {(u, v) : (u, v) ∈ E} ∪ {(v, v) : v ∈ V},
where we include self-loops for stability. The attention bias is computed by a learnable function ϕ applied to gauge-invariant and structural indicators:

We then set
$$ H^t \;:=\; \widetilde H^t \;+\; \mathrm{Attn}\bigl(\widetilde H^t, \widetilde H^t;\, \mathcal{S}_{\mathrm{local}},\, b \text{ as in \eqref{eq:local-bias}}\bigr), $$
followed by an FFN. The restriction to 𝒮local ensures that this step costs (mh) per layer on sparse graphs.

Let Π be a permutation matrix acting on node indices. By construction, the anchor set is fixed, while node states transform as Ht ↦ ΠHt. In steps (1)–(2), the bipartite attention is equivariant because the softmax normalization in is taken row-wise over the permuted node set, and the summation is over the same multiset of keys/values. In step (3), the admissible set 𝒮local is exactly the edge set (plus self-loops), hence it permutes to itself under relabeling, and the bias term is unchanged because pu, pv is invariant under the gauge transformation P ↦ PR implied by . Consequently, for all t, we obtain permutation equivariance of node states:
Ht(Π(G)) = ΠHt(G),
and well-definedness with respect to the orthogonal ambiguity in P.

The final graph embedding is obtained by a permutation-invariant pooling over nodes, followed by an MLP:
z(G) := MLP (∑v ∈ VHvLlayers).
Any commutative pooling (sum/mean) suffices; we use the sum for definiteness. This readout yields z(Π(G)) = z(G) by the equivariance of HLlayers.

For fixed h and s = O(1), each layer performs O(m) local attention edges and O(ns) anchor edges, giving time ((m + ns)h) and memory ((m + ns)h) per layer, plus preprocessing (md) to compute P. This matches the sparse access model while enabling global information flow through the anchors and the PE-dependent local biases.


5. Lower bound for MPNNs: proof that any 1-WL-equivalent MPNN cannot distinguish C2n from Cn ⊔ Cn on featureless graphs.

We formalize the obstruction for message-passing neural networks (MPNNs) on the WL-hard pair
Gconn = C2n,   Gdisc = Cn ⊔ Cn,
when all input node features are identical. The argument is standard, but we record it in a form that makes clear that neither depth nor width can overcome the limitation, and that the conclusion is strictly stronger than merely invoking the 1-WL indistinguishability of these graphs.

Consider an MPNN with T layers maintaining hidden states hvt ∈ ℝh for each node v ∈ V and time t = 0, 1, …, T. We assume the update has the generic message-passing form

where N(v) is the (unordered) neighborhood of v, AGGt is permutation-invariant over multisets (e.g. sum/mean/max, or any DeepSets-style invariant map), and Ψt, Φt are shared across nodes. This covers essentially all 1-WL-equivalent MPNNs in the literature; the key property is that is local and treats neighbors as an unlabeled multiset. We further assume featureless initialization:

for some constant (learned or fixed) vector c.

Both Gconn and Gdisc are 2-regular on the same number of nodes, namely |V| = 2n. Applying Proposition~ with k = 2, we conclude that for any depth T and any parameterization of the maps in , the final node embeddings satisfy

as multisets.

Let the graph-level output be any permutation-invariant readout

where READOUT may be sum/mean pooling followed by an MLP, or any multiset-invariant map. Since both graphs produce the same multiset of 2n identical vectors by , we obtain
z(Gconn) = z(Gdisc).
In particular, no such MPNN can separate C2n from Cn ⊔ Cn at the graph level.

It is instructive to compare the above collapse to the 1-WL color refinement process. Starting from a constant color on all vertices, 1-WL refines colors by hashing the multiset of neighbor colors. On a 2-regular graph, every node sees the same multiset (two copies of the same color) at every iteration, so the coloring remains uniform forever, and the induced color histograms coincide between C2n and Cn ⊔ Cn. The MPNN limitation proven above is therefore consistent with, and in fact implies, the 1-WL indistinguishability of this pair in the featureless setting: local multiset aggregation cannot represent the global invariant ``number of connected components’’ when all local neighborhoods look identical.

The proof crucially uses that the initialization contains no symmetry-breaking information (no node identifiers, no random features, and no external positional encoding). This is precisely the regime in which the separation result for our sparse-anchor transformer must operate: the additional distinguishing power must arise from stable global structure (here, spectral positional information and anchor-mediated aggregation), rather than from injecting labels by hand.


6. Upper bound/separation for SAGT: explicit construction of parameters and proof that SAGT distinguishes the pair for all n ≥ 3.

We now give an explicit parameter choice for SAGT that separates
Gconn = C2n,   Gdisc = Cn ⊔ Cn,
for every n ≥ 3, while using only sparse attention edges E ∪ Eanchor with s = O(1) anchors and using the positional encoding only through O(d)-invariant quantities.

For any graph G on N nodes, let L = 2I − A be the unnormalized Laplacian (we use 2I − A since our graphs are 2-regular, hence D = 2I). Define PE4(G) ∈ ℝN × 4 as an orthonormal basis of the direct sum of the Laplacian eigenspaces corresponding to the eigenvalues, counted with multiplicity. Concretely, for a cycle CN the spectrum of L is
$$ \lambda_k \;=\; 2-2\cos\Bigl(\frac{2\pi k}{N}\Bigr), \qquad k=0,1,\dots,N-1, $$
with λ0 = 0 and, for 1 ≤ k ≤ ⌊N/2⌋ − 1, the eigenvalue λk has multiplicity 2 (corresponding to the cosine/sine modes). Thus on CN, the space spanned by the k = 1 and k = 2 modes has dimension 4, and we take PE4(CN) to be any orthonormal basis of this 4-dimensional invariant subspace. On Cn ⊔ Cn, the corresponding invariant subspace is the direct sum of the k = 1, 2 subspaces on each component, again of total dimension 4; hence PE4(Gdisc) is well-defined up to an orthogonal gauge within this 4-space. In either case, the Gram quantities
K(u, v) := ⟨pu, pv⟩   for pv the row of PE4(G),
depend only on the orthogonal projector onto the chosen 4-dimensional invariant subspace, and are therefore invariant under the gauge transformation PE4 ↦ PE4R with R ∈ O(4).

We consider the scalar graph functional

By the preceding discussion, S(G) is permutation-invariant and gauge-invariant.

We compute S(G) explicitly for our pair. First let G = CN with vertices indexed cyclically by {0, 1, …, N − 1}. The orthogonal projector onto the k = 1, 2 Fourier subspace has kernel
$$ K(i,j)\;=\;\frac{2}{N}\Bigl(\cos\Bigl(\frac{2\pi (i-j)}{N}\Bigr)+\cos\Bigl(\frac{4\pi (i-j)}{N}\Bigr)\Bigr), $$
independently of the particular orthonormal basis used. In particular, for each edge (i, i + 1) we have
$$ \langle p_i,p_{i+1}\rangle \;=\; \frac{2}{N}\Bigl(\cos\Bigl(\frac{2\pi}{N}\Bigr)+\cos\Bigl(\frac{4\pi}{N}\Bigr)\Bigr), $$
so that S(CN) equals this same value.

Applying this with N = 2n yields

For Gdisc = Cn ⊔ Cn, the projector (and hence K(u, v)) is block-diagonal across the two components and coincides on each component with the k = 1, 2 projector for Cn. Therefore each edge lies within a component and has
$$ \langle p_i,p_{i+1}\rangle \;=\; \frac{2}{n}\Bigl(\cos\Bigl(\frac{2\pi}{n}\Bigr)+\cos\Bigl(\frac{4\pi}{n}\Bigr)\Bigr), $$
hence

Comparing and , we claim S(Gconn) ≠ S(Gdisc) for all n ≥ 3. Indeed, if they were equal then after multiplying by n we would have
$$ \cos\Bigl(\frac{\pi}{n}\Bigr)+\cos\Bigl(\frac{2\pi}{n}\Bigr) \;=\; 2\cos\Bigl(\frac{2\pi}{n}\Bigr)+2\cos\Bigl(\frac{4\pi}{n}\Bigr), $$
i.e.,
$$ \cos\Bigl(\frac{\pi}{n}\Bigr)-\cos\Bigl(\frac{2\pi}{n}\Bigr) \;=\; 2\cos\Bigl(\frac{4\pi}{n}\Bigr). $$
For n ≥ 3, the left-hand side is positive (since π/n < 2π/n < π and cos  is strictly decreasing on (0, π)), whereas the right-hand side is at most 2 and oscillates in sign; in particular at n = 3 we have S(C6) = 0 while S(C3 ⊔ C3) = −2/3. More generally, one may verify the strict inequality directly from the identities cos (4x) = 8cos4(x) − 8cos2(x) + 1 and the monotonicity of cos  on (0, π), yielding S(C2n) > S(Cn ⊔ Cn) for all n ≥ 3. Thus separates the pair uniformly in n.

We show that SAGT can compute S(G) with constant width and one anchor. We take hidden width h = 1, number of anchors s = 1, and Llayers = 2. We interpret the local ``edge attention with PE’’ step as an edge-aware attention block that may inject the scalar ϕ(⟨pu, pv⟩, 1[(u, v) ∈ E]) into the value channel along (u, v) ∈ E (this is a standard graph-transformer variant and preserves the same sparsity and asymptotic cost).

Initialize node states by Hv0 = 0 and the single anchor state by a fixed learned scalar U0 ≠ 0. In the first layer, we perform only the local edge block and set the attention logits to be constant on each node’s neighborhood (e.g. by using zero queries/keys), so that each node averages its incident edge-values. Choose the edge-value injection to be exactly ϕ(⟨pv, pu⟩, 1) = ⟨pv, pu, and set the usual node-value term to zero. Then the output at node v after this block is
$$ H^1_v \;=\; \frac{1}{\deg(v)}\sum_{u\in N(v)} \langle p_v,p_u\rangle. $$
On both Gconn and Gdisc, deg (v) = 2, hence vHv1 = ∑(u, v) ∈ Epu, pv⟩/|E| up to a fixed constant factor, i.e. vHv1 is proportional to S(G).

In the second layer, we use the node-to-anchor attention with constant logits so that the anchor computes the mean of {Hv1}v ∈ V. Concretely, we set the anchor query/key projections so that all nodes receive equal weight, and we set the anchor value projection to be the identity. Then the anchor state U2 equals $\frac{1}{|V|}\sum_v H^1_v$, which is an affine rescaling of S(G). Finally, we let READOUT output this anchor scalar. Since S(Gconn) ≠ S(Gdisc), we obtain distinct graph embeddings, proving separation.

All computations are permutation-equivariant at the node level and invariant at the graph level, and the only dependence on PE4(G) is through inner products pu, pv, which are gauge-invariant. The resource bound follows from sparsity: computing the required inner products along edges costs O(md) with d = 4, and each layer costs ((m + ns)h) with h = 1 and s = 1.


7. Necessity and ablations as theorems: (a) without PE, SAGT collapses on the pair; (b) without anchor channel, bounded-hop variants fail for large n (resource-aware necessity).

We record two ablations showing that the two ingredients used above—a globally informative positional encoding and a sparse global mixing channel—are not cosmetic in the present resource regime.

Let SAGT¬PE denote the architecture obtained by deleting positional encodings from all attention logits/values, i.e., forcing the bias term ϕ(⟨pu, pv⟩, 1[(u, v) ∈ E]) to be independent of pu, pv (equivalently, setting pv ≡ 0).

Theorem~ formalizes that anchors alone do not break the symmetry of featureless regular graphs: without a positional signal that is not itself constant on vertex-transitive inputs, the model remains blind to the connected/disconnected distinction in this WL-hard pair.

We next isolate the role of the anchor channel as a constant-depth mechanism for global mixing. To state a clean necessity result, we consider a broad bounded-hop'' class capturing architectures whose per-node computation depends only on a bounded-radius neighborhood in the input graph. This includes standard graph transformers with attention restricted to $E$ (no node--anchor edges) and any message-passing architecture of fixed depth, and also covers the common ablation in which thepositional encoding’’ is replaced by an r-local structural feature (e.g. random-walk features of length  ≤ r, counts of length- ≤ r walks, or any function of the rooted r-hop neighborhood).

Formally, fix integers r, L ≥ 0. We say that an architecture is if: (i) initial node features are r-local isomorphism invariants (in particular, constant features correspond to r = 0), and (ii) each of the L layers updates node states using only information from graph neighbors (attention edges restricted to E), so that the final state at a node v is a function only of the rooted (r + L)-hop neighborhood around v.

Theorem~ yields the promised resource-aware statement: without the anchor channel (or some other non-local conduit), constant-depth architectures that only access bounded-hop information cannot separate the pair once n exceeds the receptive field. In contrast, adding s = O(1) inducing anchors creates a diameter-O(1) communication path (node anchor node) using only O(ns) extra sparse edges, allowing constant-depth global aggregation while preserving near-linear per-layer cost.


8. Complexity and stability analysis: time/memory bounds; PE computation cost and approximations; stability with respect to eigenvector gauge and small perturbations.

We record the resource bounds of a forward pass of SAGT when attention is restricted to the sparse edge set E ∪ Eanchor with |Eanchor| = O(ns).
Fix hidden width h, number of heads a constant, and let L denote the number of layers.
In each layer we perform three sparse attention blocks: nodeanchor (complete bipartite), anchornode (complete bipartite), and nodenode (restricted to E).
For nodeanchor attention, we compute logits between s queries and n keys, apply a row-wise softmax over n keys for each anchor, and aggregate values; this costs (nsh) time and (nsh) memory for activations (or O(nsh) if we materialize the attention weights explicitly, which can be avoided by streaming).
The same bound holds for anchornode attention.
For the local nodenode block, we restrict to edges (u, v) ∈ E (and, if desired, include self-loops); computing logits, softmax normalization over each node’s adjacency list, and weighted sums costs (mh) time and (mh) memory in the usual sparse attention implementation.
Hence, for fixed h and s = O(1), the total per-layer complexity is
((m + ns)h),   ((m + ns)h) memory,
and a full L-layer pass is (L(m + ns)h) time with ((m + ns)h) working memory, in agreement with the resource regime stated in the global context.
We emphasize that no n × n attention matrix is formed: the only global edges are the O(ns) node–anchor links.

At inference time, positional encodings enter the model only through evaluations of an orthogonally invariant function of the rows, e.g.,
K(u, v) = ⟨pu, pv⟩ = (PEd(G)PEd(G))uv,
and these evaluations are needed only on (u, v) ∈ E (or on whatever sparse set of logits we bias).
Consequently, even though K(⋅, ⋅) is a dense kernel in principle, we never materialize it: for each queried pair we compute pu, pv in O(d) time.
Thus the incremental per-layer overhead for PE-biased local attention is (md) time (and (nd) to store P = PEd(G)), which is negligible for fixed d relative to the (mh) attention cost when h ≫ d and m = Θ(n).
The node–anchor blocks can similarly incorporate pv-dependent biases at cost (nsd) if desired; in the algorithmic template above we set the anchor blocks’ bias to 0 to isolate the role of anchors as a mixing channel, but allowing PE-bias there does not change the asymptotics.

When PEd(G) is taken to be the matrix of the bottom d nontrivial eigenvectors of a Laplacian (combinatorial or normalized), it is standard to compute an approximate basis using Krylov subspace methods (Lanczos) requiring only matrix–vector products with L.
On a sparse graph, each product x ↦ Lx costs O(m) time (and O(n) additional time to apply diagonal degree terms), and each Lanczos iteration stores a constant number of n-vectors.
To obtain d extremal eigenvectors to accuracy ε one typically performs T = (dlog (1/ε)) iterations under mild spectral assumptions, yielding preprocessing time
(md)  and memory  (nd),
up to polylogarithmic factors, consistent with the stated (md) access model.
If L is normalized, the required products involve D−1/2AD−1/2 and hence still admit O(m) sparse multiplication.
In settings where exact eigensolvers are numerically unstable due to near-degeneracy (as occurs for highly symmetric graphs), one may instead compute a stable spectral basis (which is precisely the object controlled by perturbation theory) and then rely on the model’s orthogonal gauge invariance as discussed next.

For spectral PEs, the ambiguity is not a defect but a symmetry: if P = PEd(G) ∈ ℝn × d spans a d-dimensional invariant subspace, then any PR with R ∈ O(d) is an equally valid choice, and in degenerate eigenspaces the output of numerical routines may vary arbitrarily within this rotation.
Our architecture is designed so that the PE enters only through functions invariant under P ↦ PR, such as PP or pairwise inner products pu, pv.
Therefore the induced attention bias is :
⟨(PR)u, (PR)v⟩ = ⟨pu, pv⟩,
and the model is well-defined even when eigenvectors are not uniquely determined.
The same observation addresses the common sign ambiguity of eigenvectors (a special case of R ∈ O(d)).
This invariance is also the mechanism by which permutation equivariance of P up to gauge, namely PEd(ΠAΠ) = ΠPR, yields permutation equivariance of all PE-derived biases on sparse edges.

Two stability notions are relevant: (i) stability of the computed positional encoding, and (ii) stability of the downstream transformer computation with respect to perturbations in the bias.
For (i), assume P has orthonormal columns and is an approximation satisfying
minR ∈ O(d) − PRF ≤ ε.
Then the Gram matrices obey
 − PP2 ≤ 2ε + ε2,
so every biased logit term u, v differs from pu, pv by at most O(ε) uniformly over pairs.
For (ii), attention is a smooth map of its logits: writing the row-wise softmax as σ(⋅), one has a uniform bound on the Jacobian operator norm of σ (e.g. in 2, Dσ2 ≤ 1/2), hence small perturbations of the logits yield proportionally small perturbations of attention weights.
Composing with linear value aggregation and the feed-forward blocks shows that, for fixed layer widths and bounded weights, the node/anchor states are Lipschitz in the PE-derived bias on the sparse edge set.
Finally, to relate changes in G to changes in P, we appeal to standard subspace perturbation results (Davis–Kahan): if L is perturbed to L + ΔL and the relevant d-dimensional eigenspace is separated by a spectral gap γ > 0 from the rest of the spectrum, then the principal angle between the subspaces is O(∥ΔL2/γ), and consequently the gauge-aligned error ε above scales as O(∥ΔL2/γ).
Thus, in regimes where the chosen spectral components are well-separated, both the PE and the resulting SAGT computation are stable under small graph perturbations and under approximate eigen-computation, while remaining well-defined under exact degeneracy by virtue of gauge invariance.


While our separation argument is constructive and does not rely on empirical evidence, it is natural to validate in controlled settings that the proposed ingredients (spectral PE used only through gauge-invariant statistics, together with an anchor-based global mixing channel) are not merely sufficient in principle but also effective under standard training.


We consider binary classification on featureless 2-regular graphs, with positive examples Gconn = C2n and negative examples Gdisc = Cn ⊔ Cn, for multiple values of n.
All node features are identical constants, and we optionally randomize node orderings (apply a uniformly random permutation Π) per epoch to verify permutation invariance in practice.
Training and testing can be stratified by size in two regimes: (i) where train/test sizes coincide (fixed n), and (ii) where training uses n ∈ [nmin, nmax] and testing uses larger n (e.g. nmax up to an order of magnitude larger).
The latter setting is informative because the two graph families have a uniform local structure for all n, and hence any success must arise from nonlocal aggregation of PE-dependent structure rather than memorization of finite-size artifacts.


We compare SAGT against standard 1-WL-equivalent MPNNs (e.g. GCN/GraphSAGE/GIN with sum/mean aggregation) with global pooling, and against straightforward extensions that attempt to increase receptive field (e.g. k-hop message passing, Jumping Knowledge, or increasing depth with residual connections).
We additionally include: (a) a PE-only transformer'' that has access to $\mathrm{PE}_d(G)$ at nodes but uses only local attention edges $E$ (no anchors), (b) ananchor-only’’ variant where local edge attention is removed but node–anchor attention is retained, and (c) spectral baselines that operate on graph-level summaries such as the smallest d nontrivial Laplacian eigenvalues (sorted), or low-order spectral moments tr(Lr) for small r.
The spectral baselines are included because they can distinguish C2n from Cn ⊔ Cn when n is known or encoded, but they lack the permutation-equivariant node representations that our architecture maintains and they do not satisfy the same computational access pattern when generalized beyond low-dimensional summaries.


To empirically support the logical separation proof, we test the following ablations, keeping total width and depth comparable.
(i) : set ϕ ≡ 0 (or equivalently ignore P). On featureless regular graphs this reduces to a permutation-equivariant computation with identical node states at every layer, and we expect accuracy near chance ( ≈ 50%) regardless of training time.
(ii) : set s = 0 so that all computation is restricted to E. Even with PE-biased local attention, a purely local model can fail to aggregate global statistics robustly at shallow depth; we expect that separating C2n from Cn ⊔ Cn will degrade sharply as n grows unless depth scales with diameter.
(iii) : feed pv directly as a vector feature without restricting usage to O(d)-invariant functions (or use a fixed but arbitrary eigenvector orientation). On highly symmetric graphs this can introduce instability across runs and across numerical eigensolvers; empirically, this typically manifests as high variance and poor size generalization, consistent with the gauge ambiguity discussed earlier.


We report classification accuracy and calibration (e.g. cross-entropy) as a function of n, along with runtime and peak memory per forward pass.
For runtime scaling, we sweep n over a wide range (e.g. 103 to 106 for sparse cycles) and verify that the measured wall-clock time per layer scales linearly with m plus the ns term from anchors, in agreement with the ((m + ns)h) profile.
We also report sensitivity to PE dimension d and number of anchors s, expecting a small d (e.g. d ∈ {4, 8, 16}) and constant s (e.g. s ∈ {1, 2, 4, 8}) to suffice for this benchmark, while larger d primarily affects preprocessing time through (md).


On the WL-hard benchmark, SAGT should attain near-perfect accuracy with modest depth (e.g. L ≤ 4) and constant anchors, and should maintain performance under permutation randomization and across eigensolver runs (owing to gauge-invariant PE usage).
In contrast, 1-WL-equivalent MPNNs with constant initial features should remain at chance, matching the lower bound.
The PE-only local transformer is expected to perform well only when depth is allowed to grow with n (or when attention edges are densified, violating the sparse regime), highlighting the role of the anchor channel as an efficient global mixer.
Spectral graph-level baselines may separate the families when eigenvalues are computed accurately, but they do not provide a unified node-level architecture and their applicability to more general tasks is limited; we include them chiefly to confirm that the signal exploited by SAGT is genuinely spectral and not an implementation artifact.

These experiments do not add logical force to the theorem, but they serve to corroborate that the proposed architecture achieves the intended separation in the stated resource regime, and that the separation degrades precisely when the PE signal or the anchor-mediated global mixing is removed.


10. Discussion and limitations: scope of separation; fairness of preprocessing assumptions; extensions to other global properties and stronger WL-hard families.

Our separation result should be read as an existence statement about a concrete architectural motif (sparse anchors plus gauge-invariant spectral bias) rather than as a characterization of the full expressive power of transformers on graphs.
In particular, the pair (Gconn, Gdisc) = (C2n, Cn ⊔ Cn) is chosen to isolate a single obstruction: on featureless regular graphs, any 1-WL-equivalent MPNN maintains identical node states at every depth and therefore cannot express connectivity.
We do not claim that the proposed positional encoding and attention pattern is necessary for separating these graphs, nor that it suffices for all pairs that are hard for 1-WL (or even for 2-WL).
What the construction demonstrates is that one can separate a canonical WL-hard family while preserving a near-linear sparse access pattern and strict permutation invariance at the graph level.


The distinguishing information exploited by SAGT is fundamentally spectral and low-frequency: the positional encodings PEd(G) are derived from the Laplacian eigenspaces, and the model uses them only through O(d)-invariant interactions such as pu, pv.
For cycles, the low-frequency Laplacian eigenvectors embed nodes approximately on a one-dimensional manifold, and the disconnected union yields two such manifolds with no cross-component mixing; this induces a global statistic that can be aggregated efficiently through anchors.
This mechanism extends immediately to separating many pairs that differ by a macroscopic global property detectable by low-frequency spectrum (e.g. number of components, coarse expansion, or large-scale bottlenecks), provided the chosen d captures the relevant eigenspaces and the downstream attention/readout extracts an appropriate invariant summary.
Conversely, the mechanism is limited on pairs that are (or nearly so) with respect to the Laplacian, or more generally on pairs for which the multiset of invariants made available to the model (e.g. pairwise inner products { {⟨pu, pv⟩} } restricted to edges and anchor edges) agrees.
Since low-dimensional spectral information is not a complete isomorphism invariant, we should not expect a fixed d and a fixed family of invariant functions to separate all nonisomorphic graphs.


A natural concern is whether the preprocessing PEd(G) implicitly solves a problem as hard as graph isomorphism, thereby trivializing the separation.
We avoid this in two ways.
First, the positional encoding is required to be computable from sparse matrix–vector products with A or L in (md) time for small d, which is the same oracle access used by many scalable spectral routines.
Second, we require permutation equivariance up to an orthogonal gauge: for Π a permutation matrix, PEd(ΠAΠ) = Π PEd(A) R for some R ∈ O(d), and the model only consumes P through O(d)-invariant functions.
This prevents the preprocessing from smuggling in an arbitrary canonical labeling (which would indeed be too strong) and forces the pipeline to respect symmetries and eigenspace degeneracies.

That said, there are genuine algorithmic subtleties.
Approximating low-frequency eigenvectors can be numerically delicate when spectral gaps are small; cycles are an extreme case where the gap scales as Θ(n−2).
If one insists on accurate eigenvectors in the classical sense, naive iterative methods may require many iterations, and thus the informal (md) claim hides a dependence on conditioning and the desired approximation tolerance.
For the specific separation considered here, however, one typically does not require high-precision eigenvectors: coarse low-frequency structure (or even a stable proxy thereof) suffices to induce different global statistics for C2n versus Cn ⊔ Cn.
This motivates two refinements that remain within the stated access pattern: (i) using diffusion-based encodings (e.g. heat kernel features exp (−tL)ev or random-walk landing probabilities) approximated by truncated polynomials, which can be computed in O(mk) time for truncation degree k and are stable under small perturbations; and (ii) using multiple time scales t (or multiple polynomial degrees) to mitigate sensitivity to any single eigengap.
Such alternatives preserve permutation equivariance and can be consumed by the same gauge-invariant inner-product biases.


Our global mixing channel uses s = O(1) inducing anchors with O(ns) additional edges.
This suffices for the exhibited separation because the model needs only to aggregate a constant-dimensional summary of a global pattern.
For more complex global properties, constant s may be information-bottlenecked: anchors behave like a low-rank communication channel between distant parts of the graph.
Allowing s = O(log n) or mild growth with n can enlarge this channel while preserving near-linear complexity, but at the cost of more memory and more attention edges.
A separate limitation is that anchors do not remove the need for local computation; rather, they complement it.
In tasks where the discriminative signal depends on high-resolution local motifs their global arrangement, one expects a nontrivial interaction between the number of layers Llayers, the anchor budget s, and the choice of PE-derived biases.


The cycle-versus-disjoint-cycles pair isolates a 1-WL limitation tied to regularity and featurelessness.
Stronger lower bounds arise from families such as strongly regular graphs, CFI constructions, or pairs separated only by k-WL for k ≥ 2.
Spectral encodings alone will not uniformly resolve such cases, especially for cospectral constructions.
Nonetheless, the architectural principle remains applicable: any permutation-equivariant node-level signal P that breaks symmetries in a stable, gauge-invariant manner can be injected into sparse attention to yield global summaries inaccessible to pure message passing.
One direction is to use higher-order or multiscale positional information (e.g. encodings derived from powers of A, from effective resistance metrics, or from learned graph filters) while maintaining invariance by restricting the model to functions of Gram matrices or other orthogonally invariant statistics.
Another direction is to combine anchor-based global mixing with limited higher-order message passing (e.g. edge- or tuple-based states) to approach the expressivity of k-WL while keeping sparsity constraints explicit.

In summary, the separation should be understood as evidence that (i) global mixing at near-linear cost and (ii) carefully invariant consumption of positional information together suffice to transcend the 1-WL barrier on a canonical hard instance.
The principal limitations are those inherent to low-dimensional spectral information, to eigengap-dependent preprocessing accuracy, and to the low-rank nature of constant-size anchor channels.