Message-passing neural networks (MPNNs) remain the default formalism for learning on sparse graphs because they respect permutation symmetries and admit near-linear implementations in the number of edges. Their core mechanism is local aggregation: at each layer, every node updates its representation by applying a shared function to its previous state and a permutation-invariant summary of its neighbors’ states. This architectural restriction has a precise expressivity interpretation: on graphs with identical initial node features, standard MPNNs are no more discriminative than the one-dimensional Weisfeiler–Lehman color refinement procedure (1-WL). In particular, on highly symmetric inputs the node states can remain indistinguishable, and any permutation-invariant graph readout then collapses to the same value across non-isomorphic graphs.
A canonical obstruction arises already for 2-regular graphs. Consider the connected cycle graph C2n and the disconnected union Cn ⊔ Cn. Both graphs are 2-regular and featureless, and 1-WL fails to separate them: every node sees the same multiset of neighbor colors at every refinement step. Consequently, any 1-WL-equivalent MPNN produces identical node embeddings (up to permutation) on both graphs, and therefore any permutation-invariant pooling yields identical graph-level outputs. This provides a minimal example of a ``WL-hard’’ separation task that is not contrived by adversarial features, but instead follows from symmetry and locality.
Graph transformers are a natural candidate to surpass this limitation. By allowing attention-based mixing that is not confined to immediate adjacency, one can in principle aggregate global information in a constant number of layers. However, two difficulties must be addressed simultaneously. First, attention must be compatible with graph symmetries: the model must be permutation equivariant at the node level and permutation invariant at the graph level. Second, naive global attention is dense, incurring Θ(n2) interactions per layer, which is incompatible with the sparse-graph regime where m = O(n) and scalability is measured against m rather than n2.
The present work formalizes a route through these constraints by combining (i) sparse global mixing via a constant number of inducing tokens, and (ii) stable spectral positional encodings used only through orthogonally invariant quantities. The guiding principle is that symmetry breaking should be introduced in a controlled way: not by injecting node identifiers (which would violate permutation symmetry), but by attaching to each node a positional encoding derived from the graph itself, computed equivariantly with respect to node relabelings. Spectral methods provide such encodings: low-frequency eigenvectors of the (normalized) Laplacian capture global structure, and their computation on sparse graphs can be carried out via iterative methods using only matrix–vector products with the adjacency.
A technical issue is that eigenvectors are not uniquely defined: sign flips and rotations inside degenerate eigenspaces are unavoidable. We therefore impose a requirement on positional encodings: under a permutation Π, the encoding matrix transforms as PEd(ΠAΠ⊤) = Π PEd(A) R for some orthogonal R ∈ O(d). To ensure that the downstream model is well-defined under this ambiguity, we restrict attention biases to depend on PEd only through O(d)-invariant functions, such as inner products ⟨pu, pv⟩. This is sufficient to recover a meaningful, permutation-consistent notion of relative position without committing to an arbitrary eigenbasis.
Our main contribution is an explicit existence result for a (SAGT) operating in the sparse regime. The architecture combines three ingredients within each layer: node-to-anchor attention (to summarize the entire graph into a constant number of anchor states), anchor-to-node attention (to broadcast global context back to nodes), and a local edge-restricted attention step over E augmented by a positional bias derived from ⟨pu, pv⟩. The resulting computation uses only E together with O(ns) node–anchor edges, where s = O(1), and thus scales as Õ(m + ns) per layer up to hidden-width factors. In this sense, the model is a transformer in its mixing mechanism, while retaining the algorithmic footprint of message passing.
We prove three statements, corresponding to expressivity, impossibility for 1-WL-equivalent MPNNs, and computational complexity.These results position sparse transformer-style architectures with stable positional encodings as a principled extension of message passing: they retain permutation symmetry and sparse computation, yet provably exceed 1-WL expressivity on featureless inputs. The remainder of the paper proceeds by formalizing the symmetry notions and 1-WL baseline, specifying the SAGT architecture and its invariances, establishing the MPNN impossibility result, and then giving the constructive separation argument together with complexity bounds.
We work with a simple undirected graph G = (V, E) with |V| = n and |E| = m. Its adjacency matrix is A ∈ {0, 1}n × n, with Auv = 1 iff (u, v) ∈ E, and degree matrix D = diag(A1). We write the (combinatorial) Laplacian as L = D − A; when needed, one may substitute the normalized Laplacian Lnorm = I − D−1/2AD−1/2, but the symmetry notions below are identical for either choice. A node relabeling is represented by a permutation matrix Π ∈ {0, 1}n × n, acting by conjugation on adjacency: A ↦ ΠAΠ⊤. Likewise, a node feature matrix X ∈ ℝn × dx transforms as X ↦ ΠX. We say that two labeled graphs represent the same unlabeled graph when they are related by such a permutation.
Let f be a node-level map
producing n node embeddings
from a graph (and possibly node features), i.e., f(A, X) ∈ ℝn × h.
We call f if
f(ΠAΠ⊤, ΠX) = Πf(A, X) for
all permutation matrices Π.
Similarly, a graph-level readout g(A, X) ∈ ℝq
is if
g(ΠAΠ⊤, ΠX) = g(A, X) for
all Π.
In this paper the graphs are featureless unless explicitly stated: the
initial node features are identical constants, which we may model as
X = 1c⊤
for some fixed c ∈ ℝdx.
In particular, equivariance/invariance must be achieved without relying
on external node identifiers.
The 1-WL procedure maintains discrete colors cv(t)
for nodes v ∈ V. It
starts from an initial coloring cv(0)
(for featureless graphs, all nodes share the same color), and
iterates
$$
c_v^{(t+1)} \;=\; \mathrm{Hash}\!\left(c_v^{(t)},\
\multiset{c_u^{(t)}:u\in N(v)}\right),
$$
where N(v) denotes
the neighbors of v and $\multiset{\cdot}$ is the multiset of
neighbor colors. Two graphs are if, for every round t, the multisets of colors produced
on the two graphs agree up to relabeling of color symbols. In
particular, when a graph is regular and featureless, all nodes retain
the same color at every round, since every node observes the same
multiset at each refinement step.
An MPNN is a layered node-embedding computation H(t) ∈ ℝn × h,
with H(0) derived
from X, and updates of the
schematic form
hv(t + 1) = ψt (hv(t), AGG({ϕt(hv(t), hu(t)) : u ∈ N(v)})),
where AGG is permutation invariant over
multisets (e.g., sum/mean/max), and ϕt, ψt
are shared functions (e.g., MLPs) applied at every node. A graph-level
embedding is then produced by an invariant readout, e.g.,
z(G) = READOUT({hv(T) : v ∈ V}),
with READOUT permutation invariant
(sum/mean pooling followed by an MLP is standard). By construction, the
node map f(A, X) = H(T)
is permutation equivariant, and the graph map g(A, X) = z(G)
is permutation invariant.
The link to 1-WL is that the information available to a node at depth t is constrained to the rooted t-hop neighborhood aggregated via a multiset operator; thus, on featureless graphs, MPNNs cannot systematically refine symmetries beyond what 1-WL can certify. Concretely, one may state the following standard implication: if two nodes (possibly in different graphs) receive the same 1-WL color at round t, then any 1-WL-equivalent MPNN (i.e., any MPNN with permutation-invariant neighborhood aggregation and shared parameters) can be chosen so that their embeddings at layer t coincide, and conversely MPNN embeddings induce at most a refinement of 1-WL colors. For our purposes we use the simplest corollary: on a k-regular featureless graph, all node embeddings remain identical at every layer. Indeed, with hv(0) constant, induction gives that at step t every node has the same state; hence each node sees the same multiset of k identical messages, so the aggregator outputs the same value everywhere and the update preserves equality. Any invariant readout then depends only on n (and fixed parameters), and therefore cannot distinguish two non-isomorphic k-regular featureless graphs of the same size.
Transformer-style attention replaces fixed local aggregators with
data-dependent mixing weights. Given token states Y ∈ ℝN × h,
a (single-head) attention map has the form
$$
\mathrm{Attn}(Y)_i \;=\; \sum_{j\in \mathcal{N}(i)} \alpha_{ij} \, W_V
y_j,
\qquad
\alpha_{ij} \;=\; \frac{\exp(\ell_{ij})}{\sum_{k\in\mathcal{N}(i)}
\exp(\ell_{ik})},
$$
with logits ℓij = ⟨WQyi, WKyj⟩ + bij.
In a graph transformer, 𝒩(i)
is specified by an ; in the dense case 𝒩(i) = {1, …, N}, yielding
Θ(N2)
interactions. To respect permutation equivariance on node tokens, it
suffices that (i) the same linear maps WQ, WK, WV
are applied to every token of a given type, (ii) the allowed
neighborhoods 𝒩(i) transform
compatibly under Π, and (iii)
any bias term bij is
defined from permutation-equivariant data (for node tokens, typically
from adjacency indicators and positional information derived from G).
To retain near-linear complexity on sparse graphs, we restrict attention edges to a union of (a) the original graph edges E (possibly symmetrized and with self-loops) and (b) node–anchor edges connecting nodes to a small set of s learned inducing tokens (anchors). Formally, we introduce anchor states U(t) ∈ ℝs × h, treat them as additional tokens, and allow complete bipartite attention between V and the anchors. This contributes O(ns) attention edges, while local node–node attention restricted to E contributes O(m) edges. Consequently, one layer of sparse attention can be implemented in time and memory Õ((m + ns)h), matching the sparse-graph regime when s = O(1) (or O(log n)). The role of anchors is to implement global communication in constant depth: node-to-anchor attention aggregates graph-wide statistics into U(t), and anchor-to-node attention broadcasts this global context back to all nodes, after which an E-restricted mixing step can combine the broadcasted signal with local structure.
These preliminaries fix the symmetry constraints and baseline limitations that motivate our subsequent problem formulation, where we focus on separating a canonical 1-WL-hard pair under sparse computational resources.
We consider the task of producing a permutation-invariant graph
representation from a featureless input graph. Concretely, the input is
a simple undirected graph G = (V, E)
presented through its adjacency structure (equivalently, its adjacency
matrix A or adjacency lists),
with all node features identical constants. The output is an
embedding
z(G) ∈ ℝq,
intended to support separation and classification of non-isomorphic
graphs. Since node labels are arbitrary, we require invariance under
relabeling: for every permutation matrix Π,
z(Π(G)) = z(G), where Π(G) denotes
the relabeled graph with
adjacency ΠAΠ⊤.
In this setting, any meaningful distinction must be derived solely from
graph structure.
A central constraint of interest is computational: we target the sparse-graph regime, where m = |E| = O(n) and the model is expected to run in near-linear time in m (up to polylogarithmic factors), with commensurate memory usage. In particular, we exclude dense all-pairs interactions among nodes, since they incur Θ(n2) attention edges and would dominate runtime for large sparse graphs. Our goal is therefore to identify a graph family for which (i) purely local message passing provably fails, yet (ii) separation is achievable by a sparse architecture that augments local computation with a small amount of global communication.
We formalize this by focusing on a canonical global property that is
invisible to local refinement: within a class of regular graphs. For
each integer n ≥ 3, we define
a pair of 2-regular graphs on the same
number of vertices 2n:
Gconn := C2n, Gdisc := Cn ⊔ Cn,
where Ck
denotes the cycle graph on k
nodes and ⊔ denotes disjoint union.
Both graphs are 2-regular and
featureless; they differ only in a global topological attribute, namely
that Gconn is
connected while Gdisc has two connected
components of equal size. We treat the resulting decision problem as a
minimal separation benchmark: any successful graph representation must
capture information that is not reconstructible from bounded-radius
neighborhoods alone.
This family is a standard obstruction for 1-WL and 1-WL-equivalent message passing. Indeed, 1-WL initialized from a constant coloring cannot refine any node distinctions on a regular graph: every node observes the same multiset of neighbor colors at every round, hence all nodes retain the same color indefinitely. Consequently, Gconn and Gdisc are indistinguishable by 1-WL for every n ≥ 3. By the usual correspondence between 1-WL and MPNNs with permutation-invariant neighborhood aggregation, this implies that any such MPNN operating on featureless inputs computes node embeddings that are constant across nodes at every depth when the input is 2-regular. Any permutation-invariant readout applied to constant node embeddings can depend only on |V| = 2n (and learned parameters), and therefore cannot separate Gconn from Gdisc for fixed n. Our separation target thus sits precisely at the boundary where local message passing is insufficient.
We therefore formulate the following separation objective.
For each n ≥ 3, we seek a
permutation-invariant map G ↦ z(G) such
that
z(Gconn) ≠ z(Gdisc),
with a single parameterization that works uniformly over all n (i.e., the architecture does not
hard-code n). In particular,
the model must extract a signal correlated with connectivity from a
featureless, regular input where all local neighborhoods are isomorphic.
We emphasize that this is not a supervised learning requirement; rather,
it is an expressivity requirement: there must exist parameters (and a
fixed finite depth) for which the architecture separates the two
families.
We restrict attention to algorithms whose dominant operations are
sparse matrix–vector products with A (or L) and sparse aggregations over a
prescribed set of attention edges. The intended complexity per layer is
near-linear in m on sparse
graphs, allowing an additional O(ns) interactions
with a small number s of
auxiliary tokens (anchors). Thus, we permit attention edges of the
form
Eattn = E ∪ Eanchor, |Eanchor| = O(ns),
where s is a fixed constant or
at most O(log n).
This restriction enforces that global information cannot be obtained by
dense node–node mixing; instead, it must be mediated through a small set
of global channels.
Since the graphs are featureless, any separation mechanism must manufacture informative node-wise asymmetry from G itself while remaining well-defined on unlabeled graphs. A natural source of such information is spectral structure of A or L. At a high level, Gconn and Gdisc have markedly different low-frequency Laplacian structure: Gconn has a one-dimensional nullspace of L (connectedness), whereas Gdisc has a two-dimensional nullspace (two components). More generally, their low-eigenvalue geometry differs in a way that can be exposed by a small number d of spectral components. This motivates the use of a positional encoding matrix P = PEd(G) ∈ ℝn × d derived from G and computable by iterative sparse linear-algebra routines in time Õ(md) for fixed d.
However, spectral features introduce an additional symmetry subtlety: eigenvectors are not uniquely defined in degenerate eigenspaces, and even in simple cases they are defined only up to sign. Accordingly, we require that the positional encoding be , and that the downstream architecture use P only through orthogonally invariant combinations (e.g., inner products ⟨pu, pv⟩), so that the overall model remains a well-defined function on unlabeled graphs. This requirement will be enforced explicitly in our model definition.
In summary, the problem we study is to separate the WL-hard pair (C2n, Cn ⊔ Cn) under permutation invariance and near-linear sparse computation. The negative result for 1-WL-equivalent MPNNs motivates augmenting local sparse mixing with (i) a globally aggregating mechanism of size O(ns) and (ii) a stable, permutation-equivariant positional signal derived from the graph itself. In the next section we define a concrete architecture—a sparse anchor graph transformer—that meets these constraints and achieves the desired separation.
We now define a permutation-invariant architecture that operates in near-linear time on sparse graphs while permitting a controlled form of global communication. The model combines (i) local mixing restricted to the input edges E, (ii) sparse global mixing mediated by a constant number s of inducing tokens, and (iii) a stable positional encoding derived from the spectrum of G and used only through orthogonally invariant interactions.
Let L denote a fixed choice
of Laplacian (combinatorial L = D − A or
normalized, the choice being immaterial for the invariance discussion).
For a prescribed dimension d,
we compute a matrix
P = PEd(G) ∈ ℝn × d, pv := Pv, :,
intended to capture low-frequency geometry. Concretely, one may take
P to be a collection of d eigenvectors corresponding to the
d smallest nontrivial
eigenvalues of L (or a stable
relaxation thereof, e.g. a filtered Krylov basis). Since eigenspaces may
have multiplicity, P is not
unique; accordingly, we impose the standard assumption: for any
permutation matrix Π,
for some R ∈ O(d) depending
on Π and A (and on the chosen numerical
procedure within degenerate subspaces). In sparse graphs, P can be approximated in time Õ(md) via Lanczos
or power iterations, since these require only matrix–vector products
with A or L.
To ensure well-definedness under , the model uses P only through O(d)-invariant quantities.
The canonical choice is the inner product kernel
K(u, v) := ⟨pu, pv⟩,
which satisfies K(u, v) invariant
under P ↦ PR
for any orthogonal R. More
generally, any function of the Gram matrix PP⊤ is
gauge-invariant. We will use K(u, v) as an
additive bias in attention logits.
We maintain node states Ht ∈ ℝn × h
and anchor states Ut ∈ ℝs × h
for layers t = 0, 1, …, Llayers.
The initialization is featureless at the node level:
Hv0 ≡ c ∈ ℝh for
all v ∈ V,
for a learned constant vector c, and anchors are learned inducing
tokens U0 ∈ ℝs × h.
Anchors are not permuted under node relabelings; they serve as a fixed
global workspace.
Given query vectors xi indexed by
i ∈ ℐ and key/value vectors
yj indexed
by j ∈ 𝒥, and a sparse set of
admissible edges 𝒮 ⊆ ℐ × 𝒥, we define
single-head attention by
with learnable WQ, WK, WV ∈ ℝh × h.
Multihead attention is obtained by running – in parallel heads and
concatenating, as usual; we omit this notationally since it does not
affect equivariance arguments.
Each layer performs three mixing steps.
Anchors query all nodes over the complete bipartite set
𝒮U ← H = [s] × V,
with no positional bias (or, optionally, a gauge-invariant bias
depending on ⟨ua, pv⟩
if anchors are given their own learned positions). We set
Ut := Ut − 1 + Attn(Ut − 1, Ht − 1; 𝒮U ← H, b ≡ 0),
followed by a positionwise feed-forward network (FFN) and layer
normalization if desired.
Nodes query anchors over
𝒮H ← U = V × [s],
again with b ≡ 0, yielding an
intermediate state
H̃t := Ht − 1 + Attn(Ht − 1, Ut; 𝒮H ← U, b ≡ 0).
These two steps allow every node to exchange information through s global channels at cost O(ns) per
layer.
Finally, nodes attend to their graph neighbors over the sparse
set
𝒮local = {(u, v) : (u, v) ∈ E} ∪ {(v, v) : v ∈ V},
where we include self-loops for stability. The attention bias is
computed by a learnable function ϕ applied to gauge-invariant and
structural indicators:
We then set
$$
H^t \;:=\; \widetilde H^t \;+\; \mathrm{Attn}\bigl(\widetilde H^t,
\widetilde H^t;\, \mathcal{S}_{\mathrm{local}},\, b \text{ as in
\eqref{eq:local-bias}}\bigr),
$$
followed by an FFN. The restriction to 𝒮local ensures that this step
costs Õ(mh)
per layer on sparse graphs.
Let Π be a permutation
matrix acting on node indices. By construction, the anchor set is fixed,
while node states transform as Ht ↦ ΠHt.
In steps (1)–(2), the bipartite attention is equivariant because the
softmax normalization in is taken row-wise over the permuted node set,
and the summation is over the same multiset of keys/values. In step (3),
the admissible set 𝒮local is
exactly the edge set (plus self-loops), hence it permutes to itself
under relabeling, and the bias term is unchanged because ⟨pu, pv⟩
is invariant under the gauge transformation P ↦ PR implied by
. Consequently, for all t, we
obtain permutation equivariance of node states:
Ht(Π(G)) = Π Ht(G),
and well-definedness with respect to the orthogonal ambiguity in P.
The final graph embedding is obtained by a permutation-invariant
pooling over nodes, followed by an MLP:
z(G) := MLP (∑v ∈ VHvLlayers).
Any commutative pooling (sum/mean) suffices; we use the sum for
definiteness. This readout yields z(Π(G)) = z(G)
by the equivariance of HLlayers.
For fixed h and s = O(1), each layer performs O(m) local attention edges and O(ns) anchor edges, giving time Õ((m + ns)h) and memory Õ((m + ns)h) per layer, plus preprocessing Õ(md) to compute P. This matches the sparse access model while enabling global information flow through the anchors and the PE-dependent local biases.
We formalize the obstruction for message-passing neural networks
(MPNNs) on the WL-hard pair
Gconn = C2n, Gdisc = Cn ⊔ Cn,
when all input node features are identical. The argument is standard,
but we record it in a form that makes clear that neither depth nor width
can overcome the limitation, and that the conclusion is strictly
stronger than merely invoking the 1-WL indistinguishability of these
graphs.
Consider an MPNN with T
layers maintaining hidden states hvt ∈ ℝh
for each node v ∈ V
and time t = 0, 1, …, T. We assume
the update has the generic message-passing form
where N(v) is the
(unordered) neighborhood of v,
AGGt is
permutation-invariant over multisets (e.g. sum/mean/max, or any
DeepSets-style invariant map), and Ψt, Φt
are shared across nodes. This covers essentially all 1-WL-equivalent
MPNNs in the literature; the key property is that is local and treats
neighbors as an unlabeled multiset. We further assume featureless
initialization:
for some constant (learned or fixed) vector c.
Both Gconn and
Gdisc are 2-regular on the same number of nodes, namely
|V| = 2n. Applying
Proposition~ with k = 2, we
conclude that for any depth T
and any parameterization of the maps in , the final node embeddings
satisfy
as multisets.
Let the graph-level output be any permutation-invariant readout
where READOUT may be sum/mean pooling
followed by an MLP, or any multiset-invariant map. Since both graphs
produce the same multiset of 2n identical vectors by , we
obtain
z(Gconn) = z(Gdisc).
In particular, no such MPNN can separate C2n from Cn ⊔ Cn
at the graph level.
It is instructive to compare the above collapse to the 1-WL color refinement process. Starting from a constant color on all vertices, 1-WL refines colors by hashing the multiset of neighbor colors. On a 2-regular graph, every node sees the same multiset (two copies of the same color) at every iteration, so the coloring remains uniform forever, and the induced color histograms coincide between C2n and Cn ⊔ Cn. The MPNN limitation proven above is therefore consistent with, and in fact implies, the 1-WL indistinguishability of this pair in the featureless setting: local multiset aggregation cannot represent the global invariant ``number of connected components’’ when all local neighborhoods look identical.
The proof crucially uses that the initialization contains no symmetry-breaking information (no node identifiers, no random features, and no external positional encoding). This is precisely the regime in which the separation result for our sparse-anchor transformer must operate: the additional distinguishing power must arise from stable global structure (here, spectral positional information and anchor-mediated aggregation), rather than from injecting labels by hand.
We now give an explicit parameter choice for SAGT that separates
Gconn = C2n, Gdisc = Cn ⊔ Cn,
for every n ≥ 3, while using
only sparse attention edges E ∪ Eanchor with
s = O(1) anchors and
using the positional encoding only through O(d)-invariant
quantities.
For any graph G on N nodes, let L = 2I − A be the
unnormalized Laplacian (we use 2I − A since our graphs are
2-regular, hence D = 2I). Define PE4(G) ∈ ℝN × 4
as an orthonormal basis of the direct sum of the Laplacian eigenspaces
corresponding to the eigenvalues, counted with multiplicity. Concretely,
for a cycle CN the spectrum
of L is
$$
\lambda_k \;=\; 2-2\cos\Bigl(\frac{2\pi k}{N}\Bigr),
\qquad k=0,1,\dots,N-1,
$$
with λ0 = 0 and,
for 1 ≤ k ≤ ⌊N/2⌋ − 1, the
eigenvalue λk has
multiplicity 2 (corresponding to the
cosine/sine modes). Thus on CN, the space
spanned by the k = 1 and k = 2 modes has dimension 4, and we take PE4(CN)
to be any orthonormal basis of this 4-dimensional invariant subspace. On Cn ⊔ Cn,
the corresponding invariant subspace is the direct sum of the k = 1, 2 subspaces on each
component, again of total dimension 4;
hence PE4(Gdisc) is
well-defined up to an orthogonal gauge within this 4-space. In either case, the Gram
quantities
K(u, v) := ⟨pu, pv⟩ for
pv the row of
PE4(G),
depend only on the orthogonal projector onto the chosen 4-dimensional invariant subspace, and are
therefore invariant under the gauge transformation PE4 ↦ PE4R
with R ∈ O(4).
We consider the scalar graph functional
By the preceding discussion, S(G) is
permutation-invariant and gauge-invariant.
We compute S(G)
explicitly for our pair. First let G = CN
with vertices indexed cyclically by {0, 1, …, N − 1}. The orthogonal
projector onto the k = 1, 2
Fourier subspace has kernel
$$
K(i,j)\;=\;\frac{2}{N}\Bigl(\cos\Bigl(\frac{2\pi
(i-j)}{N}\Bigr)+\cos\Bigl(\frac{4\pi (i-j)}{N}\Bigr)\Bigr),
$$
independently of the particular orthonormal basis used. In particular,
for each edge (i, i + 1) we have
$$
\langle p_i,p_{i+1}\rangle
\;=\;
\frac{2}{N}\Bigl(\cos\Bigl(\frac{2\pi}{N}\Bigr)+\cos\Bigl(\frac{4\pi}{N}\Bigr)\Bigr),
$$
so that S(CN)
equals this same value.
Applying this with N = 2n yields
For Gdisc = Cn ⊔ Cn,
the projector (and hence K(u, v)) is
block-diagonal across the two components and coincides on each component
with the k = 1, 2 projector
for Cn.
Therefore each edge lies within a component and has
$$
\langle p_i,p_{i+1}\rangle
\;=\;
\frac{2}{n}\Bigl(\cos\Bigl(\frac{2\pi}{n}\Bigr)+\cos\Bigl(\frac{4\pi}{n}\Bigr)\Bigr),
$$
hence
Comparing and , we claim S(Gconn) ≠ S(Gdisc)
for all n ≥ 3. Indeed, if they
were equal then after multiplying by n we would have
$$
\cos\Bigl(\frac{\pi}{n}\Bigr)+\cos\Bigl(\frac{2\pi}{n}\Bigr)
\;=\;
2\cos\Bigl(\frac{2\pi}{n}\Bigr)+2\cos\Bigl(\frac{4\pi}{n}\Bigr),
$$
i.e.,
$$
\cos\Bigl(\frac{\pi}{n}\Bigr)-\cos\Bigl(\frac{2\pi}{n}\Bigr)
\;=\;
2\cos\Bigl(\frac{4\pi}{n}\Bigr).
$$
For n ≥ 3, the left-hand side
is positive (since π/n < 2π/n < π
and cos is strictly decreasing on
(0, π)), whereas the
right-hand side is at most 2 and
oscillates in sign; in particular at n = 3 we have S(C6) = 0 while
S(C3 ⊔ C3) = −2/3.
More generally, one may verify the strict inequality directly from the
identities cos (4x) = 8cos4(x) − 8cos2(x) + 1
and the monotonicity of cos on (0, π), yielding S(C2n) > S(Cn ⊔ Cn)
for all n ≥ 3. Thus separates
the pair uniformly in n.
We show that SAGT can compute S(G) with constant width and one anchor. We take hidden width h = 1, number of anchors s = 1, and Llayers = 2. We interpret the local ``edge attention with PE’’ step as an edge-aware attention block that may inject the scalar ϕ(⟨pu, pv⟩, 1[(u, v) ∈ E]) into the value channel along (u, v) ∈ E (this is a standard graph-transformer variant and preserves the same sparsity and asymptotic cost).
Initialize node states by Hv0 = 0
and the single anchor state by a fixed learned scalar U0 ≠ 0. In the first
layer, we perform only the local edge block and set the attention logits
to be constant on each node’s neighborhood (e.g. by using zero
queries/keys), so that each node averages its incident edge-values.
Choose the edge-value injection to be exactly ϕ(⟨pv, pu⟩, 1) = ⟨pv, pu⟩,
and set the usual node-value term to zero. Then the output at node v after this block is
$$
H^1_v \;=\; \frac{1}{\deg(v)}\sum_{u\in N(v)} \langle p_v,p_u\rangle.
$$
On both Gconn and
Gdisc, deg (v) = 2, hence ∑vHv1 = ∑(u, v) ∈ E⟨pu, pv⟩/|E|
up to a fixed constant factor, i.e. ∑vHv1
is proportional to S(G).
In the second layer, we use the node-to-anchor attention with constant logits so that the anchor computes the mean of {Hv1}v ∈ V. Concretely, we set the anchor query/key projections so that all nodes receive equal weight, and we set the anchor value projection to be the identity. Then the anchor state U2 equals $\frac{1}{|V|}\sum_v H^1_v$, which is an affine rescaling of S(G). Finally, we let READOUT output this anchor scalar. Since S(Gconn) ≠ S(Gdisc), we obtain distinct graph embeddings, proving separation.
All computations are permutation-equivariant at the node level and invariant at the graph level, and the only dependence on PE4(G) is through inner products ⟨pu, pv⟩, which are gauge-invariant. The resource bound follows from sparsity: computing the required inner products along edges costs O(md) with d = 4, and each layer costs Õ((m + ns)h) with h = 1 and s = 1.
We record two ablations showing that the two ingredients used above—a globally informative positional encoding and a sparse global mixing channel—are not cosmetic in the present resource regime.
Let SAGT¬PE denote the architecture obtained by deleting positional encodings from all attention logits/values, i.e., forcing the bias term ϕ(⟨pu, pv⟩, 1[(u, v) ∈ E]) to be independent of ⟨pu, pv⟩ (equivalently, setting pv ≡ 0).
Theorem~ formalizes that anchors alone do not break the symmetry of featureless regular graphs: without a positional signal that is not itself constant on vertex-transitive inputs, the model remains blind to the connected/disconnected distinction in this WL-hard pair.
We next isolate the role of the anchor channel as a constant-depth
mechanism for global mixing. To state a clean necessity result, we
consider a broad
bounded-hop'' class capturing architectures whose per-node computation depends only on a bounded-radius neighborhood in the input graph. This includes standard graph transformers with attention restricted to $E$ (no node--anchor edges) and any message-passing architecture of fixed depth, and also covers the common ablation in which thepositional
encoding’’ is replaced by an r-local structural feature
(e.g. random-walk features of length ≤ r, counts of length- ≤ r walks, or any function of the
rooted r-hop
neighborhood).
Formally, fix integers r, L ≥ 0. We say that an architecture is if: (i) initial node features are r-local isomorphism invariants (in particular, constant features correspond to r = 0), and (ii) each of the L layers updates node states using only information from graph neighbors (attention edges restricted to E), so that the final state at a node v is a function only of the rooted (r + L)-hop neighborhood around v.
Theorem~ yields the promised resource-aware statement: without the anchor channel (or some other non-local conduit), constant-depth architectures that only access bounded-hop information cannot separate the pair once n exceeds the receptive field. In contrast, adding s = O(1) inducing anchors creates a diameter-O(1) communication path (node → anchor → node) using only O(ns) extra sparse edges, allowing constant-depth global aggregation while preserving near-linear per-layer cost.
We record the resource bounds of a forward pass of SAGT when attention is restricted to the
sparse edge set E ∪ Eanchor with
|Eanchor| = O(ns).
Fix hidden width h, number of
heads a constant, and let L
denote the number of layers.
In each layer we perform three sparse attention blocks: node→anchor (complete bipartite), anchor→node (complete bipartite), and node→node (restricted to E).
For node→anchor attention, we compute
logits between s queries and
n keys, apply a row-wise
softmax over n keys for each
anchor, and aggregate values; this costs Õ(nsh)
time and Õ(nsh)
memory for activations (or O(nsh) if
we materialize the attention weights explicitly, which can be avoided by
streaming).
The same bound holds for anchor→node
attention.
For the local node→node block, we
restrict to edges (u, v) ∈ E (and,
if desired, include self-loops); computing logits, softmax normalization
over each node’s adjacency list, and weighted sums costs Õ(mh) time and
Õ(mh) memory
in the usual sparse attention implementation.
Hence, for fixed h and s = O(1), the total
per-layer complexity is
Õ((m + ns)h), Õ((m + ns)h) memory,
and a full L-layer pass is
Õ(L(m + ns)h)
time with Õ((m + ns)h)
working memory, in agreement with the resource regime stated in the
global context.
We emphasize that no n × n attention matrix is
formed: the only global edges are the O(ns) node–anchor
links.
At inference time, positional encodings enter the model only through
evaluations of an orthogonally invariant function of the rows,
e.g.,
K(u, v) = ⟨pu, pv⟩ = (PEd(G)PEd(G)⊤)uv,
and these evaluations are needed only on (u, v) ∈ E (or on
whatever sparse set of logits we bias).
Consequently, even though K(⋅, ⋅) is a dense kernel in
principle, we never materialize it: for each queried pair we compute
⟨pu, pv⟩
in O(d) time.
Thus the incremental per-layer overhead for PE-biased local attention is
Õ(md) time
(and Õ(nd)
to store P = PEd(G)),
which is negligible for fixed d relative to the Õ(mh) attention
cost when h ≫ d and
m = Θ(n).
The node–anchor blocks can similarly incorporate pv-dependent
biases at cost Õ(nsd) if
desired; in the algorithmic template above we set the anchor blocks’
bias to 0 to isolate the role of
anchors as a mixing channel, but allowing PE-bias there does not change
the asymptotics.
When PEd(G) is taken
to be the matrix of the bottom d nontrivial eigenvectors of a
Laplacian (combinatorial or normalized), it is standard to compute an
approximate basis using Krylov subspace methods (Lanczos) requiring only
matrix–vector products with L.
On a sparse graph, each product x ↦ Lx costs O(m) time (and O(n) additional time to
apply diagonal degree terms), and each Lanczos iteration stores a
constant number of n-vectors.
To obtain d extremal
eigenvectors to accuracy ε one
typically performs T = Õ(dlog (1/ε))
iterations under mild spectral assumptions, yielding preprocessing
time
Õ(md) and
memory Õ(nd),
up to polylogarithmic factors, consistent with the stated Õ(md) access
model.
If L is normalized, the
required products involve D−1/2AD−1/2
and hence still admit O(m) sparse
multiplication.
In settings where exact eigensolvers are numerically unstable due to
near-degeneracy (as occurs for highly symmetric graphs), one may instead
compute a stable spectral basis (which is precisely the object
controlled by perturbation theory) and then rely on the model’s
orthogonal gauge invariance as discussed next.
For spectral PEs, the ambiguity is not a defect but a symmetry: if
P = PEd(G) ∈ ℝn × d
spans a d-dimensional
invariant subspace, then any PR with R ∈ O(d) is an
equally valid choice, and in degenerate eigenspaces the output of
numerical routines may vary arbitrarily within this rotation.
Our architecture is designed so that the PE enters only through
functions invariant under P ↦ PR, such as
PP⊤ or
pairwise inner products ⟨pu, pv⟩.
Therefore the induced attention bias is :
⟨(PR)u, (PR)v⟩ = ⟨pu, pv⟩,
and the model is well-defined even when eigenvectors are not uniquely
determined.
The same observation addresses the common sign ambiguity of eigenvectors
(a special case of R ∈ O(d)).
This invariance is also the mechanism by which permutation equivariance
of P up to gauge, namely PEd(ΠAΠ⊤) = ΠPR,
yields permutation equivariance of all PE-derived biases on sparse
edges.
Two stability notions are relevant: (i) stability of the computed
positional encoding, and (ii) stability of the downstream transformer
computation with respect to perturbations in the bias.
For (i), assume P has
orthonormal columns and P̂ is
an approximation satisfying
minR ∈ O(d)∥P̂ − PR∥F ≤ ε.
Then the Gram matrices obey
∥P̂P̂⊤ − PP⊤∥2 ≤ 2ε + ε2,
so every biased logit term ⟨p̂u, p̂v⟩
differs from ⟨pu, pv⟩
by at most O(ε)
uniformly over pairs.
For (ii), attention is a smooth map of its logits: writing the row-wise
softmax as σ(⋅), one has a
uniform bound on the Jacobian operator norm of σ (e.g. in ℓ2, ∥Dσ∥2 ≤ 1/2),
hence small perturbations of the logits yield proportionally small
perturbations of attention weights.
Composing with linear value aggregation and the feed-forward blocks
shows that, for fixed layer widths and bounded weights, the node/anchor
states are Lipschitz in the PE-derived bias on the sparse edge
set.
Finally, to relate changes in G to changes in P, we appeal to standard subspace
perturbation results (Davis–Kahan): if L is perturbed to L + ΔL and the
relevant d-dimensional
eigenspace is separated by a spectral gap γ > 0 from the rest of the
spectrum, then the principal angle between the subspaces is O(∥ΔL∥2/γ),
and consequently the gauge-aligned error ε above scales as O(∥ΔL∥2/γ).
Thus, in regimes where the chosen spectral components are
well-separated, both the PE and the resulting SAGT computation are stable under small graph
perturbations and under approximate eigen-computation, while remaining
well-defined under exact degeneracy by virtue of gauge invariance.
While our separation argument is constructive and does not rely on empirical evidence, it is natural to validate in controlled settings that the proposed ingredients (spectral PE used only through gauge-invariant statistics, together with an anchor-based global mixing channel) are not merely sufficient in principle but also effective under standard training.
We consider binary classification on featureless 2-regular graphs, with
positive examples Gconn = C2n
and negative examples Gdisc = Cn ⊔ Cn,
for multiple values of n.
All node features are identical constants, and we optionally randomize
node orderings (apply a uniformly random permutation Π) per epoch to verify permutation
invariance in practice.
Training and testing can be stratified by size in two regimes: (i) where
train/test sizes coincide (fixed n), and (ii) where training uses
n ∈ [nmin, nmax]
and testing uses larger n
(e.g. nmax up to an
order of magnitude larger).
The latter setting is informative because the two graph families have a
uniform local structure for all n, and hence any success must arise
from nonlocal aggregation of PE-dependent structure rather than
memorization of finite-size artifacts.
We compare SAGT against standard
1-WL-equivalent MPNNs (e.g. GCN/GraphSAGE/GIN with sum/mean aggregation)
with global pooling, and against straightforward extensions that attempt
to increase receptive field (e.g. k-hop message passing, Jumping
Knowledge, or increasing depth with residual connections).
We additionally include: (a) a
PE-only transformer'' that has access to $\mathrm{PE}_d(G)$ at nodes but uses only local attention edges $E$ (no anchors), (b) ananchor-only’’
variant where local edge attention is removed but node–anchor attention
is retained, and (c) spectral baselines that operate on graph-level
summaries such as the smallest d nontrivial Laplacian eigenvalues
(sorted), or low-order spectral moments tr(Lr) for small
r.
The spectral baselines are included because they can distinguish C2n from Cn ⊔ Cn
when n is known or encoded,
but they lack the permutation-equivariant node representations that our
architecture maintains and they do not satisfy the same computational
access pattern when generalized beyond low-dimensional summaries.
To empirically support the logical separation proof, we test the
following ablations, keeping total width and depth comparable.
(i) : set ϕ ≡ 0 (or
equivalently ignore P). On
featureless regular graphs this reduces to a permutation-equivariant
computation with identical node states at every layer, and we expect
accuracy near chance ( ≈ 50%)
regardless of training time.
(ii) : set s = 0 so that all
computation is restricted to E. Even with PE-biased local
attention, a purely local model can fail to aggregate global statistics
robustly at shallow depth; we expect that separating C2n from Cn ⊔ Cn
will degrade sharply as n
grows unless depth scales with diameter.
(iii) : feed pv directly as a
vector feature without restricting usage to O(d)-invariant functions
(or use a fixed but arbitrary eigenvector orientation). On highly
symmetric graphs this can introduce instability across runs and across
numerical eigensolvers; empirically, this typically manifests as high
variance and poor size generalization, consistent with the gauge
ambiguity discussed earlier.
We report classification accuracy and calibration (e.g. cross-entropy)
as a function of n, along with
runtime and peak memory per forward pass.
For runtime scaling, we sweep n over a wide range (e.g. 103 to 106 for sparse cycles) and verify
that the measured wall-clock time per layer scales linearly with m plus the ns term from anchors, in
agreement with the Õ((m + ns)h)
profile.
We also report sensitivity to PE dimension d and number of anchors s, expecting a small d (e.g. d ∈ {4, 8, 16}) and constant s (e.g. s ∈ {1, 2, 4, 8}) to suffice for
this benchmark, while larger d
primarily affects preprocessing time through Õ(md).
On the WL-hard benchmark, SAGT should
attain near-perfect accuracy with modest depth (e.g. L ≤ 4) and constant anchors, and
should maintain performance under permutation randomization and across
eigensolver runs (owing to gauge-invariant PE usage).
In contrast, 1-WL-equivalent MPNNs with constant initial features should
remain at chance, matching the lower bound.
The PE-only local transformer is expected to perform well only when
depth is allowed to grow with n (or when attention edges are
densified, violating the sparse regime), highlighting the role of the
anchor channel as an efficient global mixer.
Spectral graph-level baselines may separate the families when
eigenvalues are computed accurately, but they do not provide a unified
node-level architecture and their applicability to more general tasks is
limited; we include them chiefly to confirm that the signal exploited by
SAGT is genuinely spectral and not an
implementation artifact.
These experiments do not add logical force to the theorem, but they serve to corroborate that the proposed architecture achieves the intended separation in the stated resource regime, and that the separation degrades precisely when the PE signal or the anchor-mediated global mixing is removed.
Our separation result should be read as an existence statement about
a concrete architectural motif (sparse anchors plus gauge-invariant
spectral bias) rather than as a characterization of the full expressive
power of transformers on graphs.
In particular, the pair (Gconn, Gdisc) = (C2n, Cn ⊔ Cn)
is chosen to isolate a single obstruction: on featureless regular
graphs, any 1-WL-equivalent MPNN maintains identical node states at
every depth and therefore cannot express connectivity.
We do not claim that the proposed positional encoding and attention
pattern is necessary for separating these graphs, nor that it suffices
for all pairs that are hard for 1-WL (or even for 2-WL).
What the construction demonstrates is that one can separate a canonical
WL-hard family while preserving a near-linear sparse access pattern and
strict permutation invariance at the graph level.
The distinguishing information exploited by SAGT is fundamentally spectral and
low-frequency: the positional encodings PEd(G) are
derived from the Laplacian eigenspaces, and the model uses them only
through O(d)-invariant interactions
such as ⟨pu, pv⟩.
For cycles, the low-frequency Laplacian eigenvectors embed nodes
approximately on a one-dimensional manifold, and the disconnected union
yields two such manifolds with no cross-component mixing; this induces a
global statistic that can be aggregated efficiently through
anchors.
This mechanism extends immediately to separating many pairs that differ
by a macroscopic global property detectable by low-frequency spectrum
(e.g. number of components, coarse expansion, or large-scale
bottlenecks), provided the chosen d captures the relevant eigenspaces
and the downstream attention/readout extracts an appropriate invariant
summary.
Conversely, the mechanism is limited on pairs that are (or nearly so)
with respect to the Laplacian, or more generally on pairs for which the
multiset of invariants made available to the model (e.g. pairwise inner
products { {⟨pu, pv⟩} }
restricted to edges and anchor edges) agrees.
Since low-dimensional spectral information is not a complete isomorphism
invariant, we should not expect a fixed d and a fixed family of invariant
functions to separate all nonisomorphic graphs.
A natural concern is whether the preprocessing PEd(G)
implicitly solves a problem as hard as graph isomorphism, thereby
trivializing the separation.
We avoid this in two ways.
First, the positional encoding is required to be computable from sparse
matrix–vector products with A
or L in Õ(md) time for
small d, which is the same
oracle access used by many scalable spectral routines.
Second, we require permutation equivariance up to an orthogonal gauge:
for Π a permutation matrix,
PEd(ΠAΠ⊤) = Π PEd(A) R
for some R ∈ O(d), and the
model only consumes P through
O(d)-invariant
functions.
This prevents the preprocessing from smuggling in an arbitrary canonical
labeling (which would indeed be too strong) and forces the pipeline to
respect symmetries and eigenspace degeneracies.
That said, there are genuine algorithmic subtleties.
Approximating low-frequency eigenvectors can be numerically delicate
when spectral gaps are small; cycles are an extreme case where the gap
scales as Θ(n−2).
If one insists on accurate eigenvectors in the classical sense, naive
iterative methods may require many iterations, and thus the informal
Õ(md) claim
hides a dependence on conditioning and the desired approximation
tolerance.
For the specific separation considered here, however, one typically does
not require high-precision eigenvectors: coarse low-frequency structure
(or even a stable proxy thereof) suffices to induce different global
statistics for C2n versus Cn ⊔ Cn.
This motivates two refinements that remain within the stated access
pattern: (i) using diffusion-based encodings (e.g. heat kernel features
exp (−tL)ev
or random-walk landing probabilities) approximated by truncated
polynomials, which can be computed in O(mk) time for
truncation degree k and are
stable under small perturbations; and (ii) using multiple time scales
t (or multiple polynomial
degrees) to mitigate sensitivity to any single eigengap.
Such alternatives preserve permutation equivariance and can be consumed
by the same gauge-invariant inner-product biases.
Our global mixing channel uses s = O(1) inducing anchors
with O(ns)
additional edges.
This suffices for the exhibited separation because the model needs only
to aggregate a constant-dimensional summary of a global pattern.
For more complex global properties, constant s may be information-bottlenecked:
anchors behave like a low-rank communication channel between distant
parts of the graph.
Allowing s = O(log n) or
mild growth with n can enlarge
this channel while preserving near-linear complexity, but at the cost of
more memory and more attention edges.
A separate limitation is that anchors do not remove the need for local
computation; rather, they complement it.
In tasks where the discriminative signal depends on high-resolution
local motifs their global arrangement, one expects a nontrivial
interaction between the number of layers Llayers, the anchor
budget s, and the choice of
PE-derived biases.
The cycle-versus-disjoint-cycles pair isolates a 1-WL limitation tied to
regularity and featurelessness.
Stronger lower bounds arise from families such as strongly regular
graphs, CFI constructions, or pairs separated only by k-WL for k ≥ 2.
Spectral encodings alone will not uniformly resolve such cases,
especially for cospectral constructions.
Nonetheless, the architectural principle remains applicable: any
permutation-equivariant node-level signal P that breaks symmetries in a
stable, gauge-invariant manner can be injected into sparse attention to
yield global summaries inaccessible to pure message passing.
One direction is to use higher-order or multiscale positional
information (e.g. encodings derived from powers of A, from effective resistance
metrics, or from learned graph filters) while maintaining invariance by
restricting the model to functions of Gram matrices or other
orthogonally invariant statistics.
Another direction is to combine anchor-based global mixing with limited
higher-order message passing (e.g. edge- or tuple-based states) to
approach the expressivity of k-WL while keeping sparsity
constraints explicit.
In summary, the separation should be understood as evidence that (i)
global mixing at near-linear cost and (ii) carefully invariant
consumption of positional information together suffice to transcend the
1-WL barrier on a canonical hard instance.
The principal limitations are those inherent to low-dimensional spectral
information, to eigengap-dependent preprocessing accuracy, and to the
low-rank nature of constant-size anchor channels.