Sparse attention has become the operational bottleneck of graph transformers: in essentially all regimes where |VG| is large, one cannot afford to materialize (let alone repeatedly update) dense interactions over VG2, and so practical architectures restrict message passing to an A(G) with |EA(G)| near-linear in |VG| or with uniformly bounded degree. This restriction is not merely an implementation detail. It changes which dependencies can be transported across the model in bounded depth, and it therefore changes the set of graph patterns whose homomorphism counts are, even in principle, recoverable from the stabilized representation. Our aim is to make this limitation explicit and sharp, in a form that is invariant under isomorphism and independent of training.
A large fraction of the existing expressivity literature proceeds by comparing a model to a Weisfeiler–Leman (WL) refinement, or by bounding it between k-WL variants. These comparisons are informative when the model is allowed to aggregate over the full set of neighbors in the underlying graph (or over all pairs, in a dense 2-tuple refinement). However, WL-style upper bounds can miss the decisive feature of modern transformers: the neighborhood over which an update aggregates is neither NG(u) nor VG, but rather NA(G)(u) for an attention graph that is from G and is typically sparse. In particular, even when the state space is indexed by ordered pairs (u, v) as in 2-tuple refinement, the update of (u, v) is constrained to query only colors involving vertices w that are attention-neighbors of u or v. This induces a distinct notion of locality (and a distinct pebble-game equivalence) that is strictly weaker than classical 2-WL whenever A(G) omits a non-negligible fraction of pairs.
We therefore study a canonical discrete idealization of sparse-attention transformers, namely the refinement procedure SAT-2FGNN[A] parameterized by an isomorphism-equivariant attention generator A. The model refines colors χ(t)(u, v) on ordered pairs and uses perfect hashing so that the refinement is maximally discriminative subject to its access pattern. The central question is then structural: ? We insist on a tight answer, meaning (i) a completeness direction (the representation determines all relevant homomorphism counts) and (ii) a maximality direction (every excluded pattern admits explicit counterexamples that the model cannot distinguish).
Our first contribution is to show that sparse attention still admits
a homomorphism-basis characterization, provided one lets the basis
depend on A. Concretely, for each
isomorphism-equivariant attention generator A we define a maximal graph family FA such that, for all
graphs G, H, the
equality of stabilized graph representations χG(G) = χH(H)
is equivalent to the conjunction
hom(F, G) = hom(F, H) for
all F ∈ FA.
We establish existence and uniqueness of this family (and its rooted
analogues FA, n and FA, e) by
constructing an attention-guided unfolding/canonical-decomposition basis
whose isomorphism types are in bijection with stabilized colors. This
recovers, as a special case, the known homomorphism characterizations
for dense 2-tuple refinement and for purely local pair-refinement when
A(G) = G, while
simultaneously separating regimes where attention is strictly
sparser.
Our second contribution is structural: we characterize FA by an A-guided width-2 decomposition condition, equivalently by an attention-guided variant of almost-strong nested ear decompositions. Informally, membership of a pattern F in FA is equivalent to the existence of a canonical construction of F that extends along ears in a manner compatible with the attention accessibility witnessed by the refinement. This renders the attention restriction visible in the pattern language itself: patterns that require ``simultaneous’’ branching beyond what the attention neighborhood can expose fail the decomposition criterion and are therefore uncountable by the model, regardless of depth.
Our third contribution is maximality in the strong, algorithmic
sense: for any connected pattern F ∉ FA we
construct explicit graphs GF, HF,
computable in time polynomial in |VF|, such that
SAT-2FGNN[A] produces identical
stabilized representations on GF and HF, yet hom(F, GF) ≠ hom(F, HF).
The proof proceeds via an attention-adapted pebble game and a
modification of F"urer/CFI-type gadgets that respects the attention
constraints. This is the mechanism that upgrades an upper bound
(the model cannot see beyond attention-guided unfoldings'') into a tight characterization (and
nothing more can be inferred’’).
Finally, we derive strict hierarchies across attention regimes. If A1(G) ⊆ A2(G) for all G, then the corresponding pattern families satisfy FA1 ⊆ FA2; moreover, for natural regimes such as r-hop power attention Ar(G) = Gr, the inclusion is strict for each r. We also identify uniform limitations under bounded-degree attention: a global degree bound Δ on A(G) forbids counting an infinite collection of patterns whose canonical decompositions necessitate arity exceeding Δ. Conversely, when attention is guided by a provided width-k tree decomposition of the input graph, sparse attention can be made sufficient (with an appropriate k-tuple extension) to recover the homomorphism expressivity of bounded-treewidth patterns, showing that sparsity does not inevitably imply weakness when coupled to the right structural oracle.
These results supply a principled answer to what sparse attention
buys'' andcosts’’ from the viewpoint of homomorphism
counting and, by standard spasm identities, from the viewpoint of
subgraph counting. The remainder of the paper formalizes the above
notions and proves the stated theorems, beginning with preliminaries on
homomorphisms (including rooted variants) and the linear-algebraic lens
relating refinement invariants to homomorphism vectors.
We collect the homomorphism notions and decomposition viewpoints that will be used to state and prove our characterization theorems. Throughout, graphs are finite, simple, undirected, and vertex-labeled. We write G = (V, E, ℓ), where ℓ : V → Σ is a label map into a finite alphabet Σ.
Given graphs F = (VF, EF, ℓF)
and G = (VG, EG, ℓG),
a is a map h : VF → VG
such that
{x, y} ∈ EF ⟹ {h(x), h(y)} ∈ EG, and ℓF(x) = ℓG(h(x)) for
all x ∈ VF.
We denote by hom(F, G) the number of
such homomorphisms. Homomorphisms are stable under isomorphism: if G ≃ H, then hom(F, G) = hom(F, H)
for all F; conversely, by
Lov'asz’ classical theorem, the full homomorphism vector F ↦ hom(F, G)
determines G up to
isomorphism.
To capture node-level and edge-/pair-level information, we use rooted
pattern graphs. A is a pair (F, r) with r ∈ VF.
For a host graph G and a
vertex u ∈ VG,
we write
hom((F, r), (G, u)) := |{h : VF → VG
hom. : h(r) = u}|.
Similarly, an (or ordered-pair-rooted) pattern is (F, (r1, r2))
with an pair of distinguished vertices r1, r2 ∈ VF
(not necessarily adjacent). For an ordered pair (u, v) ∈ VG2
we define
hom((F, (r1, r2)), (G, (u, v))) := |{h : VF → VG
hom. : h(r1) = u, h(r2) = v}|.
These rooted counts refine the unrooted counts by conditioning on images
of designated vertices; they are the natural currency when a model
produces stabilized features at vertices or at ordered pairs.
For a pattern F, we write
Spasm(F) for the set of of
F, i.e., graphs F̃ obtainable as quotients of F under vertex identifications
induced by a homomorphism (equivalently, the set of cores of all
homomorphic images, up to isomorphism). A standard consequence of
inclusion–exclusion on identifications (see, e.g.,
Curticapean–Dell–Marx) is that for each fixed F there exist rational coefficients
α(F, F̃) such
that, for every host graph G,
where sub(F, G)
denotes the number of (not-necessarily-induced) subgraph embeddings of
F into G. Equation~ is the mechanism by
which homomorphism expressivity statements imply motif-counting
consequences: if a model cannot determine hom(F̃, ⋅) for some F̃ ∈ Spasm(F), then it
cannot in general determine sub(F, ⋅) either.
Let ℳ be any isomorphism-invariant graph representation mechanism producing (i) a graph representation χG(G) and, when applicable, (ii) stabilized vertex representations χV(G, u) and (iii) stabilized ordered-pair representations χP(G, (u, v)). We formalize ``what ℳ can count’’ by the maximal pattern families whose homomorphism vectors are exactly the invariants captured by the corresponding representations.
Concretely, a graph family F⋆ is a for ℳ if for all graphs G, H,
χG(G) = χH(H) ⇔ hom(F, G) = hom(F, H) for
all F ∈ F⋆,
and F⋆ is maximal
w.r.t. set inclusion among families satisfying the equivalence.
Analogously, a family of node-rooted patterns Fn⋆
is a if for all G, H
and all vertices u ∈ VG,
u′ ∈ VH,
χV(G, u) = χV(H, u′) ⇔ hom((F, r), (G, u)) = hom((F, r), (H, u′)) for
all
(F, r) ∈ Fn⋆,
with maximality; and similarly for an edge-/pair-level family Fe⋆
of ordered-pair-rooted patterns using χP and the
corresponding rooted homomorphism counts. In our setting, ℳ will be SAT-2FGNN[A], and the families will be
denoted FA, FA, n, FA, e.
Our structural characterizations will be phrased in terms of canonical width-2 decompositions. A of a graph F is a pair (T, β) where T is a tree and β : V(T) → 2VF assigns a to each node such that (i) every vertex of F appears in some bag, (ii) every edge of F is contained in some bag, and (iii) for each x ∈ VF, the set {t ∈ V(T) : x ∈ β(t)} induces a connected subtree of T. The width is maxt ∈ V(T)|β(t)|−1; width 2 thus means bags of size at most 3.
For the pattern families relevant to 2-tuple refinements, it is convenient to restrict attention to width-2 decompositions in which each new bag shares (typically) two vertices with previously introduced bags and introduces at most one new vertex. This viewpoint is equivalent to building F by successively attaching . Formally, an is a simple path whose internal vertices are new and whose endpoints lie in the current partial graph. An (AS-NED) is an ear decomposition equipped with a nesting condition that forbids ``crossing’’ attachments: each ear attaches within a previously created ear in a laminar manner, so that the resulting bag structure is tree-like when one expands each ear into a chain of size-3 bags along the path. In the source setting (local pair refinement without additional attention), AS-NED is precisely the obstruction-free condition that makes all relevant extensions visible to the refinement process.
We will later define an A-guided variant of this condition, in which the admissible ear attachments are additionally constrained by the attention accessibility implicit in the refinement updates. The present section supplies the homomorphism language into which these decomposition constraints will be translated.
We formalize the refinement mechanism underlying SAT-2FGNN[A] as a 2-tuple color refinement procedure whose information flow is constrained by an auxiliary produced from the input.
Fix an isomorphism-equivariant map A : 𝒢 → 𝒢 such that, for every input graph G = (V, E, ℓ), the attention graph is A(G) = (V, EA) on the same vertex set. Isomorphism-equivariance means that any isomorphism π : G ≃ H lifts to an isomorphism of the paired structures (G, A(G)) ≃ (H, A(H)) via the same vertex bijection. We will consider several regimes for A, all of which impose a sparsity constraint ensuring that A(G) can be traversed locally: typical promises are |EA| = O(|V|), or maxu ∈ VdegA(G)(u) ≤ Δ, or A(G) = Gr for a fixed radius r, or decompositions in which EA is a sparse chordalization guided by an externally supplied tree decomposition. The specific promise does not affect the definition of the refinement, but it will determine which pattern families can be unfolded and hence counted.
The model maintains colors (features) on pairs (u, v) ∈ V2,
written χ(t)(u, v)
at round t ≥ 0. The initial
color records the atomic type in the graph G:
Including the edge-indicator 1[{u, v} ∈ E]
is essential: even if {u, v} ∉ E but
{u, v} ∈ EA,
the procedure is allowed to ``attend’’ to the pair while still being
able to distinguish true edges from merely attended non-edges. This
separation will be used implicitly when we later translate refinement
information into homomorphism counts of labeled patterns.
At each round, the update at (u, v) can only access
intermediate vertices w that
are adjacent in the attention graph to at least one endpoint of the
pair. Formally, letting NA(x)
denote the neighbors of x in
A(G), we define the
multiset message
Here atpG(u, v, w)
denotes the atomic type of the triple (u, v, w) in G (e.g., the labels and equality
relations among u, v, w together
with adjacency indicators among the three unordered pairs). Any fixed
encoding of this finite relational type suffices; its role is to let the
update distinguish, for instance, whether w is adjacent to u in G as opposed to only in A(G). The next-round color
is then obtained by a perfect-hash refinement
where hash is injective on its domain
(equivalently, we work in the idealization of infinitely wide features
with perfect discrimination). As usual for color refinement, we iterate
until stabilization, i.e., until the partition of V2 induced by χ(t) no longer
changes.
Two remarks are in order. First, when A(G) is dense, reduces to the usual 2-tuple refinement (and thus matches the folklore 2-WL/2-FGNN correspondence at the level of stable partitions). Second, when A(G) is sparse, enforces a constraint: the update at (u, v) cannot aggregate over all w ∈ V, but only over those w made visible by attention adjacency to u or v.
By construction, the update is permutation-equivariant with respect
to relabelings of V that
preserve both G and A(G). Concretely, if π : VG → VH
is an isomorphism with (G, A(G)) ≃ (H, A(H)),
then for all rounds t and all
ordered pairs (u, v)
we have
χG(t)(u, v) = χH(t)(π(u), π(v)).
Moreover, the refinement is monotone: the equivalence relation on V2 defined by equality of
colors can only split, never merge, as t increases. Since V2 is finite,
stabilization occurs after at most |V|2 rounds (typically
far fewer), so the procedure is well-defined on all finite inputs.
From the stabilized colors we obtain representations at different
granularities via standard equivariant pooling. At the pair level we
simply set χP(G, (u, v)) := χ(∞)(u, v).
Node-level representations can be taken as
$$
\chi_V(G,u)\ :=\ \mathrm{pool}\bigl(\multiset{\chi^{(\infty)}(u,v): v\in
V}\bigr),
$$
for any injective multiset aggregator pool (again idealized as perfect hashing).
Finally, a graph-level representation χG(G)
is obtained by pooling the multiset of node colors, or directly the
multiset of pair colors. Any two such choices are interdefinable under
perfect hashing; in particular, the induced indistinguishability
relation on graphs is uniquely determined by the stabilized partition of
V2.
The procedure above should be read as a discrete analogue of stacked sparse self-attention layers. The set NA(u) ∪ NA(v) plays the role of the attention support (which keys/values are visible when updating the query associated with (u, v)), while the multiset in corresponds to aggregating value vectors indexed by visible w. Our perfect-hash update captures the maximal discriminative power of any equivariant architecture whose computations at (u, v) are restricted to this attention support and are invariant to the order of neighbors. Thus, when we later characterize expressivity via homomorphism-count families, the statements should be interpreted as upper bounds for a broad class of sparse-attention pairwise transformers, and as matching lower bounds for the idealized refinement itself.
In the next section we will convert stabilized colors into combinatorial objects—attention-guided unfoldings and canonical width-2 decompositions—which serve as the bridge from local attention access to global homomorphism invariants.
We now pass from stabilized pair-colors to explicit combinatorial
objects that expose exactly which local configurations can be explored
under attention. Throughout, we work over the paired input
structure
Ĝ := (VG, EG, EA(G), ℓ),
where EG
is the original edge relation and EA(G)
is the attention edge relation; the refinement accesses EA(G)
only through neighborhoods, while it can test EG (and
equalities) via atomic types.
Fix G and an ordered pair (u, v) ∈ VG2. We define a rooted, tree-decomposed, finite pattern 𝒰A(t)(G; u, v) of width 2 by induction on t ≥ 0. The pattern is most conveniently presented as a tree decomposition whose bags alternate between size 2 and size 3.
At depth t = 0, 𝒰A(0)(G; u, v) consists of a single root bag containing the ordered pair (u, v), labeled by the atomic type χ(0)(u, v) from .
Given 𝒰A(t)(G; ⋅, ⋅),
we form 𝒰A(t + 1)(G; u, v)
as follows. Start with a root 2-bag
Buv
containing (u, v) and
carrying the label χ(0)(u, v)
(or equivalently, the triple-atomic type restricted to {u, v}). For each
intermediate vertex
w ∈ NA(G)(u) ∪ NA(G)(v),
attach a 3-bag Buvw
adjacent to Buv
whose vertex set is (u, v, w) and
whose label is the triple atomic type atpG(u, v, w)
used in . Finally, from Buvw
we attach two child 2-bags Bwv and
Buw,
overlapping with Buvw
on the ordered pairs (w, v) and (u, w) respectively, and we
continue the construction recursively by grafting copies of 𝒰A(t)(G; w, v)
below Bwv and
of 𝒰A(t)(G; u, w)
below Buw.
The resulting decomposition is a rooted tree in which every 3-bag witnesses exactly one attention-visible
extension w from its parent
2-bag.
We view 𝒰A(t)(G; u, v) as a rooted isomorphism type: vertex names from VG are not part of the object; only the bag tree, the incidences/overlaps, and the bag labels (atomic types) are retained. An isomorphism between two unfoldings is a root-preserving isomorphism of the underlying bag trees that preserves the ordered interfaces on 2-bags and preserves the atomic-type labels on all bags.
The above definition is engineered so that the refinement update is
the usual bottom-up hashing of rooted unfolding trees. Concretely, let
type(t)(u, v)
denote the rooted isomorphism type of 𝒰A(t)(G; u, v).
Then, by induction on t, we
obtain a bijective correspondence
The base case t = 0 is
immediate from . For the inductive step, the multiset Mt(u, v)
in records, for each attention-visible w, the ordered pair of child colors
(χ(t)(w, v), χ(t)(u, w))
together with the parent witness label atpG(u, v, w).
Under the induction hypothesis, this is exactly the multiset of rooted
child types attached below the 3-bags
Buvw
in 𝒰A(t + 1)(G; u, v).
Since hash is injective, equality of
next-round colors is equivalent to equality of these labeled multisets,
hence to equality of the rooted unfolding types.
As a consequence, the stabilized color χ(∞)(u, v) can be identified with the (finite) rooted type of 𝒰A(t)(G; u, v) for any t beyond stabilization. In particular, the graph-level representation obtained by pooling stabilized colors is determined by the multiplicities of rooted unfolding types.
Let 𝒮A(t) denote
the set of all rooted isomorphism types that can occur as 𝒰A(t)(G; u, v)
over all finite G ∈ 𝒢 and all
ordered pairs (u, v) ∈ VG2,
and let
𝒮A := ⋃t ≥ 0𝒮A(t).
Each element of 𝒮A is a
finite tree decomposition of width 2
with bags of size at most 3, equipped
with the above alternation structure and atomic-type labels. We
emphasize that 𝒮A is
canonical in the sense that it is defined purely from the refinement
rule and the attention access pattern; it does not depend on any
particular input graph.
Given (F, T) ∈ 𝒮A and
a graph G, we define the
corresponding biso((F, T), Ĝ) as
the number of root-preserving mappings from the bags of (F, T) into Ĝ that are consistent on overlaps
and that realize the prescribed atomic-type labels in Ĝ (equivalently, a homomorphism into
the relational structure Ĝ
that is tracked bag-by-bag along T). For rooted unfoldings, these
counts have a direct interpretation: if τ ∈ 𝒮A(t)
is a rooted type, then
Combining and , we see that the pooled multiset of stabilized colors is
equivalent information to the vector of all bag-isomorphism counts over
𝒮A (at any sufficiently
large depth). This is the precise bridge we will use in the next
section: we first characterize representation equivalence by equality of
these canonical bag-isomorphism counts, and then translate those counts
into ordinary homomorphism counts of a maximal pattern family via the
standard linear-algebraic passage from decompositions to
homomorphisms.
We write G≡AH to denote
of SAT-2FGNN[A] on input graphs G, H, i.e.,
G≡AH ⇔ χG(G) = χH(H),
where χG(G)
is the pooled multiset of stabilized pair-colors (and analogously for
H). By the correspondence
between colors and rooted attention-guided unfoldings established in –,
this equivalence is already captured by the canonical bag-isomorphism
statistics over 𝒮A: for any
sufficiently large depth t
(beyond stabilization on the given inputs),
Thus the remaining step is to express the family of all bag-isomorphism
counts on width-2 decomposed patterns
as (equivalently) a family of ordinary homomorphism counts.
To state this cleanly, we regard each (F, T) ∈ 𝒮A as a finite relational structure over the signature consisting of vertex labels (unary predicates) together with two symmetric binary relations, one intended to be interpreted as EG and one as EA(G); the host structure is precisely Ĝ = (VG, EG, EA(G), ℓ). We then write hom(F, Ĝ) for the number of homomorphisms of these two-edge-colored, vertex-labeled patterns into Ĝ. (This is a notational refinement of hom(F, G) tailored to the paired input Ĝ used by the model.)
The standard linear-algebraic passage of Dell–Grohe–Rattan applies to
any fixed bounded-width decomposition formalism: bag-consistent local
homomorphisms (our biso counts) and
global homomorphisms (our hom counts)
span the same vector space of invariants, and there are explicit
triangular transformations between the two bases. Concretely, for each
decomposed pattern (F, T) ∈ 𝒮A
there exists a finite set ℱ(F, T) of
(two-edge-colored, vertex-labeled) graphs and rational coefficients
c(F, T)(⋅)
such that, for all G,
Conversely, for each finite pattern graph F that arises as an underlying graph
of some element of 𝒮A, there
is a finite set of decomposed patterns 𝒯(F) ⊆ 𝒮A and
coefficients dF(⋅) with
The finiteness in – is uniform in G and depends only on the bounded
width (here, 2) and the bounded bag
size (here, at most 3); no property of
the specific attention generator beyond isomorphism-equivariance is
needed at this point.
We may therefore define a by forgetting decompositions:
$$
F_{\mathsf{A}}^{\mathrm{can}}\ :=\ \{\ \underline{(F,T)}\ :\
(F,T)\in\mathcal{S}_{\mathsf{A}}\ \},
$$
where $\underline{(F,T)}$ denotes the
underlying two-edge-colored, vertex-labeled pattern of the decomposed
object (F, T).
Combining with – yields the desired completeness statement.
We emphasize that, while there may be many families of patterns whose
homomorphism counts suffice to determine the representation, there is a
unique one. Define
FAmax := {F: ∀G, H, G≡AH ⇒ hom(F, Ĝ) = hom(F, Ĥ)}.
By definition, FAmax is
unique and inclusion-maximal among families satisfying the implication
``representation equivalence ⇒ equality
of hom-counts on the family’’. Theorem~ shows that FAcan is
complete; hence every F ∈ FAmax
is already determined by the invariants encoded by 𝒮A and therefore lies in the
closure generated by FAcan under
the linear relations . In particular, FAmax is
determined solely by the refinement rule and the attention access
pattern, and not by auxiliary choices (depth parameters, pooling
convention, or representatives of isomorphism types). The subsequent
maximality theorem (proved via attention-adapted gadgets) will show that
FAcan
already attains this maximum in the strong sense demanded in the problem
statement.
We now replace the unfolding-defined description of FA by an intrinsic,
purely pattern-theoretic criterion. Throughout, a is a finite
vertex-labeled structure
F̂ := (VF, EF, EFA, ℓF),
where EF
is intended to match EG and EFA
is intended to match EA(G)
when mapping into Ĝ = (VG, EG, EA(G), ℓ).
The point of (3) is that, when realizing B inside Ĝ, the refinement update for the boundary pair (x, y) is permitted to aggregate over witnesses w that lie in NA(G)(x) ∪ NA(G)(y); thus any newly introduced vertex must be exposed to at least one boundary endpoint by an EA-edge in the pattern.
We write CDA(F̂) for the assertion that F̂ admits an A-guided canonical decomposition. By construction, every decomposed pattern in 𝒮A forgets to a pattern F̂ with CDA(F̂), and conversely every F̂ with CDA(F̂) can be unfolded into some element of 𝒮A (by turning each extension bag into an unfolding step).
For maximality arguments later, it is convenient to have a sequential construction rather than a tree decomposition. We therefore introduce an A-guided variant of the almost-strong nested ear decomposition (AS-NED) from the fully local case.
An of F̂ is a sequence of
subpatterns
F̂0 ⊆ F̂1 ⊆ ⋯ ⊆ F̂m = F̂
such that F̂0 is a
single labeled edge (two vertices with their E-adjacency and EA-adjacency
fixed), and each extension F̂i + 1 is
obtained from F̂i by adding an
Pi which
is a path x = v0, v1, …, vk = y
with endpoints x, y ∈ V(F̂i)
and internal vertices v1, …, vk − 1 ∉ V(F̂i).
The ear is required to be in the standard sense (its endpoints lie on a
common previously created boundary), and it is required to be in the
sense that every new internal vertex is introduced along a boundary pair
from which it is attention-visible:
for each j ∈ {1, …, k − 1}, at the
moment vj
is introduced there exist a, b ∈ {x, y}
with
{vj, a} ∈ EFA or {vj, b} ∈ EFA.
When EFA = EF
(the regime A(G) = G), this reduces to
the usual visibility condition along true edges and hence to the AS-NED
characterization for the local model.
We write NEDA(F̂) for the assertion that F̂ admits an A-guided AS-NED.
We argue by mutual simulation. First, F̂ ∈ FA implies CDA(F̂) because every F̂ ∈ FA arises as $\underline{(F,T)}$ for some (F, T) ∈ 𝒮A, and the definition of 𝒮A ensures that each bag extension corresponds to introducing one new vertex z witnessed by some attention-neighbor of a boundary endpoint; this is exactly condition (3) above.
Second, CDA(F̂) implies F̂ ∈ FA by reading the decomposition top-down as an admissible unfolding program: each non-root bag {x, y, z} is introduced at the boundary pair (x, y) using the witness z, and the attention-visibility condition guarantees that this witness is accessible to the refinement update. This constructs an element of 𝒮A that forgets to F̂.
Finally, CDA(F̂) ⇔ NEDA(F̂) is obtained by converting between a width-2 bag-extension decomposition and a nested sequence of ear additions: traversing T in a depth-first order yields a sequence in which each new vertex is introduced with a boundary pair, producing ears when consecutive introductions share the same endpoints; conversely, each ear addition can be expanded into a chain of size-3 bags, each introducing one internal vertex while keeping the ear endpoints as the boundary. The A-visibility requirement is preserved in both directions.
For fixed F̂, deciding CDA(F̂) is constructive. One convenient certificate is an v1, …, v|VF| such that each vi has at most two neighbors among {vi + 1, …} in the underlying simple graph of EF ∪ EFA, and whenever it has exactly two such later neighbors x, y, the vertex vi is attention-visible from the pair, i.e. {vi, x} ∈ EFA or {vi, y} ∈ EFA. From such an order we build the decomposition by introducing vi in a bag with its (at most two) later neighbors, and conversely any A-guided canonical decomposition yields such an order by repeatedly removing a leaf bag and deleting its unique new vertex. Thus membership in FA is decidable in polynomial time in |VF| by greedy peeling (trying all choices when multiple removable vertices exist), and when successful this procedure returns an explicit A-guided decomposition (or, equivalently, an A-guided AS-NED) witnessing F̂ ∈ FA.
We turn to the matching lower bound direction: the structural families characterized in Theorem~ are not only sufficient but also for homomorphism expressivity under SAT-2FGNN[A]. Concretely, whenever a connected pattern F̂ fails to admit an A-guided decomposition (equivalently, an A-guided AS-NED), we construct explicit graphs GF̂, HF̂ such that the sparse-attention refinement produces identical stabilized representations on GF̂ and HF̂, yet hom(F, GF̂) ≠ hom(F, HF̂) for the underlying graph F.
The standard route from refinement indistinguishability to lower bounds proceeds via a bijective pebble game. In our setting we require a variant that faithfully reflects the locality restriction w ∈ NA(G)(u) ∪ NA(G)(v) in the update rule. We therefore define an played on two input structures Ĝ = (VG, EG, EA(G), ℓG) and Ĥ = (VH, EH, EA(H), ℓH).
The game maintains a position consisting of a pair of pebbled
vertices (u1, u2) ∈ VG2
and (v1, v2) ∈ VH2
(order matters), together with the requirement that the atomic types of
ordered pairs match:
(ℓG(u1), ℓG(u2), 1[u1 = u2], 1[{u1, u2} ∈ EG]) = (ℓH(v1), ℓH(v2), 1[v1 = v2], 1[{v1, v2} ∈ EH]).
A round proceeds as follows. Spoiler chooses one of the two pebbles, say
the first, and announces that he will ``refresh’’ it. Duplicator must
then provide a bijection
π: NA(G)(u1) ∪ NA(G)(u2) → NA(H)(v1) ∪ NA(H)(v2)
that preserves the current pair-color information from the previous
round (equivalently, preserves the induced partition of ordered pairs in
the neighborhoods). Spoiler then selects some w ∈ NA(G)(u1) ∪ NA(G)(u2)
and moves the chosen pebble to w; Duplicator must move the
corresponding pebble to π(w). The second pebble
remains fixed. Duplicator loses if she cannot produce an admissible
bijection or if the atomic type constraint is violated after the
move.
This is precisely the locality-limited analogue of the bijective 2-pebble game for 2-tuple refinement, with the crucial modification that the choice domain is the attention neighborhood. As in the classical case, one proves by induction on t that Duplicator has a winning strategy for t rounds if and only if the refinement colors χ(t) agree on the corresponding ordered pairs. Taking t → ∞ yields the equivalence between game indistinguishability and stabilized-color indistinguishability for SAT-2FGNN[A]. In particular, any pair of graphs on which Duplicator wins indefinitely must have identical graph representations under any downstream equivariant pooling of stabilized colors.
To witness maximality we require a generic method that, for an excluded pattern F̂ ∉ FA, produces graphs that are indistinguishable in the A-local game but separated by hom(F, ⋅). The construction follows the F"urer/CFI paradigm: we start from a base graph encoding the incidence structure of F̂ and replace certain ``constraint edges’’ by parity gadgets, yielding two non-isomorphic lifts GF̂0 and GF̂1 (the untwisted and twisted versions). The gadgets are chosen so that (i) any local neighborhood accessible through attention has a canonical bijection between GF̂0 and GF̂1, enabling Duplicator to win the A-local game, while (ii) globally, the parity of twists changes the number of homomorphisms from F by a nonzero amount.
The attention aspect enters in two places. First, because the update for a boundary pair only sees witnesses in NA(⋅), we must ensure that every move Spoiler can make in the A-local game can be matched by a locally defined bijection between the corresponding gadget neighborhoods. This is achieved by : we introduce the parity gadgets only along those incidences that would have to be traversed to realize an A-guided decomposition. When F̂ fails CDA, there is an obstruction witnessing that any attempted bag-extension would require revealing a new vertex not attention-visible from the current boundary. We exploit precisely this obstruction: we place the twist so that distinguishing the two lifts would require Spoiler to ``see’’ through an attention-invisible interface, which the game forbids. Second, since our model distinguishes true edges {u, v} ∈ EG from merely attended pairs {u, v} ∈ EA(G), the gadget must preserve E-adjacency types while using EA only to control information flow; we therefore keep the underlying simple graph E identical in the relevant local views and implement the twist solely in the lift structure.
With these choices one shows the two key claims: Duplicator has a winning strategy in the A-local 2-pebble game on (GF̂0, GF̂1), hence SAT-2FGNN[A] assigns them identical stabilized representations; yet the homomorphism count hom(F, ⋅) detects the parity twist, giving hom(F, GF̂0) ≠ hom(F, GF̂1). This establishes maximality at the graph level, as stated abstractly in Theorem~4.
For node-rooted and edge-rooted families, we additionally need counterexamples that force homomorphisms to respect a designated root. We use a standard clique-augmentation trick adapted to our labeled setting: to pin a root vertex r, we attach a uniquely identifiable labeled gadget (e.g., a labeled clique of size M with a unique label multiset) to r in both GF̂0 and GF̂1, choosing M large enough that any homomorphism mapping the rooted pattern into the host must send the root to the unique attachment point. For edge-rooting we similarly pin both endpoints by attaching asymmetric gadgets to each endpoint. Because these augmentations are purely local and isomorphic between the two lifts, they do not help Spoiler in the A-local game, but they eliminate spurious homomorphisms that would otherwise avoid the intended root constraints.
The maximality proof yields a uniform separation mechanism used repeatedly in the hierarchy results later. Given an attention generator A and a candidate pattern F̂, we first attempt to certify CDA(F̂) (equivalently NEDA(F̂)) via an A-admissible elimination order. If the test fails, the failure exposes a concrete obstruction: at some stage every remaining vertex either has too many ``later’’ neighbors in E ∪ EA or is not attention-visible from any admissible boundary pair. From this obstruction we deterministically choose an interface on which to place the parity twist and build the pair (GF̂0, GF̂1). By construction, the refinement cannot access the obstruction interface through attention-guided witnesses, so the two graphs remain indistinguishable; nonetheless the excluded pattern detects the twist, separating hom(F, ⋅).
This procedure is the promised tightness principle: the structural characterization of FA is accompanied by an explicit, pattern-driven construction of indistinguishable-but-count-different graph pairs for every excluded pattern, and the same holds for rooted variants after clique augmentation.
We record three attention regimes in which the abstract maximality principle above yields concrete, tight hierarchies of homomorphism expressivity. In each case the upper bound is inherited from the structural characterization of FA (via A-guided decompositions), while the matching lower bound is obtained by applying the separation procedure to an explicit excluded pattern.
Let Ar(G) = Gr.
Since Gr ⊆ Gr + 1
for all G, Theorem~5 gives the
monotone inclusions
FA1 ⊆ FA2 ⊆ ⋯, FA1, n ⊆ FA2, n ⊆ ⋯, FA1, e ⊆ FA2, e ⊆ ⋯.
Strictness can be witnessed by an explicit separating family. Fix r ≥ 1 and define Fr + 1sep
as follows: start from a cycle C2r + 3 on
vertices x0, x1, …, x2r + 2
(indices modulo 2r + 3) and
add a single chord {x0, xr + 1}.
In Fr + 1sep,
the endpoints of the chord are at graph distance exactly r + 1 (via the shorter arc of the
cycle), hence {x0, xr + 1} ∈ E(Fr + 1)
but {x0, xr + 1} ∉ E(Fr).
Any Ar-guided
decomposition step that would introduce this chord as an ``ear
interface’’ necessarily requires attention visibility between its
endpoints, which fails under Ar but holds under Ar + 1.
Consequently,
Fr + 1sep ∈ FAr + 1 and Fr + 1sep ∉ FAr.
Applying maximality (Theorem~4) to Fr + 1sep ∉ FAr
yields graphs Gr, Hr
such that SAT-2FGNN[Ar] produces
identical stabilized representations on Gr, Hr
while hom(Fr + 1sep, Gr) ≠ hom(Fr + 1sep, Hr).
Thus increasing the attention radius strictly enlarges the homomorphism
basis that determines χG(⋅).
As a simple motif-counting corollary, note that plain cycles Cℓ already admit width-2 decompositions whose extensions are witnessed along edges, hence Cℓ ∈ FA1. In contrast, chorded-cycle motifs of the form above separate successive radii; therefore any attempt to reconstruct subgraph statistics that linearly depend on hom(Fr + 1sep, ⋅) (e.g., via spasm expansions) necessarily requires radius at least r + 1.
Assume degA(G)(u) ≤ Δ for all G and all u ∈ VG. Theorem~6 implies the existence of an infinite family of excluded patterns of small width but requiring (Δ + 1)-way branching in any A-guided unfolding. A concrete choice is the star SΔ + 1 = K1, Δ + 1 (or, if one prefers to avoid high degrees in the pattern itself, its 1-subdivision). Intuitively, hom(SΔ + 1, G) = ∑v ∈ VGdegG(v)Δ + 1 depends on a (Δ + 1)st degree moment at the image of the center. If attention reveals at most Δ neighbors of any vertex, then the refinement updates never aggregate over a witness set rich enough to reconstruct such moments uniformly across all graphs, and the A-guided decomposition condition fails for SΔ + 1 (and for infinitely many related branching patterns).
Maximality therefore yields, for each such excluded branching pattern F, explicit graphs GF, HF that are indistinguishable by SAT-2FGNN[A] yet satisfy hom(F, GF) ≠ hom(F, HF). In particular, bounded-degree attention imposes a barrier: no matter how many refinement rounds we run, there exist fixed motifs (depending only on Δ) whose homomorphism counts cannot be recovered from the stabilized representation. Via the spasm identity sub(F, G) = ∑F̃ ∈ Spasm(F)α(F, F̃)hom(F̃, G), the same barrier transfers to subgraph-counting tasks whenever Spasm(F) contains such a branching obstruction.
At the opposite extreme, suppose the input provides a tree decomposition of G of width k and A is chosen to connect all pairs of vertices co-occurring in a bag and to connect vertices across adjacent bags (a sparse chordalization). Then |EA(G)| = O(k|VG|), yet the attention relation exposes the separators needed for dynamic-programming style propagation. As formalized in Theorem~7, a k-tuple extension of SAT-2FGNN[A] matches the homomorphism expressivity of k-FWL on arbitrary graphs, hence determines hom(F, G) for every pattern F with tw(F) ≤ k.
This regime yields a different kind of tight hierarchy: increasing the supplied decomposition width strictly increases the recoverable pattern family (e.g., Kk + 2 separates width k from k + 1), while keeping attention sparse. Moreover, because the characterization is by equality of homomorphism counts over all tw ≤ k patterns, it immediately implies positive motif-counting statements for any fixed-width motif class (cycles, series–parallel motifs, and in general bounded-treewidth templates), in sharp contrast with the uniform obstructions under bounded-degree attention.
Taken together, these three instantiations realize the promised ``tightness principle’’: weakening attention constraints (larger r, larger Δ, or richer decomposition guidance) enlarges FA monotonically, and every strict enlargement is witnessed by an explicit separating pattern family and corresponding indistinguishable-but-count-different graph pairs.
Our theorems are qualitative and worst-case, hence it is natural to complement them by experiments that isolate the role of the attention graph A(G) in motif counting and, in particular, test the predicted monotonicity and obstructions as we vary the attention regime. The guiding principle is to treat SAT-2FGNN[A] as a representation map G ↦ χG(G) and to probe which homomorphism statistics hom(F, G) are recoverable from χG(G) by a simple readout (e.g., a linear regressor on pooled stabilized colors, or its neural analogue on final embeddings). We emphasize that the relevant targets are rather than induced-subgraph counts; where desired, the latter can be derived via spasm expansions and thus inherit the same limitations.
We will implement three attention generators, matching the regimes
discussed above.
First, r-hop attention Ar(G) = Gr,
where we can directly vary r
and empirically test the strict hierarchy by evaluating a fixed set of
patterns across radii. Second, bounded-degree attention: for each u we restrict NA(G)(u)
to at most Δ neighbors, using
either (i) random subsampling among NG(u)
(to test worst-case behavior), or (ii) a canonical rule such as taking
the Δ lexicographically
smallest neighbors with respect to vertex labels/IDs (to preserve
isomorphism-equivariance in the synthetic setting). Third,
decomposition-guided sparse chordalizations (when a decomposition is
available), to verify that sparsity need not preclude strong motif
recovery provided separators are exposed to attention.
For r-hop monotonicity we will use the explicit separating patterns Fr + 1sep (chorded cycles) defined earlier, together with baseline patterns known to be easy already at r = 1 (e.g., cycles Cℓ and small series–parallel motifs). The prediction is that, holding the model architecture and training protocol fixed, the error in predicting hom(Fr + 1sep, G) should drop sharply when moving from Ar to Ar + 1, while exhibiting no comparable change for patterns already in FAr. For bounded-degree attention we will include branching obstructions such as SΔ + 1 = K1, Δ + 1 (and its 1-subdivision), where the theory predicts a depth-independent failure. Concretely, we expect that for fixed Δ, increasing the number of refinement rounds (or transformer layers) should not yield systematic improvement on targets that depend on (Δ + 1)-way branching, while it should still improve performance on patterns compatible with the A-guided unfolding constraint.
We will use two complementary data sources. On the one hand, we will generate synthetic graphs with controlled local statistics (Erds–R'enyi, configuration-model, and planted-motif families), varying |V| and degree distributions to test generalization across sizes and sparsity. On the other hand, to directly witness maximality phenomena, we will include worst-case pairs (GF, HF) produced by the separation procedure of Theorem~4 for selected excluded patterns F ∉ FA. These instances serve as a crisp sanity check: under the appropriate attention regime, the representations (or their neural approximations) should be indistinguishable up to numerical tolerance, while hom(F, GF) ≠ hom(F, HF). We will report both representation distances (e.g., ℓ2 on pooled embeddings) and readout failures (inability to linearly separate/predict the differing hom-counts).
For each graph G and each pattern F in a fixed suite (typically |VF| ≤ 8 so that exact computation is feasible), we compute hom(F, G) exactly by brute-force enumeration for small |VG| and by specialized routines where available (e.g., adjacency-matrix formulas for walks/cycles, and meet-in-the-middle for small-width patterns). We then train a readout hF on top of the model representation to predict hom(F, G) (regression) or to predict thresholded events hom(F, G) ≥ τ (classification). We will favor linear readouts to better isolate representational sufficiency: if χG(G) in fact determines hom(F, G) over the distribution, a linear probe should succeed with modest sample sizes; conversely, persistent failure under linear probes is evidence (though not proof) that the representation lacks the required information.
Since SAT-2FGNN[A] is an idealized refinement with perfect hashing, we will explicitly separate (i) structural limitations due to attention locality from (ii) approximation limitations due to finite-dimensional neural parameterizations. We will therefore compare three variants: (a) a discrete implementation that maintains exact colors χ(t)(u, v) via dictionary hashing (serving as the ``ground-truth’’ refinement), (b) a neuralized pair-update network that replaces hashing by an MLP with sum-multiset aggregation, and (c) a standard sparse-attention transformer block on node embeddings augmented with edge indicators for attended pairs. For each, we will ablate whether the initial atomic type includes the indicator 𝕀[{u, v} ∈ EG], since our theory relies on distinguishing true edges from merely attended non-edges. We will also ablate tuple-tracking schemes (all pairs versus only (u, v) with v ∈ NA(u)) to quantify the runtime–accuracy tradeoff suggested by the sparse-tuple complexity discussion.
Across these experiments we expect to observe: (i) monotone improvements with r on the separating chorded-cycle family, matching FAr ⊊ FAr + 1; (ii) depth-independent failures under bounded-degree attention on branching motifs beyond Δ, consistent with Theorem~6; and (iii) a gap between the discrete refinement and its neural approximations that narrows with width and training but does not eliminate the structural failures imposed by the attention regime. These outcomes would empirically corroborate that the primary driver of motif-counting power in this setting is the attention-constrained information flow captured by FA, rather than optimization artifacts.
The present analysis sits at the intersection of (i) expressivity theory for transformers and attention mechanisms, (ii) the Weisfeiler–Leman (WL) and folklore-WL (FWL) refinements for graphs, (iii) logical characterizations of invariant graph representations, and (iv) homomorphism-based and polynomial bases for graph statistics. Our main point of contact with this literature is that we treat a sparse-attention transformer block as an whose information flow is constrained by an auxiliary attention graph A(G), and we characterize its discriminative power by an associated FA.
A large body of work studies the representational power of attention-based architectures, both in the original sequence setting and in permutation-equivariant regimes such as DeepSets and Set Transformers . In graph learning, replace message passing with attention over (possibly all) vertex pairs, and their expressivity is typically discussed in terms of permutation equivariance and the ability to simulate higher-order interactions . Dense attention can make 2-tuple information accessible in one step, but is quadratic in |V|; practical models therefore use sparse attention patterns (local windows, landmark nodes, random links) or structural sparsifiers. Our framework abstracts this choice by an isomorphism-equivariant generator A and isolates the consequences of the resulting locality constraint on what can be or .
The classical 1-WL color refinement captures the expressive envelope of message-passing GNNs under standard assumptions; higher-order WL procedures (notably k-WL) characterize the power of k-tuple refinements and are closely tied to invariant/equivariant tensor architectures . The ``folklore’’ variants (FWL) refine ordered tuples using adjacency-aware multisets and are known to connect more directly to homomorphism counting and to certain families of graph polynomials . Our SAT-2FGNN[A] can be viewed as a 2-tuple refinement on the (V, EG, EA), with the crucial design choice that the initial atomic type of (u, v) includes the indicator 𝕀[{u, v} ∈ EG] so that true edges in G are not conflated with merely attended non-edges. This places our results conceptually between (a) full 2-FWL on G (when A(G) is dense) and (b) locality-restricted refinements that only aggregate along a prescribed sparse neighborhood.
Sparse attention mechanisms have been studied extensively for long sequences (e.g., BigBird, Longformer, Reformer) and for graphs, typically with the goal of reducing computational cost while retaining enough connectivity to propagate information . Most theoretical results in this line are either approximation-oriented (showing that sparse patterns can approximate dense attention under distributional assumptions) or universality-oriented (showing that, with sufficient width and appropriate tokens, sparse attention can approximate broad function classes). Our focus is different: we fix the refinement idealization (perfect hashing, multiset aggregation) and ask for structural limits imposed solely by the attention graph. In particular, we prove strict hierarchies under attention inclusion and depth-independent obstructions under bounded attention degree, which are not captured by approximation-universality statements.
WL-type equivalences admit well-known logical counterparts in counting logics, and these logics can be refined by imposing locality or guardedness conditions . From this perspective, the attention graph induces a ``guard’’ relation: when updating (u, v), quantification over a witness w is restricted to NA(u) ∪ NA(v). Consequently, indistinguishability by SAT-2FGNN[A] corresponds to equivalence in a fragment of C2 in which binary extensions are permitted only along EA (while EG remains available as a predicate in atomic formulas). Although we do not develop the full logical apparatus here, this viewpoint is consistent with our decomposition characterizations: canonical decompositions and attention-guided unfoldings play the same role as guarded bisimulations and tree-like unravellings in finite model theory.
A central organizing principle for invariant graph representations is that many natural graph statistics can be expressed as linear combinations of homomorphism counts, and conversely that homomorphism counts provide canonical coordinates for graph parameters . The modern refinement of this principle, developed in particular by Dell–Grohe–Rattan and subsequent works, identifies for several WL/FWL procedures an explicit F such that two graphs are equivalent under the procedure if and only if their homomorphism counts agree on all patterns in F . Our contribution is to extend this style of characterization to refinements: we show that for each isomorphism-equivariant A there is a unique maximal family FA (and rooted variants) witnessing exactly the homomorphism statistics determined by the stabilized colors. Moreover, we couple completeness with maximality by constructing, for each excluded pattern, explicit indistinguishable graph pairs with differing homomorphism counts, in the spirit of CFI/F"urer-type separations adapted to the attention relation.
Another strand of expressivity theory frames GNN/transformer representations as generating algebras of invariant polynomials, such as walk-count features derived from powers of the adjacency matrix, or trace/tensor invariants arising from higher-order architectures . In the dense-attention 2-tuple setting, stabilized colors can be related to a rich collection of WL-invariant polynomials; under sparse attention, however, the closure properties of the generated invariants become constrained by the allowed extensions along EA. Our homomorphism-family viewpoint provides a concrete description of this constraint: to the extent that a target invariant can be expanded into homomorphism counts whose spasm stays within FA, it is representable; when the spasm leaks outside FA, maximality yields worst-case failures. Thus the attention graph controls not only discriminative power but also the effective polynomial basis accessible to the model.
We view the main message as follows. Sparse attention is not merely an optimization trick; when made isomorphism-equivariant, it induces a precise refinement calculus whose information flow can be characterized exactly by an A-dependent homomorphism basis and an A-guided width-2 decomposition theory. This positions our results as a bridge between practical sparse-attention design (choice of A(G)) and classical graph-parameter theory (homomorphisms, treewidth, and WL), and it provides a template for extending expressivity analyses to other structured sparsification regimes beyond those treated explicitly here.
Our results provide a worst-case and account of what a sparse-attention 2-tuple refinement can count, given an isomorphism-equivariant attention generator A subject to a sparsity promise. Several directions remain open; we highlight those that appear technically closest to the present framework and most relevant for practice.
In applications, the attention graph is rarely prescribed a priori.
Instead, it is often from intermediate features (e.g., by top-k similarities, locality-sensitive
hashing, or routing mechanisms). This raises a basic question: when can
we treat a learned attention rule as an isomorphism-equivariant
generator A in our sense? If attention
is computed by an equivariant scoring function and then sparsified by a
deterministic rule, then equivariance is threatened by tie-breaking: the
map
keep the top-$k$ neighbors'' is not canonical in the presence of equal scores, and any arbitrary tie resolution can break invariance. One open problem is to design \emph{canonical} sparsifiers (or randomized sparsifiers with controlled failure probability) that preserve equivariance while remaining implementable. Another is to extend our characterization to \emph{feature-dependent} generators $\mathsf{A}^{(t)}$ that evolve with refinement depth: even if each $\mathsf{A}^{(t)}$ is equivariant as a function of the current coloring, the resulting process is no longer a fixed refinement on a fixed two-relation structure $(V,E_G,E_A)$. We expect that anon-the-fly’’
unfolding analysis remains possible, but maximality should become
subtler: the excluded patterns may depend on the entire attention policy
class rather than a single A. A
concrete goal is a dichotomy of the following form: characterize those
policy classes for which the induced expressive envelope still admits a
homomorphism basis F that is
independent of the input distribution and independent of training.
We idealize updates by perfect hashing of multisets and stabilization
in the limit. Real networks operate with finite precision, continuous
activations, and often a bounded number of layers. It is therefore
natural to ask which parts of the theory survive under approximation.
One possible route is to define an refinement in which the hash is
replaced by a Lipschitz map and multiset aggregation is only
approximate, and then study whether equality of representations implies
approximate equality of hom(F, ⋅) for patterns F ∈ FA (and
conversely). Here one expects a quantitative dependence on pattern size,
depth, and separation margins in feature space. A different perspective
is to replace exact refinement equivalence by a notion: if two graphs
differ on hom(F, ⋅) for some
F ∈ FA,
can a bounded-depth, bounded-width instantiation of the model detect
this difference with high probability under random initialization or
random projections? Such questions connect to probabilistic versions of
WL and to smoothed analysis of color refinement. A technically clean
open problem is to prove stability bounds of the form
∥χG(G) − χH(H)∥ ≥ ε(F) ⋅ |hom(F, G) − hom(F, H)|
for suitable norms and for restricted families of graphs, thereby
turning the existence of FA into a robust
identification statement.
Our setting uses a 2-tuple refinement with locality controlled by A(G). A natural generalization is a k-tuple variant in which states live on ordered k-tuples and aggregation extends a tuple by a single vertex, but only along attention-neighborhood constraints. There are at least two non-equivalent choices: attention may be defined on vertices and then lifted to tuples (allowing witnesses in the union of neighborhoods of the tuple coordinates), or attention may be defined directly on tuples (which is more expressive but harder to sparsify). Either way, one expects a family FA(k) that interpolates between k-FWL (dense access) and strictly weaker locality-constrained procedures. An open problem is to identify the correct analogue of our A-guided width-2 decompositions: for k ≥ 3, do we obtain an A-guided decomposition theory of width k (in the sense of tree decompositions with bags of size k) or a different combinatorial object? Closely related is the algorithmic side: can one implement a sparse k-tuple refinement with near-linear dependence on |EA| for fixed k by maintaining only a carefully chosen subset of tuples, while retaining the same homomorphism basis?
Graph transformers frequently augment nodes (and sometimes pairs) with positional information: Laplacian eigenvectors, random features, shortest-path distances, or more elaborate structural encodings. From our viewpoint, such encodings effectively enrich the input structure by additional unary or binary predicates, which can either preserve or destroy isomorphism equivariance. For instance, Laplacian eigenvectors are not canonically defined when eigenvalues have multiplicity, and even for simple eigenvalues there is a sign ambiguity; injecting such features can therefore act as symmetry breaking beyond our intended setting. An open direction is to delineate the boundary between (i) encodings (invariantly definable from G, hence safely incorporated into ℓ or into an expanded relational signature) and (ii) encodings that implicitly select an ordering and thus yield expressivity not captured by homomorphism counts. On the canonical side, one may ask how the basis FA changes when we add relations such as all-pairs distances up to radius r, or when we allow A(G) itself to be derived from such relations. On the non-canonical side, one may seek probabilistic statements: random sign choices or random projections may give a model the ability to distinguish graphs with high probability, but then the correct analogue of FA may be a distribution-dependent family of statistics rather than a worst-case homomorphism basis.
Finally, we view a conceptual lesson as follows. Once attention is restricted, additional design choice (learned sparsity, landmarks, encodings, decomposition oracles) should be treated as part of the input structure whose invariance properties must be specified. A systematic open problem is to develop a calculus in which such choices come with explicit axioms (equivariance, sparsity, bounded radius, bounded degree, decomposition access) and in which the corresponding maximal homomorphism families and separation gadgets can be derived modularly. Achieving this would turn expressivity analysis into a reusable tool for sparse-attention architecture design, rather than a collection of one-off characterizations.