Beyond WL for Sparse Graph Transformers: Homomorphism Expressivity Under Attention Constraints

Metadata

Total Words: 9,338
Export Date: 2026-01-18 03:44:02
Description: Modern graph transformers nearly always operate with sparse attention, yet we lack a tight structural account of what sparse attention can and cannot represent. Building on the homomorphism expressivity framework (ICLR’24) that characterizes GNN representations by the maximal set of countable homomorphism patterns, we extend the theory to transformer-style architectures whose aggregation neighborhoods are defined by a sparse, isomorphism-equivariant attention graph A(G). We formalize a 2-tuple sparse-attention folklore refinement model that captures common edge-aware transformer blocks under perfect-hash refinement, and prove that its homomorphism expressivity families exist, are unique, and admit exact characterizations by a new attention-guided canonical decomposition (equivalently, an attention-guided nested ear decomposition). This yields clean expressiveness hierarchies as attention constraints (radius, degree, decomposition-guidance) vary, along with constructive lower bounds via generalized F"urer gadgets tailored to the attention relation. We derive corollaries for motif counting and provide an empirical protocol to validate predicted counting thresholds under practical neural implementations, highlighting sparse attention as a fundamental structural bottleneck.

1. Introduction: sparse attention as the 2026 bottleneck; why WL-style bounds miss transformer sparsity; contributions and high-level theorems.
2. Preliminaries: homomorphisms, rooted homomorphisms, spasm; homomorphism expressivity (graph/node/edge); recap of NED/tree decomposition lens from the source material.
3. Model: Sparse-Attention 2-Folklore Refinement (SAT-2FGNN[A]): definition, invariance/equivariance, relation to practical transformer blocks; attention generator assumptions.
4. Unfolding objects under attention: attention-guided unfolding trees and canonical decompositions; bag-isomorphism counts and their relation to model representations.
5. Main theorem I (Existence & completeness of F_A): representation equivalence iff equality of hom-counts over a canonical family; uniqueness of the maximal family.
6. Main theorem II (Structural characterization): define A-guided canonical decomposition and A-guided NED; prove equivalence and give explicit membership tests/constructive decompositions for patterns.
7. Main theorem III (Maximality & lower bounds): generalized pebble game and attention-adapted F"urer/click-augmented gadgets; systematic separation procedure.
8. Instantiations and tight hierarchies: (i) r-hop attention, (ii) bounded-degree attention, (iii) decomposition-guided attention; strict inclusions and explicit separating patterns; corollaries for motif/cycle counting when applicable.
9. Empirical validation plan: motif counting with controlled attention graphs; demonstrate monotone improvements with r and failures under bounded degree; ablations separating theory (perfect hash) vs neural approximation.
10. Related work and positioning: transformer expressivity, WL/FWL, guarded logic, sparse attention theory; connection back to homomorphism expressivity and polynomial expressivity.
11. Discussion and open problems: learned attention beyond structural promises, approximate FM, extending to k-tuple transformers and spectral/positional encodings.

Content

1. Introduction: sparse attention as the 2026 bottleneck; why WL-style bounds miss transformer sparsity; contributions and high-level theorems.

Sparse attention has become the operational bottleneck of graph transformers: in essentially all regimes where |V_G| is large, one cannot afford to materialize (let alone repeatedly update) dense interactions over V_G², and so practical architectures restrict message passing to an A(G) with |E_A(G)| near-linear in |V_G| or with uniformly bounded degree. This restriction is not merely an implementation detail. It changes which dependencies can be transported across the model in bounded depth, and it therefore changes the set of graph patterns whose homomorphism counts are, even in principle, recoverable from the stabilized representation. Our aim is to make this limitation explicit and sharp, in a form that is invariant under isomorphism and independent of training.

A large fraction of the existing expressivity literature proceeds by comparing a model to a Weisfeiler–Leman (WL) refinement, or by bounding it between k-WL variants. These comparisons are informative when the model is allowed to aggregate over the full set of neighbors in the underlying graph (or over all pairs, in a dense 2-tuple refinement). However, WL-style upper bounds can miss the decisive feature of modern transformers: the neighborhood over which an update aggregates is neither N_G(u) nor V_G, but rather N_A(G)(u) for an attention graph that is from G and is typically sparse. In particular, even when the state space is indexed by ordered pairs (u, v) as in 2-tuple refinement, the update of (u, v) is constrained to query only colors involving vertices w that are attention-neighbors of u or v. This induces a distinct notion of locality (and a distinct pebble-game equivalence) that is strictly weaker than classical 2-WL whenever A(G) omits a non-negligible fraction of pairs.

We therefore study a canonical discrete idealization of sparse-attention transformers, namely the refinement procedure SAT-2FGNN[A] parameterized by an isomorphism-equivariant attention generator A. The model refines colors χ^(t)(u, v) on ordered pairs and uses perfect hashing so that the refinement is maximally discriminative subject to its access pattern. The central question is then structural: ? We insist on a tight answer, meaning (i) a completeness direction (the representation determines all relevant homomorphism counts) and (ii) a maximality direction (every excluded pattern admits explicit counterexamples that the model cannot distinguish).

Our first contribution is to show that sparse attention still admits a homomorphism-basis characterization, provided one lets the basis depend on A. Concretely, for each isomorphism-equivariant attention generator A we define a maximal graph family F_A such that, for all graphs G, H, the equality of stabilized graph representations χ_G(G) = χ_H(H) is equivalent to the conjunction
hom(F, G) = hom(F, H) for all F ∈ F_A.
We establish existence and uniqueness of this family (and its rooted analogues F_A, n and F_A, e) by constructing an attention-guided unfolding/canonical-decomposition basis whose isomorphism types are in bijection with stabilized colors. This recovers, as a special case, the known homomorphism characterizations for dense 2-tuple refinement and for purely local pair-refinement when A(G) = G, while simultaneously separating regimes where attention is strictly sparser.

Our second contribution is structural: we characterize F_A by an A-guided width-2 decomposition condition, equivalently by an attention-guided variant of almost-strong nested ear decompositions. Informally, membership of a pattern F in F_A is equivalent to the existence of a canonical construction of F that extends along ears in a manner compatible with the attention accessibility witnessed by the refinement. This renders the attention restriction visible in the pattern language itself: patterns that require ``simultaneous’’ branching beyond what the attention neighborhood can expose fail the decomposition criterion and are therefore uncountable by the model, regardless of depth.

Our third contribution is maximality in the strong, algorithmic sense: for any connected pattern F ∉ F_A we construct explicit graphs G_F, H_F, computable in time polynomial in |V_F|, such that SAT-2FGNN[A] produces identical stabilized representations on G_F and H_F, yet hom(F, G_F) ≠ hom(F, H_F). The proof proceeds via an attention-adapted pebble game and a modification of F"urer/CFI-type gadgets that respects the attention constraints. This is the mechanism that upgrades an upper bound (the model cannot see beyond attention-guided unfoldings'') into a tight characterization (and nothing more can be inferred’’).

Finally, we derive strict hierarchies across attention regimes. If A₁(G) ⊆ A₂(G) for all G, then the corresponding pattern families satisfy F_A₁ ⊆ F_A₂; moreover, for natural regimes such as r-hop power attention A_r(G) = G^r, the inclusion is strict for each r. We also identify uniform limitations under bounded-degree attention: a global degree bound Δ on A(G) forbids counting an infinite collection of patterns whose canonical decompositions necessitate arity exceeding Δ. Conversely, when attention is guided by a provided width-k tree decomposition of the input graph, sparse attention can be made sufficient (with an appropriate k-tuple extension) to recover the homomorphism expressivity of bounded-treewidth patterns, showing that sparsity does not inevitably imply weakness when coupled to the right structural oracle.

These results supply a principled answer to what sparse attention buys'' andcosts’’ from the viewpoint of homomorphism counting and, by standard spasm identities, from the viewpoint of subgraph counting. The remainder of the paper formalizes the above notions and proves the stated theorems, beginning with preliminaries on homomorphisms (including rooted variants) and the linear-algebraic lens relating refinement invariants to homomorphism vectors.

2. Preliminaries: homomorphisms, rooted homomorphisms, spasm; homomorphism expressivity (graph/node/edge); recap of NED/tree decomposition lens from the source material.

We collect the homomorphism notions and decomposition viewpoints that will be used to state and prove our characterization theorems. Throughout, graphs are finite, simple, undirected, and vertex-labeled. We write G = (V, E, ℓ), where ℓ : V → Σ is a label map into a finite alphabet Σ.

Given graphs F = (V_F, E_F, ℓ_F) and G = (V_G, E_G, ℓ_G), a is a map h : V_F → V_G such that
{x, y} ∈ E_F ⟹ {h(x), h(y)} ∈ E_G, and ℓ_F(x) = ℓ_G(h(x)) for all x ∈ V_F.
We denote by hom(F, G) the number of such homomorphisms. Homomorphisms are stable under isomorphism: if G ≃ H, then hom(F, G) = hom(F, H) for all F; conversely, by Lov'asz’ classical theorem, the full homomorphism vector F ↦ hom(F, G) determines G up to isomorphism.

To capture node-level and edge-/pair-level information, we use rooted pattern graphs. A is a pair (F, r) with r ∈ V_F. For a host graph G and a vertex u ∈ V_G, we write
hom((F, r), (G, u)) := |{h : V_F → V_G hom. : h(r) = u}|.
Similarly, an (or ordered-pair-rooted) pattern is (F, (r₁, r₂)) with an pair of distinguished vertices r₁, r₂ ∈ V_F (not necessarily adjacent). For an ordered pair (u, v) ∈ V_G² we define
hom((F, (r₁, r₂)), (G, (u, v))) := |{h : V_F → V_G hom. : h(r₁) = u, h(r₂) = v}|.
These rooted counts refine the unrooted counts by conditioning on images of designated vertices; they are the natural currency when a model produces stabilized features at vertices or at ordered pairs.

For a pattern F, we write Spasm(F) for the set of of F, i.e., graphs F̃ obtainable as quotients of F under vertex identifications induced by a homomorphism (equivalently, the set of cores of all homomorphic images, up to isomorphism). A standard consequence of inclusion–exclusion on identifications (see, e.g., Curticapean–Dell–Marx) is that for each fixed F there exist rational coefficients α(F, F̃) such that, for every host graph G,

where sub(F, G) denotes the number of (not-necessarily-induced) subgraph embeddings of F into G. Equation~ is the mechanism by which homomorphism expressivity statements imply motif-counting consequences: if a model cannot determine hom(F̃, ⋅) for some F̃ ∈ Spasm(F), then it cannot in general determine sub(F, ⋅) either.

Let ℳ be any isomorphism-invariant graph representation mechanism producing (i) a graph representation χ_G(G) and, when applicable, (ii) stabilized vertex representations χ_V(G, u) and (iii) stabilized ordered-pair representations χ_P(G, (u, v)). We formalize ``what ℳ can count’’ by the maximal pattern families whose homomorphism vectors are exactly the invariants captured by the corresponding representations.

Concretely, a graph family F^⋆ is a for ℳ if for all graphs G, H,
χ_G(G) = χ_H(H) ⇔ hom(F, G) = hom(F, H) for all F ∈ F^⋆,
and F^⋆ is maximal w.r.t. set inclusion among families satisfying the equivalence. Analogously, a family of node-rooted patterns F_n^⋆ is a if for all G, H and all vertices u ∈ V_G, u^′ ∈ V_H,
χ_V(G, u) = χ_V(H, u^′) ⇔ hom((F, r), (G, u)) = hom((F, r), (H, u^′)) for all (F, r) ∈ F_n^⋆,
with maximality; and similarly for an edge-/pair-level family F_e^⋆ of ordered-pair-rooted patterns using χ_P and the corresponding rooted homomorphism counts. In our setting, ℳ will be SAT-2FGNN[A], and the families will be denoted F_A, F_A, n, F_A, e.

Our structural characterizations will be phrased in terms of canonical width-2 decompositions. A of a graph F is a pair (T, β) where T is a tree and β : V(T) → 2^V_F assigns a to each node such that (i) every vertex of F appears in some bag, (ii) every edge of F is contained in some bag, and (iii) for each x ∈ V_F, the set {t ∈ V(T) : x ∈ β(t)} induces a connected subtree of T. The width is max_{t ∈ V(T)}|β(t)|−1; width 2 thus means bags of size at most 3.

For the pattern families relevant to 2-tuple refinements, it is convenient to restrict attention to width-2 decompositions in which each new bag shares (typically) two vertices with previously introduced bags and introduces at most one new vertex. This viewpoint is equivalent to building F by successively attaching . Formally, an is a simple path whose internal vertices are new and whose endpoints lie in the current partial graph. An (AS-NED) is an ear decomposition equipped with a nesting condition that forbids ``crossing’’ attachments: each ear attaches within a previously created ear in a laminar manner, so that the resulting bag structure is tree-like when one expands each ear into a chain of size-3 bags along the path. In the source setting (local pair refinement without additional attention), AS-NED is precisely the obstruction-free condition that makes all relevant extensions visible to the refinement process.

We will later define an A-guided variant of this condition, in which the admissible ear attachments are additionally constrained by the attention accessibility implicit in the refinement updates. The present section supplies the homomorphism language into which these decomposition constraints will be translated.

3. Model: Sparse-Attention 2-Folklore Refinement (SAT-2FGNN[A]): definition, invariance/equivariance, relation to practical transformer blocks; attention generator assumptions.

We formalize the refinement mechanism underlying SAT-2FGNN[A] as a 2-tuple color refinement procedure whose information flow is constrained by an auxiliary produced from the input.

Fix an isomorphism-equivariant map A : 𝒢 → 𝒢 such that, for every input graph G = (V, E, ℓ), the attention graph is A(G) = (V, E_A) on the same vertex set. Isomorphism-equivariance means that any isomorphism π : G ≃ H lifts to an isomorphism of the paired structures (G, A(G)) ≃ (H, A(H)) via the same vertex bijection. We will consider several regimes for A, all of which impose a sparsity constraint ensuring that A(G) can be traversed locally: typical promises are |E_A| = O(|V|), or max_u ∈ Vdeg_A(G)(u) ≤ Δ, or A(G) = G^r for a fixed radius r, or decompositions in which E_A is a sparse chordalization guided by an externally supplied tree decomposition. The specific promise does not affect the definition of the refinement, but it will determine which pattern families can be unfolded and hence counted.

The model maintains colors (features) on pairs (u, v) ∈ V², written χ^(t)(u, v) at round t ≥ 0. The initial color records the atomic type in the graph G:

Including the edge-indicator 1[{u, v} ∈ E] is essential: even if {u, v} ∉ E but {u, v} ∈ E_A, the procedure is allowed to ``attend’’ to the pair while still being able to distinguish true edges from merely attended non-edges. This separation will be used implicitly when we later translate refinement information into homomorphism counts of labeled patterns.

At each round, the update at (u, v) can only access intermediate vertices w that are adjacent in the attention graph to at least one endpoint of the pair. Formally, letting N_A(x) denote the neighbors of x in A(G), we define the multiset message

Here atp_G(u, v, w) denotes the atomic type of the triple (u, v, w) in G (e.g., the labels and equality relations among u, v, w together with adjacency indicators among the three unordered pairs). Any fixed encoding of this finite relational type suffices; its role is to let the update distinguish, for instance, whether w is adjacent to u in G as opposed to only in A(G). The next-round color is then obtained by a perfect-hash refinement

where hash is injective on its domain (equivalently, we work in the idealization of infinitely wide features with perfect discrimination). As usual for color refinement, we iterate until stabilization, i.e., until the partition of V² induced by χ^(t) no longer changes.

Two remarks are in order. First, when A(G) is dense, reduces to the usual 2-tuple refinement (and thus matches the folklore 2-WL/2-FGNN correspondence at the level of stable partitions). Second, when A(G) is sparse, enforces a constraint: the update at (u, v) cannot aggregate over all w ∈ V, but only over those w made visible by attention adjacency to u or v.

By construction, the update is permutation-equivariant with respect to relabelings of V that preserve both G and A(G). Concretely, if π : V_G → V_H is an isomorphism with (G, A(G)) ≃ (H, A(H)), then for all rounds t and all ordered pairs (u, v) we have
χ_G^(t)(u, v) = χ_H^(t)(π(u), π(v)).
Moreover, the refinement is monotone: the equivalence relation on V² defined by equality of colors can only split, never merge, as t increases. Since V² is finite, stabilization occurs after at most |V|² rounds (typically far fewer), so the procedure is well-defined on all finite inputs.

From the stabilized colors we obtain representations at different granularities via standard equivariant pooling. At the pair level we simply set χ_P(G, (u, v)) := χ^(∞)(u, v). Node-level representations can be taken as
$$ \chi_V(G,u)\ :=\ \mathrm{pool}\bigl(\multiset{\chi^{(\infty)}(u,v): v\in V}\bigr), $$
for any injective multiset aggregator pool (again idealized as perfect hashing). Finally, a graph-level representation χ_G(G) is obtained by pooling the multiset of node colors, or directly the multiset of pair colors. Any two such choices are interdefinable under perfect hashing; in particular, the induced indistinguishability relation on graphs is uniquely determined by the stabilized partition of V².

The procedure above should be read as a discrete analogue of stacked sparse self-attention layers. The set N_A(u) ∪ N_A(v) plays the role of the attention support (which keys/values are visible when updating the query associated with (u, v)), while the multiset in corresponds to aggregating value vectors indexed by visible w. Our perfect-hash update captures the maximal discriminative power of any equivariant architecture whose computations at (u, v) are restricted to this attention support and are invariant to the order of neighbors. Thus, when we later characterize expressivity via homomorphism-count families, the statements should be interpreted as upper bounds for a broad class of sparse-attention pairwise transformers, and as matching lower bounds for the idealized refinement itself.

In the next section we will convert stabilized colors into combinatorial objects—attention-guided unfoldings and canonical width-2 decompositions—which serve as the bridge from local attention access to global homomorphism invariants.

4. Unfolding objects under attention: attention-guided unfolding trees and canonical decompositions; bag-isomorphism counts and their relation to model representations.

We now pass from stabilized pair-colors to explicit combinatorial objects that expose exactly which local configurations can be explored under attention. Throughout, we work over the paired input structure
Ĝ := (V_G, E_G, E_A(G), ℓ),
where E_G is the original edge relation and E_A(G) is the attention edge relation; the refinement accesses E_A(G) only through neighborhoods, while it can test E_G (and equalities) via atomic types.

Fix G and an ordered pair (u, v) ∈ V_G². We define a rooted, tree-decomposed, finite pattern 𝒰_A^(t)(G; u, v) of width 2 by induction on t ≥ 0. The pattern is most conveniently presented as a tree decomposition whose bags alternate between size 2 and size 3.

At depth t = 0, 𝒰_A⁽⁰⁾(G; u, v) consists of a single root bag containing the ordered pair (u, v), labeled by the atomic type χ⁽⁰⁾(u, v) from .

Given 𝒰_A^(t)(G; ⋅, ⋅), we form 𝒰_A^(t + 1)(G; u, v) as follows. Start with a root 2-bag B_uv containing (u, v) and carrying the label χ⁽⁰⁾(u, v) (or equivalently, the triple-atomic type restricted to {u, v}). For each intermediate vertex
w ∈ N_A(G)(u) ∪ N_A(G)(v),
attach a 3-bag B_uvw adjacent to B_uv whose vertex set is (u, v, w) and whose label is the triple atomic type atp_G(u, v, w) used in . Finally, from B_uvw we attach two child 2-bags B_wv and B_uw, overlapping with B_uvw on the ordered pairs (w, v) and (u, w) respectively, and we continue the construction recursively by grafting copies of 𝒰_A^(t)(G; w, v) below B_wv and of 𝒰_A^(t)(G; u, w) below B_uw. The resulting decomposition is a rooted tree in which every 3-bag witnesses exactly one attention-visible extension w from its parent 2-bag.

We view 𝒰_A^(t)(G; u, v) as a rooted isomorphism type: vertex names from V_G are not part of the object; only the bag tree, the incidences/overlaps, and the bag labels (atomic types) are retained. An isomorphism between two unfoldings is a root-preserving isomorphism of the underlying bag trees that preserves the ordered interfaces on 2-bags and preserves the atomic-type labels on all bags.

The above definition is engineered so that the refinement update is the usual bottom-up hashing of rooted unfolding trees. Concretely, let type^(t)(u, v) denote the rooted isomorphism type of 𝒰_A^(t)(G; u, v). Then, by induction on t, we obtain a bijective correspondence

The base case t = 0 is immediate from . For the inductive step, the multiset M_t(u, v) in records, for each attention-visible w, the ordered pair of child colors (χ^(t)(w, v), χ^(t)(u, w)) together with the parent witness label atp_G(u, v, w). Under the induction hypothesis, this is exactly the multiset of rooted child types attached below the 3-bags B_uvw in 𝒰_A^(t + 1)(G; u, v). Since hash is injective, equality of next-round colors is equivalent to equality of these labeled multisets, hence to equality of the rooted unfolding types.

As a consequence, the stabilized color χ^(∞)(u, v) can be identified with the (finite) rooted type of 𝒰_A^(t)(G; u, v) for any t beyond stabilization. In particular, the graph-level representation obtained by pooling stabilized colors is determined by the multiplicities of rooted unfolding types.

Let 𝒮_A^(t) denote the set of all rooted isomorphism types that can occur as 𝒰_A^(t)(G; u, v) over all finite G ∈ 𝒢 and all ordered pairs (u, v) ∈ V_G², and let
𝒮_A := ⋃_t ≥ 0𝒮_A^(t).
Each element of 𝒮_A is a finite tree decomposition of width 2 with bags of size at most 3, equipped with the above alternation structure and atomic-type labels. We emphasize that 𝒮_A is canonical in the sense that it is defined purely from the refinement rule and the attention access pattern; it does not depend on any particular input graph.

Given (F, T) ∈ 𝒮_A and a graph G, we define the corresponding biso((F, T), Ĝ) as the number of root-preserving mappings from the bags of (F, T) into Ĝ that are consistent on overlaps and that realize the prescribed atomic-type labels in Ĝ (equivalently, a homomorphism into the relational structure Ĝ that is tracked bag-by-bag along T). For rooted unfoldings, these counts have a direct interpretation: if τ ∈ 𝒮_A^(t) is a rooted type, then

Combining and , we see that the pooled multiset of stabilized colors is equivalent information to the vector of all bag-isomorphism counts over 𝒮_A (at any sufficiently large depth). This is the precise bridge we will use in the next section: we first characterize representation equivalence by equality of these canonical bag-isomorphism counts, and then translate those counts into ordinary homomorphism counts of a maximal pattern family via the standard linear-algebraic passage from decompositions to homomorphisms.

5. Main theorem I (Existence & completeness of F_A): representation equivalence iff equality of hom-counts over a canonical family; uniqueness of the maximal family.

We write G≡_AH to denote of SAT-2FGNN[A] on input graphs G, H, i.e.,
G≡_AH ⇔ χ_G(G) = χ_H(H),
where χ_G(G) is the pooled multiset of stabilized pair-colors (and analogously for H). By the correspondence between colors and rooted attention-guided unfoldings established in –, this equivalence is already captured by the canonical bag-isomorphism statistics over 𝒮_A: for any sufficiently large depth t (beyond stabilization on the given inputs),

Thus the remaining step is to express the family of all bag-isomorphism counts on width-2 decomposed patterns as (equivalently) a family of ordinary homomorphism counts.

To state this cleanly, we regard each (F, T) ∈ 𝒮_A as a finite relational structure over the signature consisting of vertex labels (unary predicates) together with two symmetric binary relations, one intended to be interpreted as E_G and one as E_A(G); the host structure is precisely Ĝ = (V_G, E_G, E_A(G), ℓ). We then write hom(F, Ĝ) for the number of homomorphisms of these two-edge-colored, vertex-labeled patterns into Ĝ. (This is a notational refinement of hom(F, G) tailored to the paired input Ĝ used by the model.)

The standard linear-algebraic passage of Dell–Grohe–Rattan applies to any fixed bounded-width decomposition formalism: bag-consistent local homomorphisms (our biso counts) and global homomorphisms (our hom counts) span the same vector space of invariants, and there are explicit triangular transformations between the two bases. Concretely, for each decomposed pattern (F, T) ∈ 𝒮_A there exists a finite set ℱ(F, T) of (two-edge-colored, vertex-labeled) graphs and rational coefficients c_(F, T)(⋅) such that, for all G,

Conversely, for each finite pattern graph F that arises as an underlying graph of some element of 𝒮_A, there is a finite set of decomposed patterns 𝒯(F) ⊆ 𝒮_A and coefficients d_F(⋅) with

The finiteness in – is uniform in G and depends only on the bounded width (here, 2) and the bounded bag size (here, at most 3); no property of the specific attention generator beyond isomorphism-equivariance is needed at this point.

We may therefore define a by forgetting decompositions:
$$ F_{\mathsf{A}}^{\mathrm{can}}\ :=\ \{\ \underline{(F,T)}\ :\ (F,T)\in\mathcal{S}_{\mathsf{A}}\ \}, $$
where $\underline{(F,T)}$ denotes the underlying two-edge-colored, vertex-labeled pattern of the decomposed object (F, T). Combining with – yields the desired completeness statement.

We emphasize that, while there may be many families of patterns whose homomorphism counts suffice to determine the representation, there is a unique one. Define
F_A^max := {F: ∀G, H, G≡_AH ⇒ hom(F, Ĝ) = hom(F, Ĥ)}.
By definition, F_A^max is unique and inclusion-maximal among families satisfying the implication ``representation equivalence ⇒ equality of hom-counts on the family’’. Theorem~ shows that F_A^can is complete; hence every F ∈ F_A^max is already determined by the invariants encoded by 𝒮_A and therefore lies in the closure generated by F_A^can under the linear relations . In particular, F_A^max is determined solely by the refinement rule and the attention access pattern, and not by auxiliary choices (depth parameters, pooling convention, or representatives of isomorphism types). The subsequent maximality theorem (proved via attention-adapted gadgets) will show that F_A^can already attains this maximum in the strong sense demanded in the problem statement.

6. Main theorem II (Structural characterization): define A-guided canonical decomposition and A-guided NED; prove equivalence and give explicit membership tests/constructive decompositions for patterns.

We now replace the unfolding-defined description of F_A by an intrinsic, purely pattern-theoretic criterion. Throughout, a is a finite vertex-labeled structure
F̂ := (V_F, E_F, E_F^A, ℓ_F),
where E_F is intended to match E_G and E_F^A is intended to match E_A(G) when mapping into Ĝ = (V_G, E_G, E_A(G), ℓ).

A of F̂ is a tree decomposition (T, β) such that every bag has size at most 3, every adhesion (intersection of adjacent bags) has size at most 2, and each bag B ∈ V(T) is equipped with a distinguished ordered pair ∂B = (x, y) ∈ B² called its . We say that (T, β, ∂) is if for every non-root bag B with parent P in T the following hold:

The point of (3) is that, when realizing B inside Ĝ, the refinement update for the boundary pair (x, y) is permitted to aggregate over witnesses w that lie in N_A(G)(x) ∪ N_A(G)(y); thus any newly introduced vertex must be exposed to at least one boundary endpoint by an E^A-edge in the pattern.

We write CD_A(F̂) for the assertion that F̂ admits an A-guided canonical decomposition. By construction, every decomposed pattern in 𝒮_A forgets to a pattern F̂ with CD_A(F̂), and conversely every F̂ with CD_A(F̂) can be unfolded into some element of 𝒮_A (by turning each extension bag into an unfolding step).

For maximality arguments later, it is convenient to have a sequential construction rather than a tree decomposition. We therefore introduce an A-guided variant of the almost-strong nested ear decomposition (AS-NED) from the fully local case.

An of F̂ is a sequence of subpatterns
F̂₀ ⊆ F̂₁ ⊆ ⋯ ⊆ F̂_m = F̂
such that F̂₀ is a single labeled edge (two vertices with their E-adjacency and E^A-adjacency fixed), and each extension F̂_i + 1 is obtained from F̂_i by adding an P_i which is a path x = v₀, v₁, …, v_k = y with endpoints x, y ∈ V(F̂_i) and internal vertices v₁, …, v_k − 1 ∉ V(F̂_i). The ear is required to be in the standard sense (its endpoints lie on a common previously created boundary), and it is required to be in the sense that every new internal vertex is introduced along a boundary pair from which it is attention-visible:
for each j ∈ {1, …, k − 1}, at the moment v_j is introduced there exist a, b ∈ {x, y} with
{v_j, a} ∈ E_F^A or {v_j, b} ∈ E_F^A.
When E_F^A = E_F (the regime A(G) = G), this reduces to the usual visibility condition along true edges and hence to the AS-NED characterization for the local model.

We write NED_A(F̂) for the assertion that F̂ admits an A-guided AS-NED.

We argue by mutual simulation. First, F̂ ∈ F_A implies CD_A(F̂) because every F̂ ∈ F_A arises as $\underline{(F,T)}$ for some (F, T) ∈ 𝒮_A, and the definition of 𝒮_A ensures that each bag extension corresponds to introducing one new vertex z witnessed by some attention-neighbor of a boundary endpoint; this is exactly condition (3) above.

Second, CD_A(F̂) implies F̂ ∈ F_A by reading the decomposition top-down as an admissible unfolding program: each non-root bag {x, y, z} is introduced at the boundary pair (x, y) using the witness z, and the attention-visibility condition guarantees that this witness is accessible to the refinement update. This constructs an element of 𝒮_A that forgets to F̂.

Finally, CD_A(F̂) ⇔ NED_A(F̂) is obtained by converting between a width-2 bag-extension decomposition and a nested sequence of ear additions: traversing T in a depth-first order yields a sequence in which each new vertex is introduced with a boundary pair, producing ears when consecutive introductions share the same endpoints; conversely, each ear addition can be expanded into a chain of size-3 bags, each introducing one internal vertex while keeping the ear endpoints as the boundary. The A-visibility requirement is preserved in both directions.

For fixed F̂, deciding CD_A(F̂) is constructive. One convenient certificate is an v₁, …, v_{|V_F|} such that each v_i has at most two neighbors among {v_i + 1, …} in the underlying simple graph of E_F ∪ E_F^A, and whenever it has exactly two such later neighbors x, y, the vertex v_i is attention-visible from the pair, i.e. {v_i, x} ∈ E_F^A or {v_i, y} ∈ E_F^A. From such an order we build the decomposition by introducing v_i in a bag with its (at most two) later neighbors, and conversely any A-guided canonical decomposition yields such an order by repeatedly removing a leaf bag and deleting its unique new vertex. Thus membership in F_A is decidable in polynomial time in |V_F| by greedy peeling (trying all choices when multiple removable vertices exist), and when successful this procedure returns an explicit A-guided decomposition (or, equivalently, an A-guided AS-NED) witnessing F̂ ∈ F_A.

7. Main theorem III (Maximality & lower bounds): generalized pebble game and attention-adapted F"urer/click-augmented gadgets; systematic separation procedure.

We turn to the matching lower bound direction: the structural families characterized in Theorem~ are not only sufficient but also for homomorphism expressivity under SAT-2FGNN[A]. Concretely, whenever a connected pattern F̂ fails to admit an A-guided decomposition (equivalently, an A-guided AS-NED), we construct explicit graphs G_F̂, H_F̂ such that the sparse-attention refinement produces identical stabilized representations on G_F̂ and H_F̂, yet hom(F, G_F̂) ≠ hom(F, H_F̂) for the underlying graph F.

The standard route from refinement indistinguishability to lower bounds proceeds via a bijective pebble game. In our setting we require a variant that faithfully reflects the locality restriction w ∈ N_A(G)(u) ∪ N_A(G)(v) in the update rule. We therefore define an played on two input structures Ĝ = (V_G, E_G, E_A(G), ℓ_G) and Ĥ = (V_H, E_H, E_A(H), ℓ_H).

The game maintains a position consisting of a pair of pebbled vertices (u₁, u₂) ∈ V_G² and (v₁, v₂) ∈ V_H² (order matters), together with the requirement that the atomic types of ordered pairs match:
(ℓ_G(u₁), ℓ_G(u₂), 1[u₁ = u₂], 1[{u₁, u₂} ∈ E_G]) = (ℓ_H(v₁), ℓ_H(v₂), 1[v₁ = v₂], 1[{v₁, v₂} ∈ E_H]).
A round proceeds as follows. Spoiler chooses one of the two pebbles, say the first, and announces that he will ``refresh’’ it. Duplicator must then provide a bijection
π: N_A(G)(u₁) ∪ N_A(G)(u₂) → N_A(H)(v₁) ∪ N_A(H)(v₂)
that preserves the current pair-color information from the previous round (equivalently, preserves the induced partition of ordered pairs in the neighborhoods). Spoiler then selects some w ∈ N_A(G)(u₁) ∪ N_A(G)(u₂) and moves the chosen pebble to w; Duplicator must move the corresponding pebble to π(w). The second pebble remains fixed. Duplicator loses if she cannot produce an admissible bijection or if the atomic type constraint is violated after the move.

This is precisely the locality-limited analogue of the bijective 2-pebble game for 2-tuple refinement, with the crucial modification that the choice domain is the attention neighborhood. As in the classical case, one proves by induction on t that Duplicator has a winning strategy for t rounds if and only if the refinement colors χ^(t) agree on the corresponding ordered pairs. Taking t → ∞ yields the equivalence between game indistinguishability and stabilized-color indistinguishability for SAT-2FGNN[A]. In particular, any pair of graphs on which Duplicator wins indefinitely must have identical graph representations under any downstream equivariant pooling of stabilized colors.

To witness maximality we require a generic method that, for an excluded pattern F̂ ∉ F_A, produces graphs that are indistinguishable in the A-local game but separated by hom(F, ⋅). The construction follows the F"urer/CFI paradigm: we start from a base graph encoding the incidence structure of F̂ and replace certain ``constraint edges’’ by parity gadgets, yielding two non-isomorphic lifts G_F̂⁰ and G_F̂¹ (the untwisted and twisted versions). The gadgets are chosen so that (i) any local neighborhood accessible through attention has a canonical bijection between G_F̂⁰ and G_F̂¹, enabling Duplicator to win the A-local game, while (ii) globally, the parity of twists changes the number of homomorphisms from F by a nonzero amount.

The attention aspect enters in two places. First, because the update for a boundary pair only sees witnesses in N_A(⋅), we must ensure that every move Spoiler can make in the A-local game can be matched by a locally defined bijection between the corresponding gadget neighborhoods. This is achieved by : we introduce the parity gadgets only along those incidences that would have to be traversed to realize an A-guided decomposition. When F̂ fails CD_A, there is an obstruction witnessing that any attempted bag-extension would require revealing a new vertex not attention-visible from the current boundary. We exploit precisely this obstruction: we place the twist so that distinguishing the two lifts would require Spoiler to ``see’’ through an attention-invisible interface, which the game forbids. Second, since our model distinguishes true edges {u, v} ∈ E_G from merely attended pairs {u, v} ∈ E_A(G), the gadget must preserve E-adjacency types while using E^A only to control information flow; we therefore keep the underlying simple graph E identical in the relevant local views and implement the twist solely in the lift structure.

With these choices one shows the two key claims: Duplicator has a winning strategy in the A-local 2-pebble game on (G_F̂⁰, G_F̂¹), hence SAT-2FGNN[A] assigns them identical stabilized representations; yet the homomorphism count hom(F, ⋅) detects the parity twist, giving hom(F, G_F̂⁰) ≠ hom(F, G_F̂¹). This establishes maximality at the graph level, as stated abstractly in Theorem~4.

For node-rooted and edge-rooted families, we additionally need counterexamples that force homomorphisms to respect a designated root. We use a standard clique-augmentation trick adapted to our labeled setting: to pin a root vertex r, we attach a uniquely identifiable labeled gadget (e.g., a labeled clique of size M with a unique label multiset) to r in both G_F̂⁰ and G_F̂¹, choosing M large enough that any homomorphism mapping the rooted pattern into the host must send the root to the unique attachment point. For edge-rooting we similarly pin both endpoints by attaching asymmetric gadgets to each endpoint. Because these augmentations are purely local and isomorphic between the two lifts, they do not help Spoiler in the A-local game, but they eliminate spurious homomorphisms that would otherwise avoid the intended root constraints.

The maximality proof yields a uniform separation mechanism used repeatedly in the hierarchy results later. Given an attention generator A and a candidate pattern F̂, we first attempt to certify CD_A(F̂) (equivalently NED_A(F̂)) via an A-admissible elimination order. If the test fails, the failure exposes a concrete obstruction: at some stage every remaining vertex either has too many ``later’’ neighbors in E ∪ E^A or is not attention-visible from any admissible boundary pair. From this obstruction we deterministically choose an interface on which to place the parity twist and build the pair (G_F̂⁰, G_F̂¹). By construction, the refinement cannot access the obstruction interface through attention-guided witnesses, so the two graphs remain indistinguishable; nonetheless the excluded pattern detects the twist, separating hom(F, ⋅).

This procedure is the promised tightness principle: the structural characterization of F_A is accompanied by an explicit, pattern-driven construction of indistinguishable-but-count-different graph pairs for every excluded pattern, and the same holds for rooted variants after clique augmentation.

8. Instantiations and tight hierarchies: (i) r-hop attention, (ii) bounded-degree attention, (iii) decomposition-guided attention; strict inclusions and explicit separating patterns; corollaries for motif/cycle counting when applicable.

We record three attention regimes in which the abstract maximality principle above yields concrete, tight hierarchies of homomorphism expressivity. In each case the upper bound is inherited from the structural characterization of F_A (via A-guided decompositions), while the matching lower bound is obtained by applying the separation procedure to an explicit excluded pattern.

Let A_r(G) = G^r. Since G^r ⊆ G^r + 1 for all G, Theorem~5 gives the monotone inclusions
F_A₁ ⊆ F_A₂ ⊆ ⋯, F_A₁, n ⊆ F_A₂, n ⊆ ⋯, F_A₁, e ⊆ F_A₂, e ⊆ ⋯.
Strictness can be witnessed by an explicit separating family. Fix r ≥ 1 and define F_r + 1^sep as follows: start from a cycle C_2r + 3 on vertices x₀, x₁, …, x_2r + 2 (indices modulo 2r + 3) and add a single chord {x₀, x_r + 1}. In F_r + 1^sep, the endpoints of the chord are at graph distance exactly r + 1 (via the shorter arc of the cycle), hence {x₀, x_r + 1} ∈ E(F^r + 1) but {x₀, x_r + 1} ∉ E(F^r). Any A_r-guided decomposition step that would introduce this chord as an ``ear interface’’ necessarily requires attention visibility between its endpoints, which fails under A_r but holds under A_r + 1. Consequently,
F_r + 1^sep ∈ F_{A_r + 1} and F_r + 1^sep ∉ F_{A_r}.
Applying maximality (Theorem~4) to F_r + 1^sep ∉ F_{A_r} yields graphs G_r, H_r such that SAT-2FGNN[A_r] produces identical stabilized representations on G_r, H_r while hom(F_r + 1^sep, G_r) ≠ hom(F_r + 1^sep, H_r). Thus increasing the attention radius strictly enlarges the homomorphism basis that determines χ_G(⋅).

As a simple motif-counting corollary, note that plain cycles C_ℓ already admit width-2 decompositions whose extensions are witnessed along edges, hence C_ℓ ∈ F_A₁. In contrast, chorded-cycle motifs of the form above separate successive radii; therefore any attempt to reconstruct subgraph statistics that linearly depend on hom(F_r + 1^sep, ⋅) (e.g., via spasm expansions) necessarily requires radius at least r + 1.

Assume deg_A(G)(u) ≤ Δ for all G and all u ∈ V_G. Theorem~6 implies the existence of an infinite family of excluded patterns of small width but requiring (Δ + 1)-way branching in any A-guided unfolding. A concrete choice is the star S_Δ + 1 = K_{1, Δ + 1} (or, if one prefers to avoid high degrees in the pattern itself, its 1-subdivision). Intuitively, hom(S_Δ + 1, G) = ∑_{v ∈ V_G}deg_G(v)^Δ + 1 depends on a (Δ + 1)st degree moment at the image of the center. If attention reveals at most Δ neighbors of any vertex, then the refinement updates never aggregate over a witness set rich enough to reconstruct such moments uniformly across all graphs, and the A-guided decomposition condition fails for S_Δ + 1 (and for infinitely many related branching patterns).

Maximality therefore yields, for each such excluded branching pattern F, explicit graphs G_F, H_F that are indistinguishable by SAT-2FGNN[A] yet satisfy hom(F, G_F) ≠ hom(F, H_F). In particular, bounded-degree attention imposes a barrier: no matter how many refinement rounds we run, there exist fixed motifs (depending only on Δ) whose homomorphism counts cannot be recovered from the stabilized representation. Via the spasm identity sub(F, G) = ∑_{F̃ ∈ Spasm(F)}α(F, F̃)hom(F̃, G), the same barrier transfers to subgraph-counting tasks whenever Spasm(F) contains such a branching obstruction.

At the opposite extreme, suppose the input provides a tree decomposition of G of width k and A is chosen to connect all pairs of vertices co-occurring in a bag and to connect vertices across adjacent bags (a sparse chordalization). Then |E_A(G)| = O(k|V_G|), yet the attention relation exposes the separators needed for dynamic-programming style propagation. As formalized in Theorem~7, a k-tuple extension of SAT-2FGNN[A] matches the homomorphism expressivity of k-FWL on arbitrary graphs, hence determines hom(F, G) for every pattern F with tw(F) ≤ k.

This regime yields a different kind of tight hierarchy: increasing the supplied decomposition width strictly increases the recoverable pattern family (e.g., K_k + 2 separates width k from k + 1), while keeping attention sparse. Moreover, because the characterization is by equality of homomorphism counts over all tw ≤ k patterns, it immediately implies positive motif-counting statements for any fixed-width motif class (cycles, series–parallel motifs, and in general bounded-treewidth templates), in sharp contrast with the uniform obstructions under bounded-degree attention.

Taken together, these three instantiations realize the promised ``tightness principle’’: weakening attention constraints (larger r, larger Δ, or richer decomposition guidance) enlarges F_A monotonically, and every strict enlargement is witnessed by an explicit separating pattern family and corresponding indistinguishable-but-count-different graph pairs.

9. Empirical validation plan: motif counting with controlled attention graphs; demonstrate monotone improvements with r and failures under bounded degree; ablations separating theory (perfect hash) vs neural approximation.

Our theorems are qualitative and worst-case, hence it is natural to complement them by experiments that isolate the role of the attention graph A(G) in motif counting and, in particular, test the predicted monotonicity and obstructions as we vary the attention regime. The guiding principle is to treat SAT-2FGNN[A] as a representation map G ↦ χ_G(G) and to probe which homomorphism statistics hom(F, G) are recoverable from χ_G(G) by a simple readout (e.g., a linear regressor on pooled stabilized colors, or its neural analogue on final embeddings). We emphasize that the relevant targets are rather than induced-subgraph counts; where desired, the latter can be derived via spasm expansions and thus inherit the same limitations.

We will implement three attention generators, matching the regimes discussed above.
First, r-hop attention A_r(G) = G^r, where we can directly vary r and empirically test the strict hierarchy by evaluating a fixed set of patterns across radii. Second, bounded-degree attention: for each u we restrict N_A(G)(u) to at most Δ neighbors, using either (i) random subsampling among N_G(u) (to test worst-case behavior), or (ii) a canonical rule such as taking the Δ lexicographically smallest neighbors with respect to vertex labels/IDs (to preserve isomorphism-equivariance in the synthetic setting). Third, decomposition-guided sparse chordalizations (when a decomposition is available), to verify that sparsity need not preclude strong motif recovery provided separators are exposed to attention.

For r-hop monotonicity we will use the explicit separating patterns F_r + 1^sep (chorded cycles) defined earlier, together with baseline patterns known to be easy already at r = 1 (e.g., cycles C_ℓ and small series–parallel motifs). The prediction is that, holding the model architecture and training protocol fixed, the error in predicting hom(F_r + 1^sep, G) should drop sharply when moving from A_r to A_r + 1, while exhibiting no comparable change for patterns already in F_{A_r}. For bounded-degree attention we will include branching obstructions such as S_Δ + 1 = K_{1, Δ + 1} (and its 1-subdivision), where the theory predicts a depth-independent failure. Concretely, we expect that for fixed Δ, increasing the number of refinement rounds (or transformer layers) should not yield systematic improvement on targets that depend on (Δ + 1)-way branching, while it should still improve performance on patterns compatible with the A-guided unfolding constraint.

We will use two complementary data sources. On the one hand, we will generate synthetic graphs with controlled local statistics (Erds–R'enyi, configuration-model, and planted-motif families), varying |V| and degree distributions to test generalization across sizes and sparsity. On the other hand, to directly witness maximality phenomena, we will include worst-case pairs (G_F, H_F) produced by the separation procedure of Theorem~4 for selected excluded patterns F ∉ F_A. These instances serve as a crisp sanity check: under the appropriate attention regime, the representations (or their neural approximations) should be indistinguishable up to numerical tolerance, while hom(F, G_F) ≠ hom(F, H_F). We will report both representation distances (e.g., ℓ₂ on pooled embeddings) and readout failures (inability to linearly separate/predict the differing hom-counts).

For each graph G and each pattern F in a fixed suite (typically |V_F| ≤ 8 so that exact computation is feasible), we compute hom(F, G) exactly by brute-force enumeration for small |V_G| and by specialized routines where available (e.g., adjacency-matrix formulas for walks/cycles, and meet-in-the-middle for small-width patterns). We then train a readout h_F on top of the model representation to predict hom(F, G) (regression) or to predict thresholded events hom(F, G) ≥ τ (classification). We will favor linear readouts to better isolate representational sufficiency: if χ_G(G) in fact determines hom(F, G) over the distribution, a linear probe should succeed with modest sample sizes; conversely, persistent failure under linear probes is evidence (though not proof) that the representation lacks the required information.

Since SAT-2FGNN[A] is an idealized refinement with perfect hashing, we will explicitly separate (i) structural limitations due to attention locality from (ii) approximation limitations due to finite-dimensional neural parameterizations. We will therefore compare three variants: (a) a discrete implementation that maintains exact colors χ^(t)(u, v) via dictionary hashing (serving as the ``ground-truth’’ refinement), (b) a neuralized pair-update network that replaces hashing by an MLP with sum-multiset aggregation, and (c) a standard sparse-attention transformer block on node embeddings augmented with edge indicators for attended pairs. For each, we will ablate whether the initial atomic type includes the indicator 𝕀[{u, v} ∈ E_G], since our theory relies on distinguishing true edges from merely attended non-edges. We will also ablate tuple-tracking schemes (all pairs versus only (u, v) with v ∈ N_A(u)) to quantify the runtime–accuracy tradeoff suggested by the sparse-tuple complexity discussion.

Across these experiments we expect to observe: (i) monotone improvements with r on the separating chorded-cycle family, matching F_{A_r} ⊊ F_{A_r + 1}; (ii) depth-independent failures under bounded-degree attention on branching motifs beyond Δ, consistent with Theorem~6; and (iii) a gap between the discrete refinement and its neural approximations that narrows with width and training but does not eliminate the structural failures imposed by the attention regime. These outcomes would empirically corroborate that the primary driver of motif-counting power in this setting is the attention-constrained information flow captured by F_A, rather than optimization artifacts.

The present analysis sits at the intersection of (i) expressivity theory for transformers and attention mechanisms, (ii) the Weisfeiler–Leman (WL) and folklore-WL (FWL) refinements for graphs, (iii) logical characterizations of invariant graph representations, and (iv) homomorphism-based and polynomial bases for graph statistics. Our main point of contact with this literature is that we treat a sparse-attention transformer block as an whose information flow is constrained by an auxiliary attention graph A(G), and we characterize its discriminative power by an associated F_A.

A large body of work studies the representational power of attention-based architectures, both in the original sequence setting and in permutation-equivariant regimes such as DeepSets and Set Transformers . In graph learning, replace message passing with attention over (possibly all) vertex pairs, and their expressivity is typically discussed in terms of permutation equivariance and the ability to simulate higher-order interactions . Dense attention can make 2-tuple information accessible in one step, but is quadratic in |V|; practical models therefore use sparse attention patterns (local windows, landmark nodes, random links) or structural sparsifiers. Our framework abstracts this choice by an isomorphism-equivariant generator A and isolates the consequences of the resulting locality constraint on what can be or .

The classical 1-WL color refinement captures the expressive envelope of message-passing GNNs under standard assumptions; higher-order WL procedures (notably k-WL) characterize the power of k-tuple refinements and are closely tied to invariant/equivariant tensor architectures . The ``folklore’’ variants (FWL) refine ordered tuples using adjacency-aware multisets and are known to connect more directly to homomorphism counting and to certain families of graph polynomials . Our SAT-2FGNN[A] can be viewed as a 2-tuple refinement on the (V, E_G, E_A), with the crucial design choice that the initial atomic type of (u, v) includes the indicator 𝕀[{u, v} ∈ E_G] so that true edges in G are not conflated with merely attended non-edges. This places our results conceptually between (a) full 2-FWL on G (when A(G) is dense) and (b) locality-restricted refinements that only aggregate along a prescribed sparse neighborhood.

Sparse attention mechanisms have been studied extensively for long sequences (e.g., BigBird, Longformer, Reformer) and for graphs, typically with the goal of reducing computational cost while retaining enough connectivity to propagate information . Most theoretical results in this line are either approximation-oriented (showing that sparse patterns can approximate dense attention under distributional assumptions) or universality-oriented (showing that, with sufficient width and appropriate tokens, sparse attention can approximate broad function classes). Our focus is different: we fix the refinement idealization (perfect hashing, multiset aggregation) and ask for structural limits imposed solely by the attention graph. In particular, we prove strict hierarchies under attention inclusion and depth-independent obstructions under bounded attention degree, which are not captured by approximation-universality statements.

WL-type equivalences admit well-known logical counterparts in counting logics, and these logics can be refined by imposing locality or guardedness conditions . From this perspective, the attention graph induces a ``guard’’ relation: when updating (u, v), quantification over a witness w is restricted to N_A(u) ∪ N_A(v). Consequently, indistinguishability by SAT-2FGNN[A] corresponds to equivalence in a fragment of C² in which binary extensions are permitted only along E_A (while E_G remains available as a predicate in atomic formulas). Although we do not develop the full logical apparatus here, this viewpoint is consistent with our decomposition characterizations: canonical decompositions and attention-guided unfoldings play the same role as guarded bisimulations and tree-like unravellings in finite model theory.

A central organizing principle for invariant graph representations is that many natural graph statistics can be expressed as linear combinations of homomorphism counts, and conversely that homomorphism counts provide canonical coordinates for graph parameters . The modern refinement of this principle, developed in particular by Dell–Grohe–Rattan and subsequent works, identifies for several WL/FWL procedures an explicit F such that two graphs are equivalent under the procedure if and only if their homomorphism counts agree on all patterns in F . Our contribution is to extend this style of characterization to refinements: we show that for each isomorphism-equivariant A there is a unique maximal family F_A (and rooted variants) witnessing exactly the homomorphism statistics determined by the stabilized colors. Moreover, we couple completeness with maximality by constructing, for each excluded pattern, explicit indistinguishable graph pairs with differing homomorphism counts, in the spirit of CFI/F"urer-type separations adapted to the attention relation.

Another strand of expressivity theory frames GNN/transformer representations as generating algebras of invariant polynomials, such as walk-count features derived from powers of the adjacency matrix, or trace/tensor invariants arising from higher-order architectures . In the dense-attention 2-tuple setting, stabilized colors can be related to a rich collection of WL-invariant polynomials; under sparse attention, however, the closure properties of the generated invariants become constrained by the allowed extensions along E_A. Our homomorphism-family viewpoint provides a concrete description of this constraint: to the extent that a target invariant can be expanded into homomorphism counts whose spasm stays within F_A, it is representable; when the spasm leaks outside F_A, maximality yields worst-case failures. Thus the attention graph controls not only discriminative power but also the effective polynomial basis accessible to the model.

We view the main message as follows. Sparse attention is not merely an optimization trick; when made isomorphism-equivariant, it induces a precise refinement calculus whose information flow can be characterized exactly by an A-dependent homomorphism basis and an A-guided width-2 decomposition theory. This positions our results as a bridge between practical sparse-attention design (choice of A(G)) and classical graph-parameter theory (homomorphisms, treewidth, and WL), and it provides a template for extending expressivity analyses to other structured sparsification regimes beyond those treated explicitly here.

11. Discussion and open problems: learned attention beyond structural promises, approximate FM, extending to k-tuple transformers and spectral/positional encodings.

Our results provide a worst-case and account of what a sparse-attention 2-tuple refinement can count, given an isomorphism-equivariant attention generator A subject to a sparsity promise. Several directions remain open; we highlight those that appear technically closest to the present framework and most relevant for practice.

In applications, the attention graph is rarely prescribed a priori. Instead, it is often from intermediate features (e.g., by top-k similarities, locality-sensitive hashing, or routing mechanisms). This raises a basic question: when can we treat a learned attention rule as an isomorphism-equivariant generator A in our sense? If attention is computed by an equivariant scoring function and then sparsified by a deterministic rule, then equivariance is threatened by tie-breaking: the map keep the top-$k$ neighbors'' is not canonical in the presence of equal scores, and any arbitrary tie resolution can break invariance. One open problem is to design \emph{canonical} sparsifiers (or randomized sparsifiers with controlled failure probability) that preserve equivariance while remaining implementable. Another is to extend our characterization to \emph{feature-dependent} generators $\mathsf{A}^{(t)}$ that evolve with refinement depth: even if each $\mathsf{A}^{(t)}$ is equivariant as a function of the current coloring, the resulting process is no longer a fixed refinement on a fixed two-relation structure $(V,E_G,E_A)$. We expect that anon-the-fly’’ unfolding analysis remains possible, but maximality should become subtler: the excluded patterns may depend on the entire attention policy class rather than a single A. A concrete goal is a dichotomy of the following form: characterize those policy classes for which the induced expressive envelope still admits a homomorphism basis F that is independent of the input distribution and independent of training.

We idealize updates by perfect hashing of multisets and stabilization in the limit. Real networks operate with finite precision, continuous activations, and often a bounded number of layers. It is therefore natural to ask which parts of the theory survive under approximation. One possible route is to define an refinement in which the hash is replaced by a Lipschitz map and multiset aggregation is only approximate, and then study whether equality of representations implies approximate equality of hom(F, ⋅) for patterns F ∈ F_A (and conversely). Here one expects a quantitative dependence on pattern size, depth, and separation margins in feature space. A different perspective is to replace exact refinement equivalence by a notion: if two graphs differ on hom(F, ⋅) for some F ∈ F_A, can a bounded-depth, bounded-width instantiation of the model detect this difference with high probability under random initialization or random projections? Such questions connect to probabilistic versions of WL and to smoothed analysis of color refinement. A technically clean open problem is to prove stability bounds of the form
∥χ_G(G) − χ_H(H)∥ ≥ ε(F) ⋅ |hom(F, G) − hom(F, H)|
for suitable norms and for restricted families of graphs, thereby turning the existence of F_A into a robust identification statement.

Our setting uses a 2-tuple refinement with locality controlled by A(G). A natural generalization is a k-tuple variant in which states live on ordered k-tuples and aggregation extends a tuple by a single vertex, but only along attention-neighborhood constraints. There are at least two non-equivalent choices: attention may be defined on vertices and then lifted to tuples (allowing witnesses in the union of neighborhoods of the tuple coordinates), or attention may be defined directly on tuples (which is more expressive but harder to sparsify). Either way, one expects a family F_A^(k) that interpolates between k-FWL (dense access) and strictly weaker locality-constrained procedures. An open problem is to identify the correct analogue of our A-guided width-2 decompositions: for k ≥ 3, do we obtain an A-guided decomposition theory of width k (in the sense of tree decompositions with bags of size k) or a different combinatorial object? Closely related is the algorithmic side: can one implement a sparse k-tuple refinement with near-linear dependence on |E_A| for fixed k by maintaining only a carefully chosen subset of tuples, while retaining the same homomorphism basis?

Graph transformers frequently augment nodes (and sometimes pairs) with positional information: Laplacian eigenvectors, random features, shortest-path distances, or more elaborate structural encodings. From our viewpoint, such encodings effectively enrich the input structure by additional unary or binary predicates, which can either preserve or destroy isomorphism equivariance. For instance, Laplacian eigenvectors are not canonically defined when eigenvalues have multiplicity, and even for simple eigenvalues there is a sign ambiguity; injecting such features can therefore act as symmetry breaking beyond our intended setting. An open direction is to delineate the boundary between (i) encodings (invariantly definable from G, hence safely incorporated into ℓ or into an expanded relational signature) and (ii) encodings that implicitly select an ordering and thus yield expressivity not captured by homomorphism counts. On the canonical side, one may ask how the basis F_A changes when we add relations such as all-pairs distances up to radius r, or when we allow A(G) itself to be derived from such relations. On the non-canonical side, one may seek probabilistic statements: random sign choices or random projections may give a model the ability to distinguish graphs with high probability, but then the correct analogue of F_A may be a distribution-dependent family of statistics rather than a worst-case homomorphism basis.

Finally, we view a conceptual lesson as follows. Once attention is restricted, additional design choice (learned sparsity, landmarks, encodings, decomposition oracles) should be treated as part of the input structure whose invariance properties must be specified. A systematic open problem is to develop a calculus in which such choices come with explicit axioms (equivariance, sparsity, bounded radius, bounded degree, decomposition access) and in which the corresponding maximal homomorphism families and separation gadgets can be derived modularly. Achieving this would turn expressivity analysis into a reusable tool for sparse-attention architecture design, rather than a collection of one-off characterizations.