In contemporary graph learning practice we rarely deploy a ``bare’’ learning architecture on an input attributed graph G = (V, E, X). Rather, we employ a two-stage pipeline: an explicit preprocessing routine fpre first transforms G into a derived graph G″ = (V″, E″, X″), after which a downstream predictor (a message-passing GNN, a graph Transformer, a kernel method, or a retrieval-based mechanism) operates on G″. This division of labor is often motivated by engineering considerations—data cleaning, feature normalization, positional encodings, subgraph extraction, augmentation, or indexing—but it has a direct mathematical consequence: the information available to the learner is not constrained by the inductive bias of the architecture alone, but by the joint power of (fpre, learner).
We emphasize that fpre is not merely
feature engineering'' in the weak sense of composing a fixed map \(\mathbb{R}^d\to\mathbb{R}^{d'}\) at each node. Modern preprocessors can alter topology (adding edges to represent proximity, adding supernodes, sparsifying or densifying), can attach global numerical summaries (all-pairs distances, spectral embeddings, effective resistances), and can call external tools (linear solvers, shortest-path routines, motif counters, canonicalization packages) whose computational strength is not reflected in the stated model class. In many benchmark settings these operations are silently permitted, and their cost is not accounted for. Consequently, when we attribute performance differences toexpressiveness’’
of the learner, we may in fact be measuring expressiveness of the
preprocessing pipeline.
This confound becomes acute in regimes where preprocessing is tantamount to solving the task. At the extreme, if a preprocessor computes the target labels (or a succinct certificate thereof) and stores them in X″, then the downstream model need only read off this information locally. Such constructions are not pathological; they capture a spectrum of common practices. For instance, a pipeline that augments each node with its shortest-path distances to a designated set of anchors can make long-range reachability predicates locally recognizable. A pipeline that attaches spectral or diffusion-based positional encodings can expose global connectivity patterns through constant-radius neighborhoods in G″. A pipeline that stores subgraph counts or graphlet histograms can reveal global motif structure to a model that otherwise only aggregates within a few hops. From the standpoint of what the learner must compute preprocessing, these pipelines reduce the ``effective’’ difficulty of the task by pushing computation into fpre.
We therefore require a vocabulary that separates architectural limitations from preprocessing power. The natural notion, and the one we adopt, is locality after preprocessing. Informally, a node-level label is k-local on G″ if it is determined by the induced k-hop neighborhood G″[NkG″(u)]; analogous definitions apply to edge- and graph-level tasks. Once this is formalized, we can ask a concrete question: what is the minimum radius rpre(P, fpre) at which the composed task P ∘ fpre becomes local on the preprocessed instance family? A small value of rpre indicates that preprocessing has collapsed the task into a local readout problem, regardless of the downstream architecture.
This perspective also clarifies why comparisons against Weisfeiler–Leman (WL) type baselines can fail to diagnose genuine architectural advantages when preprocessing is unconstrained. The standard narrative is that message-passing GNNs are bounded (in various senses) by 1-WL, higher-order GNNs correspond to higher-dimensional WL variants, and graph Transformers may or may not exceed these limitations depending on their positional encodings and attention mechanisms. Yet WL comparisons implicitly assume that the only available input is the raw G with its original features X. If we allow an arbitrary fpre, then the WL hierarchy is no longer a meaningful ceiling: preprocessing can write into X″ any globally computed invariant, including invariants that are strictly stronger than any fixed-dimensional WL refinement on the original graph. In that case, a 1-WL-equivalent architecture run on G″ can appear to ``beat’’ 1-WL on G, not because the architecture has transcended WL, but because the task has been rewritten into a locally checkable form.
Even more pointedly, the failure mode is not merely that preprocessing can add global information; it can add exactly the information needed for the prediction objective. In a supervised setting, it is common to tune preprocessing choices on validation performance. Without an explicit accounting of the preprocessing budget and its informational content, one can inadvertently (or deliberately) bake label information into features through leakage channels such as near-duplicate retrieval, graph-level metadata joined from external sources, or computed statistics that are highly correlated with the label and require global computation to obtain. The resulting system may generalize within the benchmark but does not support claims about the intrinsic representational power of the learner.
We thus advocate that any claim of expressiveness or limitation should be stated relative to a declared preprocessing regime. Concretely, we must specify: (a) what computations fpre is allowed to perform, (b) the associated budget Tpre(G) measured in an explicit centralized model (optionally including oracle calls), and (c) the post-preprocessing locality radius rpre(P, fpre) (exact, certified upper bound, or empirical diagnostic). This reporting does not forbid powerful preprocessing; rather, it makes transparent when performance is achieved by global computation performed upstream.
Finally, we note that the above concerns apply equally to GNNs and graph Transformers, and extend beyond neural networks. Retrieval-augmented systems, for example, can be viewed as preprocessors that build indices and attach retrieved exemplars or templates as additional features; this is a form of external computation whose complexity is carried by Tpre. Similarly, tool-using pipelines that call shortest-path solvers, linear system solvers, or motif counters are preprocessors with substantial algorithmic power. Without a budgeted analysis, it is impossible to distinguish whether a downstream learner is exploiting a genuinely nonlocal reasoning mechanism or merely consuming a locally accessible encoding of a globally computed quantity.
The remainder of this work formalizes these intuitions. We treat
preprocessing as a first-class computational stage and analyze when it
can, and cannot, collapse global graph properties into local decision
rules. This leads to a principled separation between
architectural expressiveness'' andpreprocessing leakage,’’
and provides a language in which fairness of comparisons can be
assessed.
A convenient abstraction for understanding graph learning pipelines is the analogy with round-based distributed computing. A depth-k message-passing architecture (MPNN) can be viewed as performing k synchronous communication rounds on G: in each round, every node sends a message to its neighbors, aggregates received messages, and updates its internal state. When we bound the per-edge message size to O(log n) bits (or, more generally, to polylog(n)), this abstraction aligns with the model; if messages are unbounded, it aligns with the model. In both cases, after k rounds, the state at a node u is a function only of the labeled induced neighborhood G[NkG(u)], hence any node-level prediction produced at u is necessarily k-local in the sense that it depends only on NkG(u) and the features therein. This observation, which underlies the standard locality-based limitations of MPNNs, is independent of training and applies equally to deterministic and randomized algorithms (conditioning on the randomness).
The role of preprocessing is naturally interpreted as an additional stage occurring the round-limited message passing begins. In distributed terminology, this resembles augmenting the instance with (node/edge labels) or with a locally checkable produced by a powerful prover, after which a limited-round verifier checks the property. In our setting, the preprocessor fpre plays the role of this prover: it may perform substantial centralized computation and then write the outcome into X″, or modify the topology to make relevant information accessible within small radius. Once such a stage is permitted, the effective limitation of round-limited message passing is no longer controlled solely by the learner depth; instead, it is controlled by the locality radius of the task P ∘ fpre. The point of the (Tpre, rpre)-view developed later is that it makes this separation explicit: the downstream architecture may still be k-local in the raw graph G, but it can become O(1)-local in the derived graph G″ because preprocessing has already performed the global work.
The distinction between local and global properties is standard in both distributed computing and descriptive complexity, and we recall the central mechanism: . Fix k ≥ 0. Two rooted graphs (G, u) and (H, v) are k-locally indistinguishable if there exists an isomorphism between the rooted induced neighborhoods G[NkG(u)] and H[NkH(v)] preserving all node and edge features. Any k-local node-level rule must assign the same label to u and v on such instances. Therefore, to show that a task is not k-local on some family, it suffices to construct pairs of instances with identical k-hop views around the relevant node/edge but different ground-truth labels. For edge- and graph-level tasks, one uses analogous indistinguishability statements for neighborhoods of edges or for all nodes simultaneously. In particular, many connectivity-type properties are global: for every fixed k, one can find graphs with large girth or large diameter in which all k-hop neighborhoods look like trees, while the property of interest depends on a cycle or on a distant connection.
Biconnectivity provides a concrete example of this phenomenon. Consider the edge-level property Bridge(G) ∈ {0, 1}E indicating whether an edge is a bridge, and the node-level property Cut(G) ∈ {0, 1}V indicating whether a node is an articulation point. On raw, unannotated graphs, neither property is constant-radius local on unrestricted families. For bridges, fix k and take a cycle CL of length L ≫ k and a path PL of length L. Choose an edge e far from the endpoints of PL. Then the induced k-hop neighborhood around e in both CL and PL is a length-(2k + 1) path with all degrees equal to 2 internally, hence the rooted edge neighborhoods are isomorphic; yet e is a bridge in PL and not a bridge in CL. Thus bridge detection is not k-local for any fixed k on this family. A similar construction witnesses globality for cut vertices: take a cycle CL and a ``lollipop’’ obtained by attaching a long path to the cycle at a node u; for L ≫ k, the k-hop neighborhood around u can be made identical while u is a cut vertex only in the lollipop. These are representative of a general pattern: local neighborhoods cannot certify the presence or absence of a distant alternative route.
The same tasks become locally decidable if preprocessing appends
sufficiently informative global features. Effective resistance
(resistance distance) is a canonical example because it encodes global
connectivity through a numerical invariant computable from the Laplacian
pseudoinverse. Let R(u, v) denote the
effective resistance between nodes u and v in an unweighted connected graph,
interpreted as the voltage drop induced by one unit of current injected
at u and extracted at v. Suppose the preprocessor fpre augments G with exact values R(u, v) (e.g.,
stored as edge features on the complete graph on V, or accessible on demand). Then
the bridge predicate becomes 0-local:
an existing edge (u, v) ∈ E is a
bridge if and only if R(u, v) = 1.
Consequently, after preprocessing, a local rule can label each edge by
reading a single scalar. Likewise, cut-vertex detection becomes 1-local: a node u is a cut vertex if and only if
there exist neighbors s, t ∈ N(u)
such that the resistance metric satisfies the additive condition
R(s, t) = R(s, u) + R(u, t),
which witnesses that all current (and hence all paths in an appropriate
metric sense) between s and
t must pass through u. This condition can be checked by
a 1-hop local verifier that inspects
u, its neighbors, and the
pairwise resistances among {u} ∪ N(u). In
both cases, we obtain a sharp contrast: the raw tasks are global on
natural instance families, but once a heavy global precomputation is
allowed, the same tasks collapse to constant-radius locality.
This example also clarifies why budgets must accompany locality claims. Exact all-pairs resistance distances typically require solving linear systems or computing a Laplacian pseudoinverse, with worst-case cost far beyond what is implicit in a ``k-layer message passing’’ description, and storing R explicitly requires Θ(n2) bandwidth in X″. From the standpoint of attribution, a downstream predictor that succeeds using such features is exploiting the global computation performed by fpre, not circumventing locality limitations by architectural means. The formal setup in the next section turns these observations into a measurable quantity rpre(P, fpre) paired with an explicit preprocessing budget Tpre(G), allowing us to state and compare expressiveness claims relative to declared preprocessing power.
We fix a universe of attributed graphs and make explicit which parts of a learning pipeline are treated as , which are treated as , and which are treated as . An (attributed) graph is a triple G = (V, E, X) with |V| = n and |E| = m, where X denotes the collection of node/edge features (and, when convenient, a graph-level feature vector). We allow directed or undirected edges and allow multigraphs, but our examples and locality statements are most naturally read for simple undirected graphs. An 𝒢 is any set of graphs closed under the restrictions we wish to impose (e.g., connectivity, bounded degree, planarity, or a fixed feature schema). All locality and budget statements are parametrized by 𝒢; without such a family, even the meaning of ``bounded radius’’ claims is typically vacuous.
A graph prediction task (or property) is a mapping P : 𝒢 → 𝒴, with three canonical
output types:
node-level:
𝒴 = {0, 1}V, edge-level:
𝒴 = {0, 1}E, graph-level: 𝒴 ∈ {0, 1} or
ℝ.
We write P(G)u
for the node label of u, P(G)(u, v)
for the edge label, and P(G) for graph-level
outputs. Nothing in the setup requires Boolean labels; multiclass labels
or real-valued regression targets can be handled by replacing {0, 1} with an appropriate codomain. We
emphasize that P is a task
specification; it need not be efficiently computable, although the
complexity TP(G)
in a centralized model will play a role in our leakage upper bounds.
A preprocessing procedure is a function
fpre : 𝒢 → 𝒢″, G ↦ G″ = (V″, E″, X″).
We allow fpre to
change topology (add/delete edges), to add auxiliary nodes, and to
replace or augment features. In particular, fpre may encode global
information in X″
(e.g., distance matrices, spectral quantities, motif counts) or may
create shortcut edges to reduce distances. To compare outputs to the
original task P(G),
we assume that fpre
provides an explicit identification of the original prediction items
inside the derived graph, e.g., an injective map ιV : V ↪ V″
for node tasks and an injective map ιE : E ↪ E″
for edge tasks, or, equivalently, a binary
is-original'' tag plus a stable ordering. This is a mild bookkeeping requirement: without it, one could triviallysolve’’
a task by permuting items and changing the meaning of coordinates.
For a graph G′ = (V′, E′, X′), we write NkG′(u) for the k-hop neighborhood of a node u ∈ V′, and for an edge (u, v) ∈ E′ we use NkG′((u, v)) := NkG′(u) ∪ NkG′(v). The corresponding labeled induced subgraph is denoted G′[NkG′(u)] (or G′[NkG′((u, v))]), including all node/edge features restricted to the induced subgraph. In all definitions below, isomorphisms are required to preserve features exactly; thus locality is always a statement about the feature schema X″.
We formalize a local post-preprocessing rule as a family of functions
that read only bounded-radius induced neighborhoods. For node-level
prediction on G″, a
rule gk
assigns to each u ∈ V
an output
gk(G″, u) = ψ (G″[NkG″(ιV(u))], ιV(u))
for some function ψ invariant
under rooted isomorphism of the labeled neighborhood. For edge-level
tasks we analogously require
gk(G″, (u, v)) = ψ (G″[NkG″(ιE(u, v))], ιE(u, v)).
For graph-level tasks, we adopt a uniform convention that the
preprocessor may introduce (or designate) a readout node ρ ∈ V″ and that
the output is computed as gk(G″) = ψ(G″[NkG″(ρ)], ρ).
This convention is without loss for our purposes: any constant-size
global readout (a special token, a pooled feature vector, or an appended
``graph node’’) is representable by preprocessing, and our objective is
precisely to attribute such representational power to fpre.
We say that P ∘ fpre is on
𝒢 if there exists a k-local rule gk such that for
all G ∈ 𝒢, the outputs agree
on the original prediction items:
P(G)u = gk(fpre(G), u) ∀u ∈ V, P(G)(u, v) = gk(fpre(G), (u, v)) ∀(u, v) ∈ E,
and similarly for graph-level tasks. We postpone to the next section the
definition of the such k,
which will be our leakage radius rpre(P, fpre).
To compare preprocessors that perform qualitatively different
operations (distance computations, spectral routines, oracle calls,
external solvers), we associate to each fpre a budget Tpre(G). Our
default choice is worst-case running time in the word-RAM model with
O(log n)-bit words,
with the understanding that feature values written to X″ must be representable
at the declared precision. When preprocessing uses additional resources,
we absorb them additively into Tpre, e.g.,
Tpre(G) = tRAM(G) + ∑iαi qi(G) + β ⋅ size(X″(G)),
where qi(G)
counts the number of queries to an oracle/subroutine i (such as a shortest-path oracle, a
linear-system solver, or a #SubGI
oracle), and size(X″(G))
measures the number of machine words written into node/edge features
(and, if relevant, the number of auxiliary nodes/edges created). The
coefficients αi, β
are bookkeeping weights that can be set to 1 when all costs are measured in a common
unit; we keep them explicit only to emphasize that
time'' andfeature bandwidth’’ are both part of
preprocessing power. In many of our statements we suppress
polylogarithmic factors and write Õ(⋅).
Independently of learning, we fix a centralized complexity measure TP(G), defined as the worst-case running time of an algorithm computing P(G) in the same computational model. The pair (Tpre, TP) will serve as the comparison that makes leakage statements meaningful: if Tpre is allowed to match TP up to Õ(1) factors, then fpre can in principle compute the labels and encode them locally; conversely, if Tpre is restricted, then nontrivial lower bounds on post-preprocessing locality become possible.
Finally, to distinguish ``local but computationally unbounded’’ readout rules from local rules that plausibly correspond to small neural modules, we parameterize local rules by a verifier class VClass (e.g., TIME(poly), TC0, or a fixed-depth MLP applied to an adjacency-encoded ego graph). This parameter will enter our reporting protocol but does not affect the definition of k-locality itself: locality constrains information is accessible (bounded-radius induced neighborhoods), while VClass constrains that information may be processed.
With these conventions in place, we can define leakage as the minimal post-preprocessing radius needed to decide P ∘ fpre, and we can report it together with the preprocessing budget and the verifier class, independently of the downstream architecture.
Fix an instance family 𝒢, a task
P : 𝒢 → 𝒴, and a preprocessing
function fpre : 𝒢 → 𝒢″.
The preceding definitions single out a notion of after preprocessing
(bounded-radius induced neighborhoods in G″) and a notion of
(agreement with P(G)
on the original prediction items via the identification maps).
We now compress these into a single numerical quantity measuring how
much of the task can be decided after preprocessing.
For each k ≥ 0, recall that
``P ∘ fpre
is k-local on 𝒢’’ means that there exists a k-local decision rule gk whose outputs
coincide with P(G) on
all original nodes/edges (or on the designated graph readout) for every
G ∈ 𝒢.
We define the as
rpre(P, fpre; 𝒢) := inf { k ∈ ℕ : P ∘ fpre
is k-local on 𝒢 },
with the convention rpre(P, fpre; 𝒢) = ∞
if no finite k suffices.
When 𝒢 is fixed throughout, we suppress
it and write rpre(P, fpre).
Two features of this definition are deliberate.
First, rpre is a
radius over 𝒢: it asserts that a single
locality bound works uniformly for all instances in the family, rather
than only on average.
Second, rpre is
measured preprocessing, i.e., neighborhoods are taken in G″ = fpre(G)
and include the augmented feature schema X″.
Thus rpre
attributes any locality reduction caused by shortcut edges, auxiliary
nodes, or globally-computed features to the preprocessor, not to the
downstream model.
The definition quantifies over of some local rule gk (with the
required rooted-isomorphism invariance on labeled ego graphs).
We do not assume that a learner will find gk, nor that
gk is
statistically estimable from limited data.
This is intentional: leakage is meant to be an statement.
If rpre(P, fpre)
is small, then there exists a bounded-radius procedure that recovers the
labels from G″; the
downstream architecture, optimization, and sample size only govern
whether this procedure is realized in practice.
Locality alone does not specify how costly it is to process the
induced neighborhood.
A rule that reads G″[Nk(⋅)]
might still solve an NP-hard subproblem within that neighborhood, or
might require arithmetic with extremely high-precision real features
written by preprocessing.
Accordingly, we parameterize our leakage statements by a VClass, a computational class in which gk is required
to lie (e.g., TIME(poly), TC0, or a fixed-size neural module
applied to an adjacency-encoded ego graph).
Formally, given VClass we may define
the restricted radius
rpreVClass(P, fpre) := inf { k : ∃ gk ∈ VClass
such that gk witnesses k-locality
of P ∘ fpre },
again taking the value ∞ if no such
witness exists.
In the absence of an explicit VClass,
our default is the existence-based rpre above.
We package the preprocessing cost, the induced locality, and the
computational nature of the local rule into a single report:
Leak(P, fpre) = (Tpre, rpre, VClass, notes).
Here Tpre(G) is the
declared preprocessing budget (time/oracle/bandwidth, as specified in
the setup), while rpre is either the exact
radius or a certified upper bound rpre ≤ k
obtained by exhibiting an explicit k-local rule.
The field notes is reserved for the
bookkeeping that determines whether the report is meaningful and
comparable across works: which instance family 𝒢 is in force, how auxiliary nodes/edges are
tagged, whether feature values are exact or approximate, and whether the
stated radius is proved or only observed empirically (e.g., via
ego-graph baselines constrained to radius k).
This tuple is the object we will later compare across preprocessors; in
particular, two pipelines with identical downstream architectures but
different fpre
should not be viewed as comparable unless both are accompanied by such a
report.
A central point is that Leak(P, fpre) is
defined of whether the subsequent learner is a message-passing GNN, a
graph Transformer with global attention, or any other
architecture.
The reason is structural: our leakage question is not
can a given model class compute \(P\) from \(G''\)?'' but rather \[ \text{is
P already locally decidable on
the representation G″?’’}
]
If the answer is yes with small radius, then any sufficiently expressive
architecture (including ones with global access) has, in principle, an
easy path: it may simply implement (or approximate) the local rule gk and ignore
long-range computation.
Conversely, if the answer is no (large rpre or ∞ within a plausible VClass), then even a globally attentive model
is being helped by hidden label information in X″ at small radii, and
any success must come from genuine global inference or from nonlocal
access patterns not captured by our locality model.
We emphasize that this invariance is not a claim that all
architectures behave the same; rather, it is a claim about .
Whatever advantage a graph Transformer gains from receiving G″ rather than G is, by definition, part of
preprocessing power whenever it is witnessed by a small rpre.
In this sense, rpre
is an architecture-agnostic measure of ``how much of the task has been
compiled into the input.’’
The quantity rpre should be read as a
locality-collapsing effect of preprocessing: rpre = 0 means the labels
are readable from single nodes/edges (or from a constant-size readout
token), while rpre = 1 means a node’s
label can be verified from its immediate neighborhood in G″, and so on.
When rpre is small
and Tpre is large,
it becomes difficult to attribute downstream performance to
``expressiveness’’ of the learner; the task may have been made trivial
by preprocessing.
The next section formalizes this intuition by showing that, when Tpre is allowed to match
the centralized complexity TP, locality
collapse is unavoidable in the strongest possible sense.
We now formalize the strongest possible version of the leakage
phenomenon: once preprocessing is permitted to spend essentially the
same resources as a centralized solver for the task, it can so that the
remaining decision is local in the sense of rpre.
The conclusion is not that typical feature-engineering pipelines behave
this way, but rather that , because otherwise the preprocessor can
always trivialize the remaining computation.
Theorem~ makes precise the following implication: if the
preprocessing stage is permitted to spend a budget comparable to that of
solving the task centrally, then the post-preprocessing locality radius
can always be forced down to a constant.
In that regime, any subsequent ``learning architecture’’ receives an
input on which the task is already locally decidable.
Thus, downstream success cannot be attributed to long-range message
passing, global attention, or higher-order aggregation—the nontrivial
part of the computation has already occurred inside fpre.
Formally, consider any model class 𝒜
operating on G″
after preprocessing.
If we allow Tpre(G) ≥ Ω̃(TP(G)),
then there exists a pipeline in which 𝒜
is superfluous: one may replace it by the constant-radius rule gk from the
proof (with k ∈ {0, 1}
depending on task type) while preserving exact correctness on all
instances.
This is precisely the sense in which expressiveness statements become
vacuous without an explicit preprocessing budget: any architectural
advantage can be simulated by preprocessing that simply writes the
answers into the representation.
Two clarifications are worth recording.
First, the collapse does not require large neighborhoods or subtle
coding; the features required are only the label bits (or a
bounded-precision real value for regression), placed on the relevant
prediction items.
Second, the collapse is compatible with augmentations that substantially
change topology: the definition of rpre measures locality
G″, so adding
shortcut edges, hub nodes, or globally-computed features is exactly the
mechanism by which leakage is expressed and must be accounted for.
In summary, Theorem~ provides an unconditional upper bound on rpre as a function of preprocessing resources: once Tpre is allowed to approach TP, locality can be made constant, and comparisons of downstream architectures become uninterpretable unless one simultaneously reports (and restricts) Tpre and the permitted feature/topology augmentations.
Theorem~ is an statement: with a budget comparable to a centralized
solver, preprocessing can always force rpre down to a constant
by compiling answers (or checkable certificates) into the
representation.
To obtain a framework in which rpre is not automatically
vacuous, we must therefore work in regimes where such compilation would
contradict standard complexity assumptions, or where compilation would
require oracle power that we explicitly forbid or budget for.
We record two complementary barriers: NP-hardness (as a coarse-grained
obstruction) and #P/PH oracle
separations (as an obstruction to ``feature oracles’’ such as
unrestricted motif counting).
Suppose P is NP-hard under
polynomial-time many-one reductions.
If there existed a polynomial-time preprocessor fpre such that P ∘ fpre were
0-local with a polynomial-time verifier
g, then the composite
procedure
G ↦ fpre(G) ↦ g(fpre(G))
would decide P in polynomial
time.
This yields the conditional barrier formalized by Theorem~3 in our
theorem list: under the assumption P ≠ NP, no polynomial-time preprocessing
scheme can reduce an NP-hard decision task to a trivial 0-local readout with an efficient
verifier.
In other words, for NP-hard tasks, locality collapse is possible only by
paying one of the following prices: (i) super-polynomial Tpre, (ii) a verifier
g whose computational power is
itself beyond polynomial time, or (iii) a relaxation of the task notion
(approximation, promises, or distributional/average-case
correctness).
Two remarks are necessary for correct interpretation.
First, this is a obstruction only in the weak sense that it rules out a
single polynomial-time fpre that works on all
inputs G ∈ 𝒢; it does not
forbid heuristics that collapse locality on restricted families or
typical instances.
Second, the statement is about solving: in learning settings, it remains
possible (and common) that preprocessing makes labels from small-radius
neighborhoods without making the task algorithmically easy in the worst
case.
This is precisely why our reporting protocol allows an empirical
estimate r̂ alongside certified
bounds.
The preceding NP-hardness barrier is easily circumvented if we
silently grant preprocessing access to computationally powerful
oracles.
Many ``feature engineering’’ pipelines do precisely this in practice
(e.g., calls to ILP/SAT solvers, exact substructure enumeration, or
external chemistry toolkits), even when the downstream model is
presented as a purely local learner.
Accordingly, our budget Tpre must be understood
as : it must include not only RAM time, but also the number and type of
external oracle calls.
A particularly clean separation arises for subgraph- and
motif-counting features.
If fpre is
permitted to output exact values of #SubGI(H ↪ G) for
unrestricted pattern families ℋ, then
(by Theorem~4 in our theorem list) preprocessing is #P-hard in general; moreover, even a single
adaptive #P query together with
polynomial-time computation suffices to decide all languages in PH (via standard Toda-style
inclusions).
Thus, in the presence of such an oracle, locality claims become
uninformative: the preprocessor can encode extremely hard global
information into node/edge features, after which a constant-radius
downstream rule can appear to have ``global expressiveness’’ while in
fact consuming #P-level power
upstream.
The appropriate resolution is not to ban all global features, but to and
them: if exact motif or subgraph counts are used, the leakage report
must reflect oracle complexity (or restrict ℋ to fixed-size patterns where exact counting
is tractable on the relevant graph class).
Locality is a restriction (bounded radius in G″), not a computational
one.
Even with small rpre, one could in
principle place all computational hardness into the local decision rule
gk by
allowing gk to be
arbitrarily expensive.
This is why our leakage tuple includes an explicit verifier class VClass.
At the extreme, if VClass = TIME(exp),
then any property becomes 0-local by
simply copying the input G
into a large feature string at a distinguished node and letting g0 solve the problem
there.
To avoid this degeneracy, one must either (i) constrain VClass to a neural-plausible or circuit-like
class (e.g. small MLP, TC0,
low-depth circuits), or (ii) explicitly include the evaluation time
Tgk
in the accounting.
Theorem~5 captures the relevant simulation principle: an exact k-local certificate P(G) = gk(fpre(G))
implies a centralized algorithm with runtime Õ(Tpre(G) + Tgk(G)).
Consequently, any (assumed) lower bound on TP transfers to
a lower bound on Tpre for achieving small
rpre with efficient
verifiers.
The above discussion yields a minimal set of methodological
commitments.
To interpret downstream performance as evidence of model expressiveness
rather than preprocessing leakage, we must (a) specify Tpre including external
calls and oracle access; (b) bound the output bandwidth of preprocessing
(feature dimension, bit precision, and any O(n2) objects
such as SPD/RD tables) since bandwidth is itself a channel for leakage;
and (c) declare a verifier class VClass
so that
local but computationally heavy'' readouts are not conflated withlocal
and easy’’ ones.
Under these declarations, the non-triviality claims become conditional
but operational: if P is
NP-hard, exact locality collapse with polynomial budgets would imply
P = NP; if preprocessing uses #P-level primitives, then any resulting
locality collapse must be attributed to that oracle power.
These principles guide the case studies that follow, where we certify
tight rpre bounds
for concrete, commonly used preprocessors and explicitly relate those
bounds to their computational and informational costs.
We now instantiate the (Tpre, rpre)-lens on several preprocessors that are routinely used (explicitly or implicitly) to ``globalize’’ local architectures. Our goal in each case study is to (i) identify a natural task family P whose apparent non-locality can be collapsed by the preprocessor, (ii) give an explicit constant-radius decision rule on G″ = fpre(G) (hence a certified upper bound on rpre), and (iii) make the required preprocessing budget and bandwidth explicit.
Let fpre augment G with exact effective resistance values R(u, v) between relevant pairs of vertices. In the strongest form, fpre stores all-pairs resistances, i.e. a matrix R ∈ ℝn × n whose entries are written into node features (e.g. node u carries the vector R(u, ⋅)). This representation is expensive in both computation and bandwidth: exact all-pairs resistances require access to the Laplacian pseudoinverse and, in general, incur Ω(n2) output size and at least Ω(n2) work simply to write the features. Nevertheless, once such features exist, several classical global connectivity predicates become constant-radius.
Concretely, by Theorem~2, bridge detection becomes 0-local: for an edge (u, v) ∈ E, one
checks whether R(u, v) = 1. No
neighborhood exploration is necessary; the label depends only on the
edge feature R(u, v) (or
equivalently on the endpoints’ feature vectors restricted to the
coordinate v and u). Cut-vertex detection becomes
1-local: a node u is a cut vertex if and only if
there exist neighbors s, t ∈ N(u)
such that
R(s, t) = R(s, u) + R(u, t),
which can be verified by inspecting the induced subgraph on N1G″(u).
The tightness aspect is equally important: on plain graphs without such
resistance features, both bridge and cut-vertex detection are global in
the locality sense (one can witness non-locality on path-versus-cycle
families where arbitrarily large neighborhoods are indistinguishable
around a given edge/node while the global label flips). Thus the
preprocessor has collapsed rpre from Ω(D) to 0/1, and the only plausible attribution for
any downstream ``global’’ success is the resistance augmentation
itself.
A second canonical preprocessor computes shortest-path distances. Let
D denote the all-pairs
shortest-path matrix (SPD). If fpre writes to each node
u the vector D(u, ⋅) ∈ ℕn,
then any node-level statistic that is a function of distances from u becomes 0-local. For example, eccentricity, closeness
centrality, and t-reachability
labels admit 0-local readout:
ecc(u) = maxv ∈ VD(u, v), close(u) = (∑v ∈ VD(u, v))−1, 1{∃v: D(u, v) ≤ t ∧ ϕ(v)}.
Similarly, graph-level quantities such as diameter diam(G) = maxu, vD(u, v)
become 0-local we permit a designated
readout location (e.g. a ``virtual’’ graph node whose features contain
the entire matrix, or a convention that graph-level labels are stored in
a global feature slot). Absent such conventions, one can still obtain
1-locality by adding a virtual node
z adjacent to all vertices and
storing global summaries at z.
The cost story is immediate: computing exact SPD is typically O(nm) in unweighted graphs by running BFS from each node, and the output bandwidth is Θ(n2) even before precision/encoding considerations. Moreover, rpre depends subtly on the verifier class. If the preprocessor stores ecc(u) directly as a scalar, the 0-local verifier can lie in a very weak class (constant-depth arithmetic on O(log n)-bit words). If instead it stores the entire vector D(u, ⋅), then the 0-local verifier must compute a maximum over n values, which is still polynomial-time but not ``neural-plausible’’ in the strongest circuit-theoretic senses unless we explicitly allow such aggregation. This illustrates why rpre must be reported together with VClass: the same distance features can be either a trivial leakage channel (when they already contain the answer) or a merely representation (when substantial computation remains in the verifier).
Positional encodings (PEs) are often presented as ``features,’’ but many of them behave, for locality, like topology modifications. For instance, Laplacian eigenvector encodings and related spectral coordinates require global computation; random-walk structural encodings require solving linear systems or simulating walks at scale. Once written into X″, however, they can reduce the effective radius needed to separate nodes or to verify global certificates (e.g. by giving each node a nearly unique coordinate system).
More importantly, several PE pipelines add determined by global
information (e.g. k-nearest-neighbor edges in the PE
space, diffusion edges, or a global
CLS''/virtual node). In our framework these are bona fide changes from \(G\) to \(G''=(V'',E'',X'')\). Adding a universal virtual node \(z\) connected to all \(u\in V\) immediately makes many graph-level predicates \(1\)-local at \(z\), because \(G''[N^{G''}_1(z)]\) contains all vertices (and often their full feature payload). Similarly, adding long-range PE-induced edges can reduce the diameter of \(G''\) and hence lower \(r_{\mathrm{pre}}\) even if no explicit labels are compiled. Thus PE design must be audited not only for runtime but also for \emph{induced locality}: an architecture may appear to havelong-range
reasoning’’ while in fact operating locally on a graph whose long-range
edges were inserted upstream.
Finally, motif-counting features provide a sharp illustration of oracle-priced leakage. If fpre outputs exact counts #SubGI(H ↪ G) for an unrestricted pattern family ℋ, then the preprocessor can encode extremely hard global information, and by Theorem~4 this capability is #P-hard in general (with standard consequences for PH under oracle access). From a locality perspective, once such counts are placed into X″, the downstream task is frequently 0-local: a node-level label ``does u participate in any copy of H?’’ becomes a threshold test on a single feature coordinate, and graph-level existence/counting predicates become constant-radius tests at an agreed readout location. In contrast, when ℋ is restricted to fixed-size motifs (e.g. triangles, 4-cycles), exact counting may be tractable on specific graph classes, and the correct report is then a quantitative one: locality has been reduced, but only by paying a preprocessing cost that must be disclosed (and which may scale as n|V(H)| in the worst case if implemented na"ively).
Across effective resistance, SPD features, positional encodings, and motif counts, the same structural phenomenon recurs: once global invariants are written into X″ (or once E″ is augmented to shrink graph distances), the post-preprocessing radius rpre for many ``global’’ tasks becomes constant, often 0 or 1. The only scientifically meaningful question is therefore not whether a downstream learner can exploit these representations, but how much computational and informational power was consumed upstream to create them. This motivates the next section, where we present concrete leakage-auditing procedures that either certify small rpre analytically or diagnose practical locality collapse empirically via strictly local ego-graph baselines.
We now make the (Tpre, rpre)-report operational by giving auditing procedures that (a) small post-preprocessing locality whenever an explicit local rule exists, and (b) practical locality collapse when formal certificates are unavailable or inconvenient. Both procedures treat preprocessing as part of the computational budget and therefore allow us to attribute performance either to upstream computation or to downstream inference.
The analytic audit is a proof obligation: we exhibit an explicit
k-local decision rule gk such
that
P(G) = gk (fpre(G)) for
all G ∈ 𝒢,
where gk
is constrained to read only G″[NkG″(u)]
for node/edge labels (or an agreed readout location for graph labels)
and lies in a stated verifier class VClass. Once such a rule is written down, we
have a certified bound rpre(P, fpre) ≤ k
by definition. Crucially, we insist that the certificate uses
information present in G″; the audit is not a
proof about the original G,
but about what the preprocessor has already placed into
topology/features.
For reporting, we recommend that a paper claiming improved
long-range'' reasoning from preprocessed features include (a) the smallest \(k\) for which a certificate is known, (b) a short description of \(g_k\), and (c) an upper bound on the total preprocessing work \(T_{\mathrm{pre}}(G)\) including any calls to solvers/oracles and the cost of writing \(\Theta(n^2)\)-size outputs when applicable. When such a certificate exists, Theorem~5 further implies an attribution statement: if \(r_{\mathrm{pre}}\) is small with an efficient \(g_k\), then the pipeline implicitly contains an efficient centralized algorithm for \(P\), and any purportedexpressiveness’’
must be charged against preprocessing.
Analytic certificates are not always available for modern feature pipelines (e.g. learned positional encodings, tool-augmented preprocessors, or proprietary feature extractors). In such cases we propose an empirical audit that tests for locality collapse by training baselines that are to k-hop ego-graphs of G″. The guiding principle is: if a classifier that cannot see beyond NkG″(u) achieves near-saturated performance, then the task has effectively become k-local after preprocessing, regardless of whether we can prove it.
Concretely, for node-level tasks we consider, for each k ∈ {0, 1, …, Kmax},
a predictor hk that takes as
input only the induced subgraph G″[NkG″(u)]
with features and outputs a label for u. For edge-level tasks we
analogously use G″[NkG″((u, v))].
The case k = 0 is the
baseline: no adjacency information is visible except what is encoded in
node/edge attributes. We then define an estimated collapse radius
r̂pre := min {k ≤ Kmax: Acc(hk) ≥ τ},
for a task-dependent threshold τ (e.g. within ε of the best reported model, or
within a confidence band around the best attainable performance on the
benchmark).
To ensure the baseline is genuinely local, we enforce two invariants. First, the data pipeline must materialize ego-graphs explicitly and forbid global operations (no batch normalization whose statistics mix nodes from different ego-graphs; no attention across a batch; no global positional indices that reintroduce node identities). Second, when the preprocessor produces graph-level artifacts (e.g. an n × n distance matrix distributed across nodes), we must specify what is actually visible to hk: if node u contains D(u, ⋅), then k = 0 already includes n numbers, and the baseline is allowed to use them, but this must be recorded as bandwidth-heavy leakage rather than interpreted as ``local generalization.’’
A minimal audit can be summarized as follows.These audits supply the bridge to our implementation guidelines: once (Tpre, rpre, VClass) are reported with the above ablations, experiments can explicitly demonstrate (or refute) locality collapse and thereby prevent conflating preprocessing power with downstream architectural expressiveness.
We now describe experimental practices that make the leakage report Leak(P, fpre) = (Tpre, rpre, VClass, notes) empirically meaningful and comparable across pipelines. The goal is not to propose new benchmarks, but to standardize ablations that (i) expose locality collapse when it occurs (as predicted by Theorem~1), (ii) quantify the extent to which performance is attributable to preprocessing rather than downstream inference, and (iii) make tool-augmented or solver-augmented preprocessors auditable.
For each reported model trained on a fixed preprocessed instance
family fpre(𝒢), we
recommend reporting a small set of baselines that isolate what
information is already present at bounded radius in G″. At a minimum, for
node- and edge-level tasks, one should include:
(i) a k = 0 baseline that uses
only node/edge attributes and no adjacency beyond what is encoded as
features;
(ii) a small set of ego-graph baselines for k ∈ {1, 2, 3} (or up to a
task-dependent cap);
and (iii) a full-graph baseline in the same VClass as a sanity check that the learner is
not underpowered.
We emphasize that the ego-graph baselines should be implemented by
explicit subgraph extraction G″[Nk(⋅)]
followed by a predictor that receives only this induced subgraph as
input, so that locality is enforced by construction rather than by an
architectural heuristic.
In addition to reporting the resulting accuracies, it is informative to report the k ↦ Acc(hk) and the corresponding r̂pre for a fixed threshold τ. We recommend choosing τ relative to the best reported pipeline for the benchmark, e.g. τ = Acc⋆ − ε with a small ε (or an uncertainty-corrected band), and reporting sensitivity over ε. In our experience, the curve is often more diagnostic than a single number: a sharp jump from k = 0 to k = 1 suggests that preprocessing has produced a locally checkable certificate, whereas a smooth increase may indicate partial leakage through approximate global summaries.
To detect inadvertent label encoding through precomputed features, we recommend a set of controlled corruptions that preserve marginal distributions while destroying alignment with graph structure. Concretely, if X″ decomposes into blocks X″ = (Xraw, Xpre), where Xpre is produced by fpre, then one can (i) permute Xpre(u) across nodes u, (ii) permute only within degree classes or within connected components (to preserve coarse statistics), or (iii) resample Xpre from other graphs in the dataset with matched size. A large performance drop under such perturbations indicates that Xpre is carrying instance-specific information that is not recoverable from local topology alone. These tests are particularly important when fpre outputs high-bandwidth objects (e.g. per-node distance vectors, spectral coordinates, or motif histograms), where a seemingly ``local’’ readout may in fact be exploiting globally computed identifiers.
We recommend that every experimental table involving preprocessing
include an explicit budget line with at least the following
quantities:
Tpre(G) (measured
wall-clock and asymptotic
bound), Sout(G) (output size in
scalars), b (numeric precision / bits per
scalar).
When fpre computes
all-pairs objects (e.g. SPD or effective resistance), one should report
whether the representation is stored explicitly (Θ(n2) scalars)
or sparsified/quantized, since this materially changes both bandwidth
and the effective power of the local readout. If the preprocessor makes
external calls (linear solvers, MILP solvers, SAT solvers, database
queries), then Tpre
must include both the number and type of calls and the total time; if
calls are nondeterministic or time-capped, this should be stated and
treated as part of the experimental condition rather than hidden as
``data preparation.’’
A growing class of pipelines computes features using tools that are
best modeled as oracles: exact/approximate solvers, retrieval systems,
or foundation-model calls. Our recommendation is to treat such tools as
part of fpre and to
report them in the leakage tuple via an budget accounting:
Tpre(G) = (CPU/GPU
time, #oracle queries, total tokens or bytes, tool class).
This makes explicit when a pipeline is implicitly leveraging #P-hard or NP-hard computation
(cf. Theorem~4) or when it is simply amortizing heavy computation
outside the learning loop. In tool-augmented settings, we also recommend
a : rerun the downstream learner with the tool outputs replaced by
outputs generated without access to the test instance (e.g. generic
templates, unconditional samples, or outputs from an unrelated graph).
If performance remains high, then the tool output is not carrying
instance information; if performance collapses, then the tool output is
an instance-dependent channel and must be audited for locality and
budget.
This block is intentionally brief but forces the separation of upstream computation from downstream inference. In particular, when a pipeline achieves strong ``long-range’’ performance, the disclosure should make it evident whether the gain is explained by small rpre (locality collapse) together with large Tpre, or whether it persists in regimes where preprocessing is genuinely local and low-budget.
We conclude by isolating several directions in which the leakage tuple Leak(P, fpre) = (Tpre, rpre, VClass, notes) is presently too coarse, and by formulating open problems whose resolution would make locality audits both more faithful and harder to game.
Our definition of rpre is qualitative:
either P ∘ fpre is
k-local (exactly) or it is
not. Empirically, however, preprocessing often yields features that are
neither a full label encoding nor purely local statistics; instead, they
offer a toward the correct answer that a downstream learner can amplify.
This suggests that one should measure leakage on a continuum. One
natural extension is an ε-locality radius
rpre(ε)(P, fpre) := min {k: ∃ k-local
gk s.t.
Pr [gk(G″[Nk(⋅)]) = P(G)] ≥ 1 − ε},
with the probability taken over an explicit distribution on 𝒢 (or over a worst-case instance family if
one adopts distribution-free agnostic guarantees). This raises two
immediate questions. First, what is the correct for node/edge-level
tasks: per-label error, average error over nodes/edges, or an
adversarially chosen index? Second, what is the correct for ε: should we compare to Bayes risk
under the distribution, or to the accuracy of a full-graph learner in a
fixed VClass? Even in the simplest
setting, one expects a separation between
information leakage'' (existence of a \(k\)-local predictor with small \(\varepsilon\)) andcomputational
leakage’’ (existence of such a predictor within a restricted VClass). A principled report would therefore
include a pair (rpre(ε), VClass)
rather than a single integer radius, together with an explicit
definition of the error criterion.
A complementary approach is to define a of radius k, e.g.
Advk := supg ∈ 𝒢k(Pr [g(G″[Nk(⋅)]) = P(G)] − Pr [g0(G″[N0(⋅)]) = P(G)]),
where 𝒢k denotes a
fixed hypothesis class of k-local predictors. This makes
explicit that leakage is a attributable to increasing radius given the
same VClass, aligning with the collapse
curves advocated in Section~. The open problem is to connect such
operational quantities to information-theoretic measures (conditional
mutual information between P(G) and G″[Nk]
given G″[Nk − 1],
or conditional entropy decay), in a way that is invariant under benign
feature reparameterizations.
Theorems~2–4 highlight that global objects (e.g. all-pairs effective resistance, all-pairs shortest paths, or exact motif counts) can collapse locality dramatically, but exactness is rarely used in practice. Many pipelines compute approximate shortest-path distances, approximate resistances via Laplacian solvers, low-rank spectral embeddings, or quantized distance bins. Formally, these are still global functions of G; nevertheless, their is reduced, and their approximation error may destroy the crisp characterizations underlying constant-radius certificates.
This motivates a refined accounting in which the preprocessor output
is constrained not only by time Tpre but also by an
information budget: output size Sout(G) and
numeric precision b. One can
then ask for a tradeoff curve: for a fixed task P, what minimal (Sout, b)
suffices to achieve rpre(ε) ≤ k
for small k? Even for concrete
tasks such as bridge detection, it is unclear how robust the 0-local
rule
\(R(u,v)=1\)'' is under approximation. The immediate technical challenge is to control false positives/negatives when resistances are perturbed by \(\pm \delta\), or when shortest-path distances are only available up to additive/multiplicative distortion. The broader open problem is to characterize which classes of approximate global summaries admit \emph{locally checkable certificates} despite approximation error, and which do not. Such results would separateapproximate
leakage’’ that still yields constant-radius decision rules from
approximate features that merely help learning in a non-certificate
way.
Our definitions are deterministic: after preprocessing, a label is a function of the radius-k induced neighborhood. In distributed computing and property testing, it is standard to allow a local algorithm access to public randomness. If we allow the local verifier gk to use a shared random seed ρ, then one can define a randomized radius rprerand by requiring Prρ[gk(⋅; ρ) = P(G)] ≥ 1 − ε. This is not merely a technicality: public randomness can simulate hashing-based certificates, local sampling, or randomized sketch verification, potentially reducing the radius needed to validate a global summary computed by fpre.
The associated open questions are definitional and complexity-theoretic. Definitional: what randomness is ``free’’ in a learning pipeline audit? If the seed is fixed across all instances, it can become an implicit identifier channel; if it is per-instance, it must be budgeted as part of the output. Complexity-theoretic: can randomized locality meaningfully reduce the preprocessing budget needed to collapse locality for natural problems, without simply reintroducing Theorem~1 by another route (namely, by allowing the preprocessor to write a randomized encoding of the labels)? Any satisfactory theory must treat randomness as a resource with explicit accounting, analogously to time and output bandwidth.
A practical frontier is that fpre is increasingly implemented by a model (graph foundation model, LLM-based tool, retrieval system) that consumes a representation of the entire graph and emits node/edge/graph embeddings. In our framework, such a system is an oracle-augmented preprocessor with budget components that do not fit the word-RAM abstraction: token budgets, retrieval calls, and (often hidden) training compute and training data access. The immediate methodological requirement is to state whether the preprocessor is recomputed per test instance, whether it is transductive (trained on the test graph), and whether it can access labels or label-correlated metadata during pretraining.
A more conceptual open problem is to define an analogue of rpre when fpre is across many instances. If a foundation model has memorized a distribution of graphs, then the per-instance runtime may be small while the computational budget is enormous. Leakage reports that ignore this amortization can be misleading in the opposite direction of Theorem~1: they may understate the effective preprocessing power. We therefore view it as necessary to extend Leak(P, fpre) with provenance fields (training data access, transductive vs inductive, and whether the preprocessor parameters depend on the evaluation instance) and with an amortized budget model. Establishing a standard here is an open problem in its own right, but without it, comparisons across ``feature-free’’ learners and tool-augmented pipelines are not well-defined.
Finally, our strongest negative statements are conditional (Theorem~3) or complexity-based (Theorem~4). A missing piece is a set of lower bounds showing that, for specific tasks P, any preprocessor that reduces rpre(ε) below a threshold must output Ω(n) or Ω(n2) bits, regardless of time. Such bounds would formalize the intuition that certain global properties cannot be made locally decidable without transmitting essentially the whole graph (or a near-complete distance matrix) through features. Proving these bounds, even for restricted instance families, appears to require combining locality arguments with communication complexity or streaming lower bounds. We regard this as the most direct route to a theory in which (Tpre, Sout, b, rpre) jointly constrain leakage in a way that cannot be circumvented by simply relocating computation upstream.