Preprocessing Leakage in Graph Learning Pipelines: A (T_pre, r_pre) Audit via Locality Collapse

Metadata

Total Words: 9,942
Export Date: 2026-01-18 04:23:00
Description: Graph learning papers increasingly claim expressive power gains from preprocessing (distance matrices, positional encodings, subgraph/motif features, rewiring, tool-augmented feature generation). Building on the RL-CONGEST perspective that separates preprocessing from message passing, we argue that many such gains are better understood as preprocessing-induced shortcuts: global problems become locally decidable once preprocessing computes a global certificate. We formalize this phenomenon with a budgeted preprocessing model and a post-preprocessing locality radius r_pre. Our first result is a general locality-collapse theorem: any problem P computable in time T_P can be made 0-local (or 1-local) by a preprocessor running in Õ(T_P), implying that expressiveness claims that ignore preprocessing budgets are vacuous. We then give concrete, tight locality bounds for widely used preprocessors, including effective resistance distances (which render bridge detection 0-local and cut-vertex detection 1-local) and motif-count preprocessors (which implicitly require NP/#P-hard power when unconstrained). Finally, we propose a leakage-report protocol that standardizes reporting (T_pre, r_pre) alongside accuracy, and we outline experiments showing how simple local baselines can diagnose leakage in modern 2026-era GNN/Transformer pipelines, including those augmented by external tools or solvers.

1. Introduction: why preprocessing is the modern expressiveness confound (pipelines, tools, retrieval), and why WL-style comparisons fail without budgets
2. Background: RL-CONGEST separation of preprocessing/message passing; locality (k-hop) vs global properties; examples from biconnectivity + resistance distance
3. Formal setup: graph instance families, tasks (node/edge/graph), preprocessing functions, and a unified preprocessing budget model
4. Defining leakage: the post-preprocessing locality radius r_pre and the leakage report tuple (T_pre, r_pre, verifier class); invariance to architecture choice (GNN vs Transformer)
5. General locality-collapse theorem: any efficiently computable task can be made locally trivial if preprocessing budget matches task complexity; implications for ‘expressiveness’ claims
6. Conditional lower bounds / non-triviality: when can a polynomial-time preprocessor not collapse locality (NP-hardness and oracle separations); what must be assumed to avoid vacuity
7. Case studies with tight locality bounds: (a) effective resistance and biconnectivity, (b) shortest-path distances and distance-definable tasks, (c) positional encodings, (d) motif/subgraph counting preprocessors and their complexity
8. Leakage auditing algorithms: (i) certificate-based analytic auditing (prove r_pre upper bounds), (ii) empirical auditing via ego-graph baselines and feature-only baselines; recommendations for benchmark design
9. Implementation & experiments (recommended): standardized ablations showing locality collapse; tool-augmented preprocessing scenarios; reporting guidelines
10. Discussion and open problems: measuring ‘partial leakage’, approximate distances, public randomness, and extending audits to foundation-model pretraining pipelines

Content

1. Introduction: why preprocessing is the modern expressiveness confound (pipelines, tools, retrieval), and why WL-style comparisons fail without budgets

In contemporary graph learning practice we rarely deploy a ``bare’’ learning architecture on an input attributed graph G = (V, E, X). Rather, we employ a two-stage pipeline: an explicit preprocessing routine f_pre first transforms G into a derived graph G^″ = (V^″, E^″, X^″), after which a downstream predictor (a message-passing GNN, a graph Transformer, a kernel method, or a retrieval-based mechanism) operates on G^″. This division of labor is often motivated by engineering considerations—data cleaning, feature normalization, positional encodings, subgraph extraction, augmentation, or indexing—but it has a direct mathematical consequence: the information available to the learner is not constrained by the inductive bias of the architecture alone, but by the joint power of (f_pre, learner).

We emphasize that f_pre is not merely feature engineering'' in the weak sense of composing a fixed map \(\mathbb{R}^d\to\mathbb{R}^{d'}\) at each node. Modern preprocessors can alter topology (adding edges to represent proximity, adding supernodes, sparsifying or densifying), can attach global numerical summaries (all-pairs distances, spectral embeddings, effective resistances), and can call external tools (linear solvers, shortest-path routines, motif counters, canonicalization packages) whose computational strength is not reflected in the stated model class. In many benchmark settings these operations are silently permitted, and their cost is not accounted for. Consequently, when we attribute performance differences toexpressiveness’’ of the learner, we may in fact be measuring expressiveness of the preprocessing pipeline.

This confound becomes acute in regimes where preprocessing is tantamount to solving the task. At the extreme, if a preprocessor computes the target labels (or a succinct certificate thereof) and stores them in X^″, then the downstream model need only read off this information locally. Such constructions are not pathological; they capture a spectrum of common practices. For instance, a pipeline that augments each node with its shortest-path distances to a designated set of anchors can make long-range reachability predicates locally recognizable. A pipeline that attaches spectral or diffusion-based positional encodings can expose global connectivity patterns through constant-radius neighborhoods in G^″. A pipeline that stores subgraph counts or graphlet histograms can reveal global motif structure to a model that otherwise only aggregates within a few hops. From the standpoint of what the learner must compute preprocessing, these pipelines reduce the ``effective’’ difficulty of the task by pushing computation into f_pre.

We therefore require a vocabulary that separates architectural limitations from preprocessing power. The natural notion, and the one we adopt, is locality after preprocessing. Informally, a node-level label is k-local on G^″ if it is determined by the induced k-hop neighborhood G^″[N_k^{G^″}(u)]; analogous definitions apply to edge- and graph-level tasks. Once this is formalized, we can ask a concrete question: what is the minimum radius r_pre(P, f_pre) at which the composed task P ∘ f_pre becomes local on the preprocessed instance family? A small value of r_pre indicates that preprocessing has collapsed the task into a local readout problem, regardless of the downstream architecture.

This perspective also clarifies why comparisons against Weisfeiler–Leman (WL) type baselines can fail to diagnose genuine architectural advantages when preprocessing is unconstrained. The standard narrative is that message-passing GNNs are bounded (in various senses) by 1-WL, higher-order GNNs correspond to higher-dimensional WL variants, and graph Transformers may or may not exceed these limitations depending on their positional encodings and attention mechanisms. Yet WL comparisons implicitly assume that the only available input is the raw G with its original features X. If we allow an arbitrary f_pre, then the WL hierarchy is no longer a meaningful ceiling: preprocessing can write into X^″ any globally computed invariant, including invariants that are strictly stronger than any fixed-dimensional WL refinement on the original graph. In that case, a 1-WL-equivalent architecture run on G^″ can appear to ``beat’’ 1-WL on G, not because the architecture has transcended WL, but because the task has been rewritten into a locally checkable form.

Even more pointedly, the failure mode is not merely that preprocessing can add global information; it can add exactly the information needed for the prediction objective. In a supervised setting, it is common to tune preprocessing choices on validation performance. Without an explicit accounting of the preprocessing budget and its informational content, one can inadvertently (or deliberately) bake label information into features through leakage channels such as near-duplicate retrieval, graph-level metadata joined from external sources, or computed statistics that are highly correlated with the label and require global computation to obtain. The resulting system may generalize within the benchmark but does not support claims about the intrinsic representational power of the learner.

We thus advocate that any claim of expressiveness or limitation should be stated relative to a declared preprocessing regime. Concretely, we must specify: (a) what computations f_pre is allowed to perform, (b) the associated budget T_pre(G) measured in an explicit centralized model (optionally including oracle calls), and (c) the post-preprocessing locality radius r_pre(P, f_pre) (exact, certified upper bound, or empirical diagnostic). This reporting does not forbid powerful preprocessing; rather, it makes transparent when performance is achieved by global computation performed upstream.

Finally, we note that the above concerns apply equally to GNNs and graph Transformers, and extend beyond neural networks. Retrieval-augmented systems, for example, can be viewed as preprocessors that build indices and attach retrieved exemplars or templates as additional features; this is a form of external computation whose complexity is carried by T_pre. Similarly, tool-using pipelines that call shortest-path solvers, linear system solvers, or motif counters are preprocessors with substantial algorithmic power. Without a budgeted analysis, it is impossible to distinguish whether a downstream learner is exploiting a genuinely nonlocal reasoning mechanism or merely consuming a locally accessible encoding of a globally computed quantity.

The remainder of this work formalizes these intuitions. We treat preprocessing as a first-class computational stage and analyze when it can, and cannot, collapse global graph properties into local decision rules. This leads to a principled separation between architectural expressiveness'' andpreprocessing leakage,’’ and provides a language in which fairness of comparisons can be assessed.

2. Background: RL-CONGEST separation of preprocessing/message passing; locality (k-hop) vs global properties; examples from biconnectivity + resistance distance

A convenient abstraction for understanding graph learning pipelines is the analogy with round-based distributed computing. A depth-k message-passing architecture (MPNN) can be viewed as performing k synchronous communication rounds on G: in each round, every node sends a message to its neighbors, aggregates received messages, and updates its internal state. When we bound the per-edge message size to O(log n) bits (or, more generally, to polylog(n)), this abstraction aligns with the model; if messages are unbounded, it aligns with the model. In both cases, after k rounds, the state at a node u is a function only of the labeled induced neighborhood G[N_k^G(u)], hence any node-level prediction produced at u is necessarily k-local in the sense that it depends only on N_k^G(u) and the features therein. This observation, which underlies the standard locality-based limitations of MPNNs, is independent of training and applies equally to deterministic and randomized algorithms (conditioning on the randomness).

The role of preprocessing is naturally interpreted as an additional stage occurring the round-limited message passing begins. In distributed terminology, this resembles augmenting the instance with (node/edge labels) or with a locally checkable produced by a powerful prover, after which a limited-round verifier checks the property. In our setting, the preprocessor f_pre plays the role of this prover: it may perform substantial centralized computation and then write the outcome into X^″, or modify the topology to make relevant information accessible within small radius. Once such a stage is permitted, the effective limitation of round-limited message passing is no longer controlled solely by the learner depth; instead, it is controlled by the locality radius of the task P ∘ f_pre. The point of the (T_pre, r_pre)-view developed later is that it makes this separation explicit: the downstream architecture may still be k-local in the raw graph G, but it can become O(1)-local in the derived graph G^″ because preprocessing has already performed the global work.

The distinction between local and global properties is standard in both distributed computing and descriptive complexity, and we recall the central mechanism: . Fix k ≥ 0. Two rooted graphs (G, u) and (H, v) are k-locally indistinguishable if there exists an isomorphism between the rooted induced neighborhoods G[N_k^G(u)] and H[N_k^H(v)] preserving all node and edge features. Any k-local node-level rule must assign the same label to u and v on such instances. Therefore, to show that a task is not k-local on some family, it suffices to construct pairs of instances with identical k-hop views around the relevant node/edge but different ground-truth labels. For edge- and graph-level tasks, one uses analogous indistinguishability statements for neighborhoods of edges or for all nodes simultaneously. In particular, many connectivity-type properties are global: for every fixed k, one can find graphs with large girth or large diameter in which all k-hop neighborhoods look like trees, while the property of interest depends on a cycle or on a distant connection.

Biconnectivity provides a concrete example of this phenomenon. Consider the edge-level property Bridge(G) ∈ {0, 1}^E indicating whether an edge is a bridge, and the node-level property Cut(G) ∈ {0, 1}^V indicating whether a node is an articulation point. On raw, unannotated graphs, neither property is constant-radius local on unrestricted families. For bridges, fix k and take a cycle C_L of length L ≫ k and a path P_L of length L. Choose an edge e far from the endpoints of P_L. Then the induced k-hop neighborhood around e in both C_L and P_L is a length-(2k + 1) path with all degrees equal to 2 internally, hence the rooted edge neighborhoods are isomorphic; yet e is a bridge in P_L and not a bridge in C_L. Thus bridge detection is not k-local for any fixed k on this family. A similar construction witnesses globality for cut vertices: take a cycle C_L and a ``lollipop’’ obtained by attaching a long path to the cycle at a node u; for L ≫ k, the k-hop neighborhood around u can be made identical while u is a cut vertex only in the lollipop. These are representative of a general pattern: local neighborhoods cannot certify the presence or absence of a distant alternative route.

The same tasks become locally decidable if preprocessing appends sufficiently informative global features. Effective resistance (resistance distance) is a canonical example because it encodes global connectivity through a numerical invariant computable from the Laplacian pseudoinverse. Let R(u, v) denote the effective resistance between nodes u and v in an unweighted connected graph, interpreted as the voltage drop induced by one unit of current injected at u and extracted at v. Suppose the preprocessor f_pre augments G with exact values R(u, v) (e.g., stored as edge features on the complete graph on V, or accessible on demand). Then the bridge predicate becomes 0-local: an existing edge (u, v) ∈ E is a bridge if and only if R(u, v) = 1. Consequently, after preprocessing, a local rule can label each edge by reading a single scalar. Likewise, cut-vertex detection becomes 1-local: a node u is a cut vertex if and only if there exist neighbors s, t ∈ N(u) such that the resistance metric satisfies the additive condition
R(s, t) = R(s, u) + R(u, t),
which witnesses that all current (and hence all paths in an appropriate metric sense) between s and t must pass through u. This condition can be checked by a 1-hop local verifier that inspects u, its neighbors, and the pairwise resistances among {u} ∪ N(u). In both cases, we obtain a sharp contrast: the raw tasks are global on natural instance families, but once a heavy global precomputation is allowed, the same tasks collapse to constant-radius locality.

This example also clarifies why budgets must accompany locality claims. Exact all-pairs resistance distances typically require solving linear systems or computing a Laplacian pseudoinverse, with worst-case cost far beyond what is implicit in a ``k-layer message passing’’ description, and storing R explicitly requires Θ(n²) bandwidth in X^″. From the standpoint of attribution, a downstream predictor that succeeds using such features is exploiting the global computation performed by f_pre, not circumventing locality limitations by architectural means. The formal setup in the next section turns these observations into a measurable quantity r_pre(P, f_pre) paired with an explicit preprocessing budget T_pre(G), allowing us to state and compare expressiveness claims relative to declared preprocessing power.

3. Formal setup: graph instance families, tasks (node/edge/graph), preprocessing functions, and a unified preprocessing budget model

We fix a universe of attributed graphs and make explicit which parts of a learning pipeline are treated as , which are treated as , and which are treated as . An (attributed) graph is a triple G = (V, E, X) with |V| = n and |E| = m, where X denotes the collection of node/edge features (and, when convenient, a graph-level feature vector). We allow directed or undirected edges and allow multigraphs, but our examples and locality statements are most naturally read for simple undirected graphs. An 𝒢 is any set of graphs closed under the restrictions we wish to impose (e.g., connectivity, bounded degree, planarity, or a fixed feature schema). All locality and budget statements are parametrized by 𝒢; without such a family, even the meaning of ``bounded radius’’ claims is typically vacuous.

A graph prediction task (or property) is a mapping P : 𝒢 → 𝒴, with three canonical output types:
node-level: 𝒴 = {0, 1}^V, edge-level: 𝒴 = {0, 1}^E, graph-level: 𝒴 ∈ {0, 1} or ℝ.
We write P(G)_u for the node label of u, P(G)_(u, v) for the edge label, and P(G) for graph-level outputs. Nothing in the setup requires Boolean labels; multiclass labels or real-valued regression targets can be handled by replacing {0, 1} with an appropriate codomain. We emphasize that P is a task specification; it need not be efficiently computable, although the complexity T_P(G) in a centralized model will play a role in our leakage upper bounds.

A preprocessing procedure is a function
f_pre : 𝒢 → 𝒢^″, G ↦ G^″ = (V^″, E^″, X^″).
We allow f_pre to change topology (add/delete edges), to add auxiliary nodes, and to replace or augment features. In particular, f_pre may encode global information in X^″ (e.g., distance matrices, spectral quantities, motif counts) or may create shortcut edges to reduce distances. To compare outputs to the original task P(G), we assume that f_pre provides an explicit identification of the original prediction items inside the derived graph, e.g., an injective map ι_V : V ↪ V^″ for node tasks and an injective map ι_E : E ↪ E^″ for edge tasks, or, equivalently, a binary is-original'' tag plus a stable ordering. This is a mild bookkeeping requirement: without it, one could triviallysolve’’ a task by permuting items and changing the meaning of coordinates.

For a graph G^′ = (V^′, E^′, X^′), we write N_k^{G^′}(u) for the k-hop neighborhood of a node u ∈ V^′, and for an edge (u, v) ∈ E^′ we use N_k^{G^′}((u, v)) := N_k^{G^′}(u) ∪ N_k^{G^′}(v). The corresponding labeled induced subgraph is denoted G^′[N_k^{G^′}(u)] (or G^′[N_k^{G^′}((u, v))]), including all node/edge features restricted to the induced subgraph. In all definitions below, isomorphisms are required to preserve features exactly; thus locality is always a statement about the feature schema X^″.

We formalize a local post-preprocessing rule as a family of functions that read only bounded-radius induced neighborhoods. For node-level prediction on G^″, a rule g_k assigns to each u ∈ V an output
g_k(G^″, u) = ψ (G^″[N_k^{G^″}(ι_V(u))], ι_V(u))
for some function ψ invariant under rooted isomorphism of the labeled neighborhood. For edge-level tasks we analogously require
g_k(G^″, (u, v)) = ψ (G^″[N_k^{G^″}(ι_E(u, v))], ι_E(u, v)).
For graph-level tasks, we adopt a uniform convention that the preprocessor may introduce (or designate) a readout node ρ ∈ V^″ and that the output is computed as g_k(G^″) = ψ(G^″[N_k^{G^″}(ρ)], ρ). This convention is without loss for our purposes: any constant-size global readout (a special token, a pooled feature vector, or an appended ``graph node’’) is representable by preprocessing, and our objective is precisely to attribute such representational power to f_pre.

We say that P ∘ f_pre is on 𝒢 if there exists a k-local rule g_k such that for all G ∈ 𝒢, the outputs agree on the original prediction items:
P(G)_u = g_k(f_pre(G), u) ∀u ∈ V, P(G)_(u, v) = g_k(f_pre(G), (u, v)) ∀(u, v) ∈ E,
and similarly for graph-level tasks. We postpone to the next section the definition of the such k, which will be our leakage radius r_pre(P, f_pre).

To compare preprocessors that perform qualitatively different operations (distance computations, spectral routines, oracle calls, external solvers), we associate to each f_pre a budget T_pre(G). Our default choice is worst-case running time in the word-RAM model with O(log n)-bit words, with the understanding that feature values written to X^″ must be representable at the declared precision. When preprocessing uses additional resources, we absorb them additively into T_pre, e.g.,
T_pre(G) = t_RAM(G) + ∑_iα_i q_i(G) + β ⋅ size(X^″(G)),
where q_i(G) counts the number of queries to an oracle/subroutine i (such as a shortest-path oracle, a linear-system solver, or a #SubGI oracle), and size(X^″(G)) measures the number of machine words written into node/edge features (and, if relevant, the number of auxiliary nodes/edges created). The coefficients α_i, β are bookkeeping weights that can be set to 1 when all costs are measured in a common unit; we keep them explicit only to emphasize that time'' andfeature bandwidth’’ are both part of preprocessing power. In many of our statements we suppress polylogarithmic factors and write Õ(⋅).

Independently of learning, we fix a centralized complexity measure T_P(G), defined as the worst-case running time of an algorithm computing P(G) in the same computational model. The pair (T_pre, T_P) will serve as the comparison that makes leakage statements meaningful: if T_pre is allowed to match T_P up to Õ(1) factors, then f_pre can in principle compute the labels and encode them locally; conversely, if T_pre is restricted, then nontrivial lower bounds on post-preprocessing locality become possible.

Finally, to distinguish ``local but computationally unbounded’’ readout rules from local rules that plausibly correspond to small neural modules, we parameterize local rules by a verifier class VClass (e.g., TIME(poly), TC⁰, or a fixed-depth MLP applied to an adjacency-encoded ego graph). This parameter will enter our reporting protocol but does not affect the definition of k-locality itself: locality constrains information is accessible (bounded-radius induced neighborhoods), while VClass constrains that information may be processed.

With these conventions in place, we can define leakage as the minimal post-preprocessing radius needed to decide P ∘ f_pre, and we can report it together with the preprocessing budget and the verifier class, independently of the downstream architecture.

4. Defining leakage: the post-preprocessing locality radius r_pre and the leakage report tuple (T_pre, r_pre, verifier class); invariance to architecture choice (GNN vs Transformer)

Fix an instance family 𝒢, a task P : 𝒢 → 𝒴, and a preprocessing function f_pre : 𝒢 → 𝒢^″.
The preceding definitions single out a notion of after preprocessing (bounded-radius induced neighborhoods in G^″) and a notion of (agreement with P(G) on the original prediction items via the identification maps).
We now compress these into a single numerical quantity measuring how much of the task can be decided after preprocessing.

For each k ≥ 0, recall that ``P ∘ f_pre is k-local on 𝒢’’ means that there exists a k-local decision rule g_k whose outputs coincide with P(G) on all original nodes/edges (or on the designated graph readout) for every G ∈ 𝒢.
We define the as
r_pre(P, f_pre; 𝒢) := inf { k ∈ ℕ : P ∘ f_pre is k-local on 𝒢 },
with the convention r_pre(P, f_pre; 𝒢) = ∞ if no finite k suffices.
When 𝒢 is fixed throughout, we suppress it and write r_pre(P, f_pre).

Two features of this definition are deliberate.
First, r_pre is a radius over 𝒢: it asserts that a single locality bound works uniformly for all instances in the family, rather than only on average.
Second, r_pre is measured preprocessing, i.e., neighborhoods are taken in G^″ = f_pre(G) and include the augmented feature schema X^″.
Thus r_pre attributes any locality reduction caused by shortcut edges, auxiliary nodes, or globally-computed features to the preprocessor, not to the downstream model.

The definition quantifies over of some local rule g_k (with the required rooted-isomorphism invariance on labeled ego graphs).
We do not assume that a learner will find g_k, nor that g_k is statistically estimable from limited data.
This is intentional: leakage is meant to be an statement.
If r_pre(P, f_pre) is small, then there exists a bounded-radius procedure that recovers the labels from G^″; the downstream architecture, optimization, and sample size only govern whether this procedure is realized in practice.

Locality alone does not specify how costly it is to process the induced neighborhood.
A rule that reads G^″[N_k(⋅)] might still solve an NP-hard subproblem within that neighborhood, or might require arithmetic with extremely high-precision real features written by preprocessing.
Accordingly, we parameterize our leakage statements by a VClass, a computational class in which g_k is required to lie (e.g., TIME(poly), TC⁰, or a fixed-size neural module applied to an adjacency-encoded ego graph).
Formally, given VClass we may define the restricted radius
r_pre^VClass(P, f_pre) := inf { k : ∃ g_k ∈ VClass such that g_k witnesses k-locality of P ∘ f_pre },
again taking the value ∞ if no such witness exists.
In the absence of an explicit VClass, our default is the existence-based r_pre above.

We package the preprocessing cost, the induced locality, and the computational nature of the local rule into a single report:
Leak(P, f_pre) = (T_pre, r_pre, VClass, notes).
Here T_pre(G) is the declared preprocessing budget (time/oracle/bandwidth, as specified in the setup), while r_pre is either the exact radius or a certified upper bound r_pre ≤ k obtained by exhibiting an explicit k-local rule.
The field notes is reserved for the bookkeeping that determines whether the report is meaningful and comparable across works: which instance family 𝒢 is in force, how auxiliary nodes/edges are tagged, whether feature values are exact or approximate, and whether the stated radius is proved or only observed empirically (e.g., via ego-graph baselines constrained to radius k).
This tuple is the object we will later compare across preprocessors; in particular, two pipelines with identical downstream architectures but different f_pre should not be viewed as comparable unless both are accompanied by such a report.

A central point is that Leak(P, f_pre) is defined of whether the subsequent learner is a message-passing GNN, a graph Transformer with global attention, or any other architecture.
The reason is structural: our leakage question is not can a given model class compute \(P\) from \(G''\)?'' but rather \[ \text{is P already locally decidable on the representation G^″?’’}
]
If the answer is yes with small radius, then any sufficiently expressive architecture (including ones with global access) has, in principle, an easy path: it may simply implement (or approximate) the local rule g_k and ignore long-range computation.
Conversely, if the answer is no (large r_pre or ∞ within a plausible VClass), then even a globally attentive model is being helped by hidden label information in X^″ at small radii, and any success must come from genuine global inference or from nonlocal access patterns not captured by our locality model.

We emphasize that this invariance is not a claim that all architectures behave the same; rather, it is a claim about .
Whatever advantage a graph Transformer gains from receiving G^″ rather than G is, by definition, part of preprocessing power whenever it is witnessed by a small r_pre.
In this sense, r_pre is an architecture-agnostic measure of ``how much of the task has been compiled into the input.’’

The quantity r_pre should be read as a locality-collapsing effect of preprocessing: r_pre = 0 means the labels are readable from single nodes/edges (or from a constant-size readout token), while r_pre = 1 means a node’s label can be verified from its immediate neighborhood in G^″, and so on.
When r_pre is small and T_pre is large, it becomes difficult to attribute downstream performance to ``expressiveness’’ of the learner; the task may have been made trivial by preprocessing.
The next section formalizes this intuition by showing that, when T_pre is allowed to match the centralized complexity T_P, locality collapse is unavoidable in the strongest possible sense.

5. General locality-collapse theorem: any efficiently computable task can be made locally trivial if preprocessing budget matches task complexity; implications for ‘expressiveness’ claims

We now formalize the strongest possible version of the leakage phenomenon: once preprocessing is permitted to spend essentially the same resources as a centralized solver for the task, it can so that the remaining decision is local in the sense of r_pre.
The conclusion is not that typical feature-engineering pipelines behave this way, but rather that , because otherwise the preprocessor can always trivialize the remaining computation.

Theorem~ makes precise the following implication: if the preprocessing stage is permitted to spend a budget comparable to that of solving the task centrally, then the post-preprocessing locality radius can always be forced down to a constant.
In that regime, any subsequent ``learning architecture’’ receives an input on which the task is already locally decidable.
Thus, downstream success cannot be attributed to long-range message passing, global attention, or higher-order aggregation—the nontrivial part of the computation has already occurred inside f_pre.

Formally, consider any model class 𝒜 operating on G^″ after preprocessing.
If we allow T_pre(G) ≥ Ω̃(T_P(G)), then there exists a pipeline in which 𝒜 is superfluous: one may replace it by the constant-radius rule g_k from the proof (with k ∈ {0, 1} depending on task type) while preserving exact correctness on all instances.
This is precisely the sense in which expressiveness statements become vacuous without an explicit preprocessing budget: any architectural advantage can be simulated by preprocessing that simply writes the answers into the representation.

Two clarifications are worth recording.
First, the collapse does not require large neighborhoods or subtle coding; the features required are only the label bits (or a bounded-precision real value for regression), placed on the relevant prediction items.
Second, the collapse is compatible with augmentations that substantially change topology: the definition of r_pre measures locality G^″, so adding shortcut edges, hub nodes, or globally-computed features is exactly the mechanism by which leakage is expressed and must be accounted for.

In summary, Theorem~ provides an unconditional upper bound on r_pre as a function of preprocessing resources: once T_pre is allowed to approach T_P, locality can be made constant, and comparisons of downstream architectures become uninterpretable unless one simultaneously reports (and restricts) T_pre and the permitted feature/topology augmentations.

6. Conditional lower bounds / non-triviality: when can a polynomial-time preprocessor not collapse locality (NP-hardness and oracle separations); what must be assumed to avoid vacuity

Theorem~ is an statement: with a budget comparable to a centralized solver, preprocessing can always force r_pre down to a constant by compiling answers (or checkable certificates) into the representation.
To obtain a framework in which r_pre is not automatically vacuous, we must therefore work in regimes where such compilation would contradict standard complexity assumptions, or where compilation would require oracle power that we explicitly forbid or budget for.
We record two complementary barriers: NP-hardness (as a coarse-grained obstruction) and #P/PH oracle separations (as an obstruction to ``feature oracles’’ such as unrestricted motif counting).

Suppose P is NP-hard under polynomial-time many-one reductions.
If there existed a polynomial-time preprocessor f_pre such that P ∘ f_pre were 0-local with a polynomial-time verifier g, then the composite procedure
G ↦ f_pre(G) ↦ g(f_pre(G))
would decide P in polynomial time.
This yields the conditional barrier formalized by Theorem~3 in our theorem list: under the assumption P ≠ NP, no polynomial-time preprocessing scheme can reduce an NP-hard decision task to a trivial 0-local readout with an efficient verifier.
In other words, for NP-hard tasks, locality collapse is possible only by paying one of the following prices: (i) super-polynomial T_pre, (ii) a verifier g whose computational power is itself beyond polynomial time, or (iii) a relaxation of the task notion (approximation, promises, or distributional/average-case correctness).

Two remarks are necessary for correct interpretation.
First, this is a obstruction only in the weak sense that it rules out a single polynomial-time f_pre that works on all inputs G ∈ 𝒢; it does not forbid heuristics that collapse locality on restricted families or typical instances.
Second, the statement is about solving: in learning settings, it remains possible (and common) that preprocessing makes labels from small-radius neighborhoods without making the task algorithmically easy in the worst case.
This is precisely why our reporting protocol allows an empirical estimate r̂ alongside certified bounds.

The preceding NP-hardness barrier is easily circumvented if we silently grant preprocessing access to computationally powerful oracles.
Many ``feature engineering’’ pipelines do precisely this in practice (e.g., calls to ILP/SAT solvers, exact substructure enumeration, or external chemistry toolkits), even when the downstream model is presented as a purely local learner.
Accordingly, our budget T_pre must be understood as : it must include not only RAM time, but also the number and type of external oracle calls.

A particularly clean separation arises for subgraph- and motif-counting features.
If f_pre is permitted to output exact values of #SubGI(H ↪ G) for unrestricted pattern families ℋ, then (by Theorem~4 in our theorem list) preprocessing is #P-hard in general; moreover, even a single adaptive #P query together with polynomial-time computation suffices to decide all languages in PH (via standard Toda-style inclusions).
Thus, in the presence of such an oracle, locality claims become uninformative: the preprocessor can encode extremely hard global information into node/edge features, after which a constant-radius downstream rule can appear to have ``global expressiveness’’ while in fact consuming #P-level power upstream.
The appropriate resolution is not to ban all global features, but to and them: if exact motif or subgraph counts are used, the leakage report must reflect oracle complexity (or restrict ℋ to fixed-size patterns where exact counting is tractable on the relevant graph class).

Locality is a restriction (bounded radius in G^″), not a computational one.
Even with small r_pre, one could in principle place all computational hardness into the local decision rule g_k by allowing g_k to be arbitrarily expensive.
This is why our leakage tuple includes an explicit verifier class VClass.
At the extreme, if VClass = TIME(exp), then any property becomes 0-local by simply copying the input G into a large feature string at a distinguished node and letting g₀ solve the problem there.
To avoid this degeneracy, one must either (i) constrain VClass to a neural-plausible or circuit-like class (e.g. small MLP, TC⁰, low-depth circuits), or (ii) explicitly include the evaluation time T_{g_k} in the accounting.
Theorem~5 captures the relevant simulation principle: an exact k-local certificate P(G) = g_k(f_pre(G)) implies a centralized algorithm with runtime Õ(T_pre(G) + T_{g_k}(G)).
Consequently, any (assumed) lower bound on T_P transfers to a lower bound on T_pre for achieving small r_pre with efficient verifiers.

The above discussion yields a minimal set of methodological commitments.
To interpret downstream performance as evidence of model expressiveness rather than preprocessing leakage, we must (a) specify T_pre including external calls and oracle access; (b) bound the output bandwidth of preprocessing (feature dimension, bit precision, and any O(n²) objects such as SPD/RD tables) since bandwidth is itself a channel for leakage; and (c) declare a verifier class VClass so that local but computationally heavy'' readouts are not conflated withlocal and easy’’ ones.
Under these declarations, the non-triviality claims become conditional but operational: if P is NP-hard, exact locality collapse with polynomial budgets would imply P = NP; if preprocessing uses #P-level primitives, then any resulting locality collapse must be attributed to that oracle power.
These principles guide the case studies that follow, where we certify tight r_pre bounds for concrete, commonly used preprocessors and explicitly relate those bounds to their computational and informational costs.

7. Case studies with tight locality bounds: (a) effective resistance and biconnectivity, (b) shortest-path distances and distance-definable tasks, (c) positional encodings, (d) motif/subgraph counting preprocessors and their complexity

We now instantiate the (T_pre, r_pre)-lens on several preprocessors that are routinely used (explicitly or implicitly) to ``globalize’’ local architectures. Our goal in each case study is to (i) identify a natural task family P whose apparent non-locality can be collapsed by the preprocessor, (ii) give an explicit constant-radius decision rule on G^″ = f_pre(G) (hence a certified upper bound on r_pre), and (iii) make the required preprocessing budget and bandwidth explicit.

Let f_pre augment G with exact effective resistance values R(u, v) between relevant pairs of vertices. In the strongest form, f_pre stores all-pairs resistances, i.e. a matrix R ∈ ℝ^n × n whose entries are written into node features (e.g. node u carries the vector R(u, ⋅)). This representation is expensive in both computation and bandwidth: exact all-pairs resistances require access to the Laplacian pseudoinverse and, in general, incur Ω(n²) output size and at least Ω(n²) work simply to write the features. Nevertheless, once such features exist, several classical global connectivity predicates become constant-radius.

Concretely, by Theorem~2, bridge detection becomes 0-local: for an edge (u, v) ∈ E, one checks whether R(u, v) = 1. No neighborhood exploration is necessary; the label depends only on the edge feature R(u, v) (or equivalently on the endpoints’ feature vectors restricted to the coordinate v and u). Cut-vertex detection becomes 1-local: a node u is a cut vertex if and only if there exist neighbors s, t ∈ N(u) such that
R(s, t) = R(s, u) + R(u, t),
which can be verified by inspecting the induced subgraph on N₁^{G^″}(u). The tightness aspect is equally important: on plain graphs without such resistance features, both bridge and cut-vertex detection are global in the locality sense (one can witness non-locality on path-versus-cycle families where arbitrarily large neighborhoods are indistinguishable around a given edge/node while the global label flips). Thus the preprocessor has collapsed r_pre from Ω(D) to 0/1, and the only plausible attribution for any downstream ``global’’ success is the resistance augmentation itself.

A second canonical preprocessor computes shortest-path distances. Let D denote the all-pairs shortest-path matrix (SPD). If f_pre writes to each node u the vector D(u, ⋅) ∈ ℕⁿ, then any node-level statistic that is a function of distances from u becomes 0-local. For example, eccentricity, closeness centrality, and t-reachability labels admit 0-local readout:
ecc(u) = max_v ∈ VD(u, v), close(u) = (∑_v ∈ VD(u, v))⁻¹, 1{∃v: D(u, v) ≤ t ∧ ϕ(v)}.
Similarly, graph-level quantities such as diameter diam(G) = max_u, vD(u, v) become 0-local we permit a designated readout location (e.g. a ``virtual’’ graph node whose features contain the entire matrix, or a convention that graph-level labels are stored in a global feature slot). Absent such conventions, one can still obtain 1-locality by adding a virtual node z adjacent to all vertices and storing global summaries at z.

The cost story is immediate: computing exact SPD is typically O(nm) in unweighted graphs by running BFS from each node, and the output bandwidth is Θ(n²) even before precision/encoding considerations. Moreover, r_pre depends subtly on the verifier class. If the preprocessor stores ecc(u) directly as a scalar, the 0-local verifier can lie in a very weak class (constant-depth arithmetic on O(log n)-bit words). If instead it stores the entire vector D(u, ⋅), then the 0-local verifier must compute a maximum over n values, which is still polynomial-time but not ``neural-plausible’’ in the strongest circuit-theoretic senses unless we explicitly allow such aggregation. This illustrates why r_pre must be reported together with VClass: the same distance features can be either a trivial leakage channel (when they already contain the answer) or a merely representation (when substantial computation remains in the verifier).

Positional encodings (PEs) are often presented as ``features,’’ but many of them behave, for locality, like topology modifications. For instance, Laplacian eigenvector encodings and related spectral coordinates require global computation; random-walk structural encodings require solving linear systems or simulating walks at scale. Once written into X^″, however, they can reduce the effective radius needed to separate nodes or to verify global certificates (e.g. by giving each node a nearly unique coordinate system).

More importantly, several PE pipelines add determined by global information (e.g. k-nearest-neighbor edges in the PE space, diffusion edges, or a global CLS''/virtual node). In our framework these are bona fide changes from \(G\) to \(G''=(V'',E'',X'')\). Adding a universal virtual node \(z\) connected to all \(u\in V\) immediately makes many graph-level predicates \(1\)-local at \(z\), because \(G''[N^{G''}_1(z)]\) contains all vertices (and often their full feature payload). Similarly, adding long-range PE-induced edges can reduce the diameter of \(G''\) and hence lower \(r_{\mathrm{pre}}\) even if no explicit labels are compiled. Thus PE design must be audited not only for runtime but also for \emph{induced locality}: an architecture may appear to havelong-range reasoning’’ while in fact operating locally on a graph whose long-range edges were inserted upstream.

Finally, motif-counting features provide a sharp illustration of oracle-priced leakage. If f_pre outputs exact counts #SubGI(H ↪ G) for an unrestricted pattern family ℋ, then the preprocessor can encode extremely hard global information, and by Theorem~4 this capability is #P-hard in general (with standard consequences for PH under oracle access). From a locality perspective, once such counts are placed into X^″, the downstream task is frequently 0-local: a node-level label ``does u participate in any copy of H?’’ becomes a threshold test on a single feature coordinate, and graph-level existence/counting predicates become constant-radius tests at an agreed readout location. In contrast, when ℋ is restricted to fixed-size motifs (e.g. triangles, 4-cycles), exact counting may be tractable on specific graph classes, and the correct report is then a quantitative one: locality has been reduced, but only by paying a preprocessing cost that must be disclosed (and which may scale as n^|V(H)| in the worst case if implemented na"ively).

Across effective resistance, SPD features, positional encodings, and motif counts, the same structural phenomenon recurs: once global invariants are written into X^″ (or once E^″ is augmented to shrink graph distances), the post-preprocessing radius r_pre for many ``global’’ tasks becomes constant, often 0 or 1. The only scientifically meaningful question is therefore not whether a downstream learner can exploit these representations, but how much computational and informational power was consumed upstream to create them. This motivates the next section, where we present concrete leakage-auditing procedures that either certify small r_pre analytically or diagnose practical locality collapse empirically via strictly local ego-graph baselines.

8. Leakage auditing algorithms: (i) certificate-based analytic auditing (prove r_pre upper bounds), (ii) empirical auditing via ego-graph baselines and feature-only baselines; recommendations for benchmark design

We now make the (T_pre, r_pre)-report operational by giving auditing procedures that (a) small post-preprocessing locality whenever an explicit local rule exists, and (b) practical locality collapse when formal certificates are unavailable or inconvenient. Both procedures treat preprocessing as part of the computational budget and therefore allow us to attribute performance either to upstream computation or to downstream inference.

The analytic audit is a proof obligation: we exhibit an explicit k-local decision rule g_k such that
P(G) = g_k (f_pre(G)) for all G ∈ 𝒢,
where g_k is constrained to read only G^″[N_k^{G^″}(u)] for node/edge labels (or an agreed readout location for graph labels) and lies in a stated verifier class VClass. Once such a rule is written down, we have a certified bound r_pre(P, f_pre) ≤ k by definition. Crucially, we insist that the certificate uses information present in G^″; the audit is not a proof about the original G, but about what the preprocessor has already placed into topology/features.

In practice, we find it useful to separate three increasingly permissive certificate patterns.

For reporting, we recommend that a paper claiming improved long-range'' reasoning from preprocessed features include (a) the smallest \(k\) for which a certificate is known, (b) a short description of \(g_k\), and (c) an upper bound on the total preprocessing work \(T_{\mathrm{pre}}(G)\) including any calls to solvers/oracles and the cost of writing \(\Theta(n^2)\)-size outputs when applicable. When such a certificate exists, Theorem~5 further implies an attribution statement: if \(r_{\mathrm{pre}}\) is small with an efficient \(g_k\), then the pipeline implicitly contains an efficient centralized algorithm for \(P\), and any purportedexpressiveness’’ must be charged against preprocessing.

Analytic certificates are not always available for modern feature pipelines (e.g. learned positional encodings, tool-augmented preprocessors, or proprietary feature extractors). In such cases we propose an empirical audit that tests for locality collapse by training baselines that are to k-hop ego-graphs of G^″. The guiding principle is: if a classifier that cannot see beyond N_k^{G^″}(u) achieves near-saturated performance, then the task has effectively become k-local after preprocessing, regardless of whether we can prove it.

Concretely, for node-level tasks we consider, for each k ∈ {0, 1, …, K_max}, a predictor h_k that takes as input only the induced subgraph G^″[N_k^{G^″}(u)] with features and outputs a label for u. For edge-level tasks we analogously use G^″[N_k^{G^″}((u, v))]. The case k = 0 is the baseline: no adjacency information is visible except what is encoded in node/edge attributes. We then define an estimated collapse radius
r̂_pre := min {k ≤ K_max: Acc(h_k) ≥ τ},
for a task-dependent threshold τ (e.g. within ε of the best reported model, or within a confidence band around the best attainable performance on the benchmark).

To ensure the baseline is genuinely local, we enforce two invariants. First, the data pipeline must materialize ego-graphs explicitly and forbid global operations (no batch normalization whose statistics mix nodes from different ego-graphs; no attention across a batch; no global positional indices that reintroduce node identities). Second, when the preprocessor produces graph-level artifacts (e.g. an n × n distance matrix distributed across nodes), we must specify what is actually visible to h_k: if node u contains D(u, ⋅), then k = 0 already includes n numbers, and the baseline is allowed to use them, but this must be recorded as bandwidth-heavy leakage rather than interpreted as ``local generalization.’’

A minimal audit can be summarized as follows.
We conclude with concrete recommendations that make leakage quantifiable and comparisons meaningful.

These audits supply the bridge to our implementation guidelines: once (T_pre, r_pre, VClass) are reported with the above ablations, experiments can explicitly demonstrate (or refute) locality collapse and thereby prevent conflating preprocessing power with downstream architectural expressiveness.

9. Implementation & experiments (recommended): standardized ablations showing locality collapse; tool-augmented preprocessing scenarios; reporting guidelines

We now describe experimental practices that make the leakage report Leak(P, f_pre) = (T_pre, r_pre, VClass, notes) empirically meaningful and comparable across pipelines. The goal is not to propose new benchmarks, but to standardize ablations that (i) expose locality collapse when it occurs (as predicted by Theorem~1), (ii) quantify the extent to which performance is attributable to preprocessing rather than downstream inference, and (iii) make tool-augmented or solver-augmented preprocessors auditable.

For each reported model trained on a fixed preprocessed instance family f_pre(𝒢), we recommend reporting a small set of baselines that isolate what information is already present at bounded radius in G^″. At a minimum, for node- and edge-level tasks, one should include:
(i) a k = 0 baseline that uses only node/edge attributes and no adjacency beyond what is encoded as features;
(ii) a small set of ego-graph baselines for k ∈ {1, 2, 3} (or up to a task-dependent cap);
and (iii) a full-graph baseline in the same VClass as a sanity check that the learner is not underpowered.
We emphasize that the ego-graph baselines should be implemented by explicit subgraph extraction G^″[N_k(⋅)] followed by a predictor that receives only this induced subgraph as input, so that locality is enforced by construction rather than by an architectural heuristic.

In addition to reporting the resulting accuracies, it is informative to report the k ↦ Acc(h_k) and the corresponding r̂_pre for a fixed threshold τ. We recommend choosing τ relative to the best reported pipeline for the benchmark, e.g. τ = Acc^⋆ − ε with a small ε (or an uncertainty-corrected band), and reporting sensitivity over ε. In our experience, the curve is often more diagnostic than a single number: a sharp jump from k = 0 to k = 1 suggests that preprocessing has produced a locally checkable certificate, whereas a smooth increase may indicate partial leakage through approximate global summaries.

To detect inadvertent label encoding through precomputed features, we recommend a set of controlled corruptions that preserve marginal distributions while destroying alignment with graph structure. Concretely, if X^″ decomposes into blocks X^″ = (X_raw, X_pre), where X_pre is produced by f_pre, then one can (i) permute X_pre(u) across nodes u, (ii) permute only within degree classes or within connected components (to preserve coarse statistics), or (iii) resample X_pre from other graphs in the dataset with matched size. A large performance drop under such perturbations indicates that X_pre is carrying instance-specific information that is not recoverable from local topology alone. These tests are particularly important when f_pre outputs high-bandwidth objects (e.g. per-node distance vectors, spectral coordinates, or motif histograms), where a seemingly ``local’’ readout may in fact be exploiting globally computed identifiers.

We recommend that every experimental table involving preprocessing include an explicit budget line with at least the following quantities:
T_pre(G) (measured wall-clock and asymptotic bound), S_out(G) (output size in scalars), b (numeric precision / bits per scalar).
When f_pre computes all-pairs objects (e.g. SPD or effective resistance), one should report whether the representation is stored explicitly (Θ(n²) scalars) or sparsified/quantized, since this materially changes both bandwidth and the effective power of the local readout. If the preprocessor makes external calls (linear solvers, MILP solvers, SAT solvers, database queries), then T_pre must include both the number and type of calls and the total time; if calls are nondeterministic or time-capped, this should be stated and treated as part of the experimental condition rather than hidden as ``data preparation.’’

A growing class of pipelines computes features using tools that are best modeled as oracles: exact/approximate solvers, retrieval systems, or foundation-model calls. Our recommendation is to treat such tools as part of f_pre and to report them in the leakage tuple via an budget accounting:
T_pre(G) = (CPU/GPU time, #oracle queries, total tokens or bytes, tool class).
This makes explicit when a pipeline is implicitly leveraging #P-hard or NP-hard computation (cf. Theorem~4) or when it is simply amortizing heavy computation outside the learning loop. In tool-augmented settings, we also recommend a : rerun the downstream learner with the tool outputs replaced by outputs generated without access to the test instance (e.g. generic templates, unconditional samples, or outputs from an unrelated graph). If performance remains high, then the tool output is not carrying instance information; if performance collapses, then the tool output is an instance-dependent channel and must be audited for locality and budget.

Locality audits are fragile to seemingly innocuous implementation choices. We therefore recommend the following reproducibility constraints when reporting k-local baselines or certificates:
We recommend that each experimental configuration include a short, standardized disclosure block:

This block is intentionally brief but forces the separation of upstream computation from downstream inference. In particular, when a pipeline achieves strong ``long-range’’ performance, the disclosure should make it evident whether the gain is explained by small r_pre (locality collapse) together with large T_pre, or whether it persists in regimes where preprocessing is genuinely local and low-budget.

10. Discussion and open problems: measuring ‘partial leakage’, approximate distances, public randomness, and extending audits to foundation-model pretraining pipelines

We conclude by isolating several directions in which the leakage tuple Leak(P, f_pre) = (T_pre, r_pre, VClass, notes) is presently too coarse, and by formulating open problems whose resolution would make locality audits both more faithful and harder to game.

Our definition of r_pre is qualitative: either P ∘ f_pre is k-local (exactly) or it is not. Empirically, however, preprocessing often yields features that are neither a full label encoding nor purely local statistics; instead, they offer a toward the correct answer that a downstream learner can amplify. This suggests that one should measure leakage on a continuum. One natural extension is an ε-locality radius
r_pre^(ε)(P, f_pre) := min {k: ∃ k-local g_k s.t. Pr [g_k(G^″[N_k(⋅)]) = P(G)] ≥ 1 − ε},
with the probability taken over an explicit distribution on 𝒢 (or over a worst-case instance family if one adopts distribution-free agnostic guarantees). This raises two immediate questions. First, what is the correct for node/edge-level tasks: per-label error, average error over nodes/edges, or an adversarially chosen index? Second, what is the correct for ε: should we compare to Bayes risk under the distribution, or to the accuracy of a full-graph learner in a fixed VClass? Even in the simplest setting, one expects a separation between information leakage'' (existence of a \(k\)-local predictor with small \(\varepsilon\)) andcomputational leakage’’ (existence of such a predictor within a restricted VClass). A principled report would therefore include a pair (r_pre^(ε), VClass) rather than a single integer radius, together with an explicit definition of the error criterion.

A complementary approach is to define a of radius k, e.g.
Adv_k := sup_{g ∈ 𝒢_k}(Pr [g(G^″[N_k(⋅)]) = P(G)] − Pr [g₀(G^″[N₀(⋅)]) = P(G)]),
where 𝒢_k denotes a fixed hypothesis class of k-local predictors. This makes explicit that leakage is a attributable to increasing radius given the same VClass, aligning with the collapse curves advocated in Section~. The open problem is to connect such operational quantities to information-theoretic measures (conditional mutual information between P(G) and G^″[N_k] given G^″[N_k − 1], or conditional entropy decay), in a way that is invariant under benign feature reparameterizations.

Theorems~2–4 highlight that global objects (e.g. all-pairs effective resistance, all-pairs shortest paths, or exact motif counts) can collapse locality dramatically, but exactness is rarely used in practice. Many pipelines compute approximate shortest-path distances, approximate resistances via Laplacian solvers, low-rank spectral embeddings, or quantized distance bins. Formally, these are still global functions of G; nevertheless, their is reduced, and their approximation error may destroy the crisp characterizations underlying constant-radius certificates.

This motivates a refined accounting in which the preprocessor output is constrained not only by time T_pre but also by an information budget: output size S_out(G) and numeric precision b. One can then ask for a tradeoff curve: for a fixed task P, what minimal (S_out, b) suffices to achieve r_pre^(ε) ≤ k for small k? Even for concrete tasks such as bridge detection, it is unclear how robust the 0-local rule \(R(u,v)=1\)'' is under approximation. The immediate technical challenge is to control false positives/negatives when resistances are perturbed by \(\pm \delta\), or when shortest-path distances are only available up to additive/multiplicative distortion. The broader open problem is to characterize which classes of approximate global summaries admit \emph{locally checkable certificates} despite approximation error, and which do not. Such results would separateapproximate leakage’’ that still yields constant-radius decision rules from approximate features that merely help learning in a non-certificate way.

Our definitions are deterministic: after preprocessing, a label is a function of the radius-k induced neighborhood. In distributed computing and property testing, it is standard to allow a local algorithm access to public randomness. If we allow the local verifier g_k to use a shared random seed ρ, then one can define a randomized radius r_pre^rand by requiring Pr_ρ[g_k(⋅; ρ) = P(G)] ≥ 1 − ε. This is not merely a technicality: public randomness can simulate hashing-based certificates, local sampling, or randomized sketch verification, potentially reducing the radius needed to validate a global summary computed by f_pre.

The associated open questions are definitional and complexity-theoretic. Definitional: what randomness is ``free’’ in a learning pipeline audit? If the seed is fixed across all instances, it can become an implicit identifier channel; if it is per-instance, it must be budgeted as part of the output. Complexity-theoretic: can randomized locality meaningfully reduce the preprocessing budget needed to collapse locality for natural problems, without simply reintroducing Theorem~1 by another route (namely, by allowing the preprocessor to write a randomized encoding of the labels)? Any satisfactory theory must treat randomness as a resource with explicit accounting, analogously to time and output bandwidth.

A practical frontier is that f_pre is increasingly implemented by a model (graph foundation model, LLM-based tool, retrieval system) that consumes a representation of the entire graph and emits node/edge/graph embeddings. In our framework, such a system is an oracle-augmented preprocessor with budget components that do not fit the word-RAM abstraction: token budgets, retrieval calls, and (often hidden) training compute and training data access. The immediate methodological requirement is to state whether the preprocessor is recomputed per test instance, whether it is transductive (trained on the test graph), and whether it can access labels or label-correlated metadata during pretraining.

A more conceptual open problem is to define an analogue of r_pre when f_pre is across many instances. If a foundation model has memorized a distribution of graphs, then the per-instance runtime may be small while the computational budget is enormous. Leakage reports that ignore this amortization can be misleading in the opposite direction of Theorem~1: they may understate the effective preprocessing power. We therefore view it as necessary to extend Leak(P, f_pre) with provenance fields (training data access, transductive vs inductive, and whether the preprocessor parameters depend on the evaluation instance) and with an amortized budget model. Establishing a standard here is an open problem in its own right, but without it, comparisons across ``feature-free’’ learners and tool-augmented pipelines are not well-defined.

Finally, our strongest negative statements are conditional (Theorem~3) or complexity-based (Theorem~4). A missing piece is a set of lower bounds showing that, for specific tasks P, any preprocessor that reduces r_pre^(ε) below a threshold must output Ω(n) or Ω(n²) bits, regardless of time. Such bounds would formalize the intuition that certain global properties cannot be made locally decidable without transmitting essentially the whole graph (or a near-complete distance matrix) through features. Proving these bounds, even for restricted instance families, appears to require combining locality arguments with communication complexity or streaming lower bounds. We regard this as the most direct route to a theory in which (T_pre, S_out, b, r_pre) jointly constrain leakage in a way that cannot be circumvented by simply relocating computation upstream.

Preprocessing Leakage in Graph Learning Pipelines: A (Tpre, rpre) Audit via Locality Collapse