Robust Program Semantics for Random Graph Interfaces via Total Variation and Cut Distance

Metadata

Total Words: 6,787
Export Date: 2026-01-16 23:42:35
Description: The source paper classifies exact equational theories of a simple random-graph programming interface by graphons. This note pushes the correspondence into a 2026-era setting where approximate inference and robustness matter: instead of exact equality on observable Bernoulli outcomes, we measure observational discrepancy by statistical distance. We equip the numeral semantics with total variation (TV) distance, define a contextual testing pseudometric (in the spirit of the source’s quotient construction), and specialize it to the universal graph interface. The main structural result is that bounded programs correspond exactly to induced-subgraph statistics, so the induced observational distances coincide with classical dense-graph metrics based on finite subgraph distributions. Consequently, the resulting program metric factors through graphon equivalence and is topologically equivalent to cut distance. We obtain explicit (though pessimistic) quantitative stability bounds: small observational error implies small change in the underlying graphon in the sense relevant for graph limits. This provides a principled semantic foundation for approximate compiler optimizations, approximate equivalence checking, and robustness guarantees for probabilistic programs over random graphs.

1. Introduction: why approximate/robust equivalence is the right notion for 2026 probabilistic programming; summary of source’s exact graphon classification; statement of contributions.
2. Preliminaries: FinStoch, total variation distance on distributions and stochastic matrices; graphons, induced subgraph densities, cut norm and cut distance; standard inequalities (counting lemma style).
3. Metric observation semantics: define TV-enriched observations on numeral types; define k-bounded program classes and the k-bounded observational distance d_k; define global program distance d_prog.
4. Normal forms for bounded graph programs: syntactic normalization pulling new’s outward; representation of any k-bounded boolean program as a measurable predicate on the k×k adjacency matrix generated by (18)-style sampling.
5. Identification theorem: d_k equals the maximal difference of induced subgraph densities on k vertices; explicit constructions of programs computing t_ind(F,·).
6. Stability with respect to cut distance: prove d_k ≤ C_k δ_□; assemble continuity of d_prog w.r.t. δ_□; converse continuity via weak regularity/compactness yielding an explicit modulus ω.
7. Examples and sharpness: what low-k testing can/cannot distinguish; examples with identical edge density but different triangle density; discussion of constant blowups in ω.
8. Applications and outlook: robust compiler transformations; approximate equivalence checking; extension paths (hypergraphs/arrays; sparse interfaces; enriched observation beyond TV).

Content

1. Introduction: why approximate/robust equivalence is the right notion for 2026 probabilistic programming; summary of source’s exact graphon classification; statement of contributions.

Probabilistic programming systems in current use are rarely deployed with a semantics that is both fully exact and fully preserved by the surrounding toolchain. Inference is typically approximate (variational objectives, sequential Monte Carlo with resampling, MCMC with finite burn-in), compilation passes may replace subroutines by faster surrogates, and implementations rely on finite precision and randomized kernels whose idealized counterparts are not literally realized. For these reasons, a notion of equivalence that is strictly extensional (equality of denotations) is often too brittle to support program reasoning as practiced: small, localized approximations should induce small changes in observable behavior, and the relevant mathematical problem is to control such changes uniformly over program contexts. We therefore regard robust, quantitative observational comparison as a primary semantic interface: the semantics should provide a metric in which common transformations are nonexpansive, and in which convergence of approximations implies convergence of induced observations.

Random graph models provide a particularly clear setting in which to make this point. A typical program samples a finite number of vertices and queries adjacency relations between sampled vertices, possibly combining these queries with additional probabilistic choices and deterministic control. Even in this restricted fragment, exact contextual equivalence is subtle: two models may agree on all finite observations while differing on sets of measure zero, and syntactic equality of generators is far from informative. The source develops an exact classification theorem for the undirected simple graph interface, showing that distributive Markov semantics for such programs correspond to graphons, unique up to the usual notion of weak isomorphism. In that picture, the semantic content of a theory is not a particular measurable function W : [0, 1]² → [0, 1] but its equivalence class under measure-preserving relabelings, exactly matching the standard representation of dense exchangeable random graphs. This identification explains why graphon geometry is the natural ambient structure for program equivalence in the dense regime.

The present work takes the same classification result as a starting point and asks for a quantitative refinement compatible with program observation. Our guiding principle is that the semantic distance between two theories should be defined by a family of tests that are themselves programs, and that are stratified by the amount of sampling they perform. Concretely, we test theories using closed boolean programs that sample at most k vertices via and then query among the sampled vertices. Such tests are operationally meaningful—they correspond to experiments that inspect a bounded number of sampled vertices—and they align directly with the statistics used throughout graph limit theory. We measure discrepancies between test outcomes in total variation, so that postprocessing (deterministic or stochastic) is automatically nonexpansive and so that the resulting distances admit a direct probabilistic interpretation as the maximal distinguishing advantage of a bounded experiment.

The resulting hierarchy d_k of observational distances and its aggregate d_prog serve two simultaneous purposes. First, they provide a robust, program-defined metric on theories, suitable for reasoning about approximate implementations and approximate compilation. Second, they connect the semantics of probabilistic programs to the established analytic notions on graphons, notably induced subgraph densities and the cut metric. This connection is not merely qualitative: the basic tests realized by k-bounded programs generate exactly the same σ-algebra of finite observations as induced k-vertex subgraph events, and the standard stability bounds for subgraph counts yield explicit Lipschitz control of program observables by cut distance. Conversely, convergence in program distance implies convergence of all finite induced laws, and hence (by classical compactness and separation arguments in dense graph limits) convergence in cut distance up to equivalence.

We summarize our contributions as follows.

The net effect is that random graph theories acquire a canonical, implementation-relevant notion of approximate equivalence: two theories are close exactly when no bounded sampling experiment can distinguish them with significant advantage, and this closeness is equivalent (in the topological sense) to closeness in the standard cut metric on graphons. The remainder of the paper develops the necessary preliminaries and then proves the stated identifications and continuity properties. Our emphasis throughout is that the metric structure arises from the programming interface itself, while the graphon correspondence provides the analytic tools needed to relate these program metrics to established notions in dense graph limit theory.

2. Preliminaries: FinStoch, total variation distance on distributions and stochastic matrices; graphons, induced subgraph densities, cut norm and cut distance; standard inequalities (counting lemma style).

We collect the analytic and probabilistic background used throughout. Our two main ambient structures are the category FinStoch, in which program denotations on numeral types live, and the space of graphons equipped with the cut metric, which governs dense exchangeable random graphs.

We take FinStoch to have objects the natural numbers n ∈ ℕ, viewed as finite sets [n] = {1, …, n}. A morphism M : m → n is a row-stochastic matrix M ∈ [0, 1]^m × n with $\sum_{j=1}^n M_{ij}=1$ for each i ∈ [m]. Composition is matrix multiplication: for K : n → r, $(K\circ M)_{i\ell}=\sum_{j=1}^n M_{ij}K_{j\ell}$. A deterministic map f : [m] → [n] is identified with the stochastic matrix δ_f given by (δ_f)_ij = 1 if f(i) = j and 0 otherwise. A morphism p : 1 → n is equivalently a probability distribution on [n]; in particular, programs 1 → 2 denote Bernoulli laws.

For distributions p, q : 1 → n we use the total variation metric
$$ d_{\mathrm{TV}}(p,q)\;:=\;\frac12\sum_{j=1}^n |p(j)-q(j)|, $$
which also admits the equivalent variational characterization d_TV(p, q) = sup_{A ⊆ [n]}|p(A) − q(A)|. For stochastic matrices M, N : m → n we equip each hom-set with the induced sup-row metric
$$ d_{\mathrm{TV}}(M,N)\;:=\;\max_{i\in[m]} d_{\mathrm{TV}}\big(M(i,-),N(i,-)\big) \;=\;\max_{i\in[m]}\frac12\sum_{j=1}^n|M_{ij}-N_{ij}|. $$
This choice is convenient for program semantics because it is precisely the worst-case distinguishing advantage over inputs i ∈ [m], and because it interacts well with composition.

We shall repeatedly use the standard nonexpansiveness properties of TV under stochastic postprocessing. Concretely, if K : n → r is stochastic, then for each fixed row i the map p ↦ pK is an affine map on the simplex, and TV is contractive under such Markov kernels. In matrix form, this yields

Similarly, if f : [ℓ] → [m] is deterministic, then precomposition simply selects rows, hence
d_TV(M ∘ δ_f, N ∘ δ_f) ≤ d_TV(M, N).
We will use~ to justify that observational distances defined by suprema of TV discrepancies behave well under program contexts.

A (dense) graphon is a symmetric measurable function W : [0, 1]² → [0, 1], with W(x, y) = W(y, x) a.e. Given W, we obtain an exchangeable random graph on vertex set [k] by sampling i.i.d. latent variables U₁, …, U_k ∼ Unif[0, 1] and then, conditional on (U_i), sampling edges {i, j} independently with
ℙ({i, j} ∈ E ∣ U₁, …, U_k) = W(U_i, U_j) (1 ≤ i < j ≤ k).
This defines a distribution on 𝒢_k, the set of simple graphs on [k]. For F ∈ 𝒢_k, the induced subgraph density is the probability of seeing exactly F under this procedure:

Equivalently, (t_ind(F, W))_{F ∈ 𝒢_k} is the law of the k-vertex labeled graph sampled from W; we will often regard these numbers as the canonical finite observables of W.

Graphons are considered up to measure-preserving relabeling. If φ : [0, 1] → [0, 1] is measure-preserving, we write W^φ(x, y) := W(φ(x), φ(y)). Such relabelings do not change the sampled graph law, hence leave all quantities t_ind(F, W) invariant.

We measure graphon proximity using the cut norm
∥W − U∥_▫ := sup_{S, T ⊆ [0, 1]}|∫_S × T(W(x, y) − U(x, y)) dx dy|,
where S, T range over measurable sets. The cut distance on equivalence classes is defined by
δ_▫(W, U) := inf_φ ∥W^φ − U∥_▫,
where φ ranges over measure-preserving bijections (or, equivalently, measure-preserving maps in the usual completion). This is a pseudometric on concrete representatives and descends to a metric on weak isomorphism classes; we use it as the standard topology on dense graph limits.

We will use the fact that cut proximity controls finite subgraph statistics. The relevant inequalities are classical; we record a convenient induced variant with explicit (though not optimized) constants. Let k ≥ 1 and F ∈ 𝒢_k, and write $m=\binom{k}{2}$. Expanding the integrand in~ as a multilinear polynomial in the m variables {W(x_i, x_j)}_i < j and bounding termwise, one obtains

A standard route to~ is to first bound differences of non-induced homomorphism densities by |E(F)|∥W − U∥_▫, and then express induced densities as alternating sums of homomorphism densities over supergraphs of F, producing the factor 2^m. In particular, after optimizing over relabelings we get the cut-distance form
|t_ind(F, W) − t_ind(F, U)| ≤ 2^m m δ_▫(W, U).
We will only need the qualitative content that, for fixed k, all induced k-vertex statistics vary Lipschitzly with δ_▫, with constants depending solely on k.

The converse direction—that agreement of all finite induced statistics forces δ_▫ = 0—is also standard in dense graph limit theory: the family {t_ind(F, −)}_F separates graphon equivalence classes, and the cut-metric space is compact. We will appeal to these facts when relating program-defined distances to the usual graphon topology.

3. Metric observation semantics: define TV-enriched observations on numeral types; define k-bounded program classes and the k-bounded observational distance d_k; define global program distance d_prog.

Our goal is to compare two random-graph theories Ψ, Φ : 𝒰_ℕ → FinStoch by the observable behavior of numeral-typed programs. Since 𝒰_ℕ is purely probabilistic/combinatorial (objects are numerals n, morphisms are programs with n-valued outputs), it is natural to equip each hom-set FinStoch(m, n) with a metric that measures statistical distinguishability. We take total variation, as recalled in Section~, and regard it as an observation metric rather than merely a property of distributions.

Fix m, n ∈ ℕ. For M, N : m → n in FinStoch, recall
$$ d_{\mathrm{TV}}(M,N)\;=\;\max_{i\in[m]}\frac12\sum_{j\in[n]}|M_{ij}-N_{ij}|. $$
This choice admits an explicit ``testing’’ interpretation: a deterministic predicate on [n] is a function b : [n] → [2], identified with δ_b : n → 2 in FinStoch. Postcomposing M : m → n with δ_b yields the induced Bernoulli experiment δ_b ∘ M : m → 2. Using the variational characterization of TV rowwise, we have the equivalence

i.e. d_TV(M, N) is the maximum distinguishing advantage of any (deterministic) yes/no observation applied to the output of M versus N, in the worst case over inputs i ∈ [m].

Moreover, TV is stable under program contexts: if K : n → r is any stochastic postprocessing, then d_TV(K ∘ M, K ∘ N) ≤ d_TV(M, N), and similarly precomposition by deterministic maps is nonexpansive. We will use these contraction properties to justify that distances defined by taking suprema of d_TV-discrepancies are robust under adding surrounding code.

Let Ψ : 𝒰_ℕ → FinStoch be a theory. For each program p : m → n in 𝒰_ℕ we write Ψ⟦p⟧ ∈ FinStoch(m, n) for its denotation. In particular, if p : 1 → n is closed then Ψ⟦p⟧ is a distribution on [n]. For boolean outputs p : 1 → 2, the total variation distance reduces to absolute difference of acceptance probabilities:

Thus boolean programs are canonical statistical tests for distinguishing theories.

While one could attempt to compare Ψ and Φ by taking a supremum over all boolean programs p : 1 → 2, this is too coarse for two reasons: (i) the space of tests is infinite and syntactically unstructured, and (ii) in the graph interface the only source of unbounded information is the number of vertex samples. We therefore introduce a bounded testing hierarchy aligned with the number of calls to .

We say that a closed program p : 1 → 2 in 𝒰_ℕ is if its syntax contains at most k occurrences of the vertex-sampling primitive new. Intuitively, such a program can only inspect the random graph through the induced subgraph on at most k sampled vertices, together with any internal probabilistic choices that do not introduce additional vertices. We write Prog_k(1, 2) for the collection of closed k-bounded boolean programs (considered modulo the equational theory of 𝒰, so that program rewrites do not change membership).

The restriction to closed programs 1 → 2 is deliberate: by~, any observable property of an n-valued output can be reduced to a boolean post-test, and by the contraction of TV under postcomposition it suffices to bound boolean observations. In other words, we adopt the standard ``testing semantics’’ stance that observational power is measured by acceptance probabilities of predicates.

Given two theories Ψ, Φ : 𝒰_ℕ → FinStoch, we define

Since each denotation Ψ⟦p⟧ : 1 → 2 is a Bernoulli law, we have 0 ≤ d_k(Ψ, Φ) ≤ 1 and d_k(Ψ, Φ) = d_k(Φ, Ψ). Monotonicity in the bound is immediate: if k ≤ k^′ then Prog_k(1, 2) ⊆ Prog_k^′(1, 2) and hence d_k(Ψ, Φ) ≤ d_k^′(Ψ, Φ).

The point of~ is that the supremum ranges over a conceptually simple family of experiments: choose up to k vertices using , query the edge predicate among them as desired, and finally output a bit. The next section makes this intuition precise by proving a normal form in which all calls are hoisted and the remainder of the program is deterministic in the induced adjacency matrix.

To aggregate all finite observational levels into a single bounded metric, we define

This is a standard weighted construction: each min {1, d_k} is a bounded pseudometric, the weights 2^−k ensure convergence, and the resulting d_prog is bounded by 1. Operationally, d_prog measures distinguishability by allowing tests of all finite vertex complexities but discounting higher complexities exponentially.

Two further remarks clarify why~ is the appropriate notion of distance for our purposes. First, because min {1, −} and weighted sums preserve the triangle inequality, d_prog is a pseudometric on theories; it becomes a genuine metric after quotienting by observational equivalence (i.e. identifying theories that agree on all finite tests). Second, d_prog is designed to interact with the graphon correspondence: the subsequent identification of d_k with induced k-vertex graph statistics will imply that d_prog factors through graphon weak isomorphism classes and is topologically equivalent to cut distance.

4. Normal forms for bounded graph programs: syntactic normalization pulling new’s outward; representation of any k-bounded boolean program as a measurable predicate on the k×k adjacency matrix generated by (18)-style sampling.

The definition of d_k quantifies over all closed boolean programs that use at most k calls to . In order to relate such programs to classical finite-graph statistics, we isolate the precise syntactic dependence of a k-bounded program on the random graph: up to the equations of 𝒰, a k-bounded program can only depend on the induced adjacency pattern among the sampled vertices. Concretely, we show that all occurrences of can be hoisted to the outside, after which the remainder factors through the k × k adjacency matrix (edge(x_i, x_j))_{i, j ≤ k}.

We begin with a purely structural normalization step: vertex samples can be pulled outward across the other program constructs. This is an instance of the usual ``Fubini/let-exchange’’ principles in a commutative Markov setting, specialized to a closed sampler new : 1 → vertex. Intuitively, sampling a vertex is independent of all computations that do not themselves resample vertices; hence we may always reorder and regroup vertex samples.

For later use it is convenient to allow padding: if r < k we may adjoin unused samples x_r + 1, …, x_k and ignore them. Thus, without loss of generality, we may work with exactly k samples, at the cost of notational redundancy.

After hoisting, the only way that the sampled vertices x₁, …, x_k can influence the outcome is through applications of . Since is deterministic and satisfies the simple undirected axioms (irreflexive and symmetric), its observable content on a k-tuple of vertices is exactly a simple labeled graph on [k], equivalently a symmetric {0, 1}-matrix with zero diagonal.

Fix once and for all an enumeration of unordered pairs
$$ E_k \;:=\;\{\,\{i,j\}\mid 1\le i<j\le k\,\},\qquad |E_k|=\binom{k}{2}. $$
We write A_k for the finite set {0, 1}^E_k of adjacency patterns on [k], and we identify it with a numeral object of size $|A_k|=2^{\binom{k}{2}}$ in 𝒰_ℕ (any fixed encoding suffices). There is then a canonical deterministic program
adj_k : vertex^k → A_k
which, given (x₁, …, x_k), returns the bit-vector (edge(x_i, x_j))_{{i, j} ∈ E_k}. Determinism of ensures that adj_k is deterministic, and symmetry/irreflexivity ensure that this vector exactly corresponds to the induced simple graph on the sampled vertices.

We now show that once new has been hoisted, every remaining program depends on the vertices only through adj_k. The key point is that in the universal graph interface there are no vertex-valued observations other than feeding vertices into ; thus vertices have no observable structure beyond their induced adjacency relations.

Combining Lemmas~ and~, every closed k-bounded boolean program p : 1 → 2 is equal to one of the form

for some q̃ : A_k → 2. Here new^k denotes the iterated sampling program that returns k vertices.

For the identification results in the next section, we further refine~ by observing that q̃ can be taken deterministic without loss of distinguishing power. Indeed, q̃ is precisely a (possibly randomized) statistical test applied to the finite outcome space A_k. By the variational characterization of total variation (cf. above), the optimal distinguishing advantage between two distributions on A_k is achieved by a deterministic predicate f : A_k → 2, i.e. by choosing a subset S ⊆ A_k and accepting exactly on S. Thus, although our definition of d_k allows arbitrary k-bounded programs (including internal randomness after sampling vertices), in computing the supremum we may restrict to programs whose final stage is deterministic in the adjacency pattern.

In summary, the k-bounded testing hierarchy is equivalent to testing, by deterministic predicates, the distribution of the induced labeled graph on [k] generated by new^k together with the deterministic edge oracle. The next section exploits this normal form to identify d_k(Ψ, Φ) with the total variation distance between the induced k-vertex graph laws of W_Ψ and W_Φ, and to exhibit explicit programs realizing the individual induced subgraph densities t_ind(F, ⋅).

5. Identification theorem: d_k equals the maximal difference of induced subgraph densities on k vertices; explicit constructions of programs computing t_ind(F,·).

We now relate the observational distances d_k to the graphon cut metric. Since by Theorem~A we may identify d_k(Ψ, Φ) with the maximum discrepancy of induced k-vertex laws, it suffices to bound induced subgraph densities in terms of δ_▫.

We recall the standard counting-lemma estimate for (labeled) homomorphism densities. For a simple graph H on vertex set [k], write
t(H, W) := ∫_[0, 1]^k ∏_{{i, j} ∈ E(H)}W(x_i, x_j) dx₁⋯dx_k.
The following inequality is classical (see e.g. Lov'asz, ) and is the basic quantitative interface between cut norm and subgraph statistics.

For F ∈ 𝒢_k, the induced density t_ind(F, W) may be expressed by inclusion–exclusion in terms of the homomorphism densities t(H, W) as H ranges over supergraphs of F on the same vertex set. Writing $m=\binom{k}{2}$ and letting Sup(F) denote the family of graphs H on [k] with E(F) ⊆ E(H) ⊆ E(K_k), we have

Consequently, Lemma~ implies the uniform cut-norm stability estimate
2^{m},m,|W-U|_{},
\end{equation}
since |Sup(F)| ≤ 2^m and |E(H)| ≤ m for all H ⊆ K_k. This gives an explicit admissible choice $C_k:=2^{\binom{k}{2}}\binom{k}{2}$ in Proposition~B.

Let Ψ, Φ be theories with associated graphons W_Ψ, W_Φ. By Theorem~A and~,
d_k(Ψ, Φ) = max_{F ∈ 𝒢_k}|t_ind(F, W_Ψ) − t_ind(F, W_Φ)| ≤ C_k ∥W_Ψ − W_Φ∥_▫.
Finally, since δ_▫(W_Ψ, W_Φ) = inf_φ∥W_Ψ^φ − W_Φ∥_▫ over measure-preserving relabelings φ, we may apply the same bound to W_Ψ^φ and take the infimum in φ, obtaining
d_k(Ψ, Φ) ≤ C_k δ_▫(W_Ψ, W_Φ).

Fix Ψ and let Φ_n be any sequence with δ_▫(W_{Φ_n}, W_Ψ) → 0. The inequality above yields d_k(Φ_n, Ψ) → 0 for each fixed k. To pass to d_prog, we use the summability built into its definition: given ε > 0, choose K so that ∑_k > K2^−k < ε/2. Then pick N such that for all n ≥ N and all 1 ≤ k ≤ K, we have d_k(Φ_n, Ψ) < ε (possible since K is finite). For such n,
d_prog(Φ_n, Ψ) ≤ ∑_k ≤ K2^−kd_k(Φ_n, Ψ) + ∑_k > K2^−k ≤ ε∑_k ≤ K2^−k + ε/2 < ε,
so d_prog is continuous in δ_▫.

For the converse direction (Theorem~C and Corollary~D), we require that sufficiently many induced densities control cut distance. Qualitatively, this is a standard consequence of (i) compactness of graphon equivalence classes under δ_▫ and (ii) separation by the family {t_ind(F, ⋅)}_F (Lemma~5). Quantitatively, one may route this through a weak regularity lemma: for each ε > 0, every graphon W admits a step approximation W^(ε) with at most r(ε) parts such that δ_▫(W, W^(ε)) ≤ ε/3, where r(ε) may be taken on the order of 2^O(ε⁻²). The parameters of a step graphon with r parts are a finite list of edge-densities between parts; these may be recovered (up to an error depending on ε) from finitely many induced densities t_ind(F, ⋅) with |V(F)| ≤ k(ε), for an explicit (but very large) function k(ε) depending on r(ε).

Thus, given ε, there exist k(ε) and η(ε) > 0 such that if
max_{F ∈ 𝒢_k(ε)}|t_ind(F, W) − t_ind(F, U)| < η(ε),
then δ_▫(W, U) < ε. Translating the left-hand side via Theorem~A, it suffices to require d_k(ε)(Ψ, Φ) < η(ε), which in turn is implied by d_prog(Ψ, Φ) being small enough since d_prog dominates 2^−k(ε)min {1, d_k(ε)}. Defining
ω(δ) := inf {ε > 0: δ ≥ 2^−k(ε)min {1, η(ε)}},
we obtain a monotone modulus with δ_▫(W_Ψ, W_Φ) ≤ ω(d_prog(Ψ, Φ)). As noted earlier, any such explicit ω extracted from weak regularity will have poor (typically tower-type) growth, and improving it requires additional structural hypotheses beyond the dense graphon regime.

6. Stability with respect to cut distance: prove d_k ≤ C_k δ_□; assemble continuity of d_prog w.r.t. δ_□; converse continuity via weak regularity/compactness yielding an explicit modulus ω.

To control the k-bounded observational discrepancy in terms of the cut geometry, we reduce to the usual analytic bounds relating the cut norm to subgraph statistics. Throughout, we regard graphons W, U : [0, 1]² → [0, 1] as representatives of their weak isomorphism classes, and we write ∥ ⋅ ∥_▫ for the cut norm on [0, 1]². Since δ_▫ is obtained by taking an infimum of cut norms over measure-preserving relabelings, it is enough to derive estimates in ∥ ⋅ ∥_▫ and then infimize.

Let H be a simple graph on vertex set [k] with edge set E(H) and m := |E(H)|. We use the labeled homomorphism density
t(H, W) := ∫_[0, 1]^k ∏_{{i, j} ∈ E(H)}W(x_i, x_j) dx₁⋯dx_k.
The next estimate is the standard ``counting lemma’’ in its simplest form; the dependence on H is only through the number of edges.

Fix k ≥ 1 and F ∈ 𝒢_k. Write $m:=\binom{k}{2}$ and denote by Sup(F) the family of graphs H on [k] satisfying E(F) ⊆ E(H) ⊆ E(K_k). The induced density is
t_ind(F, W) = ∫_[0, 1]^k ∏_{{i, j} ∈ E(F)}W(x_i, x_j) ∏_{{i, j} ∉ E(F)}(1 − W(x_i, x_j)) dx₁⋯dx_k,
and expanding the ∏(1 − W) by inclusion–exclusion yields the identity
t_ind(F, W) = ∑_{H ∈ Sup(F)}(−1)^{|E(H)|−|E(F)|} t(H, W).
Combining this with Lemma~, we obtain a uniform estimate

In particular, we may take $C_k:=2^{\binom{k}{2}}\binom{k}{2}$ in the stability statement d_k ≤ C_k δ_▫; no optimization of C_k is needed for our purposes, and any explicit C_k depending only on k suffices.

Let Ψ, Φ : 𝒰_ℕ → FinStoch be theories and let W_Ψ, W_Φ be their associated graphons. By the identification of k-bounded program tests with induced k-vertex statistics (Theorem~A), we have
d_k(Ψ, Φ) = max_{F ∈ 𝒢_k}|t_ind(F, W_Ψ) − t_ind(F, W_Φ)|.
Applying the induced-density cut-norm bound to each F gives
d_k(Ψ, Φ) ≤ C_k ∥W_Ψ − W_Φ∥_▫.
To pass from ∥ ⋅ ∥_▫ to δ_▫, we use that t_ind(F, ⋅) is invariant under measure-preserving relabelings: for any measure-preserving φ, we have t_ind(F, W_Ψ) = t_ind(F, W_Ψ^φ). Hence the same inequality holds with W_Ψ replaced by W_Ψ^φ, and taking the infimum over φ yields
d_k(Ψ, Φ) ≤ C_k δ_▫(W_Ψ, W_Φ).
This is Proposition~B.

The definition
d_prog(Ψ, Φ) = ∑_k ≥ 12^−kmin {1, d_k(Ψ, Φ)}
builds summability into the observable hierarchy. If δ_▫(W_{Φ_n}, W_Ψ) → 0, then Proposition~B implies d_k(Φ_n, Ψ) → 0 for each fixed k. Given ε > 0, choose K so that ∑_k > K2^−k < ε/2. Then for n sufficiently large, we have d_k(Φ_n, Ψ) < ε for all k ≤ K, and therefore
d_prog(Φ_n, Ψ) ≤ ∑_k ≤ K2^−kd_k(Φ_n, Ψ) + ∑_k > K2^−k < ε∑_k ≤ K2^−k + ε/2 ≤ ε.
Thus δ_▫-convergence implies d_prog-convergence.

For the reverse implication, the qualitative statement is that the family of all induced densities {t_ind(F, ⋅)}_F separates graphon equivalence classes, and the cut-metric space of equivalence classes is compact; consequently, convergence of all induced densities forces convergence in δ_▫. In our setting, d_prog(Ψ_n, Ψ) → 0 implies d_k(Ψ_n, Ψ) → 0 for every fixed k, hence convergence of induced k-vertex laws for every k, and therefore convergence of all induced densities.

To obtain a quantitative modulus, we appeal to a weak regularity lemma. Fix ε ∈ (0, 1]. There exists a function r(ε) (e.g. r(ε) = 2^O(ε⁻²) suffices by Frieze–Kannan) such that every graphon W admits a step graphon W^(ε) with at most r(ε) parts and
δ_▫(W, W^(ε)) ≤ ε/3.
A step graphon with r parts is determined by finitely many parameters (the part measures and the r × r edge-density matrix). For fixed r, these parameters can be estimated, with an error tolerance depending on ε, from finitely many induced densities on at most k(ε) vertices, for some explicit (though rapidly growing) k(ε) depending on r(ε). Consequently, there exists η(ε) > 0 such that whenever
max_{F ∈ 𝒢_k(ε)}|t_ind(F, W) − t_ind(F, U)| < η(ε),
we can couple the step approximants W^(ε) and U^(ε) so that δ_▫(W^(ε), U^(ε)) ≤ ε/3, and hence by the triangle inequality δ_▫(W, U) < ε.

Translating the hypothesis through Theorem~A, the displayed maximum equals d_k(ε)(Ψ, Φ) for the corresponding theories. Moreover, since
d_prog(Ψ, Φ) ≥ 2^−k(ε)min {1, d_k(ε)(Ψ, Φ)},
it suffices to require d_prog(Ψ, Φ) < 2^−k(ε)η(ε) (and η(ε) ≤ 1 if desired). We therefore obtain a monotone modulus by setting
ω(δ) := inf {ε ∈ (0, 1]: δ ≥ 2^−k(ε)min {1, η(ε)}},
which ensures δ_▫(W_Ψ, W_Φ) ≤ ω(d_prog(Ψ, Φ)). The dependence hidden in k(ε) and η(ε) is necessarily poor when derived solely from weak regularity; sharpening ω is a separate quantitative problem rather than a conceptual obstruction.

7. Examples and sharpness: what low-k testing can/cannot distinguish; examples with identical edge density but different triangle density; discussion of constant blowups in ω.

The family {d_k}_k ≥ 1 is intentionally stratified: fixing k restricts us to observables expressible by sampling at most k vertices and then applying a deterministic Boolean predicate to the induced adjacency matrix. Consequently, for any fixed k, the pseudometric d_k is strictly weaker than cut distance and does not determine the underlying graphon up to weak isomorphism. We record concrete instances of this phenomenon and briefly indicate where the constants in the quantitative comparison necessarily deteriorate.

For k = 2 we have $\mathcal G_2=\{\overline{K_2},K_2\}$, hence the induced 2-vertex law is determined by a single parameter, the edge density
ρ(W) := ∫_[0, 1]²W(x, y) dx dy.
Indeed t_ind(K₂, W) = ρ(W) and $t_{\mathrm{ind}}(\overline{K_2},W)=1-\rho(W)$, so
d₂(Ψ, Φ) = max {|ρ(W_Ψ) − ρ(W_Φ)|, |ρ(W_Ψ) − ρ(W_Φ)|} = |ρ(W_Ψ) − ρ(W_Φ)|.
Thus any two theories/graphons with the same overall edge density are indistinguishable by 2-bounded tests, even if their higher-order structure is far apart.

Fix α ∈ (0, 1) and let W_α be the constant graphon W_α ≡ α. Let U be the balanced complete bipartite graphon given by the partition [0, 1] = A ⊔ B with λ(A) = λ(B) = 1/2 and
$$ U(x,y)\;=\;\begin{cases} 0,& (x,y)\in A^2\cup B^2,\\ 1,& (x,y)\in A\times B\cup B\times A. \end{cases} $$
Then ρ(W_1/2) = ρ(U) = 1/2, so d₂(W_1/2, U) = 0. However, the triangle density separates them:
$$ t(K_3,W_{1/2})=\Bigl(\frac12\Bigr)^3=\frac18, \qquad t(K_3,U)=0, $$
hence d₃(W_1/2, U) ≥ |t_ind(K₃, W_1/2) − t_ind(K₃, U)| = 1/8.
In program terms, there exists a 3-bounded Boolean program that samples three vertices and returns exactly when they form a triangle; its acceptance probability differs by 1/8 under these two theories despite complete agreement on all 2-vertex experiments.

The previous example already shows that d₂ = 0 does not imply small cut distance. In fact δ_▫(W_1/2, U) is bounded below by an absolute constant. Writing D := U − W_1/2, we have D = +1/2 on A × B ∪ B × A and D = −1/2 on A² ∪ B². Taking S = T = A gives
$$ \Bigl|\int_{S\times T} D\Bigr| =\Bigl|\int_{A\times A} \Bigl(-\frac12\Bigr)\Bigr| =\frac12\cdot \lambda(A)^2 =\frac18, $$
so ∥U − W_1/2∥_▫ ≥ 1/8 and therefore δ_▫(W_1/2, U) ≥ 1/8 as well. Hence no inequality of the form δ_▫ ≤ f(d₂) can hold with f(δ) → 0 as δ → 0; any modulus necessarily requires access to all k, i.e. to d_prog.

More generally, for a fixed k the finite vector
(t_ind(F, W))_{F ∈ 𝒢_k}
cannot separate graphon equivalence classes: the space of graphons (modulo weak isomorphism) is infinite-dimensional, while 𝒢_k is finite. Concretely, already within the class of step graphons with a moderate number of parts, one can vary the parameters (part measures and edge probabilities) along nontrivial directions while preserving the finitely many constraints t_ind(F, ⋅) for F ∈ 𝒢_k. Such perturbations yield distinct graphons W ≆ U with t_ind(F, W) = t_ind(F, U) for all F ∈ 𝒢_k, hence d_k(W, U) = 0, yet typically t_ind(F^′, W) ≠ t_ind(F^′, U) for some F^′ ∈ 𝒢_k + 1. This is the precise sense in which bounded testing is a strictly incomplete observational principle: k-bounded programs recover exactly the induced k-vertex law and nothing beyond it.

Our choice $C_k=2^{\binom{k}{2}}\binom{k}{2}$ comes from a direct inclusion–exclusion expansion of induced densities into ordinary homomorphism densities together with the edge-counting bound |t(H, W) − t(H, U)| ≤ |E(H)|∥W − U∥_▫. The factor |E(H)| in the latter estimate is essentially sharp in its dependence on H: if H is a single edge, then t(H, W) − t(H, U) = ∫(W − U), and one can realize |∫(W − U)| on a near-extremal cut by taking W − U approximately constant on a rectangle S × T. For induced densities, the combinatorial factor $2^{\binom{k}{2}}$ is certainly not optimized and can be improved in specific regimes (e.g. for particular F or for restricted classes of graphons), but we do not expect polynomial dependence on k uniformly over all F ∈ 𝒢_k: the inclusion–exclusion expansion inherently sums over all supergraphs H of F, whose number is $2^{\binom{k}{2}-|E(F)|}$.

The modulus ω obtained via weak regularity is correspondingly coarse. Intuitively, to force δ_▫(W, U) < ε from finitely many subgraph statistics, one must first approximate both W and U by step graphons of complexity r(ε) and then identify (up to relabeling) their step parameters from a finite list of induced densities. Even with Frieze–Kannan weak regularity, one has r(ε) = 2^O(ε⁻²), and any reconstruction argument that is uniform over all step graphons with r parts leads to rapidly growing k(ε) and correspondingly tiny thresholds η(ε). This reflects a genuine limitation rather than a technical artifact: without restricting to a smaller model class (e.g. step graphons with a fixed number of parts, Lipschitz graphons, or other parametrically controlled families), low-order statistics can remain stable under substantial cut perturbations, and conversely small changes in cut metric may require inspecting very high-order induced laws to certify.

From the programmatic viewpoint, these examples say that bounded-k approximate equivalence is a meaningful but deliberately local notion: it certifies agreement of all experiments that inspect at most k sampled vertices, but it cannot preclude global rearrangements visible only at larger sample sizes. The aggregated distance d_prog remedies this by weighting all k, but its converse control of δ_▫ necessarily inherits the poor quantitative behavior of regularity-type arguments in the absence of additional hypotheses.

8. Applications and outlook: robust compiler transformations; approximate equivalence checking; extension paths (hypergraphs/arrays; sparse interfaces; enriched observation beyond TV).

We briefly indicate how the distances d_k and d_prog can be used as correctness/robustness criteria for probabilistic transformations, and we record several natural extension directions.

A common situation in probabilistic compilation and approximate inference is that a source-to-source rewrite replaces a subprogram by an approximation whose denotation is close in total variation. The TV metric on FinStoch is designed precisely so that such local errors do not amplify under surrounding program structure: by Lemma~1, stochastic postcomposition is 1-Lipschitz, and deterministic wiring is nonexpansive as well. Concretely, if q, q^′ : m → n are two subterms with
d_TV(Ψ⟦q⟧, Ψ⟦q^′⟧) ≤ ε,
then for any context C[−] built from composition, products/coproducts, and the distributive Markov structure of 𝒰_ℕ, we have
d_TV(Ψ⟦C[q]⟧, Ψ⟦C[q^′]⟧) ≤ ε.
When C[−] is moreover in the sense that C[q] uses at most k occurrences of , we obtain a uniform bound on the effect of the rewrite on all k-vertex experiments. In practice, this permits a modular proof methodology: we certify local approximations in TV, and the global observational deviation for all bounded tests follows for free. This is particularly useful for transformations that change sampling order, eliminate latent variables, or replace an expensive conditional distribution by a cheaper surrogate, since the denotational error bound is stable under further optimization passes.

For a fixed k, Theorem~A reduces the supremum over all k-bounded Boolean programs to a maximum over labeled graphs F ∈ 𝒢_k:
d_k(Ψ, Φ) = max_{F ∈ 𝒢_k}|t_ind(F, W_Ψ) − t_ind(F, W_Φ)|.
Thus approximate equivalence up to depth k can be checked by evaluating (or estimating) the finitely many induced densities t_ind(F, ⋅). Syntactically, Lemma~3 provides a concrete program family {p_F}_{F ∈ 𝒢_k} such that Pr [p_F = true] = t_ind(F, W) under graphon semantics. This yields two practically relevant regimes.

For the aggregated metric d_prog, we may truncate: since ∑_k > K2^−k = 2^−K, checking d_k only up to k = K controls d_prog up to an additive error 2^−K. Hence, for a target tolerance η, it suffices to check finitely many induced laws up to K = ⌈log₂(1/η)⌉, together with the desired thresholds for each d_k.

Because d_prog factors through weak isomorphism (Theorem~C), it is insensitive to measure-preserving relabelings and therefore compares rather than arbitrary latent parametrizations. This makes d_prog a natural objective or evaluation metric in settings where one learns graphon-like latent structure from data and wishes to compare two fitted models, two training runs, or a fitted model against ground truth. Proposition~B provides a one-sided stability guarantee: small cut distance implies small program distance, so convergence in graphon geometry implies convergence of all finite experiments expressible in the language.

The graph interface can be generalized by replacing the binary predicate with an r-ary relation symbol (or multiple relation symbols), yielding theories of uniform hypergraphs or, more generally, relational structures. On the limit side, one expects the corresponding semantics to be governed by hypergraphons or by Aldous–Hoover exchangeable array representations. The same blueprint should persist: k-bounded programs observe the induced law on the labeled k-element substructure, and a suitable analogue of Theorem~A identifies d_k with the maximum discrepancy over induced k-vertex hypergraphs/structures. The main technical work is to import the appropriate limit topology (hypergraph cut norms and their variants) and to re-establish a counterpart of Theorem~C, which is known to be more delicate for hypergraphs than for graphs.

Our setup is intentionally dense: produces a vertex, and yields a Bernoulli adjacency with no additional scaling, so graphons are the appropriate limits. For sparse random graphs, a direct graphon semantics collapses many distinct models, and one must replace W by a graphex-type object and replace δ_▫ by an appropriate sparse metric. From the programming side, two issues arise: (i) a sparse interface typically requires an explicit intensity parameter (or Poisson sampling of vertices/edges), and (ii) bounded-k vertex sampling may not reveal sparse structure unless the observation language also exposes counts or degrees at the correct scale. A satisfactory sparse analogue therefore likely needs a refined notion of boundedness (e.g. bounded expected number of revealed edges) and an observation class rich enough to express sparse statistics.

We have focused on closed Boolean programs p : 1 → 2 and TV distance, since this makes d_k a clean supremum over events and gives an exact identification with induced subgraph laws. Two enrichments are natural. First, one may allow finite outputs p : 1 → n and use the same matrix TV metric on FinStoch(1, n), thereby measuring discrepancies of multiway observables (not only predicates). Second, one may replace TV by another divergence/metric (e.g. Wasserstein on ordered outputs, or an f-divergence when absolute continuity is available). Such choices can be better aligned with downstream tasks, but they typically forfeit the simple ``maximum over coordinates’’ characterization and may require additional analytic input to relate the resulting program distances to cut-type graphon metrics.

The distances d_k and d_prog provide a semantics-driven interface between program equivalence and graph limit geometry: they support modular robustness arguments for compiler transformations, reduce bounded observational equivalence to finitely many induced statistics, and admit principled generalizations to richer relational signatures, to sparse models, and to alternative notions of observable discrepancy.