In graph neural networks, attention weights are frequently invoked as a proxy for . This interpretation is compelling but, in its most common instantiation, mathematically incomplete: the usual normalization rules enforce only constraints (e.g. a simplex constraint per node), whereas many domains of interest are governed by conservation laws and hard throughput limits. Our starting point is the observation that, once attention is read as an quantity (a routed mass, current, or commodity), the mismatch between local softmax normalization and physically meaningful feasibility conditions becomes unavoidable.
To make the tension precise, consider outgoing-normalized attention, where each node distributes unit mass across its outgoing neighbors. If we denote by fvu the weight assigned to edge (v, u), the constraint ∑u ∈ Nout(v)fvu = 1 forces a node to emit a fixed amount irrespective of what it receives. Consequently, nodes with small incoming total weight must mass to satisfy the outgoing constraint, while nodes with large incoming weight must mass. From the perspective of conservation, this is the wrong invariance: Kirchhoff-type balance requires that, except at designated terminals, net inflow equals net outflow. The issue does not disappear if one instead normalizes incoming edges at each receiver (as in other attention variants); then each node a fixed amount regardless of what is supplied upstream. In either case, a purely local simplex normalization cannot encode a global supply–demand pattern, nor can it enforce that interior nodes are neither sources nor sinks.
Capacity constraints add a second, equally structural, obstacle. In a capacitated network, each edge admits only a bounded throughput. Standard attention mechanisms can be made to low-capacity edges less often by learning scores, but they cannot prevent a learned normalization from placing arbitrarily large mass on a single edge when the scores dictate it. The resulting model may be statistically adequate in some prediction settings, yet it cannot be interpreted as a feasible routing, and it cannot support statements of the form ``this layer respects the network constraints by construction.’’ When the graph represents a power grid, a circuit, or a transport network, the distinction between preference and feasibility is not cosmetic: infeasible routings correspond to violations of conservation or overloads that are meaningful failure modes.
We therefore propose to treat attention as a constrained pseudo-flow and to compute it by a projection principle. Concretely, the neural scoring function produces edge log-weights e, hence positive reference weights w = exp (e). We then choose an edge mass vector f by minimizing a generalized Kullback–Leibler divergence to w subject to (i) conservation with respect to a supply vector b and (ii) edge capacities. This choice has three advantages that are simultaneously algorithmic and conceptual. First, it yields a pseudo-flow whenever the feasible set is nonempty, so the layer is well-defined. Second, it converts feasibility into a differentiable normalization subroutine (a KL projection) that can be implemented by sparse, GPU-friendly iterative scaling and box-projection steps. Third, it recovers classical attention as a limiting special case: if one removes global balance and replaces it by per-node simplex constraints, the KL projection reduces to a softmax computed independently at each node.
Our contribution is thus not merely to add constraints, but to identify the appropriate mathematical object that interpolates between attention and flow. A KL projection onto a flow polytope is an : among all feasible pseudo-flows, it selects the one that minimally perturbs the model’s learned edge preferences in an information-geometric sense. This is the appropriate analogue of softmax, which itself can be characterized as a KL projection onto a simplex. In particular, the projection viewpoint cleanly separates the roles of learning and feasibility: the GNN learns scores e without needing to ``rediscover’’ conservation in its parameters, while the projection enforces conservation and capacity compliance at the level of the produced pseudo-flow.
We provide guarantees at three levels of abstraction, corresponding to the three requirements one expects from a normalization layer: existence, computability, and quantitative control under truncation. At the level of well-posedness, strict convexity of the KL objective over nonnegative flows implies that, under feasibility and positivity of w, there exists a unique minimizer f⋆ satisfying the balance constraints exactly. At the level of computation, we instantiate the projection by alternating Bregman projections: a generalized Sinkhorn/IPFP step enforces balance in a dual-potential form, while a Dykstra-style correction enforces box constraints for capacities. This yields a sparse iterative method whose per-iteration cost is linear in |E|, matching the natural input-reading lower bound for any nontrivial dependence on all edges. At the level of approximation, we make explicit that practical implementations return f(K) after K iterations; we connect a KL suboptimality tolerance to explicit bounds on conservation residual and capacity violation, so that ``approximately normalized’’ has a precise meaning.
Finally, the projection layer is not an isolated optimization primitive: it is designed to be used as a drop-in replacement for attention weights in message passing. The resulting aggregation uses f(K) as edge coefficients, yielding node updates that can be interpreted as routing information through the graph in a manner consistent with supplies, demands, and (optionally) capacities. This furnishes an interpretability statement that is otherwise unavailable: to the extent that the solver is run to tolerance, the produced edge weights satisfy Kirchhoff conservation up to an explicit residual bound, and capacity violations are likewise controlled. The remaining parts of our development elaborate these claims, relate them to standard attention mechanisms, and evaluate empirically when enforcing feasibility improves prediction and when it imposes an unnecessary inductive bias.
We briefly recall the standard attention mechanism as it appears in
message passing and make explicit the sense in which it encodes only a
redistribution rule. Let G = (V, E) be
directed and let a scoring function produce logits evu ∈ ℝ
for each edge (v, u) ∈ E. A
common outgoing-normalized attention defines coefficients
$$
\alpha_{vu}
:=\frac{\exp(e_{vu})}{\sum_{u'\in N_{\mathrm{out}}(v)} \exp(e_{vu'})},
\qquad (v,u)\in E,
$$
so that ∑u ∈ Nout(v)αvu = 1
for every v with Nout(v) ≠ ∅. In
a typical layer, these coefficients modulate aggregation by
mu := ∑(v, u) ∈ Eαvu M(hv, edge_featvu), hu′ := U(hu, mu),
for some message map M and
update map U. From a
probabilistic viewpoint, α is
a row-stochastic transition rule: conditioning on being at node v, one moves to an outgoing neighbor
u with probability αvu.
This interpretation is coherent and often useful, but it should not be
confused with a conservation law: α prescribes of whatever mass is
available at v, and the layer
as written does not specify how much mass resides at each node.
To see the missing degree of freedom, suppose we attempted to
interpret attention as an actual routed quantity. Introduce an (unknown)
nonnegative node mass vector x ∈ ℝ ≥ 0V,
where xv
is the amount available at v
to be sent along outgoing edges. If α determines only the split, then
the induced edge mass is
fvu := xvαvu.
Kirchhoff-type balance at a node v (with supply/demand bv) would
require
∑(u, v) ∈ Efuv − ∑(v, u) ∈ Efvu = bv, i.e. ∑u ∈ Nin(v)xuαuv − xv = bv.
Even with fixed α, this is a
system coupling all nodes through the graph. In particular, the common
choice implicit in many layers—that each node emits a fixed total
amount, effectively xv ≡ 1 (or xv determined
only by a local feature transform)—is incompatible with the above
balance unless b and α satisfy special identities. Thus,
the per-node simplex constraint on α is best understood as defining
relative routing preferences, not a feasible flow.
One may similarly consider incoming-normalized variants, in which coefficients βvu satisfy ∑v ∈ Nin(u)βvu = 1 for each u and aggregation is mu = ∑vβvuM(⋅). This normalization can be read as prescribing how each receiver u averages information from its predecessors. However, if we try to interpret β as edge mass directly, we again obtain a model that enforces a local constraint (a unit inflow) irrespective of what upstream nodes can supply. In the flow picture, inflow should be determined by upstream outflows and external demands, not fixed by a local averaging rule. The distinction is not semantic: conservation asserts that, away from terminals, nodes do not act as net sources or sinks, whereas local normalizations can force precisely such net creation or absorption.
It is helpful to state explicitly the conservation law we have in
mind. Given an edge mass vector f ∈ ℝ ≥ 0E,
Kirchhoff balance with supply vector b ∈ ℝV is the
linear constraint
(Af)v = ∑(u, v) ∈ Efuv − ∑(v, u) ∈ Efvu = bv ∀v ∈ V,
where ∑v ∈ Vbv = 0
ensures global consistency. Nodes with bv > 0 inject
mass, nodes with bv < 0 absorb
mass, and interior nodes satisfy (Af)v = 0.
Two features are immediate. First, the constraint couples incoming and
outgoing edges at each node, hence cannot be enforced by normalizing
only outgoing edges (or only incoming edges) independently. Second, the
constraint is inherently global because the feasibility of balancing at
one node depends on the mass routed through others. Consequently, a
conservation-respecting ``attention’’ must be computed by a procedure
that has access to the entire incidence structure, at least through
local aggregations repeated across the graph.
A related point concerns the scale of the edge masses. Standard attention is invariant under adding a constant to all logits leaving a node, since αvu depends only on relative differences. In contrast, the absolute magnitude of a flow is meaningful: multiplying all edge masses by a constant changes the supplies and demands that can be satisfied, and it changes whether capacities are violated. This mismatch clarifies why learned scores alone cannot encode feasibility. Even if the network learns to certain edges, without an additional normalization principle it has no mechanism to select the unique global scaling and coupling needed to satisfy Af = b.
Capacities introduce an additional constraint that is likewise nonlocal in effect. In many domains, each edge (u, v) admits a maximum throughput cuv ∈ (0, ∞], requiring 0 ≤ fuv ≤ cuv. Local softmax-based attention does not interact with such bounds in any direct way: because αuv ∈ (0, 1) but is later multiplied implicitly by an uncontrolled node-dependent scale (e.g. the magnitude of messages, or the number of attention heads, or downstream linear transforms), the resulting effective mass on an edge can exceed any prescribed cuv. One may attempt to address this by saturating or clipping edge weights, but doing so after a local normalization generally destroys whatever balance properties one might have hoped to preserve. In other words, conservation and capacities should be treated as —an absolute edge mass vector—rather than as separate heuristics applied at different stages.
These observations suggest a clean separation: logits e should be interpreted as a expressing edge preferences, while feasibility (conservation and capacity compliance) should be imposed by a global normalization that maps preferences to a pseudo-flow. The mathematical question then becomes: given positive base weights w = exp (e) on edges, how do we choose a nonnegative edge mass f that is as close as possible to w while satisfying Kirchhoff balance (and, when present, capacities)? In the next section we make this question precise by specifying the feasible polytope induced by (G, b, c) and a projection objective that selects a unique pseudo-flow compatible with the graph constraints.
Fix a directed graph G = (V, E). We
regard the output of the scoring network as a vector of edge logits
e ∈ ℝE and
define strictly positive w ∈ ℝ > 0E
by
we := exp (ee), e ∈ E.
The object that will replace attention coefficients is an nonnegative
edge mass (pseudo-flow) f ∈ ℝ ≥ 0E.
Its net imbalance at each node is encoded by the node–edge incidence
operator A ∈ ℝV × E,
(Af)v := ∑(u, v) ∈ Efuv − ∑(v, u) ∈ Efvu, v ∈ V.
We also fix a supply/demand vector b ∈ ℝV
satisfying ∑v ∈ Vbv = 0,
where bv > 0
injects mass and bv < 0
absorbs mass. Finally, each edge e may be equipped with an upper
capacity ce ∈ (0, ∞],
where ce = ∞ indicates
no upper bound.
The constraints we seek to enforce are (i) Kirchhoff balance with
supplies and (ii) capacity compliance. We therefore define the feasible
set
ℱ(G, b, c) := {f ∈ ℝ ≥ 0E : Af = b, 0 ≤ f ≤ c},
where inequalities are interpreted componentwise on edges.
The global consistency condition ∑vbv = 0
is necessary for feasibility, but not sufficient: the capacities and
graph topology can render ℱ(G, b, c) = ∅. In
our development we separate from : the CCFA layer is defined as a
projection onto ℱ(G, b, c)
whenever the polytope is nonempty, and we discuss below standard
relaxations when it is not.
Two limiting cases are instructive. If b ≡ 0, then Af = 0 enforces conservation at every node, meaning that interior nodes cannot create or destroy mass. If, further, c ≡ ∞, then the only constraints are f ≥ 0 and Af = 0, yielding a circulation polytope. Conversely, if one retains capacities but chooses b supported only on designated sources and sinks, then ℱ(G, b, c) describes capacitated multi-commodity-free flow routing with a prescribed net injection/absorption profile.
Given the base weights w
and constraints ℱ(G, b, c), we
select the pseudo-flow as the unique minimizer of a generalized KL
divergence,
$$
f^\star
\;\in\;
\arg\min_{f\in\mathcal{F}(G,b,c)}
\mathrm{KL}(f\|w),
\qquad
\mathrm{KL}(f\|w)
:=
\sum_{e\in E}
\Bigl(
f_e\log\frac{f_e}{w_e}-f_e+w_e
\Bigr).
$$
We emphasize the role of w as
a : it encodes the network’s edge preferences in an absolute, positive
scale, while the projection enforces feasibility with minimal distortion
in the KL geometry. This choice yields three properties that are central
for a GNN layer. First, the objective is strictly convex on ℝ ≥ 0E when
w > 0, hence (under
feasibility) the solution is unique. Second, the optimizer depends
smoothly on w in regimes of
interest, enabling stable backpropagation through approximate solvers.
Third, the KL form is compatible with multiplicative reweighting and
sparse iterative Bregman projection algorithms, which we exploit
computationally in the next section.
In some applications b and
c are physical and chosen so
that ℱ(G, b, c) ≠ ∅ by
construction. In others, one wishes to impose ``soft’’ conservation or
capacities as an inductive bias even when strict feasibility is not
guaranteed. A standard remedy is to enlarge the feasible set by
introducing slack variables. For example, one may permit node imbalances
s ∈ ℝV and
solve
minf ≥ 0, 0 ≤ f ≤ c, s ∈ ℝVKL(f∥w) + λ∥s∥1 s.t. Af = b + s,
or, equivalently, augment the graph with a super-source/super-sink
connected by high-capacity edges whose usage is penalized. Such
relaxations preserve the interpretation that the layer attempts to route
mass according to w, but now
pays an explicit price for violating conservation or exceeding
prescribed net injections. In the remainder we focus on the
exact-feasibility setting to state clean theorems; the relaxed variants
follow by standard convex-analytic modifications.
The vector b can be
provided in multiple ways, depending on supervision and domain
knowledge.
(i) If sources and sinks are known, one may set bv = 0 for
interior nodes and distribute a total mass m > 0 across sources and sinks,
e.g. ∑v : bv > 0bv = m
and ∑v : bv < 0bv = −m.
(ii) If only some terminals are known, one may fix b on those nodes and learn the
remainder under the constraint ∑vbv = 0.
(iii) One may parameterize b
as an output of a neural map g
on node embeddings, followed by centering to enforce global
consistency:
$$
\tilde b_v := g(h_v),
\qquad
b_v := \tilde b_v - \frac{1}{|V|}\sum_{u\in V}\tilde b_u,
$$
optionally combined with a scale constraint (e.g. ∥b∥1 = m) to
avoid the trivial near-zero regime. This yields an end-to-end trainable
layer in which the network chooses both mass is injected/absorbed and it
is routed, with the routing still respecting the incidence
constraints.
When the input graph is undirected, we embed it into the directed
setting by replacing each undirected edge {u, v} with two directed
edges (u, v) and
(v, u), yielding a
bidirected graph. Capacities may be assigned per direction (two
parameters), or tied to represent a single physical limit. If a single
undirected capacity c{u, v}
is desired, one convenient convex encoding is
0 ≤ fuv, fvu, fuv + fvu ≤ c{u, v},
which can be handled by extending the box constraints with one
additional linear inequality per undirected edge. For simplicity of
exposition we keep the per-directed-edge box constraint 0 ≤ fe ≤ ce,
which already captures many settings (e.g. asymmetric conductances,
one-way valves, directed communication links), and we note that the
projection framework extends to these coupled-capacity variants with the
same Bregman projection toolkit.
The CCFA normalization is the mapping
CCFA: (G, b, c, w) ↦ f⋆, f⋆ := arg minf ∈ ℱ(G, b, c)KL(f∥w),
and, in practice, its truncated approximation f(K) produced by
a fixed number of sparse Bregman projection iterations. The subsequent
section instantiates this mapping as a differentiable GNN layer: edge
scores generate w, the
projection produces f(K), and f(K) is then
used as the edge weight in message passing and readout.
We consider a single GNN layer that receives node embeddings h ∈ ℝV × d
(and optionally edge features xe) and returns
updated embeddings h′ ∈ ℝV × d.
The CCFA layer intervenes precisely at the point where one would
ordinarily compute attention coefficients: instead of producing locally
normalized weights (e.g. per-node softmax), we produce a global
pseudo-flow f ∈ ℝ ≥ 0E
by KL projection onto ℱ(G, b, c).
Concretely, the layer decomposes into three maps
(h, x) ↦ e ↦ w = exp (e) ↦ f(K) ≈ f⋆,
followed by weighted aggregation using f(K) as edge
weights. The supply vector b
and capacities c may be fixed
from domain knowledge or predicted by auxiliary heads; once chosen, they
define the constraint geometry that the learned scores must respect.
Given embeddings hu, hv ∈ ℝd
and edge features xuv, we
define an edge logit euv ∈ ℝ
by a differentiable scoring function. Typical choices include an MLP on
concatenated features or a bilinear form,
euv = ϕ ([hu, hv, xuv]), or euv = hu⊤Whv + a⊤xuv,
where ϕ is a shared network
and W, a are
learnable parameters. Multi-head variants are obtained by producing
H logits e(1), …, e(H)
and projecting each head separately, yielding flows f(K, 1), …, f(K, H)
that parameterize multiple constrained routing patterns. We then set
wuv = exp (euv) > 0
as the base measure for the projection; the exponential map is not
merely conventional, but matches the multiplicative structure of KL
geometry and ensures positivity without ad hoc clipping.
Given w, the CCFA normalization returns f(K) approximating the unique I-projection f⋆ onto ℱ(G, b, c). In contrast to local normalizations, the defining constraints couple edges across the entire graph via the incidence operator. In particular, conservation Af = b encodes that every node can only emit or absorb mass according to its prescribed supply, and capacities 0 ≤ f ≤ c enforce edgewise feasibility. The effect is that increasing the score of one edge can indirectly decrease mass on distant edges in order to maintain global balance. This coupling is the mechanism by which CCFA expresses inductive biases such as conservation laws, flow bottlenecks, and multi-path routing. When b ≡ 0, the pseudo-flow is a circulation up to capacities; when b is supported on a small terminal set, the pseudo-flow concentrates on paths that connect sources to sinks while respecting congestion limits.
Once f(K) is
computed, we use it as the edge weight in aggregation. A generic form
is
mv := ∑(u, v) ∈ Efuv(K) M (hu, xuv), hv′ := U (hv, mv),
with learnable maps M
(message) and U (update),
e.g. linear layers with nonlinearities and residual connections. We
emphasize that f(K) is mass
rather than a convex combination, so the scale of mv is controlled
by the scale of the feasible set (through b and c) rather than by a simplex
constraint. This is desirable in physical settings where the total
throughput is meaningful; when a fixed scale is preferred (for numerical
conditioning or architectural parity), we may normalize supplies so that
∥b∥1 = m
is constant across graphs, or include a learned global scalar α and use αf(K)
inside mv.
In either case, conservation provides an interpretable guarantee: for
nodes with bv = 0, the
incoming and outgoing mass of f(K) nearly
match, with the deviation controlled by the solver tolerance.
For graph-level prediction, we apply any standard readout to the final node embeddings, e.g. sum/mean pooling, attention pooling, or Set2Set. Additionally, CCFA provides a structured object f(K) that can be exploited directly. For instance, we may define edge-aggregated features such as total utilized capacity ∑efe(K)/ce (with the convention 1/∞ = 0), cut flows across a designated partition, or per-node throughputs ∑(u, v)fuv(K). These quantities can be concatenated to the pooled embedding to form a hybrid readout that combines learned representations with physically interpretable statistics. When supervision includes flow-related targets, one may also add auxiliary losses on Af(K) or on specific edge subsets, thereby aligning the learned attention with domain constraints without requiring exact flow labels.
CCFA recovers several familiar mechanisms as limiting or specialized cases. If we drop all constraints and simply set f = w, then the layer reduces to unnormalized edge gating. If we retain only local outgoing simplex constraints (and no other couplings), then the KL projection decouples by node and yields the standard outgoing softmax attention; this is precisely the sense in which softmax attention is a KL projection, but onto a local polytope rather than ℱ(G, b, c). If we set c ≡ ∞ but enforce Af = b, we obtain a globally balanced reweighting of w that behaves like a circulation-based attention mechanism: the layer can only route mass along cycles unless sources/sinks are introduced via b. Many ``flow-based’’ GNN constructions may be interpreted as choosing a constraint family and a divergence; CCFA fixes the divergence to generalized KL (for strict convexity and multiplicative structure) and allows the practitioner to interpolate between unconstrained attention, locally normalized attention, and globally constrained flow attention by selecting (b, c) and, when desired, introducing slack relaxations. The next section addresses how we compute f(K) efficiently on sparse graphs.
We first treat the case without finite capacities, i.e. c ≡ ∞, where the feasible set
reduces to the affine slice {f ≥ 0 : Af = b}.
The KL I-projection onto an incidence-defined affine constraint admits
the familiar exponential-family form: there exists a potential (dual)
vector p ∈ ℝV, unique
up to an additive constant on each weakly connected component, such that
the balanced flow induced by p
is
Indeed, if we form the Lagrangian for minf ≥ 0KL(f∥w)
subject to Af = b, the
first-order optimality conditions yield log (fe/we) = (A⊤p)e,
and with our incidence convention (A⊤p)uv = pv − pu
we recover . Thus, enforcing balance reduces to finding p such that Af(p) = b.
We accomplish this by iterative proportional fitting in the dual,
implemented as repeated sparse aggregations of incoming/outgoing mass
followed by a potential update.
Concretely, define the node residual
r(p) := Af(p) − b ∈ ℝV, rv(p) = ∑(u, v) ∈ Efuv(p) − ∑(v, u) ∈ Efvu(p) − bv.
We view p ↦ r(p) as a
nonlinear map with a sparse Jacobian structure inherited from G. In practice we employ damped
first- or second-order updates. A simple and effective choice is
gradient ascent on the concave dual objective
maxp ∈ ℝV ⟨b, p⟩ − ∑(u, v) ∈ Ewuvexp (pu − pv),
whose gradient is b − Af(p) = −r(p);
hence one may update
with a stepsize η > 0
(possibly scheduled or learned). To reduce stepsize sensitivity, we may
also apply a diagonal Newton-like preconditioner using the local
curvature
Hvv(p) ≈ ∑(v, u) ∈ Efvu(p) + ∑(u, v) ∈ Efuv(p),
and update pv ← pv − η rv/Hvv
for nodes with Hvv > 0.
We remove the gauge freedom by fixing pv0 = 0
for a designated node v0 per component (or by
subtracting the mean of p
after each iteration), which is numerically important to prevent
drift.
We now incorporate finite capacities 0 ≤ f ≤ c. The feasible
region is the intersection of two closed convex sets in $\mathbb{R}^E_{\ge 0$}: the affine balance
set 𝒜 := {f ≥ 0 : Af = b}
and the box set ℬ := {f : 0 ≤ f ≤ c}.
We compute the I-projection onto 𝒜 ∩ ℬ
by alternating Bregman projections with Dykstra corrections. At a high
level, we iterate
$$
f \xrightarrow{\ \Pi_{\mathcal{A}}^{\mathrm{KL}}\ } \tilde f
\xrightarrow{\ \Pi_{\mathcal{B}}^{\mathrm{KL}}\ } f^+,
$$
but because ΠℬKL and Π𝒜KL do not
commute, Dykstra’s method maintains edgewise correction variables to
ensure convergence to the projection onto the intersection rather than
to a limit cycle.
The KL projection onto the box set is edge-separable. Given a
positive vector g ∈ ℝ > 0E,
the minimizer of KL(f∥g) subject to 0 ≤ f ≤ c is simply
since the unconstrained minimizer is f = g and the objective is
strictly convex with monotone derivative in each coordinate. The
nontrivial part is accounting for the interaction between and the
balance projection. A convenient multiplicative form of KL-Dykstra
maintains a correction factor q ∈ ℝ > 0E
(equivalently, an additive log-correction) so that the current iterate
is represented as a reweighted base measure. One practical variant
is:
$$
\text{(i) } \hat w_e \leftarrow w_e\, q_e,\qquad
\text{(ii) } \tilde f \leftarrow \Pi_{\mathcal{A}}^{\mathrm{KL}}(\hat
w)\ \ \text{(via potentials as in \eqref{eq:incidence-exp-family})},
$$
$$
\text{(iii) } f \leftarrow \Pi_{\mathcal{B}}^{\mathrm{KL}}(\tilde f)\ \
\text{(via clipping \eqref{eq:box-proj})},
\qquad
\text{(iv) } q_e \leftarrow q_e\, \frac{\tilde f_e}{f_e}.
$$
Step (iv) is the Dykstra correction: it ensures that mass removed by the
box projection is “remembered” when we subsequently rebalance. When
ce = ∞ we
may omit that edge from the capacity step by treating the clip as
identity, and we keep qe ≡ 1.
The core computational kernels are sparse edgewise transforms and
per-node reductions. We store each graph (or a batch of graphs) in
adjacency-list form with contiguous edge arrays and index vectors for
sources and targets. Given node potentials p, we evaluate for all edges in
parallel:
log fe = log we + psrc(e) − ptgt(e), fe = exp (log fe),
followed by segment-reductions to compute incoming and outgoing totals
for each node, hence the residual r = Af − b.
This is O(|E|) work
per iteration with favorable memory access (two gathers for potentials,
one fused multiply-add/exponentiation, two scatters for in/out sums).
For multiple graphs, we concatenate their edge lists and offset node
indices; all reductions become segmented by graph, and we optionally fix
one gauge node per graph to remove additive invariances.
Because is multiplicative, we control overflow/underflow by operating in the log domain as long as possible. We keep log w = e and represent Dykstra corrections as log q; then log ŵ = e + log q and the edge log-flow is computed as e + log q + pu − pv. We may clamp log fe to a finite interval [−L, L] before exponentiation (with L chosen according to the floating-point format) to prevent NaNs in early training when scores are poorly scaled. Gauge normalization of p (e.g. subtracting the per-graph mean) prevents unbounded drift that would otherwise be invisible to the constraints but harmful numerically. Finally, we damp potential updates and, if needed, clip large residuals rv to stabilize training under backpropagation through K iterations.
We typically run a fixed number of iterations K for predictable runtime, but we monitor feasibility diagnostics ∥Af − b∥ and capacity violation ∥(f − c)+∥ during development to select K and damping. The entire procedure is differentiable by unrolling; all operations are compositions of smooth maps (except for the clip in , which is piecewise smooth and works well with subgradients). When memory is constrained, we checkpoint only the final potentials and recompute intermediate iterates, or we employ implicit differentiation through the fixed-point conditions when higher accuracy is required. These algorithmic choices produce a sparse, GPU-friendly normalization subroutine whose per-iteration cost is linear in |E| and whose numerical behavior is consistent with the KL geometry of the projection.
We study the map from positive edge weights w = exp (e) to a capacity-constrained pseudo-flow f⋆ defined as the unique KL (I-)projection onto ℱ(G, b, c). Throughout, we assume we > 0 for all e ∈ E and that the feasible polytope ℱ(G, b, c) is nonempty.
Consider the convex program
The objective KL(⋅∥w) is
proper, lower semicontinuous, and strictly convex on ℝ ≥ 0E when
w > 0 (strictness follows
from strict convexity of x ↦ xlog x on
(0, ∞) together with separability). The
constraint set ℱ(G, b, c) is a
closed convex polytope. Hence admits a unique minimizer f⋆ ∈ ℱ(G, b, c).
In particular, Kirchhoff balance holds at optimality: Af⋆ = b,
so for every node v with bv = 0 we have
∑ufuv⋆ = ∑ufvu⋆.
We record the associated optimality system. Introducing Lagrange
multipliers p ∈ ℝV for Af = b and
multipliers α, β ∈ ℝ ≥ 0E
for f ≥ 0 and f ≤ c, the KKT conditions
for imply that for each edge e = (u, v),
When capacities are inactive on an edge (i.e. 0 < fuv⋆ < cuv),
we have αuv = βuv = 0
and recover the pure incidence exponential form fuv⋆ = wuvexp (pu − pv).
The multipliers (α, β) thus play the role
of edgewise corrections that ``turn on’’ precisely at saturated or
extinguished edges, matching the geometry exploited by KL-Dykstra
updates.
The KL projection view strictly generalizes standard attention
normalizations. To make this precise, fix b ≡ 0 and replace global balance
Af = 0 by
constraints
with no coupling across nodes and no upper bounds. The minimization of
KL(f∥w) subject to
separates across v. For each
node v, introducing a scalar
multiplier λv for the
equality constraint yields the stationarity condition
$$
\log\frac{f_{vu}}{w_{vu}} = \lambda_v\qquad\Longrightarrow\qquad
f_{vu}=w_{vu}\exp(\lambda_v).
$$
Enforcing ∑ufvu = 1
gives exp (λv) = (∑uwvu)−1
and hence
Thus standard outgoing-softmax attention is exactly the KL I-projection
onto per-node simplices, whereas CCFA performs the I-projection onto the
globally coupled polytope ℱ(G, b, c).
When c ≡ ∞, the method reduces to enforcing the affine constraints in the dual incidence family, i.e. the generalized Sinkhorn/IPFP scaling for a log-linear model on edges. In this setting, convergence follows from strict convexity of the dual and smoothness of the map p ↦ Af(p); with capacities, Dykstra corrections restore convergence for the intersection.
In learning and inference we run a fixed number K of iterations, producing an
approximate solution f(K). We
therefore require quantitative statements connecting iteration count (or
objective error) to constraint violation. A convenient starting point is
that KL is strongly convex on bounded boxes: if f and g lie in [ℓ, u]E
for some 0 < ℓ ≤ u < ∞,
then
Assuming the iterates are kept in such a bounded region (e.g. by finite
capacities and/or log-clamping in implementation), an objective gap
bound KL(f(K)∥w) − KL(f⋆∥w) ≤ ε
implies $\|f^{(K)}-f^\star\|_2\le
\sqrt{2u\varepsilon}$. Since Af⋆ = b,
we obtain a conservation bound
and similarly in other norms using the corresponding operator bounds for
A. For capacities, if the
algorithm includes an explicit projection onto ℬ at each cycle, then 0 ≤ f(K) ≤ c
holds exactly; otherwise, one may bound ∥(f(K) − c)+∥
by comparing to the clipped vector and controlling the resulting KL
increase. These inequalities formalize the sense in which Kirchhoff’s
law holds and holds at finite K with an explicit ε-dependent violation. In practice,
we use these bounds to select K and damping so that the residual
is commensurate with downstream predictive noise, ensuring that CCFA’s
structural bias is present without over-solving the inner
projection.
We summarize the computational cost of a CCFA layer and relate it to unavoidable lower bounds, both for reading the sparse graph input and for attaining a prescribed accuracy ε in the KL projection.
A single forward layer decomposes into (i) edge scoring, (ii)
constrained normalization (approximate KL projection), and (iii)
weighted message passing. For typical GNN scorers (bilinear forms or
small MLPs) the edge scoring step evaluates euv = score(hu, hv, edge_featuv)
for each (u, v) ∈ E, which
costs
Tscore = O(|E|⋅d)
up to constants depending on the number of heads and the scorer
architecture. The message passing step computes mv = ∑(u, v) ∈ Efuv M(hu, edge_featuv)
and then applies an update map U; this similarly costs
Tmp = O(|E|⋅d)
for standard linear or MLP messages.
The normalization step is linear in |E| per iteration. Indeed, in the
dual-incidence parametrization fuv = wuvexp (pu − pv)
(with additional edgewise corrections when capacities are active), each
update of the node potentials p requires computing nodewise
aggregates over incident edges. Concretely, evaluating the balance
residual r = Af − b
requires one pass over edges to accumulate incoming and outgoing sums,
hence O(|E|) time,
plus O(|V|) to
combine with b. Updating p (by IPFP-like scaling, damped
Newton, or a gradient step in the smooth dual when c ≡ ∞) is then O(|V|) on top of these
aggregates. If we include a Dykstra-style capacity substep, we
additionally perform edge-local box projections and correction updates,
again a single pass over edges. Consequently,
Tnorm(K) = O(K ⋅ |E|) and Tlayer = O(|E|⋅d + K ⋅ |E|).
In practice the O(|V|) terms are dominated
by |E| on sparse graphs with
moderate average degree. Moreover, the operations are streaming over
adjacency lists and thus amenable to fused GPU kernels: the dominant
primitives are segment-reductions (per node) and pointwise
exponentiation/multiplication (per edge). Numerical stabilization
(e.g. log-domain accumulation, clamping of p, or damping) increases constant
factors but does not change the O(K|E|)
scaling.
The solver maintains the edge weights w, the current pseudo-flow f, and auxiliary vectors (dual
potentials p ∈ ℝV and, with
capacities, correction variables in ℝE). Hence the memory
footprint is
Slayer = O(|E|+|V|) (solver
buffers) and O(|V|⋅d + |E|⋅de) (features),
again up to constants for multiple heads and optimizer state.
The iteration count K required to reach a target tolerance ε is governed by the convergence results for alternating KL projections (Theorem~3) and by the translation from objective error to feasibility residuals (Theorem~4). While the precise complexity depends on the chosen update rule for p and on the conditioning of the instance (in particular on the dynamic range of w and on the graph structure through A), the general form is that K is polynomial in 1/ε and logarithmic in the relevant scale parameters (e.g. log (maxewe/minewe)), in line with known entropic scaling and Sinkhorn-type analyses.
For downstream use, it is often more meaningful to select K by specifying a conservation
tolerance η rather than an
objective tolerance. Under boundedness of iterates (e.g. by finite
capacities or implementation clamping), the strong convexity estimate
and the residual inequality yield the implication
$$
\mathrm{KL}(f^{(K)}\|w)-\mathrm{KL}(f^\star\|w)\le \varepsilon
\quad\Longrightarrow\quad
\|Af^{(K)}-b\|_2 \le \|A\|_{2\to 2}\sqrt{2u\varepsilon}.
$$
Thus achieving ∥Af(K) − b∥2 ≤ η
is ensured by taking ε ≤ η2/(2u∥A∥2 → 22)
and then running K = K(ε)
iterations as prescribed by Theorem~3. If we enforce 0 ≤ f(t) ≤ c
at each cycle (by an exact KL projection onto the box constraints in the
Dykstra scheme), capacity compliance holds identically for all iterates,
and the only residual of interest is conservation.
The linear dependence on |E| is not merely an artifact of our implementation: it is information-theoretically necessary in the sparse input model. Any procedure that outputs a nontrivial pseudo-flow f which depends on the edge data (w, c) must, in the worst case, inspect Ω(|E|) edges. Formally, an adversary may modify a single edge weight we or capacity ce so that the unique projection f⋆ changes by a fixed amount on that edge (and potentially propagates through the balance constraints). An algorithm that does not read e cannot distinguish the two instances and therefore cannot be correct on both; this yields the Ω(|E|) lower bound stated in Theorem~5. Consequently, the per-iteration cost O(|E|) is optimal up to constants among algorithms that must react to all edges.
Beyond input reading, there are also lower bounds on the needed by Sinkhorn/IPFP-like methods to reach ε-accurate constraints in worst case instances, even in classical entropic optimal transport and matrix scaling. These results imply that one should not expect a uniformly logarithmic dependence on 1/ε for first-order multiplicative scaling schemes; rather, a polynomial dependence on 1/ε is intrinsic for broad instance classes (see, e.g., lower-bound discussions in the entropic OT literature ). Since our balance constraints are encoded by the sparse incidence operator A, CCFA inherits the same fundamental limitation: any method that enforces Af = b via repeated local multiplicative corrections must propagate information across the graph, and the number of passes required scales at least with the desired precision and with the graph’s mixing/conditioning properties. In this sense, Theorem~3 is best interpreted as guaranteeing convergence and a polynomial iteration bound under feasibility and positivity, while Theorem~5 explains why the dominant O(K|E|) runtime is essentially unavoidable on sparse inputs.
Taken together, these bounds justify the practical regime in which we operate CCFA: we fix a moderate K (constant across batches) so that Tnorm(K) is comparable to Tscore and Tmp, and we appeal to Theorem~4 to certify that the resulting pseudo-flow satisfies Kirchhoff balance up to an explicit, tunable residual. The lower bounds indicate that, absent additional structure or stronger assumptions, asymptotically faster normalization than O(|E|) per iteration or dramatically fewer than polynomial-in-1/ε iterations cannot be guaranteed in the worst case, reinforcing the design choice of sparse, batched, approximate projections.
We design experiments to isolate three questions: (a) whether enforcing global flow constraints improves predictive accuracy on tasks with underlying conservation laws, (b) whether the returned pseudo-flow is faithful to physically meaningful flows produced by simulators, and (c) how sensitive the method is to distribution shift and to potentially misspecified constraints.
We target two families of benchmark problems in which directed flow is a natural latent variable.
We consider graphs representing power networks, where nodes are buses
and directed edges are transmission lines. Node features include
exogenous injections/loads, voltage setpoints (when available), and
static attributes; edge features include line parameters and thermal
ratings (used as capacities ce). Typical
prediction targets include (i) per-bus voltage magnitude/angle, (ii)
per-line active/reactive power flow magnitudes, and (iii) feasibility or
contingency labels (e.g. whether an operating point violates line
limits). In these tasks, conservation is naturally expressed through a
supply vector b derived from
injections/loads (possibly augmented with slack-bus conventions).
We consider directed circuit graphs in which nodes are circuit
components or connection points, edges represent directed current
pathways, and capacities encode maximum allowable current (or are set to
∞ when not applicable). Targets include
simulated node potentials, currents on designated edges, and pass/fail
labels for over-current or connectivity constraints. Although real
circuits are not purely directed-flow objects, these tasks provide a
controlled setting in which Kirchhoff-type constraints (on appropriate
graph abstractions) are meaningful.
We compare CCFA against attention and message-passing baselines that
differ only in the normalization step:
(i) , i.e. per-node simplex normalization on outgoing edges (Theorem~2
identifies this as a local KL projection).
(ii) , i.e. using w = exp (e) directly as
edge weights.
(iii) on bipartite or layered variants when applicable, to separate the
effect of global constraints from mere scaling.
(iv) (e.g. GCN/GAT variants) matched in depth and parameter count.
For CCFA, we include ablations that remove individual constraints: (a) balance only (Af = b with c ≡ ∞), (b) capacities only (box projection without enforcing Af = b), and (c) full constraints (Af = b and 0 ≤ f ≤ c). We also vary the iteration budget K to expose the compute–accuracy tradeoff.
We report standard supervised metrics (MSE/MAE for regression targets and NLL/AUROC for classification targets) averaged over multiple random seeds. To ensure a fair comparison, we hold constant the encoder, scorer architecture, and message/update maps, and change only the normalization mechanism. Hyperparameters (learning rate, weight decay, dropout, and K) are tuned on validation splits. Since CCFA introduces an additional algorithmic component, we also report wall-clock time per epoch and throughput (graphs/sec) to quantify the computational overhead relative to softmax attention.
A central claim is that the learned pseudo-flow f(K) is
interpretable as a physically constrained transport of attention mass.
To evaluate this, we align the pseudo-flow with simulator outputs when
the latter are available:
for PowerGraph-like tasks, we compare f(K) (after
appropriate scaling) to AC or DC power-flow solutions on each line; for
circuit tasks, we compare to simulated currents on designated directed
edges. We report correlation measures (Spearman and Pearson), relative
error on magnitudes (e.g. ∥f − f̂∥1/∥f̂∥1),
and rank-based agreement on the top-k highest-flow edges. When the
simulator flow is only defined up to sign conventions or undirected
magnitudes, we compare |fe| to the
corresponding magnitudes.
To connect fidelity to the optimization guarantees, we also report
the conservation residual and capacity residual directly:
$$
\mathrm{Res}_A := \frac{\|Af^{(K)}-b\|_1}{\|b\|_1+\epsilon_0},
\qquad
\mathrm{Res}_c := \frac{\|(f^{(K)}-c)_+\|_1}{\|c\|_1+\epsilon_0},
$$
for a small ϵ0 > 0 to avoid
degeneracy when b ≡ 0 or c ≡ ∞ on many edges. These
diagnostics permit a quantitative check that the solver behaves as
intended and allow us to correlate downstream error with constraint
violation.
Even when we enforce box constraints via a Dykstra substep, finite precision and implementation details (e.g. mixed precision, log-domain stabilization, or damping) can produce small violations. We therefore report: (i) the maximum relative violation maxe(fe − ce)+/( ce + ϵ0), (ii) the fraction of edges with violation exceeding a threshold τ, and (iii) the total overflow mass ∥(f − c)+∥1 normalized by ∥f∥1. These summaries are presented per dataset and stratified by degree and by capacity scale to reveal whether violations concentrate on hubs or on tight bottlenecks.
Since the normalization step propagates information globally through the incidence constraints, we test robustness to structural shift. We construct out-of-distribution splits by (i) training on graphs with bounded average degree and testing on graphs with higher degree (or vice versa), (ii) perturbing topology by random edge deletions/additions while preserving node features, and (iii) training on one family of motifs (e.g. radial networks) and testing on another (e.g. meshed networks). We report the same predictive metrics as in-distribution, together with residual diagnostics ResA, Resc, to distinguish failures due to representation shift from failures due to nonconvergence of the inner projection.
To avoid presenting constraints as uniformly beneficial, we include a task in which conservation is either irrelevant or misspecified. Concretely, we consider a dataset in which the true signal is localized (e.g. edge classification based on local features) and the supplied b is uninformative or noisy; alternatively, we inject controlled noise into b and measure performance degradation as a function of noise level. We expect CCFA to underperform softmax attention in this regime, since the projection enforces a global coupling that may suppress useful local evidence. We report the crossover point (noise level or task type) at which CCFA ceases to help, and we relate this observation to a modeling recommendation: when b or c are uncertain, one should either (i) learn them with appropriate regularization, or (ii) relax constraints by penalization rather than hard projection, both of which we address next.
We view CCFA primarily as a : given scores e and a constraint specification (A, b, c), we replace ad hoc local normalizations by the unique I-projection onto ℱ(G, b, c). This perspective suggests several extensions in which the combinatorial object (the feasible polytope) is modified while the variational form minf ∈ ℱKL(f∥w) is preserved.
In some applications b is
observed (e.g. injections/loads), but in others it is latent or only
partially specified. A direct approach is to parameterize b = bθ(x)
as a nodewise map from features xv (and possibly
global context) and to train θ
jointly with the scorer producing e. Since feasibility requires ∑vbv = 0,
we may enforce this by a differentiable centering step
$$
\tilde b_v := b_{\theta}(x)_v - \frac{1}{|V|}\sum_{u\in V}
b_{\theta}(x)_u,
$$
or, when sources/sinks are known up to a slack convention, by fixing a
designated slack node s and
setting b̃s := −∑v ≠ sbθ(x)v.
If magnitudes should be bounded (e.g. due to physical limits), one may
also constrain b via bv = B ⋅ tanh (b̄v)
with learned b̄v. We emphasize
an identifiability issue: if both b and e are learned without supervision on
flows, then multiple (b, e) pairs can yield
similar downstream predictions. In such regimes we expect that
regularization is necessary, for example penalizing ∥b∥1 or encouraging
sparsity of nonzero supplies, and we recommend reporting the induced
residual ∥Af(K) − b∥
together with predictive metrics to ensure that the learned b is not being ``explained away’’ by
solver nonconvergence. Finally, if strict feasibility of (G, b, c) is
uncertain, one may introduce a slack variable s ∈ ℝV and solve
with relaxed balance Af = b + s
and a penalty λ∥s∥1 (or ∥s∥22), which
retains convexity and yields an explicit knob controlling the
conservation–fit tradeoff.
Many domains involve multiple interacting conserved quantities
(e.g. phases, species, or parallel demands). A natural extension
introduces commodities indexed by k ∈ {1, …, m} with
pseudo-flows f(k) ∈ ℝ ≥ 0E,
supplies b(k), and
(optionally) commodity-specific scores e(k) and base
weights w(k) = exp (e(k)).
The simplest decoupled variant enforces Af(k) = b(k)
and 0 ≤ f(k) ≤ c(k)
independently, with objective ∑kKL(f(k)∥w(k));
this amounts to running m CCFA
layers in parallel. The more interesting case is a :
$$
Af^{(k)} = b^{(k)}\quad\forall k,
\qquad
\sum_{k=1}^m f^{(k)}_e \le c_e \quad \forall e\in E,
$$
with f(k) ≥ 0. The
feasible set remains convex, and the KL objective remains strictly
convex in the joint variable (f(1), …, f(m)),
hence the I-projection is still unique when feasible. Algorithmically,
alternating Bregman projections extend by adding an edgewise projection
onto the coupled constraint {z ∈ ℝ ≥ 0m : ∑kzk ≤ ce}
for each edge e, which is a
low-dimensional KL projection that can be computed efficiently (via a
single scalar dual variable per edge). This yields an implementation
whose per-iteration cost remains O(m|E|), with the
same qualitative convergence guarantees as in Theorem~3 under standard
regularity conditions.
Although our presentation is in message-passing language, CCFA can be inserted wherever a model forms nonnegative edge weights and aggregates along directed edges. For Graph Transformers, a head typically defines scores on a set of admissible pairs (often all pairs, or a sparsified pattern) and then applies a rowwise softmax. Replacing this with a CCFA projection enforces global conservation across the attention graph, thereby coupling what are otherwise independent per-query normalizations. Concretely, for each head we may form a directed graph on tokens with edges restricted to a sparse pattern (e.g. k-NN, locality, or routing edges) and apply CCFA to obtain f(K) used as attention weights. When the attention graph is dense, the Ω(|E|) lower bound (Theorem~5) becomes prohibitive, so practicality requires sparsification or factorization; from our viewpoint this is not an artifact of the solver but a structural consequence of reading the input. We therefore regard CCFA as most appropriate for intrinsically sparse relational data, or for Transformers already using sparse attention mechanisms.
The method presupposes that ℱ(G, b, c) ≠ ∅. Infeasibility can arise from inconsistent supplies across connected components, from overly tight capacities, or from misspecified directions. We can detect this empirically via nonvanishing residuals after large K, but for robust deployment we recommend either (i) preprocessing to enforce ∑v ∈ Cbv = 0 on each weakly connected component C, or (ii) adopting a slack-penalized formulation as above. A second limitation is numerical: if e has large magnitude, then w = exp (e) can be poorly scaled, and multiplicative updates may underflow/overflow. Log-domain implementations mitigate this but introduce additional constant factors; moreover, capacity projections can create near-active constraints that yield small gradients through clipped edges. Third, finite-K truncation yields approximate feasibility; while Theorem~4 formalizes the resulting violations in terms of ε, in practice the map (e, b, c) ↦ f(K) can be sensitive on ill-conditioned graphs (e.g. long directed chains with tight bottlenecks), requiring damping or adaptive stopping. Finally, the global coupling is not universally beneficial: when b or c are noisy, or when the task signal is genuinely local, enforcing Af = b may suppress predictive evidence. We therefore view CCFA as a modeling choice that should be justified by domain structure, and we advocate reporting both predictive metrics and the diagnostic residuals as part of any evaluation.
For completeness and reproducibility, we collect in the appendices the formal arguments deferred from the main text, together with engineering details needed to implement CCFA efficiently on sparse graphs, and additional experimental results (datasets, ablations, and diagnostics) supporting the claims made in the discussion.
We provide self-contained proofs of Theorems~1–5 under the standing
assumptions stated in the global context (in particular we > 0 and
ℱ(G, b, c) ≠ ∅
when feasibility is required). For Theorem~1 (existence and uniqueness),
we write the objective as a strictly convex, lower semicontinuous
function on ℝ ≥ 0E and use
closedness and convexity of ℱ(G, b, c) to
obtain existence; strict convexity then yields uniqueness. We also
record the KKT conditions: there exist node potentials p ∈ ℝV and
edgewise multipliers αe, βe ≥ 0
for the constraints fe ≤ ce
and fe ≥ 0
such that
$$
\log\frac{f^\star_e}{w_e} + (A^\top p)_e + \alpha_e - \beta_e = 0,
\qquad
\alpha_e(f^\star_e-c_e)=0,\quad \beta_e f^\star_e=0,
$$
which clarifies when the incidence-form parametrization fe = weexp ((A⊤p)e)
holds exactly (namely away from active box constraints).
For Theorem~2 (containment of softmax), we give the per-node
Lagrangian derivation on the local simplex constraints and explicitly
show separability over v ∈ V. For Theorem~3
(convergence), we state the alternating Bregman projection scheme as a
Dykstra iteration on the intersection of an affine set and a box, and we
cite the relevant convergence theorem for generalized KL
(I-)projections, then specialize to the incidence-structured affine
constraints. In the specialization, we also record conditions ensuring
well-posedness of the affine projection in the dual (e.g. handling
disconnected components by fixing one potential per component, or by
restricting to the subspace {p : ∑v ∈ Cpv = 0}
per component). For Theorem~4 (truncation guarantees), we spell out the
duality argument relating KL suboptimality to approximate stationarity
and hence to constraint residual; we include a lemma giving a
quantitative bound of the form
$$
\|Af^{(K)}-b\|_1 \le L_A
\sqrt{2\bigl(\mathrm{KL}(f^{(K)}\|w)-\mathrm{KL}(f^\star\|w)\bigr)},
$$
for an explicit constant LA depending on
boundedness of iterates (ensured, for instance, by finite capacities or
by a mild regularization). Finally, for Theorem~5 we provide the
adversarial construction in full detail, including the reduction that
forces any algorithm producing a nontrivial dependence on (w, c) to read an Ω(|E|) fraction of the
input in the worst case.
We describe the precise solver used in our experiments as an unrolled K-step map (e, b, c) ↦ f(K). The appendices specify (i) the parametrization of primal iterates via node potentials p (affine step) and correction variables (Dykstra step), (ii) damping choices that ensure numerical stability, and (iii) stopping criteria when we allow adaptive K. We also discuss differentiation through the solver. Since the updates are compositions of smooth operations (exponentials, sparse reductions) and simple projections (box clipping, or KL box projections with corrections), backpropagation through the unrolled iterations is straightforward; nevertheless, we record practical measures that control exploding/vanishing gradients, including log-domain evaluation of we = exp (ee) and gradient clipping on p. When ce = ∞, we treat the corresponding capacity projection as a no-op and omit its correction state. When infeasibility is suspected, we optionally replace the hard constraint Af = b by a slack-penalized variant and show how this modifies the dual update.
We give pseudocode and tensor shapes for a batched implementation. The key representation is a COO/CSR edge list together with segment indices enabling reductions over Nin(v) and Nout(v). The appendices enumerate the per-iteration kernels: (a) compute residuals rv = (Af)v − bv by two segmented sums, (b) update node potentials p by a simple rule (e.g. a dual gradient step or IPFP-style scaling), (c) form updated flows fe = weexp (pu − pv) edgewise, and (d) apply the capacity projection with Dykstra corrections. We report the memory cost of storing (f, w) and the auxiliary buffers (p, corr), and we record the exact asymptotic scaling O(K|E|) as realized by sparse primitives. We also list default hyperparameters (initialization f(0) = w, damping coefficient, maximum K, and ε-based early stopping) and provide guidance for choosing K to balance predictive performance against residuals ∥Af(K) − b∥ and ∥(f(K) − c)+∥.
We document the datasets used in our evaluation, including
preprocessing and graph construction. For power-grid graphs, we specify
how buses/lines become nodes/edges, how directions are chosen (or
symmetrized into antiparallel directed edges), and how supplies b and capacities c are obtained or normalized. For
circuit-style graphs, we describe component-level features, edge types,
and any imposed conservation conventions. For the informational control
dataset, we specify the source/sink semantics and how we build sparse
attention graphs. We provide train/validation/test splits, metrics (task
loss and accuracy/MSE, as appropriate), and the solver diagnostics we
report alongside predictive metrics:
ResA := ∥Af(K) − b∥1, Resc := ∥(f(K) − c)+∥1, GapKL := KL(f(K)∥w) − KL(f(K − 1)∥w).
These diagnostics are used both to monitor convergence and to detect
infeasibility or numerical issues.
We include ablations isolating the contributions of (i) enforcing balance only (Af = b without capacities), (ii) enforcing capacities only (box projection without balance), and (iii) the full intersection. We vary K and report the tradeoff between task performance and residuals, and we study warm starts versus cold starts across layers. We also analyze the learned-b setting (when enabled) by plotting the distribution of bv and correlating it with residuals, and we evaluate the slack-penalized formulation as a robustness mechanism under misspecified b or c. Finally, we report runtime scaling with |E| and compare against local softmax attention to quantify the constant-factor overhead of constrained normalization under the same sparse access pattern.