Capacity-Constrained Flow Attention via KL Projections: Sinkhorn/Dykstra-Normalized GNNs with Interpretable Pseudo-Flows

Metadata

Total Words: 8,991
Export Date: 2026-01-18 05:02:16
Description: Flow-attentional GNNs normalize attention across outgoing edges to mimic conservation, but softmax only enforces a local simplex constraint and cannot encode global source/sink balance or edge capacities common in infrastructure networks. We introduce Capacity-Constrained Flow Attention (CCFA), a structured normalization layer that turns attention into a feasible pseudo-flow. Given edge scores, CCFA computes the KL projection of exp (e) onto a flow polytope defined by nodewise balance constraints (sources/sinks) and optional edge capacities. This produces edge weights that (approximately or exactly, depending on solver tolerance) satisfy Kirchhoff’s law and can be inspected as routed mass, improving interpretability and enabling constraint compliance without auxiliary loss terms. Technically, we (i) establish existence/uniqueness of the projected flow, (ii) give a sparse generalized Sinkhorn/Dykstra algorithm suitable for batched GPU training, and (iii) prove convergence and truncation-error bounds translating solver iterations into conservation/capacity violation guarantees. We position CCFA as a principled extension of flow attention (outgoing softmax) from Plettenberg et al. (2025) from local splitting to globally feasible conservation, and outline experiments on power grids and electronic circuits to demonstrate predictive gains and pseudo-flow fidelity.

1. Introduction: conservation vs duplication; why softmax outgoing-normalization is insufficient for capacity- and supply-constrained networks; contributions and guarantees.
2. Background and Source Connection: recap standard attention, flow attention, Kirchhoff’s law; interpret attention as relative flow; limitations motivating global constraints.
3. Problem Formulation (CCFA): define flow polytope with supplies and capacities; define KL projection objective; discuss feasibility and modeling choices (unknown sources/sinks, undirected graphs via bidirection).
4. CCFA Layer in a GNN: how edge scores are computed; how projected flow weights enter message passing; readouts; relation to existing FlowGNN layers as special cases.
5. Algorithms: sparse generalized Sinkhorn / iterative proportional fitting for balance constraints; Dykstra/Bregman projections for adding box capacities; batched GPU implementation considerations (stability, log-domain).
6. Theory: existence/uniqueness; convergence theorems; truncation bounds translating iteration count to constraint violation and objective gap; containment results (recover outgoing-softmax as a degenerate constraint set).
7. Complexity and Lower Bounds: per-iteration costs in sparse graphs; dependence on accuracy ε; matching lower bounds for reading edges and iteration dependence (as far as known).
8. Experiments (recommended to strengthen paper): predictive performance on PowerGraph and CktBench-like tasks; pseudo-flow fidelity vs simulators; capacity violation statistics; OOD degree/topology shift; control dataset where constraints may hurt.
9. Discussion and Extensions: learning/inferring supplies b; multi-commodity flows; coupling to Graph Transformers; limitations and failure modes.
Appendices: proofs, implementation details, additional datasets/ablations.

Content

1. Introduction: conservation vs duplication; why softmax outgoing-normalization is insufficient for capacity- and supply-constrained networks; contributions and guarantees.

In graph neural networks, attention weights are frequently invoked as a proxy for . This interpretation is compelling but, in its most common instantiation, mathematically incomplete: the usual normalization rules enforce only constraints (e.g. a simplex constraint per node), whereas many domains of interest are governed by conservation laws and hard throughput limits. Our starting point is the observation that, once attention is read as an quantity (a routed mass, current, or commodity), the mismatch between local softmax normalization and physically meaningful feasibility conditions becomes unavoidable.

To make the tension precise, consider outgoing-normalized attention, where each node distributes unit mass across its outgoing neighbors. If we denote by f_vu the weight assigned to edge (v, u), the constraint ∑_{u ∈ N_out(v)}f_vu = 1 forces a node to emit a fixed amount irrespective of what it receives. Consequently, nodes with small incoming total weight must mass to satisfy the outgoing constraint, while nodes with large incoming weight must mass. From the perspective of conservation, this is the wrong invariance: Kirchhoff-type balance requires that, except at designated terminals, net inflow equals net outflow. The issue does not disappear if one instead normalizes incoming edges at each receiver (as in other attention variants); then each node a fixed amount regardless of what is supplied upstream. In either case, a purely local simplex normalization cannot encode a global supply–demand pattern, nor can it enforce that interior nodes are neither sources nor sinks.

Capacity constraints add a second, equally structural, obstacle. In a capacitated network, each edge admits only a bounded throughput. Standard attention mechanisms can be made to low-capacity edges less often by learning scores, but they cannot prevent a learned normalization from placing arbitrarily large mass on a single edge when the scores dictate it. The resulting model may be statistically adequate in some prediction settings, yet it cannot be interpreted as a feasible routing, and it cannot support statements of the form ``this layer respects the network constraints by construction.’’ When the graph represents a power grid, a circuit, or a transport network, the distinction between preference and feasibility is not cosmetic: infeasible routings correspond to violations of conservation or overloads that are meaningful failure modes.

We therefore propose to treat attention as a constrained pseudo-flow and to compute it by a projection principle. Concretely, the neural scoring function produces edge log-weights e, hence positive reference weights w = exp (e). We then choose an edge mass vector f by minimizing a generalized Kullback–Leibler divergence to w subject to (i) conservation with respect to a supply vector b and (ii) edge capacities. This choice has three advantages that are simultaneously algorithmic and conceptual. First, it yields a pseudo-flow whenever the feasible set is nonempty, so the layer is well-defined. Second, it converts feasibility into a differentiable normalization subroutine (a KL projection) that can be implemented by sparse, GPU-friendly iterative scaling and box-projection steps. Third, it recovers classical attention as a limiting special case: if one removes global balance and replaces it by per-node simplex constraints, the KL projection reduces to a softmax computed independently at each node.

Our contribution is thus not merely to add constraints, but to identify the appropriate mathematical object that interpolates between attention and flow. A KL projection onto a flow polytope is an : among all feasible pseudo-flows, it selects the one that minimally perturbs the model’s learned edge preferences in an information-geometric sense. This is the appropriate analogue of softmax, which itself can be characterized as a KL projection onto a simplex. In particular, the projection viewpoint cleanly separates the roles of learning and feasibility: the GNN learns scores e without needing to ``rediscover’’ conservation in its parameters, while the projection enforces conservation and capacity compliance at the level of the produced pseudo-flow.

We provide guarantees at three levels of abstraction, corresponding to the three requirements one expects from a normalization layer: existence, computability, and quantitative control under truncation. At the level of well-posedness, strict convexity of the KL objective over nonnegative flows implies that, under feasibility and positivity of w, there exists a unique minimizer f^⋆ satisfying the balance constraints exactly. At the level of computation, we instantiate the projection by alternating Bregman projections: a generalized Sinkhorn/IPFP step enforces balance in a dual-potential form, while a Dykstra-style correction enforces box constraints for capacities. This yields a sparse iterative method whose per-iteration cost is linear in |E|, matching the natural input-reading lower bound for any nontrivial dependence on all edges. At the level of approximation, we make explicit that practical implementations return f^(K) after K iterations; we connect a KL suboptimality tolerance to explicit bounds on conservation residual and capacity violation, so that ``approximately normalized’’ has a precise meaning.

Finally, the projection layer is not an isolated optimization primitive: it is designed to be used as a drop-in replacement for attention weights in message passing. The resulting aggregation uses f^(K) as edge coefficients, yielding node updates that can be interpreted as routing information through the graph in a manner consistent with supplies, demands, and (optionally) capacities. This furnishes an interpretability statement that is otherwise unavailable: to the extent that the solver is run to tolerance, the produced edge weights satisfy Kirchhoff conservation up to an explicit residual bound, and capacity violations are likewise controlled. The remaining parts of our development elaborate these claims, relate them to standard attention mechanisms, and evaluate empirically when enforcing feasibility improves prediction and when it imposes an unnecessary inductive bias.

2. Background and Source Connection: recap standard attention, flow attention, Kirchhoff’s law; interpret attention as relative flow; limitations motivating global constraints.

We briefly recall the standard attention mechanism as it appears in message passing and make explicit the sense in which it encodes only a redistribution rule. Let G = (V, E) be directed and let a scoring function produce logits e_vu ∈ ℝ for each edge (v, u) ∈ E. A common outgoing-normalized attention defines coefficients
$$ \alpha_{vu} :=\frac{\exp(e_{vu})}{\sum_{u'\in N_{\mathrm{out}}(v)} \exp(e_{vu'})}, \qquad (v,u)\in E, $$
so that ∑_{u ∈ N_out(v)}α_vu = 1 for every v with N_out(v) ≠ ∅. In a typical layer, these coefficients modulate aggregation by
m_u := ∑_{(v, u) ∈ E}α_vu M(h_v, edge_feat_vu), h_u^′ := U(h_u, m_u),
for some message map M and update map U. From a probabilistic viewpoint, α is a row-stochastic transition rule: conditioning on being at node v, one moves to an outgoing neighbor u with probability α_vu. This interpretation is coherent and often useful, but it should not be confused with a conservation law: α prescribes of whatever mass is available at v, and the layer as written does not specify how much mass resides at each node.

To see the missing degree of freedom, suppose we attempted to interpret attention as an actual routed quantity. Introduce an (unknown) nonnegative node mass vector x ∈ ℝ_≥ 0^V, where x_v is the amount available at v to be sent along outgoing edges. If α determines only the split, then the induced edge mass is
f_vu := x_vα_vu.
Kirchhoff-type balance at a node v (with supply/demand b_v) would require
∑_{(u, v) ∈ E}f_uv − ∑_{(v, u) ∈ E}f_vu = b_v, i.e. ∑_{u ∈ N_in(v)}x_uα_uv − x_v = b_v.
Even with fixed α, this is a system coupling all nodes through the graph. In particular, the common choice implicit in many layers—that each node emits a fixed total amount, effectively x_v ≡ 1 (or x_v determined only by a local feature transform)—is incompatible with the above balance unless b and α satisfy special identities. Thus, the per-node simplex constraint on α is best understood as defining relative routing preferences, not a feasible flow.

One may similarly consider incoming-normalized variants, in which coefficients β_vu satisfy ∑_{v ∈ N_in(u)}β_vu = 1 for each u and aggregation is m_u = ∑_vβ_vuM(⋅). This normalization can be read as prescribing how each receiver u averages information from its predecessors. However, if we try to interpret β as edge mass directly, we again obtain a model that enforces a local constraint (a unit inflow) irrespective of what upstream nodes can supply. In the flow picture, inflow should be determined by upstream outflows and external demands, not fixed by a local averaging rule. The distinction is not semantic: conservation asserts that, away from terminals, nodes do not act as net sources or sinks, whereas local normalizations can force precisely such net creation or absorption.

It is helpful to state explicitly the conservation law we have in mind. Given an edge mass vector f ∈ ℝ_≥ 0^E, Kirchhoff balance with supply vector b ∈ ℝ^V is the linear constraint
(Af)_v = ∑_{(u, v) ∈ E}f_uv − ∑_{(v, u) ∈ E}f_vu = b_v ∀v ∈ V,
where ∑_v ∈ Vb_v = 0 ensures global consistency. Nodes with b_v > 0 inject mass, nodes with b_v < 0 absorb mass, and interior nodes satisfy (Af)_v = 0. Two features are immediate. First, the constraint couples incoming and outgoing edges at each node, hence cannot be enforced by normalizing only outgoing edges (or only incoming edges) independently. Second, the constraint is inherently global because the feasibility of balancing at one node depends on the mass routed through others. Consequently, a conservation-respecting ``attention’’ must be computed by a procedure that has access to the entire incidence structure, at least through local aggregations repeated across the graph.

A related point concerns the scale of the edge masses. Standard attention is invariant under adding a constant to all logits leaving a node, since α_vu depends only on relative differences. In contrast, the absolute magnitude of a flow is meaningful: multiplying all edge masses by a constant changes the supplies and demands that can be satisfied, and it changes whether capacities are violated. This mismatch clarifies why learned scores alone cannot encode feasibility. Even if the network learns to certain edges, without an additional normalization principle it has no mechanism to select the unique global scaling and coupling needed to satisfy Af = b.

Capacities introduce an additional constraint that is likewise nonlocal in effect. In many domains, each edge (u, v) admits a maximum throughput c_uv ∈ (0, ∞], requiring 0 ≤ f_uv ≤ c_uv. Local softmax-based attention does not interact with such bounds in any direct way: because α_uv ∈ (0, 1) but is later multiplied implicitly by an uncontrolled node-dependent scale (e.g. the magnitude of messages, or the number of attention heads, or downstream linear transforms), the resulting effective mass on an edge can exceed any prescribed c_uv. One may attempt to address this by saturating or clipping edge weights, but doing so after a local normalization generally destroys whatever balance properties one might have hoped to preserve. In other words, conservation and capacities should be treated as —an absolute edge mass vector—rather than as separate heuristics applied at different stages.

These observations suggest a clean separation: logits e should be interpreted as a expressing edge preferences, while feasibility (conservation and capacity compliance) should be imposed by a global normalization that maps preferences to a pseudo-flow. The mathematical question then becomes: given positive base weights w = exp (e) on edges, how do we choose a nonnegative edge mass f that is as close as possible to w while satisfying Kirchhoff balance (and, when present, capacities)? In the next section we make this question precise by specifying the feasible polytope induced by (G, b, c) and a projection objective that selects a unique pseudo-flow compatible with the graph constraints.

3. Problem Formulation (CCFA): define flow polytope with supplies and capacities; define KL projection objective; discuss feasibility and modeling choices (unknown sources/sinks, undirected graphs via bidirection).

Fix a directed graph G = (V, E). We regard the output of the scoring network as a vector of edge logits e ∈ ℝ^E and define strictly positive w ∈ ℝ_> 0^E by
w_e := exp (e_e), e ∈ E.
The object that will replace attention coefficients is an nonnegative edge mass (pseudo-flow) f ∈ ℝ_≥ 0^E. Its net imbalance at each node is encoded by the node–edge incidence operator A ∈ ℝ^V × E,
(Af)_v := ∑_{(u, v) ∈ E}f_uv − ∑_{(v, u) ∈ E}f_vu, v ∈ V.
We also fix a supply/demand vector b ∈ ℝ^V satisfying ∑_v ∈ Vb_v = 0, where b_v > 0 injects mass and b_v < 0 absorbs mass. Finally, each edge e may be equipped with an upper capacity c_e ∈ (0, ∞], where c_e = ∞ indicates no upper bound.

The constraints we seek to enforce are (i) Kirchhoff balance with supplies and (ii) capacity compliance. We therefore define the feasible set
ℱ(G, b, c) := {f ∈ ℝ_≥ 0^E : Af = b, 0 ≤ f ≤ c},
where inequalities are interpreted componentwise on edges.
The global consistency condition ∑_vb_v = 0 is necessary for feasibility, but not sufficient: the capacities and graph topology can render ℱ(G, b, c) = ∅. In our development we separate from : the CCFA layer is defined as a projection onto ℱ(G, b, c) whenever the polytope is nonempty, and we discuss below standard relaxations when it is not.

Two limiting cases are instructive. If b ≡ 0, then Af = 0 enforces conservation at every node, meaning that interior nodes cannot create or destroy mass. If, further, c ≡ ∞, then the only constraints are f ≥ 0 and Af = 0, yielding a circulation polytope. Conversely, if one retains capacities but chooses b supported only on designated sources and sinks, then ℱ(G, b, c) describes capacitated multi-commodity-free flow routing with a prescribed net injection/absorption profile.

Given the base weights w and constraints ℱ(G, b, c), we select the pseudo-flow as the unique minimizer of a generalized KL divergence,
$$ f^\star \;\in\; \arg\min_{f\in\mathcal{F}(G,b,c)} \mathrm{KL}(f\|w), \qquad \mathrm{KL}(f\|w) := \sum_{e\in E} \Bigl( f_e\log\frac{f_e}{w_e}-f_e+w_e \Bigr). $$
We emphasize the role of w as a : it encodes the network’s edge preferences in an absolute, positive scale, while the projection enforces feasibility with minimal distortion in the KL geometry. This choice yields three properties that are central for a GNN layer. First, the objective is strictly convex on ℝ_≥ 0^E when w > 0, hence (under feasibility) the solution is unique. Second, the optimizer depends smoothly on w in regimes of interest, enabling stable backpropagation through approximate solvers. Third, the KL form is compatible with multiplicative reweighting and sparse iterative Bregman projection algorithms, which we exploit computationally in the next section.

In some applications b and c are physical and chosen so that ℱ(G, b, c) ≠ ∅ by construction. In others, one wishes to impose ``soft’’ conservation or capacities as an inductive bias even when strict feasibility is not guaranteed. A standard remedy is to enlarge the feasible set by introducing slack variables. For example, one may permit node imbalances s ∈ ℝ^V and solve
min_{f ≥ 0, 0 ≤ f ≤ c, s ∈ ℝ^V}KL(f∥w) + λ∥s∥₁ s.t. Af = b + s,
or, equivalently, augment the graph with a super-source/super-sink connected by high-capacity edges whose usage is penalized. Such relaxations preserve the interpretation that the layer attempts to route mass according to w, but now pays an explicit price for violating conservation or exceeding prescribed net injections. In the remainder we focus on the exact-feasibility setting to state clean theorems; the relaxed variants follow by standard convex-analytic modifications.

The vector b can be provided in multiple ways, depending on supervision and domain knowledge.
(i) If sources and sinks are known, one may set b_v = 0 for interior nodes and distribute a total mass m > 0 across sources and sinks, e.g. ∑_{v : b_v > 0}b_v = m and ∑_{v : b_v < 0}b_v = −m.
(ii) If only some terminals are known, one may fix b on those nodes and learn the remainder under the constraint ∑_vb_v = 0.
(iii) One may parameterize b as an output of a neural map g on node embeddings, followed by centering to enforce global consistency:
$$ \tilde b_v := g(h_v), \qquad b_v := \tilde b_v - \frac{1}{|V|}\sum_{u\in V}\tilde b_u, $$
optionally combined with a scale constraint (e.g. ∥b∥₁ = m) to avoid the trivial near-zero regime. This yields an end-to-end trainable layer in which the network chooses both mass is injected/absorbed and it is routed, with the routing still respecting the incidence constraints.

When the input graph is undirected, we embed it into the directed setting by replacing each undirected edge {u, v} with two directed edges (u, v) and (v, u), yielding a bidirected graph. Capacities may be assigned per direction (two parameters), or tied to represent a single physical limit. If a single undirected capacity c_{u, v} is desired, one convenient convex encoding is
0 ≤ f_uv, f_vu, f_uv + f_vu ≤ c_{u, v},
which can be handled by extending the box constraints with one additional linear inequality per undirected edge. For simplicity of exposition we keep the per-directed-edge box constraint 0 ≤ f_e ≤ c_e, which already captures many settings (e.g. asymmetric conductances, one-way valves, directed communication links), and we note that the projection framework extends to these coupled-capacity variants with the same Bregman projection toolkit.

The CCFA normalization is the mapping
CCFA: (G, b, c, w) ↦ f^⋆, f^⋆ := arg min_{f ∈ ℱ(G, b, c)}KL(f∥w),
and, in practice, its truncated approximation f^(K) produced by a fixed number of sparse Bregman projection iterations. The subsequent section instantiates this mapping as a differentiable GNN layer: edge scores generate w, the projection produces f^(K), and f^(K) is then used as the edge weight in message passing and readout.

4. CCFA Layer in a GNN: how edge scores are computed; how projected flow weights enter message passing; readouts; relation to existing FlowGNN layers as special cases.

We consider a single GNN layer that receives node embeddings h ∈ ℝ^V × d (and optionally edge features x_e) and returns updated embeddings h^′ ∈ ℝ^V × d. The CCFA layer intervenes precisely at the point where one would ordinarily compute attention coefficients: instead of producing locally normalized weights (e.g. per-node softmax), we produce a global pseudo-flow f ∈ ℝ_≥ 0^E by KL projection onto ℱ(G, b, c). Concretely, the layer decomposes into three maps
(h, x) ↦ e ↦ w = exp (e) ↦ f^(K) ≈ f^⋆,
followed by weighted aggregation using f^(K) as edge weights. The supply vector b and capacities c may be fixed from domain knowledge or predicted by auxiliary heads; once chosen, they define the constraint geometry that the learned scores must respect.

Given embeddings h_u, h_v ∈ ℝ^d and edge features x_uv, we define an edge logit e_uv ∈ ℝ by a differentiable scoring function. Typical choices include an MLP on concatenated features or a bilinear form,
e_uv = ϕ ([h_u, h_v, x_uv]), or e_uv = h_u^⊤Wh_v + a^⊤x_uv,
where ϕ is a shared network and W, a are learnable parameters. Multi-head variants are obtained by producing H logits e⁽¹⁾, …, e^(H) and projecting each head separately, yielding flows f^(K, 1), …, f^(K, H) that parameterize multiple constrained routing patterns. We then set w_uv = exp (e_uv) > 0 as the base measure for the projection; the exponential map is not merely conventional, but matches the multiplicative structure of KL geometry and ensures positivity without ad hoc clipping.

Given w, the CCFA normalization returns f^(K) approximating the unique I-projection f^⋆ onto ℱ(G, b, c). In contrast to local normalizations, the defining constraints couple edges across the entire graph via the incidence operator. In particular, conservation Af = b encodes that every node can only emit or absorb mass according to its prescribed supply, and capacities 0 ≤ f ≤ c enforce edgewise feasibility. The effect is that increasing the score of one edge can indirectly decrease mass on distant edges in order to maintain global balance. This coupling is the mechanism by which CCFA expresses inductive biases such as conservation laws, flow bottlenecks, and multi-path routing. When b ≡ 0, the pseudo-flow is a circulation up to capacities; when b is supported on a small terminal set, the pseudo-flow concentrates on paths that connect sources to sinks while respecting congestion limits.

Once f^(K) is computed, we use it as the edge weight in aggregation. A generic form is
m_v := ∑_{(u, v) ∈ E}f_uv^(K) M (h_u, x_uv), h_v^′ := U (h_v, m_v),
with learnable maps M (message) and U (update), e.g. linear layers with nonlinearities and residual connections. We emphasize that f^(K) is mass rather than a convex combination, so the scale of m_v is controlled by the scale of the feasible set (through b and c) rather than by a simplex constraint. This is desirable in physical settings where the total throughput is meaningful; when a fixed scale is preferred (for numerical conditioning or architectural parity), we may normalize supplies so that ∥b∥₁ = m is constant across graphs, or include a learned global scalar α and use αf^(K) inside m_v. In either case, conservation provides an interpretable guarantee: for nodes with b_v = 0, the incoming and outgoing mass of f^(K) nearly match, with the deviation controlled by the solver tolerance.

For graph-level prediction, we apply any standard readout to the final node embeddings, e.g. sum/mean pooling, attention pooling, or Set2Set. Additionally, CCFA provides a structured object f^(K) that can be exploited directly. For instance, we may define edge-aggregated features such as total utilized capacity ∑_ef_e^(K)/c_e (with the convention 1/∞ = 0), cut flows across a designated partition, or per-node throughputs ∑_(u, v)f_uv^(K). These quantities can be concatenated to the pooled embedding to form a hybrid readout that combines learned representations with physically interpretable statistics. When supervision includes flow-related targets, one may also add auxiliary losses on Af^(K) or on specific edge subsets, thereby aligning the learned attention with domain constraints without requiring exact flow labels.

CCFA recovers several familiar mechanisms as limiting or specialized cases. If we drop all constraints and simply set f = w, then the layer reduces to unnormalized edge gating. If we retain only local outgoing simplex constraints (and no other couplings), then the KL projection decouples by node and yields the standard outgoing softmax attention; this is precisely the sense in which softmax attention is a KL projection, but onto a local polytope rather than ℱ(G, b, c). If we set c ≡ ∞ but enforce Af = b, we obtain a globally balanced reweighting of w that behaves like a circulation-based attention mechanism: the layer can only route mass along cycles unless sources/sinks are introduced via b. Many ``flow-based’’ GNN constructions may be interpreted as choosing a constraint family and a divergence; CCFA fixes the divergence to generalized KL (for strict convexity and multiplicative structure) and allows the practitioner to interpolate between unconstrained attention, locally normalized attention, and globally constrained flow attention by selecting (b, c) and, when desired, introducing slack relaxations. The next section addresses how we compute f^(K) efficiently on sparse graphs.

5. Algorithms: sparse generalized Sinkhorn / iterative proportional fitting for balance constraints; Dykstra/Bregman projections for adding box capacities; batched GPU implementation considerations (stability, log-domain).

We first treat the case without finite capacities, i.e. c ≡ ∞, where the feasible set reduces to the affine slice {f ≥ 0 : Af = b}. The KL I-projection onto an incidence-defined affine constraint admits the familiar exponential-family form: there exists a potential (dual) vector p ∈ ℝ^V, unique up to an additive constant on each weakly connected component, such that the balanced flow induced by p is

Indeed, if we form the Lagrangian for min_f ≥ 0KL(f∥w) subject to Af = b, the first-order optimality conditions yield log (f_e/w_e) = (A^⊤p)_e, and with our incidence convention (A^⊤p)_uv = p_v − p_u we recover . Thus, enforcing balance reduces to finding p such that Af(p) = b. We accomplish this by iterative proportional fitting in the dual, implemented as repeated sparse aggregations of incoming/outgoing mass followed by a potential update.

Concretely, define the node residual
r(p) := Af(p) − b ∈ ℝ^V, r_v(p) = ∑_{(u, v) ∈ E}f_uv(p) − ∑_{(v, u) ∈ E}f_vu(p) − b_v.
We view p ↦ r(p) as a nonlinear map with a sparse Jacobian structure inherited from G. In practice we employ damped first- or second-order updates. A simple and effective choice is gradient ascent on the concave dual objective
max_{p ∈ ℝ^V} ⟨b, p⟩ − ∑_{(u, v) ∈ E}w_uvexp (p_u − p_v),
whose gradient is b − Af(p) = −r(p); hence one may update

with a stepsize η > 0 (possibly scheduled or learned). To reduce stepsize sensitivity, we may also apply a diagonal Newton-like preconditioner using the local curvature
H_vv(p) ≈ ∑_{(v, u) ∈ E}f_vu(p) + ∑_{(u, v) ∈ E}f_uv(p),
and update p_v ← p_v − η r_v/H_vv for nodes with H_vv > 0. We remove the gauge freedom by fixing p_v₀ = 0 for a designated node v₀ per component (or by subtracting the mean of p after each iteration), which is numerically important to prevent drift.

We now incorporate finite capacities 0 ≤ f ≤ c. The feasible region is the intersection of two closed convex sets in $\mathbb{R}^E_{\ge 0$}: the affine balance set 𝒜 := {f ≥ 0 : Af = b} and the box set ℬ := {f : 0 ≤ f ≤ c}. We compute the I-projection onto 𝒜 ∩ ℬ by alternating Bregman projections with Dykstra corrections. At a high level, we iterate
$$ f \xrightarrow{\ \Pi_{\mathcal{A}}^{\mathrm{KL}}\ } \tilde f \xrightarrow{\ \Pi_{\mathcal{B}}^{\mathrm{KL}}\ } f^+, $$
but because Π_ℬ^KL and Π_𝒜^KL do not commute, Dykstra’s method maintains edgewise correction variables to ensure convergence to the projection onto the intersection rather than to a limit cycle.

The KL projection onto the box set is edge-separable. Given a positive vector g ∈ ℝ_> 0^E, the minimizer of KL(f∥g) subject to 0 ≤ f ≤ c is simply

since the unconstrained minimizer is f = g and the objective is strictly convex with monotone derivative in each coordinate. The nontrivial part is accounting for the interaction between and the balance projection. A convenient multiplicative form of KL-Dykstra maintains a correction factor q ∈ ℝ_> 0^E (equivalently, an additive log-correction) so that the current iterate is represented as a reweighted base measure. One practical variant is:
$$ \text{(i) } \hat w_e \leftarrow w_e\, q_e,\qquad \text{(ii) } \tilde f \leftarrow \Pi_{\mathcal{A}}^{\mathrm{KL}}(\hat w)\ \ \text{(via potentials as in \eqref{eq:incidence-exp-family})}, $$
$$ \text{(iii) } f \leftarrow \Pi_{\mathcal{B}}^{\mathrm{KL}}(\tilde f)\ \ \text{(via clipping \eqref{eq:box-proj})}, \qquad \text{(iv) } q_e \leftarrow q_e\, \frac{\tilde f_e}{f_e}. $$
Step (iv) is the Dykstra correction: it ensures that mass removed by the box projection is “remembered” when we subsequently rebalance. When c_e = ∞ we may omit that edge from the capacity step by treating the clip as identity, and we keep q_e ≡ 1.

The core computational kernels are sparse edgewise transforms and per-node reductions. We store each graph (or a batch of graphs) in adjacency-list form with contiguous edge arrays and index vectors for sources and targets. Given node potentials p, we evaluate for all edges in parallel:
log f_e = log w_e + p_src(e) − p_tgt(e), f_e = exp (log f_e),
followed by segment-reductions to compute incoming and outgoing totals for each node, hence the residual r = Af − b. This is O(|E|) work per iteration with favorable memory access (two gathers for potentials, one fused multiply-add/exponentiation, two scatters for in/out sums). For multiple graphs, we concatenate their edge lists and offset node indices; all reductions become segmented by graph, and we optionally fix one gauge node per graph to remove additive invariances.

Because is multiplicative, we control overflow/underflow by operating in the log domain as long as possible. We keep log w = e and represent Dykstra corrections as log q; then log ŵ = e + log q and the edge log-flow is computed as e + log q + p_u − p_v. We may clamp log f_e to a finite interval [−L, L] before exponentiation (with L chosen according to the floating-point format) to prevent NaNs in early training when scores are poorly scaled. Gauge normalization of p (e.g. subtracting the per-graph mean) prevents unbounded drift that would otherwise be invisible to the constraints but harmful numerically. Finally, we damp potential updates and, if needed, clip large residuals r_v to stabilize training under backpropagation through K iterations.

We typically run a fixed number of iterations K for predictable runtime, but we monitor feasibility diagnostics ∥Af − b∥ and capacity violation ∥(f − c)₊∥ during development to select K and damping. The entire procedure is differentiable by unrolling; all operations are compositions of smooth maps (except for the clip in , which is piecewise smooth and works well with subgradients). When memory is constrained, we checkpoint only the final potentials and recompute intermediate iterates, or we employ implicit differentiation through the fixed-point conditions when higher accuracy is required. These algorithmic choices produce a sparse, GPU-friendly normalization subroutine whose per-iteration cost is linear in |E| and whose numerical behavior is consistent with the KL geometry of the projection.

6. Theory: existence/uniqueness; convergence theorems; truncation bounds translating iteration count to constraint violation and objective gap; containment results (recover outgoing-softmax as a degenerate constraint set).

We study the map from positive edge weights w = exp (e) to a capacity-constrained pseudo-flow f^⋆ defined as the unique KL (I-)projection onto ℱ(G, b, c). Throughout, we assume w_e > 0 for all e ∈ E and that the feasible polytope ℱ(G, b, c) is nonempty.

Consider the convex program

The objective KL(⋅∥w) is proper, lower semicontinuous, and strictly convex on ℝ_≥ 0^E when w > 0 (strictness follows from strict convexity of x ↦ xlog x on (0, ∞) together with separability). The constraint set ℱ(G, b, c) is a closed convex polytope. Hence admits a unique minimizer f^⋆ ∈ ℱ(G, b, c). In particular, Kirchhoff balance holds at optimality: Af^⋆ = b, so for every node v with b_v = 0 we have ∑_uf_uv^⋆ = ∑_uf_vu^⋆.

We record the associated optimality system. Introducing Lagrange multipliers p ∈ ℝ^V for Af = b and multipliers α, β ∈ ℝ_≥ 0^E for f ≥ 0 and f ≤ c, the KKT conditions for imply that for each edge e = (u, v),

When capacities are inactive on an edge (i.e. 0 < f_uv^⋆ < c_uv), we have α_uv = β_uv = 0 and recover the pure incidence exponential form f_uv^⋆ = w_uvexp (p_u − p_v). The multipliers (α, β) thus play the role of edgewise corrections that ``turn on’’ precisely at saturated or extinguished edges, matching the geometry exploited by KL-Dykstra updates.

The KL projection view strictly generalizes standard attention normalizations. To make this precise, fix b ≡ 0 and replace global balance Af = 0 by constraints

with no coupling across nodes and no upper bounds. The minimization of KL(f∥w) subject to separates across v. For each node v, introducing a scalar multiplier λ_v for the equality constraint yields the stationarity condition
$$ \log\frac{f_{vu}}{w_{vu}} = \lambda_v\qquad\Longrightarrow\qquad f_{vu}=w_{vu}\exp(\lambda_v). $$
Enforcing ∑_uf_vu = 1 gives exp (λ_v) = (∑_uw_vu)⁻¹ and hence

Thus standard outgoing-softmax attention is exactly the KL I-projection onto per-node simplices, whereas CCFA performs the I-projection onto the globally coupled polytope ℱ(G, b, c).

We next justify the iterative procedure used to approximate f^⋆ when |E| is large. Let 𝒜 := {f ≥ 0 : Af = b} and ℬ := {f : 0 ≤ f ≤ c}. Both are closed convex sets, and 𝒜 ∩ ℬ = ℱ(G, b, c) ≠ ∅. Denote by Π_𝒞^KL(g) the unique minimizer of KL(f∥g) over f ∈ 𝒞. Alternating Bregman projections onto 𝒜 and ℬ converge to the I-projection onto the intersection when equipped with Dykstra corrections. Concretely, Dykstra’s algorithm constructs sequences (f^(t), q^(t)) such that each substep applies an exact KL projection onto one constraint set with respect to a reweighted measure, and the correction variables ensure that the limit is Π_{𝒜 ∩ ℬ}^KL(w) = f^⋆ rather than a two-cycle. Standard results for Bregman projections and Dykstra-type methods applied to the generalized KL divergence imply:

When c ≡ ∞, the method reduces to enforcing the affine constraints in the dual incidence family, i.e. the generalized Sinkhorn/IPFP scaling for a log-linear model on edges. In this setting, convergence follows from strict convexity of the dual and smoothness of the map p ↦ Af(p); with capacities, Dykstra corrections restore convergence for the intersection.

In learning and inference we run a fixed number K of iterations, producing an approximate solution f^(K). We therefore require quantitative statements connecting iteration count (or objective error) to constraint violation. A convenient starting point is that KL is strongly convex on bounded boxes: if f and g lie in [ℓ, u]^E for some 0 < ℓ ≤ u < ∞, then

Assuming the iterates are kept in such a bounded region (e.g. by finite capacities and/or log-clamping in implementation), an objective gap bound KL(f^(K)∥w) − KL(f^⋆∥w) ≤ ε implies $\|f^{(K)}-f^\star\|_2\le \sqrt{2u\varepsilon}$. Since Af^⋆ = b, we obtain a conservation bound

and similarly in other norms using the corresponding operator bounds for A. For capacities, if the algorithm includes an explicit projection onto ℬ at each cycle, then 0 ≤ f^(K) ≤ c holds exactly; otherwise, one may bound ∥(f^(K) − c)₊∥ by comparing to the clipped vector and controlling the resulting KL increase. These inequalities formalize the sense in which Kirchhoff’s law holds and holds at finite K with an explicit ε-dependent violation. In practice, we use these bounds to select K and damping so that the residual is commensurate with downstream predictive noise, ensuring that CCFA’s structural bias is present without over-solving the inner projection.

7. Complexity and Lower Bounds: per-iteration costs in sparse graphs; dependence on accuracy ε; matching lower bounds for reading edges and iteration dependence (as far as known).

We summarize the computational cost of a CCFA layer and relate it to unavoidable lower bounds, both for reading the sparse graph input and for attaining a prescribed accuracy ε in the KL projection.

A single forward layer decomposes into (i) edge scoring, (ii) constrained normalization (approximate KL projection), and (iii) weighted message passing. For typical GNN scorers (bilinear forms or small MLPs) the edge scoring step evaluates e_uv = score(h_u, h_v, edge_feat_uv) for each (u, v) ∈ E, which costs
T_score = O(|E|⋅d)
up to constants depending on the number of heads and the scorer architecture. The message passing step computes m_v = ∑_{(u, v) ∈ E}f_uv M(h_u, edge_feat_uv) and then applies an update map U; this similarly costs
T_mp = O(|E|⋅d)
for standard linear or MLP messages.

The normalization step is linear in |E| per iteration. Indeed, in the dual-incidence parametrization f_uv = w_uvexp (p_u − p_v) (with additional edgewise corrections when capacities are active), each update of the node potentials p requires computing nodewise aggregates over incident edges. Concretely, evaluating the balance residual r = Af − b requires one pass over edges to accumulate incoming and outgoing sums, hence O(|E|) time, plus O(|V|) to combine with b. Updating p (by IPFP-like scaling, damped Newton, or a gradient step in the smooth dual when c ≡ ∞) is then O(|V|) on top of these aggregates. If we include a Dykstra-style capacity substep, we additionally perform edge-local box projections and correction updates, again a single pass over edges. Consequently,
T_norm(K) = O(K ⋅ |E|) and T_layer = O(|E|⋅d + K ⋅ |E|).
In practice the O(|V|) terms are dominated by |E| on sparse graphs with moderate average degree. Moreover, the operations are streaming over adjacency lists and thus amenable to fused GPU kernels: the dominant primitives are segment-reductions (per node) and pointwise exponentiation/multiplication (per edge). Numerical stabilization (e.g. log-domain accumulation, clamping of p, or damping) increases constant factors but does not change the O(K|E|) scaling.

The solver maintains the edge weights w, the current pseudo-flow f, and auxiliary vectors (dual potentials p ∈ ℝ^V and, with capacities, correction variables in ℝ^E). Hence the memory footprint is
S_layer = O(|E|+|V|) (solver buffers) and O(|V|⋅d + |E|⋅d_e) (features),
again up to constants for multiple heads and optimizer state.

The iteration count K required to reach a target tolerance ε is governed by the convergence results for alternating KL projections (Theorem~3) and by the translation from objective error to feasibility residuals (Theorem~4). While the precise complexity depends on the chosen update rule for p and on the conditioning of the instance (in particular on the dynamic range of w and on the graph structure through A), the general form is that K is polynomial in 1/ε and logarithmic in the relevant scale parameters (e.g. log (max_ew_e/min_ew_e)), in line with known entropic scaling and Sinkhorn-type analyses.

For downstream use, it is often more meaningful to select K by specifying a conservation tolerance η rather than an objective tolerance. Under boundedness of iterates (e.g. by finite capacities or implementation clamping), the strong convexity estimate and the residual inequality yield the implication
$$ \mathrm{KL}(f^{(K)}\|w)-\mathrm{KL}(f^\star\|w)\le \varepsilon \quad\Longrightarrow\quad \|Af^{(K)}-b\|_2 \le \|A\|_{2\to 2}\sqrt{2u\varepsilon}. $$
Thus achieving ∥Af^(K) − b∥₂ ≤ η is ensured by taking ε ≤ η²/(2u∥A∥_2 → 2²) and then running K = K(ε) iterations as prescribed by Theorem~3. If we enforce 0 ≤ f^(t) ≤ c at each cycle (by an exact KL projection onto the box constraints in the Dykstra scheme), capacity compliance holds identically for all iterates, and the only residual of interest is conservation.

The linear dependence on |E| is not merely an artifact of our implementation: it is information-theoretically necessary in the sparse input model. Any procedure that outputs a nontrivial pseudo-flow f which depends on the edge data (w, c) must, in the worst case, inspect Ω(|E|) edges. Formally, an adversary may modify a single edge weight w_e or capacity c_e so that the unique projection f^⋆ changes by a fixed amount on that edge (and potentially propagates through the balance constraints). An algorithm that does not read e cannot distinguish the two instances and therefore cannot be correct on both; this yields the Ω(|E|) lower bound stated in Theorem~5. Consequently, the per-iteration cost O(|E|) is optimal up to constants among algorithms that must react to all edges.

Beyond input reading, there are also lower bounds on the needed by Sinkhorn/IPFP-like methods to reach ε-accurate constraints in worst case instances, even in classical entropic optimal transport and matrix scaling. These results imply that one should not expect a uniformly logarithmic dependence on 1/ε for first-order multiplicative scaling schemes; rather, a polynomial dependence on 1/ε is intrinsic for broad instance classes (see, e.g., lower-bound discussions in the entropic OT literature ). Since our balance constraints are encoded by the sparse incidence operator A, CCFA inherits the same fundamental limitation: any method that enforces Af = b via repeated local multiplicative corrections must propagate information across the graph, and the number of passes required scales at least with the desired precision and with the graph’s mixing/conditioning properties. In this sense, Theorem~3 is best interpreted as guaranteeing convergence and a polynomial iteration bound under feasibility and positivity, while Theorem~5 explains why the dominant O(K|E|) runtime is essentially unavoidable on sparse inputs.

Taken together, these bounds justify the practical regime in which we operate CCFA: we fix a moderate K (constant across batches) so that T_norm(K) is comparable to T_score and T_mp, and we appeal to Theorem~4 to certify that the resulting pseudo-flow satisfies Kirchhoff balance up to an explicit, tunable residual. The lower bounds indicate that, absent additional structure or stronger assumptions, asymptotically faster normalization than O(|E|) per iteration or dramatically fewer than polynomial-in-1/ε iterations cannot be guaranteed in the worst case, reinforcing the design choice of sparse, batched, approximate projections.

8. Experiments (recommended to strengthen paper): predictive performance on PowerGraph and CktBench-like tasks; pseudo-flow fidelity vs simulators; capacity violation statistics; OOD degree/topology shift; control dataset where constraints may hurt.

We design experiments to isolate three questions: (a) whether enforcing global flow constraints improves predictive accuracy on tasks with underlying conservation laws, (b) whether the returned pseudo-flow is faithful to physically meaningful flows produced by simulators, and (c) how sensitive the method is to distribution shift and to potentially misspecified constraints.

We target two families of benchmark problems in which directed flow is a natural latent variable.

We consider graphs representing power networks, where nodes are buses and directed edges are transmission lines. Node features include exogenous injections/loads, voltage setpoints (when available), and static attributes; edge features include line parameters and thermal ratings (used as capacities c_e). Typical prediction targets include (i) per-bus voltage magnitude/angle, (ii) per-line active/reactive power flow magnitudes, and (iii) feasibility or contingency labels (e.g. whether an operating point violates line limits). In these tasks, conservation is naturally expressed through a supply vector b derived from injections/loads (possibly augmented with slack-bus conventions).

We consider directed circuit graphs in which nodes are circuit components or connection points, edges represent directed current pathways, and capacities encode maximum allowable current (or are set to ∞ when not applicable). Targets include simulated node potentials, currents on designated edges, and pass/fail labels for over-current or connectivity constraints. Although real circuits are not purely directed-flow objects, these tasks provide a controlled setting in which Kirchhoff-type constraints (on appropriate graph abstractions) are meaningful.

We compare CCFA against attention and message-passing baselines that differ only in the normalization step:
(i) , i.e. per-node simplex normalization on outgoing edges (Theorem~2 identifies this as a local KL projection).
(ii) , i.e. using w = exp (e) directly as edge weights.
(iii) on bipartite or layered variants when applicable, to separate the effect of global constraints from mere scaling.
(iv) (e.g. GCN/GAT variants) matched in depth and parameter count.

For CCFA, we include ablations that remove individual constraints: (a) balance only (Af = b with c ≡ ∞), (b) capacities only (box projection without enforcing Af = b), and (c) full constraints (Af = b and 0 ≤ f ≤ c). We also vary the iteration budget K to expose the compute–accuracy tradeoff.

We report standard supervised metrics (MSE/MAE for regression targets and NLL/AUROC for classification targets) averaged over multiple random seeds. To ensure a fair comparison, we hold constant the encoder, scorer architecture, and message/update maps, and change only the normalization mechanism. Hyperparameters (learning rate, weight decay, dropout, and K) are tuned on validation splits. Since CCFA introduces an additional algorithmic component, we also report wall-clock time per epoch and throughput (graphs/sec) to quantify the computational overhead relative to softmax attention.

A central claim is that the learned pseudo-flow f^(K) is interpretable as a physically constrained transport of attention mass. To evaluate this, we align the pseudo-flow with simulator outputs when the latter are available:
for PowerGraph-like tasks, we compare f^(K) (after appropriate scaling) to AC or DC power-flow solutions on each line; for circuit tasks, we compare to simulated currents on designated directed edges. We report correlation measures (Spearman and Pearson), relative error on magnitudes (e.g. ∥f − f̂∥₁/∥f̂∥₁), and rank-based agreement on the top-k highest-flow edges. When the simulator flow is only defined up to sign conventions or undirected magnitudes, we compare |f_e| to the corresponding magnitudes.

To connect fidelity to the optimization guarantees, we also report the conservation residual and capacity residual directly:
$$ \mathrm{Res}_A := \frac{\|Af^{(K)}-b\|_1}{\|b\|_1+\epsilon_0}, \qquad \mathrm{Res}_c := \frac{\|(f^{(K)}-c)_+\|_1}{\|c\|_1+\epsilon_0}, $$
for a small ϵ₀ > 0 to avoid degeneracy when b ≡ 0 or c ≡ ∞ on many edges. These diagnostics permit a quantitative check that the solver behaves as intended and allow us to correlate downstream error with constraint violation.

Even when we enforce box constraints via a Dykstra substep, finite precision and implementation details (e.g. mixed precision, log-domain stabilization, or damping) can produce small violations. We therefore report: (i) the maximum relative violation max_e(f_e − c_e)₊/( c_e + ϵ₀), (ii) the fraction of edges with violation exceeding a threshold τ, and (iii) the total overflow mass ∥(f − c)₊∥₁ normalized by ∥f∥₁. These summaries are presented per dataset and stratified by degree and by capacity scale to reveal whether violations concentrate on hubs or on tight bottlenecks.

Since the normalization step propagates information globally through the incidence constraints, we test robustness to structural shift. We construct out-of-distribution splits by (i) training on graphs with bounded average degree and testing on graphs with higher degree (or vice versa), (ii) perturbing topology by random edge deletions/additions while preserving node features, and (iii) training on one family of motifs (e.g. radial networks) and testing on another (e.g. meshed networks). We report the same predictive metrics as in-distribution, together with residual diagnostics Res_A, Res_c, to distinguish failures due to representation shift from failures due to nonconvergence of the inner projection.

To avoid presenting constraints as uniformly beneficial, we include a task in which conservation is either irrelevant or misspecified. Concretely, we consider a dataset in which the true signal is localized (e.g. edge classification based on local features) and the supplied b is uninformative or noisy; alternatively, we inject controlled noise into b and measure performance degradation as a function of noise level. We expect CCFA to underperform softmax attention in this regime, since the projection enforces a global coupling that may suppress useful local evidence. We report the crossover point (noise level or task type) at which CCFA ceases to help, and we relate this observation to a modeling recommendation: when b or c are uncertain, one should either (i) learn them with appropriate regularization, or (ii) relax constraints by penalization rather than hard projection, both of which we address next.

9. Discussion and Extensions: learning/inferring supplies b; multi-commodity flows; coupling to Graph Transformers; limitations and failure modes.

We view CCFA primarily as a : given scores e and a constraint specification (A, b, c), we replace ad hoc local normalizations by the unique I-projection onto ℱ(G, b, c). This perspective suggests several extensions in which the combinatorial object (the feasible polytope) is modified while the variational form min_{f ∈ ℱ}KL(f∥w) is preserved.

In some applications b is observed (e.g. injections/loads), but in others it is latent or only partially specified. A direct approach is to parameterize b = b_θ(x) as a nodewise map from features x_v (and possibly global context) and to train θ jointly with the scorer producing e. Since feasibility requires ∑_vb_v = 0, we may enforce this by a differentiable centering step
$$ \tilde b_v := b_{\theta}(x)_v - \frac{1}{|V|}\sum_{u\in V} b_{\theta}(x)_u, $$
or, when sources/sinks are known up to a slack convention, by fixing a designated slack node s and setting b̃_s := −∑_v ≠ sb_θ(x)_v. If magnitudes should be bounded (e.g. due to physical limits), one may also constrain b via b_v = B ⋅ tanh (b̄_v) with learned b̄_v. We emphasize an identifiability issue: if both b and e are learned without supervision on flows, then multiple (b, e) pairs can yield similar downstream predictions. In such regimes we expect that regularization is necessary, for example penalizing ∥b∥₁ or encouraging sparsity of nonzero supplies, and we recommend reporting the induced residual ∥Af^(K) − b∥ together with predictive metrics to ensure that the learned b is not being ``explained away’’ by solver nonconvergence. Finally, if strict feasibility of (G, b, c) is uncertain, one may introduce a slack variable s ∈ ℝ^V and solve with relaxed balance Af = b + s and a penalty λ∥s∥₁ (or ∥s∥₂²), which retains convexity and yields an explicit knob controlling the conservation–fit tradeoff.

Many domains involve multiple interacting conserved quantities (e.g. phases, species, or parallel demands). A natural extension introduces commodities indexed by k ∈ {1, …, m} with pseudo-flows f^(k) ∈ ℝ_≥ 0^E, supplies b^(k), and (optionally) commodity-specific scores e^(k) and base weights w^(k) = exp (e^(k)). The simplest decoupled variant enforces Af^(k) = b^(k) and 0 ≤ f^(k) ≤ c^(k) independently, with objective ∑_kKL(f^(k)∥w^(k)); this amounts to running m CCFA layers in parallel. The more interesting case is a :
$$ Af^{(k)} = b^{(k)}\quad\forall k, \qquad \sum_{k=1}^m f^{(k)}_e \le c_e \quad \forall e\in E, $$
with f^(k) ≥ 0. The feasible set remains convex, and the KL objective remains strictly convex in the joint variable (f⁽¹⁾, …, f^(m)), hence the I-projection is still unique when feasible. Algorithmically, alternating Bregman projections extend by adding an edgewise projection onto the coupled constraint {z ∈ ℝ_≥ 0^m : ∑_kz_k ≤ c_e} for each edge e, which is a low-dimensional KL projection that can be computed efficiently (via a single scalar dual variable per edge). This yields an implementation whose per-iteration cost remains O(m|E|), with the same qualitative convergence guarantees as in Theorem~3 under standard regularity conditions.

Although our presentation is in message-passing language, CCFA can be inserted wherever a model forms nonnegative edge weights and aggregates along directed edges. For Graph Transformers, a head typically defines scores on a set of admissible pairs (often all pairs, or a sparsified pattern) and then applies a rowwise softmax. Replacing this with a CCFA projection enforces global conservation across the attention graph, thereby coupling what are otherwise independent per-query normalizations. Concretely, for each head we may form a directed graph on tokens with edges restricted to a sparse pattern (e.g. k-NN, locality, or routing edges) and apply CCFA to obtain f^(K) used as attention weights. When the attention graph is dense, the Ω(|E|) lower bound (Theorem~5) becomes prohibitive, so practicality requires sparsification or factorization; from our viewpoint this is not an artifact of the solver but a structural consequence of reading the input. We therefore regard CCFA as most appropriate for intrinsically sparse relational data, or for Transformers already using sparse attention mechanisms.

The method presupposes that ℱ(G, b, c) ≠ ∅. Infeasibility can arise from inconsistent supplies across connected components, from overly tight capacities, or from misspecified directions. We can detect this empirically via nonvanishing residuals after large K, but for robust deployment we recommend either (i) preprocessing to enforce ∑_v ∈ Cb_v = 0 on each weakly connected component C, or (ii) adopting a slack-penalized formulation as above. A second limitation is numerical: if e has large magnitude, then w = exp (e) can be poorly scaled, and multiplicative updates may underflow/overflow. Log-domain implementations mitigate this but introduce additional constant factors; moreover, capacity projections can create near-active constraints that yield small gradients through clipped edges. Third, finite-K truncation yields approximate feasibility; while Theorem~4 formalizes the resulting violations in terms of ε, in practice the map (e, b, c) ↦ f^(K) can be sensitive on ill-conditioned graphs (e.g. long directed chains with tight bottlenecks), requiring damping or adaptive stopping. Finally, the global coupling is not universally beneficial: when b or c are noisy, or when the task signal is genuinely local, enforcing Af = b may suppress predictive evidence. We therefore view CCFA as a modeling choice that should be justified by domain structure, and we advocate reporting both predictive metrics and the diagnostic residuals as part of any evaluation.

Appendices: proofs, implementation details, additional datasets/ablations.

For completeness and reproducibility, we collect in the appendices the formal arguments deferred from the main text, together with engineering details needed to implement CCFA efficiently on sparse graphs, and additional experimental results (datasets, ablations, and diagnostics) supporting the claims made in the discussion.

We provide self-contained proofs of Theorems~1–5 under the standing assumptions stated in the global context (in particular w_e > 0 and ℱ(G, b, c) ≠ ∅ when feasibility is required). For Theorem~1 (existence and uniqueness), we write the objective as a strictly convex, lower semicontinuous function on ℝ_≥ 0^E and use closedness and convexity of ℱ(G, b, c) to obtain existence; strict convexity then yields uniqueness. We also record the KKT conditions: there exist node potentials p ∈ ℝ^V and edgewise multipliers α_e, β_e ≥ 0 for the constraints f_e ≤ c_e and f_e ≥ 0 such that
$$ \log\frac{f^\star_e}{w_e} + (A^\top p)_e + \alpha_e - \beta_e = 0, \qquad \alpha_e(f^\star_e-c_e)=0,\quad \beta_e f^\star_e=0, $$
which clarifies when the incidence-form parametrization f_e = w_eexp ((A^⊤p)_e) holds exactly (namely away from active box constraints).

For Theorem~2 (containment of softmax), we give the per-node Lagrangian derivation on the local simplex constraints and explicitly show separability over v ∈ V. For Theorem~3 (convergence), we state the alternating Bregman projection scheme as a Dykstra iteration on the intersection of an affine set and a box, and we cite the relevant convergence theorem for generalized KL (I-)projections, then specialize to the incidence-structured affine constraints. In the specialization, we also record conditions ensuring well-posedness of the affine projection in the dual (e.g. handling disconnected components by fixing one potential per component, or by restricting to the subspace {p : ∑_v ∈ Cp_v = 0} per component). For Theorem~4 (truncation guarantees), we spell out the duality argument relating KL suboptimality to approximate stationarity and hence to constraint residual; we include a lemma giving a quantitative bound of the form
$$ \|Af^{(K)}-b\|_1 \le L_A \sqrt{2\bigl(\mathrm{KL}(f^{(K)}\|w)-\mathrm{KL}(f^\star\|w)\bigr)}, $$
for an explicit constant L_A depending on boundedness of iterates (ensured, for instance, by finite capacities or by a mild regularization). Finally, for Theorem~5 we provide the adversarial construction in full detail, including the reduction that forces any algorithm producing a nontrivial dependence on (w, c) to read an Ω(|E|) fraction of the input in the worst case.

We describe the precise solver used in our experiments as an unrolled K-step map (e, b, c) ↦ f^(K). The appendices specify (i) the parametrization of primal iterates via node potentials p (affine step) and correction variables (Dykstra step), (ii) damping choices that ensure numerical stability, and (iii) stopping criteria when we allow adaptive K. We also discuss differentiation through the solver. Since the updates are compositions of smooth operations (exponentials, sparse reductions) and simple projections (box clipping, or KL box projections with corrections), backpropagation through the unrolled iterations is straightforward; nevertheless, we record practical measures that control exploding/vanishing gradients, including log-domain evaluation of w_e = exp (e_e) and gradient clipping on p. When c_e = ∞, we treat the corresponding capacity projection as a no-op and omit its correction state. When infeasibility is suspected, we optionally replace the hard constraint Af = b by a slack-penalized variant and show how this modifies the dual update.

We give pseudocode and tensor shapes for a batched implementation. The key representation is a COO/CSR edge list together with segment indices enabling reductions over N_in(v) and N_out(v). The appendices enumerate the per-iteration kernels: (a) compute residuals r_v = (Af)_v − b_v by two segmented sums, (b) update node potentials p by a simple rule (e.g. a dual gradient step or IPFP-style scaling), (c) form updated flows f_e = w_eexp (p_u − p_v) edgewise, and (d) apply the capacity projection with Dykstra corrections. We report the memory cost of storing (f, w) and the auxiliary buffers (p, corr), and we record the exact asymptotic scaling O(K|E|) as realized by sparse primitives. We also list default hyperparameters (initialization f⁽⁰⁾ = w, damping coefficient, maximum K, and ε-based early stopping) and provide guidance for choosing K to balance predictive performance against residuals ∥Af^(K) − b∥ and ∥(f^(K) − c)₊∥.

We document the datasets used in our evaluation, including preprocessing and graph construction. For power-grid graphs, we specify how buses/lines become nodes/edges, how directions are chosen (or symmetrized into antiparallel directed edges), and how supplies b and capacities c are obtained or normalized. For circuit-style graphs, we describe component-level features, edge types, and any imposed conservation conventions. For the informational control dataset, we specify the source/sink semantics and how we build sparse attention graphs. We provide train/validation/test splits, metrics (task loss and accuracy/MSE, as appropriate), and the solver diagnostics we report alongside predictive metrics:
Res_A := ∥Af^(K) − b∥₁, Res_c := ∥(f^(K) − c)₊∥₁, Gap_KL := KL(f^(K)∥w) − KL(f^{(K − 1)}∥w).
These diagnostics are used both to monitor convergence and to detect infeasibility or numerical issues.

We include ablations isolating the contributions of (i) enforcing balance only (Af = b without capacities), (ii) enforcing capacities only (box projection without balance), and (iii) the full intersection. We vary K and report the tradeoff between task performance and residuals, and we study warm starts versus cold starts across layers. We also analyze the learned-b setting (when enabled) by plotting the distribution of b_v and correlating it with residuals, and we evaluate the slack-penalized formulation as a robustness mechanism under misspecified b or c. Finally, we report runtime scaling with |E| and compare against local softmax attention to quantify the constant-factor overhead of constrained normalization under the same sparse access pattern.