Digital persuasion in 2026 increasingly takes place inside user interfaces that are explicitly . Platforms do not merely choose to disclose; they face binding constraints on disclosures may be formatted and how many distinct messages may be deployed. Product pages are organized around standardized ``nutrition labels,’’ app stores impose fixed explanation templates, and risk-sensitive domains (finance, employment, health) require short-form warnings with prescribed language and placement. Even when regulation is absent, design systems and A/B testing pipelines create de facto budgets: only a small number of UI components can be varied without breaking consistency, accessibility, or compliance. These constraints are naturally modeled either as a (at most K distinct disclosures) or as an cap (at most κ bits of mutual information about the underlying state can be conveyed per interaction).
A second, equally important shift is that the ``receiver’’ in persuasion is often no longer a human deliberating from scratch. Many decisions are delegated to learning agents: browser assistants that auto-fill forms and select defaults, procurement bots that choose suppliers based on past performance, ad-bidding agents that update bidding rules in real time, and enterprise copilots that learn an organization’s preferences over repeated requests. These systems are typically trained or fine-tuned online and evaluated by regret-style performance criteria. As a result, the receiver’s behavior is neither fully rational in the one-shot Bayesian sense nor arbitrary; it is disciplined by guarantees such as contextual no-regret, internal regret, or swap regret. This mixture of structure and boundedness is precisely what makes the setting conceptually and practically distinct: a platform can exploit dynamics and learning, but it cannot assume the receiver will tolerate systematically dominated responses forever.
Our goal is to illuminate a specific tension created by these two trends. On the one hand, limiting the space of admissible disclosures is often advocated as a remedy against manipulation: fewer message types, or less informative messages, should reduce the sender’s ability to steer behavior. On the other hand, when the receiver is a learning system whose are observable and feed back into the interaction, the action itself becomes a communicative object. The receiver’s action is chosen after seeing the message, and therefore can encode information about the receiver’s posterior and its learned response mapping. From the sender’s perspective, the observable pair ``(message sent, action taken)’’ may behave like a richer alphabet than the physical message space alone. This paper formalizes that intuition and studies its implications for the effectiveness of attention constraints.
The core phenomenon we isolate is what we call . Suppose the sender can only choose among K UI messages each round. Classical Bayesian persuasion would treat this as a hard K-signal limit. In a repeated interaction with a no-swap-regret receiver, however, the realized action a can act as an additional label: conditioning on the joint outcome (s, a) partitions histories more finely than conditioning on s alone. In effect, the sender may be able to implement outcomes comparable to a one-shot scheme with as many as K ⋅ |A| joint signals, where A is the receiver’s action set. Under an information-capacity constraint, an analogous logic applies in bits: even if the message s carries at most κ bits about the state, the receiver’s action can carry up to log |A| additional bits, so the relevant benchmark becomes κ + log |A|. This is not a statement about the receiver ``helping’’ the sender; rather, it reflects that a learning receiver’s behavioral responses create observable variation that the sender can anticipate and exploit.
Why does a regret guarantee not eliminate this effect? Regret minimization ensures that, , the receiver cannot significantly improve by systematically swapping one action for another as a function of the observed message and its own action. This is a powerful consistency property: it implies that the receiver’s action choices are approximately optimal relative to the empirical distribution induced by the sender’s behavior. Yet that very empirical distribution is shaped by the sender’s adaptive choice of policies, and it is shaped in a way that depends on the receiver’s mapping. When we treat the joint object (s, a) as the ``signal,’’ swap regret implies approximate best-response behavior to an induced (and potentially richer) signal structure. The sender’s adaptivity can therefore be constrained not by the original message channel but by the expanded channel created endogenously by the receiver’s actions.
This perspective helps reconcile a practical puzzle. Policy
discussions often presume that restricting disclosure formats (fewer
variants, shorter explanations, bounded
dark pattern'' surface area) directly bounds the sender’s persuasive power. But platforms today routinely observe, and optimize against, downstream behavioral traces: clicks, dwell time, purchases, opt-outs, and subsequent choices. When the receiver is an automated assistant, these traces can be even more structured, because the assistant’s action set is discrete and stable (e.g., accept/reject/ask-for-more-info, choose among ranked options, select a default). A platform that only has a few allowed disclosures may still be able to induce a wide range of action-conditioned posteriors by varying policies over time and letting the learner’s action mappingtag’’
the interaction. Our results suggest that attention constraints aimed
solely at the sender’s message space can be weaker than intended in
dynamic environments.
At a high level, we provide three types of contributions. First, we develop an upper bound on what any adaptive sender can achieve against a receiver with contextual swap regret: the sender’s long-run payoff is bounded by the commitment optimum in a corresponding one-shot persuasion problem, but with the channel size effectively expanded from K to K ⋅ |A|, or from κ to κ + log |A|, up to an additive term that vanishes with average swap regret. This bound is informative because it cleanly separates effects (captured by regret) from effects (captured by the expanded benchmark). Second, we show that the difference between the original constrained benchmark and the expanded benchmark can be bounded away from zero in finite instances: there are environments in which limiting the sender to K messages meaningfully reduces commitment persuasion, yet does prevent an adaptive sender from approaching the K ⋅ |A| benchmark in the repeated game. Third, we complement the upper bound with instance-tight constructions: there exist explicit sender strategies that, against standard no-swap-regret learners, asymptotically attain the expanded benchmark, demonstrating that action-induced expansion is not merely an artifact of proof technique.
Beyond the formal results, the framework clarifies what kinds of interventions might or might not work. If a regulator limits the number of disclosure templates but leaves the receiver’s action space large and observable, the effective channel may remain rich. Conversely, constraining the action interface (e.g., reducing the granularity of choices, limiting contingent actions, or adding friction that collapses distinct actions into fewer observable categories) can directly shrink the endogenous expansion term log |A| or the factor |A|. Similarly, privacy interventions that reduce observability of actions to the sender can sever the link between actions and induced posteriors, potentially restoring the intended force of message caps. These observations do not prescribe a single policy; rather, they highlight that the design of attention constraints must account for the full feedback loop, not just the sender’s outbound communication.
We also emphasize limitations. Our baseline model treats states as i.i.d. across rounds and imposes per-round constraints on signaling; real systems may face long-run budgets, correlated environments, and strategic experimentation costs. Moreover, we abstract from multi-agent interactions, competition among senders, and delegation chains in which a human supervises an assistant. The point is not that any specific platform can always exploit action-induced expansion, but that the possibility is inherent once (i) the sender is adaptive and uncommitted, (ii) the receiver learns with internal-consistency guarantees, and (iii) the receiver’s actions are sufficiently rich and observable. The model therefore serves as a diagnostic: it identifies when and why standard ``limit the message’’ intuitions can fail.
The remainder of the paper formalizes the repeated persuasion environment with either message budgets or mutual-information caps, characterizes the relevant one-shot benchmarks, and then derives the expanded-channel bounds and matching constructions. We close by discussing how these theoretical insights map onto practical design constraints in disclosures, assistant interfaces, and auditing regimes, and how the tradeoff between usability, transparency, and manipulability changes when the receiver is a learning agent rather than a Bayesian decision maker.
Our analysis sits at the intersection of Bayesian persuasion, online
learning (in particular internal and swap regret), and a growing body of
work on platform design and manipulation under UI and attention
constraints. The closest conceptual starting point is the
generalized principal--agent'' framework with learning receivers introduced by Lin and Chen \cite{LinChen2023}. In that model, a principal interacts repeatedly with an agent who adapts via a regret-minimizing algorithm, while the principal chooses policies without commitment. Lin and Chen show that when the agent achieves contextual no-swap regret, the principal cannot asymptotically do better than the one-shot commitment optimum in a suitably defined induced problem. We adopt their regret notion and timing discipline, but our focus differs: we study environments in which the principal is \emph{ex ante} constrained in the message channel (either by a finite message budget or a mutual-information cap) and ask whether regret-based learning restores the usual intuition that such constraints limit persuasion. The key departure is that we deliberately \emph{do not} apply the standardrevelation’’
step that collapses joint objects back to the original message space;
instead, we treat the realized action as part of an effective signal and
quantify the resulting expansion of the relevant benchmark. In this way,
our results can be read as a refinement of the Lin–Chen reduction
tailored to attention-limited disclosure channels.
On the persuasion side, our benchmarks build on the classic Bayesian persuasion model of Kamenica and Gentzkow and the subsequent information-design literature (see, e.g., , ). The commitment persuasion value U⋆ is characterized by concavification of the sender’s interim value as a function of beliefs, and finite-signal constraints appear naturally as restrictions on the number of posteriors implementable via Bayes-plausible splittings. There is a substantial literature studying persuasion with bounded signal alphabets or other communication restrictions, including finite-message persuasion and menu-size constraints (e.g., , , ). Our message-budget formulation is aligned with these works, but our contribution is not a new concavification result under a fixed alphabet. Rather, we show that in repeated interactions with adaptive (non-committing) senders and swap-regret receivers, the effective constraint is weaker: the relevant comparison point is the one-shot optimum under an alphabet whose size scales with the receiver’s action set.
A parallel strand of work imposes information-capacity constraints through mutual information, typically motivated by rational inattention or communication complexity. In economics, rational inattention originates with Sims and is developed in discrete choice settings by Matjka and McKay ; related treatments appear in information design with attention costs or entropy constraints (e.g., , ). In computer science and information theory, mutual-information constraints are ubiquitous as a way to cap the informativeness of signaling rules. Our per-round cap I(ω; s) ≤ κ is in this spirit and corresponds to bounding the capacity of the sender’s outbound channel. The novelty in our repeated setting is the observation that even if the sender’s channel is capped at κ bits, the can contribute up to log |A| additional bits, implying that the tight benchmark becomes κ + log |A| when the sender can anticipate and exploit action-conditioned behavior. This perspective complements rational inattention: whereas that literature typically treats the decision maker’s attention choice as internal and costly, we treat attention limits as exogenous constraints on what the sender may transmit, and we highlight how feedback can partially undo such limits.
The learning-theoretic component of our model relies on internal and
swap regret, and its well-known connection to correlated equilibrium.
The classical results of Foster and Vohra and Hart and Mas-Colell show
that no-internal-regret (and a fortiori no-swap-regret) dynamics
converge to the set of correlated equilibria in repeated games. In
contextual settings, swap regret can be viewed as ensuring that for each
context'' (here, the realized message, and in our deviation class even the realized message--action pair), the learner’s action distribution is approximately stable against systematic relabelings. Lin and Chen’s contribution is to bring this equilibrium-selection logic into a principal--agent setting in which the principal chooses thegame’’
(via information policy) each round. Our work leverages the same
internal-consistency property but emphasizes a different implication:
because deviations may condition on both the message and the realized
action, the induced object governing incentives is naturally the joint
outcome (s, a). This
is precisely the step at which the effective channel can expand, and it
is also the step at which the correlation-device intuition from
correlated equilibrium becomes especially apt: the realized action
behaves like an endogenous correlating variable that can refine the
information structure beyond the exogenous message labels.
There is also a growing literature on strategic interaction with learning agents and on the incentives of a sophisticated player facing a regret-minimizing opponent. Work on ``gaming’’ no-regret learners, adversarial bandits with strategic feedback, and manipulation of online learning dynamics (e.g., , ) shares the broad theme that regret guarantees constrain long-run averages but do not generally prevent exploitation via nonstationarity or carefully designed sequences. In economics, related themes appear in dynamic mechanism design and delegation with boundedly rational or learning agents, where a principal may exploit the agent’s adjustment process. Our results can be interpreted within this theme: attention limits that would be binding under commitment can be circumvented in part because the principal can use adaptivity to induce action-contingent variation that a swap-regret learner cannot systematically reject without sacrificing its own performance criterion.
Finally, our motivation relates to the empirical and conceptual literature on platform manipulation, interface design, and ``dark patterns.’’ Legal and policy discussions often emphasize limiting persuasive surface area by standardizing disclosures and bounding complexity (e.g., mandated labels, constrained templates, short-form warnings), and the HCI literature documents how subtle UI variations can steer behavior even without changing substantive information. Our contribution to this conversation is theoretical and diagnostic: even when regulators successfully constrain the sender’s message repertoire or message informativeness, the feedback loop created by observing user (or assistant) actions can reintroduce a richer effective signaling channel. This suggests that a regulatory focus solely on outbound disclosures may be incomplete in settings where downstream actions are both discrete and observable—as is common for automated assistants operating through APIs, structured decision interfaces, or standardized workflows. At the same time, we view this as a limitation as well as an insight: our results are strongest when the action is observable and the principal can condition future policies on the learner’s mapping. Settings with limited observability, privacy protections, or delayed/noisy action feedback may substantially attenuate the channel-expansion effect, and understanding such frictions connects naturally to the literatures on privacy, auditing, and informational externalities.
In sum, we combine (i) constrained information design (finite messages or mutual-information caps), (ii) repeated interaction without commitment, and (iii) internally consistent learning (contextual swap regret). The resulting picture is not that attention constraints are futile, but that their bite depends on the entire interaction protocol—including the granularity of the receiver’s action space and what the sender can observe and condition on. This lens complements existing persuasion and learning results by identifying a concrete, quantifiable mechanism—action-induced channel expansion—through which dynamic environments can weaken static communication limits.
We study a repeated persuasion (equivalently, repeated information-design) environment with a sender (the principal/platform) interacting with a learning receiver (the agent/assistant). Time is indexed by rounds t ∈ {1, …, T}. In each round a state ωt is realized from a finite state space Ω according to a common prior μ0 ∈ Δ(Ω). Our baseline assumption is that (ωt)t = 1T is i.i.d. with law μ0, so that the statistical environment is stationary and any dynamics arise from strategic adaptation rather than drifting fundamentals. (Most of our arguments extend to mild nonstationarities, but the i.i.d. baseline keeps the role of the signal constraint transparent.)
The agent chooses an action at from a finite
action set A with |A| ≥ 2. Payoffs are per-round and
depend only on the realized state and action: the principal obtains
u(ωt, at)
and the agent obtains v(ωt, at).
We impose a uniform boundedness condition: there exists B < ∞ such that
|u(ω, a)| ≤ B and |v(ω, a)| ≤ B ∀(ω, a) ∈ Ω × A.
This is standard in online-learning analyses and ensures that regret
terms translate into additive utility perturbations at rate O(B δT),
where δT
denotes average swap regret defined below.
In each round the principal sends a message (or disclosure) st from a set
S. The principal’s per-round
signaling policy is a mapping
πt : Ω → Δ(S),
chosen without commitment and potentially as a function of the entire
public history. Importantly, we impose a signaling constraint, capturing
exogenous limits on the principal’s ``UI surface area’’ or on how
informative the principal may be.
We consider two canonical constraints.
The principal is restricted to an alphabet of size at most K ≥ 2 each round, so that |S| ≤ K. We can, without loss, take S = [K] and interpret πt(ω) as a distribution over $K labels. The principal may relabel the K messages across time through its choice of πt; the constraint is purely cardinal.
The principal may choose an arbitrary (finite) message set S, but the informativeness of the
induced channel is capped by a mutual-information bound κ ≥ 0:
Iμ0, πt(ω; s) ≤ κ for
all t.
Here Iμ0, πt(ω; s)
is computed under the joint distribution in which ω ∼ μ0 and s ∣ ω ∼ πt(ω).
This formulation mirrors rational-inattention and communication-capacity
constraints: the sender may choose the alphabet and garbling structure,
but cannot transmit more than κ bits (on average) about the state
in a single round.
A key modeling choice is how the agent maps observed messages into actions when the principal does not commit. Following Lin–Chen’s generalized principal–agent timing, each round proceeds as follows.
Two aspects are worth emphasizing. First, the principal’s ``move’’ in a round is the choice of πt before ωt is realized, but the message itself is chosen after observing ωt via st ∼ πt(ωt)—exactly as in one-shot persuasion. Second, the principal observes ρt when selecting πt; this captures the idea that the receiver’s algorithm is known (or, equivalently, that the principal can infer the receiver’s current mapping from interaction data). This observability is the channel through which the principal can exploit adaptivity, and later we discuss how limited observability would attenuate our effects.
We do not assume that the agent best-responds to πt in each
round; instead, the agent is a learning algorithm that guarantees low
swap regret relative to deviations that can depend on the observed
context. Concretely, let d : S × A → A
be an arbitrary deviation rule that, after seeing the realized message
s and the realized action
a, recommends replacing a with d(s, a). The agent
achieves contextual no-swap regret if, for all such d,
where the expectation is over all randomization and over the states. We
define the average swap regret level δT := CSReg(T)/T,
and we will be interested in algorithms with CSReg(T) = o(T),
i.e., δT → 0.
The deviation class in is deliberately strong: it permits the agent
to
second-guess'' its own realized action contingent on the message and on the action itself. This is the internal-consistency notion that underlies convergence to correlated equilibrium and is exactly the tool used in Lin--Chen to reduce the repeated interaction to an induced one-shot problem. At the same time, our signal constraints mean that the relevantinduced’’
object is subtle: the empirical joint outcomes (st, at)
can behave as if they were drawn from a larger effective alphabet than
the physical message set S.
To evaluate what a non-committing principal can achieve against such
a learner, we compare outcomes to classical persuasion values under the
same explicit signal constraints. Under a message budget K, the standard benchmark is
$$
U^\star_K
:=\max_{\pi:\Omega\to\Delta([K])}\;
\sum_{s\in[K]} \Pr[s]\; u\!\big(\mu_s, a^\*(\mu_s)\big)
\quad\text{s.t.}\quad \sum_{s\in[K]}\Pr[s]\,\mu_s=\mu_0,
$$
where μs
is the posterior induced by π,
Pr [s] is the induced message
probability, and the receiver best-responds according to
$$
a^\*(\mu)\in \arg\max_{a\in A}\mathbb{E}_{\omega\sim\mu}[v(\omega,a)],
$$
with ties broken in the principal’s favor. This is the Kamenica–Gentzkow
objective with a finite-signal constraint, and it can be computed by
linear programming over Bayes-plausible posteriors.
Under a mutual-information cap κ, the analogous benchmark is
$$
U^\star_{\kappa}
:=\max_{\pi:\Omega\to\Delta(S)}\;
\mathbb{E}\big[u(\omega,a^\*(\mu_s))\big]
\quad\text{s.t.}\quad I_{\mu_0,\pi}(\omega;s)\le \kappa,
$$
where the maximization ranges over all finite message sets S and all channels π satisfying the information
constraint. In general this problem is a convex program (after suitable
reformulation), and in applications one may compute it numerically.
Our central objects are benchmarks that anticipate the possibility
that the realized action effectively refines the principal’s message.
Under a message budget K, we
define
UK ⋅ |A|⋆
to be the one-shot persuasion value when the principal can commit to a
scheme with at most K|A| distinct signals.
Under a mutual-information cap κ, we define
Uκ + log |A|⋆
to be the one-shot persuasion value when the mutual-information
constraint is relaxed to κ + log |A|.
These expansions are motivated by a simple accounting identity that
will be formalized in our reduction. In the repeated game, the analyst
observes a triple (ωt, st, at).
Even if the physical message alphabet is limited to K, the (st, at)
lives in an alphabet of size at most K|A|; and even if I(ω; s) ≤ κ,
we always have
I(ω; s, a) ≤ I(ω; s) + H(a) ≤ κ + log |A|.
Thus, the joint outcome (s, a) can be viewed as an
``effective signal’’ with a strictly weaker constraint. Our main results
show that these expanded quantities are the appropriate upper bounds for
what a non-committing principal can extract from a no-swap-regret agent,
and that the gap between UK⋆
and UK|A|⋆
(or between Uκ⋆
and Uκ + log |A|⋆)
can be economically meaningful.
In the next section we make this precise by reconstructing the Lin–Chen joint-signal reduction under explicit signal constraints, and by isolating the exact step at which a revelation-style collapse of (s, a) back to s is no longer innocuous when S is constrained.
Our analysis follows the logic of Lin–Chen’s joint-signal reduction, but we keep track of the sender’s per-round signal constraint. The key idea is to treat the realized pair (st, at) as the relevant ``context’’ for internal consistency. This is innocuous in unconstrained persuasion because a revelation-style argument can often collapse (st, at) back into a single message. Under a message budget or an information-capacity cap, that collapse is exactly where the argument ceases to be without loss.
Fix an arbitrary (possibly adaptive) principal strategy and a
contextual no-swap-regret agent. We package the repeated play into a
single distribution by sampling a random round. Let τ be uniform on {1, …, T}, independent of all play,
and define the random variables
ω := ωτ, s := sτ, a := aτ, j := (s, a).
Let ψ denote the induced joint
law of (ω, s, a)
(equivalently, the empirical mixture of per-round laws). Because ωt ∼ μ0
i.i.d. and independent of the past, we have the Bayes-plausibility
identity
where μj ∈ Δ(Ω)
is the posterior Prψ[ω ∣ j].
The principal’s average expected payoff can be rewritten exactly as
the one-shot payoff under ψ:
Thus, the entire repeated interaction induces a one-shot Bayesian
persuasion ``outcome’’ in which the effective signal is j = (s, a) and the
realized action is the second coordinate.
The agent’s contextual no-swap-regret guarantee is stated for
deviations d : S × A → A.
Under the identification j = (s, a), these
are exactly deviations of the form d : j ↦ d(j).
Taking the uniform random round τ as above and dividing by T, we obtain for every d,
This is an (or approximate correlated-equilibrium) condition for the
one-shot distribution ψ with
signal j and recommendation
a. Intuitively, says that once
we condition on the realized joint outcome (s, a), the agent cannot
systematically gain (by more than δT in
expectation) by relabeling the realized action as a function of (s, a). This is precisely
the strength of swap regret: it controls deviations that depend on the
action itself, which is what makes j = (s, a) the
natural effective signal.
A convenient way to use is to compare the realized action to a best
response . For each j,
let
$$
a^\*(\mu_j)\in\arg\max_{a'\in
A}\mathbb{E}_{\omega\sim\mu_j}\big[v(\omega,a')\big]
\quad\text{(ties broken for the principal).}
$$
Consider the deviation $d^\*$ defined
by $d^\*(j):=a^\*(\mu_j)$. Plugging
$d^\*$ into yields
Thus, on average over j, the
realized response is within δT (in the
agent’s payoff units) of the best response to the posterior induced by
the effective signal j.
Because utilities are bounded, this approximate-best-response property transfers to the principal’s objective with an additive loss at rate O(B δT). Formally, one can upper bound the principal’s realized value by the value under the same induced posteriors (μj)j but with the agent replaced by a posterior best responder, plus an O(B δT) perturbation term.
In Lin–Chen’s unconstrained setting, one typically proceeds from a joint outcome (s, a) to a ``direct’’ scheme in which the sender simply recommends an action, and then appeals to a revelation principle to argue that the sender loses no power by using a signal alphabet indexed by A (or by a small refinement thereof). That step is harmless when signals are unconstrained: any refinement of the message space is feasible.
Under explicit signal constraints, the alphabet size (or information content) is not free. The pair j = (s, a) may carry strictly more ``labels’’ than s alone, even though a is chosen by the agent. In our repeated game, the principal can exploit adaptivity to steer the learner into systematically different action realizations following the physical message, so that conditioning on (s, a) yields a strictly finer partition of beliefs than conditioning on s. The revelation-style collapse would require the sender to the refinement that is currently implemented via action realizations into the physical signal itself, which can violate the message budget K or the mutual-information cap κ.
This observation leads to the central notion of an : the correct one-shot object against which to compare the principal’s repeated-game performance is the commitment optimum under the constraint that applies to j = (s, a), not to s alone.
If the principal is restricted each round to |S| ≤ K, then the joint
alphabet satisfies
|{(s, a) : s ∈ S, a ∈ A}| ≤ K|A|.
Therefore, the induced one-shot outcome ψ can be implemented by a one-shot
scheme with at most K|A| distinct effective
signals j. Combining (i) the
Bayes-plausibility condition , (ii) the fact that ψ uses at most K|A| signal realizations,
and (iii) the approximate obedience (which makes the receiver an O(δT)-approximate
best responder on the induced signal space), we obtain the
message-budget upper bound in Proposition~1:
𝔼ψ[u(ω, a)] ≤ UK|A|⋆ + O(B δT).
The substantive difference from the classical benchmark UK⋆
is that we do attempt to collapse the induced signal space back to K messages. Doing so would
implicitly assume that the refinement provided by the realized action
can be replaced by additional physical messages, which is exactly what
the budget forbids.
Suppose instead that the principal satisfies Iμ0, πt(ω; s) ≤ κ
each round. Conditional on τ = t, the pair (ω, s) is distributed
according to μ0 and
the channel πt, hence
I(ω; s ∣ τ = t) ≤ κ.
Averaging over the uniform random τ and using ω ⟂ τ yields
Now apply the chain rule to the effective signal j = (s, a). Since
τ is independent of ω and (s, a) is measurable with
respect to (s, a, τ), we
have
I(ω; s, a) ≤ I(ω; s, a, τ) = I(ω; s, τ) + I(ω; a ∣ s, τ).
The first term is at most κ by
. For the second term, we use the entropy bound I(ω; a ∣ s, τ) ≤ H(a ∣ s, τ) ≤ H(a) ≤ log |A|,
obtaining
I(ω; s, a) ≤ κ + log |A|.
Thus the induced one-shot outcome ψ is feasible for the
information-cap persuasion problem with cap κ + log |A|. Combining this
feasibility with the same approximate-obedience argument as above yields
Proposition~2:
𝔼ψ[u(ω, a)] ≤ Uκ + log |A|⋆ + O(B δT).
The reduction shows that the learning requirement we impose—contextual no-swap regret—controls incentives (s, a). When the sender’s constraint is formulated on the physical message s alone, the sender can still benefit from the additional ``degrees of freedom’’ created by the receiver’s endogenous action, up to a limit determined by K|A| (cardinality) or κ + log |A| (information). In the next section we show that the resulting channel gaps can be Ω(1) in simple finite instances, so the expanded-channel benchmarks are not merely technical artifacts.
We now give explicit finite instances in which the ``expanded-channel’’ benchmarks in Propositions~1–2 are larger than the classical constrained commitment optima. The point is not that these instances are exotic; rather, they isolate a simple geometric fact: many persuasion objectives require (at least) |A| distinct posterior regions to attain the concavification optimum, while a small physical message set forces pooling. Once the realized action is allowed to function as an additional label, the relevant constraint becomes K|A| (or κ + log |A|), and a constant gap can reappear.
Consider the ``matching’’ instance
$$
\Omega=\{1,2,3\},\qquad A=\{1,2,3\},\qquad \mu_0(\omega)=\frac13\ \
\forall \omega,
$$
with aligned utilities (and hence B = 1):
v(ω, a) = 1{ω = a}, u(ω, a) = 1{ω = a}.
In words, the receiver chooses a label a, and both players get payoff 1 if the label matches the state.
With 3 signals, the sender can fully
reveal the state (one message per state). The receiver best responds by
choosing a = ω,
yielding expected payoff 1.
Therefore
U3⋆ = 1 ⇒ UK|A|⋆ ≥ U3⋆ = 1 when
K|A| ≥ 3.
In particular, when K = 2 and
|A| = 3, we have K|A| = 6 and hence U6⋆ = 1.
We claim that U2⋆ = 2/3, so
the channel gap is
$$
\mathrm{Gap}(2)=U^\star_{6}-U^\star_{2}=1-\frac23=\frac13.
$$
The lower bound U2⋆ ≥ 2/3 is
achieved by a simple partition scheme: send s = 1 when ω = 1, and send s = 2 when ω ∈ {2, 3}. Upon s = 1 the posterior is degenerate on
state 1, so the receiver plays a = 1 and is correct with
probability 1. Upon s = 2 the posterior is uniform on
{2, 3}, so any best response guesses
either 2 or 3 and is correct with probability 1/2. Thus the overall success probability
is
$$
\Pr[\text{correct}]=\frac13\cdot 1+\frac23\cdot\frac12=\frac23.
$$
For the matching upper bound U2⋆ ≤ 2/3,
observe that in this instance the receiver’s best response to any
posterior μ ∈ Δ(Ω) is to
choose a of μ (ties are
broken, but that only selects among modes). Hence any two-signal scheme
induces at most two actions as strict best responses across the two
posteriors. Concretely, let the two signals be s ∈ {1, 2} and let a(s) ∈ A be a
best-response action after signal s. Then the set {a(1), a(2)} contains at
most two actions, so there exists some action ā ∈ A that is never chosen.
But whenever ω = ā
occurs, the receiver’s realized action cannot equal ω (since ā is never played), implying that
the scheme in that state. Therefore, regardless of how the sender
randomizes signals as a function of ω,
$$
\Pr[\text{correct}]
\le 1-\Pr[\omega=\bar a]
=1-\frac13=\frac23.
$$
This proves U2⋆ = 2/3 and
hence establishes a constant Ω(1) message-budget gap when K < |A|.
This example is deliberately simple: persuasion is ``easy’’ absent constraints (full revelation is optimal), but messages are not enough to elicit three distinct receiver actions in a posterior-optimal way. The concavification picture is that the value function g(μ) := maxa𝔼μ[1{ω = a}] = maxωμ(ω) has three linear pieces, and attaining the optimum at the uniform prior requires three supporting posteriors—one for each action. A K = 2 constraint forces pooling of at least two of these regions and creates a discrete loss of 1/3.
We reuse the same matching instance with Ω = A = {1, 2, 3} and uniform prior, but now impose an information-cap constraint I(ω; s) ≤ κ in the one-shot benchmark.
If I(ω; s) = 0, then
s is independent of ω, and the receiver’s posterior
after observing s equals the
prior μ0. Any best
response then chooses a fixed action (up to tie-breaking) and is correct
with probability 1/3. Thus
$$
U^\star_{\kappa}\big|_{\kappa=0}=\frac13.
$$
In this instance, the entropy of the state is H(ω) = log 3 (with log in nats). The mutual information cannot
exceed H(ω), and full
revelation attains I(ω; s) = H(ω) = log 3.
Hence under cap log 3 full revelation
is feasible and yields value 1,
i.e.,
Ulog 3⋆ = 1.
Therefore, at κ = 0 we obtain
the constant gap
$$
\mathrm{Gap}(0)
=
U^\star_{\kappa+\log|A|}-U^\star_{\kappa}
=
U^\star_{\log 3}-U^\star_0
=
1-\frac13
=
\frac23.
$$
The same qualitative separation persists whenever κ is bounded away from log 3: the value Uκ⋆ is strictly below 1 because the receiver cannot reliably identify all three states with limited information, while Uκ + log |A|⋆ is already at (or near) the full-information optimum once κ + log |A| ≥ H(ω). For general (Ω, A, u, v), obtaining a sharp closed form for Uκ⋆ typically requires solving a convex program (e.g., via a Lagrangian/variational representation of mutual information), so our emphasis here is on the existence of an Ω(1) separation rather than on the exact functional form of Uκ⋆.
These two examples support a common interpretation. Define the
channel gaps
Gap(K) = UK|A|⋆ − UK⋆, Gap(κ) = Uκ + log |A|⋆ − Uκ⋆.
Both gaps quantify how sensitive the sender’s achievable value is to a
of an explicit attention constraint once the receiver’s action is
allowed to function as an auxiliary label. In that sense, Gap(K) and Gap(κ) behave like an ``attention
condition number’’: when they are large, controlling persuasion by
limiting message formats (small K) or raw information flow (small
κ) is fragile, because
endogenous behavior can recreate the missing labels/bits through the
joint outcome (s, a).
We also emphasize a limitation of the closed-form constructions above: they use aligned payoffs to keep the benchmarks transparent. The same separation logic extends to genuinely misaligned sender–receiver objectives (indeed, classic concavification arguments often make the need for multiple posterior regions even sharper), but the exact constants then depend on the geometry of the value function and are most cleanly illustrated numerically in small-state examples. In the next section we complement these existence results by showing that the expanded-channel upper bounds are not merely artifacts: there are instances in which an adaptive sender can (up to o(1)) the expanded-channel value against a standard no-swap-regret learner.
The separation constructions in Section~ show that the benchmarks UK|A|⋆ and Uκ + log |A|⋆ can be larger than the classical constrained optima UK⋆ and Uκ⋆. We now explain why these expanded-channel values are not merely proof artifacts: for suitable finite instances, an adaptive (non-committing) principal can in fact attain them (up to vanishing loss) against a standard contextual no-swap-regret learner. Conceptually, the sender can a one-shot persuasion scheme on an enlarged alphabet into a repeated-time manipulation that uses the receiver’s endogenous action as an auxiliary label.
Fix a one-shot optimal solution for the expanded benchmark. Under a
message budget, let
π̃⋆ : Ω → Δ(S̃), |S̃| ≤ K|A|,
achieve UK|A|⋆,
with induced posteriors {μs̃}s̃ ∈ S̃
and receiver best responses a*(μs̃).
Since |S̃| ≤ K|A|, we may
index (possibly after relabeling) each ``virtual’’ signal as a
pair
s̃ = (s, a) ∈ [K] × A,
with the intended receiver response being precisely the second
coordinate a. The expanded
benchmark is then the value of a persuasion scheme in which the sender
can directly choose a label (s, a) and the receiver
best-responds . Our repeated-game objective is to make the joint outcome
(st, at)
behave as if it were drawn from this optimal virtual scheme.
Under a mutual-information cap, we similarly fix an optimal one-shot solution achieving Uκ + log |A|⋆, which can be written as a joint distribution over (ω, s̃) such that I(ω; s̃) ≤ κ + log |A| and the receiver’s action is (w.l.o.g.) a deterministic best response a(s̃). The interpretation is again that the joint object s̃ contains an ``action label’’ worth up to log |A| bits, over and above the raw message capacity.
The central difficulty is that the sender does control the receiver’s action; the receiver chooses at after seeing st. The repeated-game leverage comes from two design choices that are absent in the one-shot commitment model:
We exploit these by running the interaction in (epochs). In each phase, the sender uses a fixed signaling rule designed to make a particular action a the uniquely (or margin-)optimal response to a particular physical message s for the receiver, given the phase’s induced empirical distribution of states conditional on s. When phases are long, a standard no-swap-regret algorithm behaves almost as if it were best-responding to each frequently occurring context—here, the relevant contexts are the joint labels (s, a) that occur in that phase. In effect, the realized action becomes a stable marker of the phase, so that conditioning on (st, at) selects (approximately) the intended posterior region from the target expanded scheme.
To keep notation transparent, consider the message-budget case with
|S| = K and target
virtual alphabet S̃ ⊆ [K] × A. For
each virtual label s̃ = (s, a) we
choose a πs, a : Ω → Δ([K])
with the following property: under πs, a,
the posterior induced by observing message s (call it μs, a)
makes a a strict receiver best
response,
a ∈ arg maxa′ ∈ A𝔼ω ∼ μs, a[v(ω, a′)], with
a positive margin.
Such strictness is without loss for achievability: if an optimal
expanded scheme involves ties, we can perturb payoffs by an arbitrarily
small amount (or mix in an arbitrarily small ``tie-breaking’’ signal) to
obtain a nearby instance in which best responses are unique while
changing all values by o(1).
The sender then cycles through phases indexed by s̃ = (s, a), with phase lengths L1, L2, … growing so that the number of phase changes is o(T). In phase (s, a), the sender plays πt = πs, a on every round in that phase. Intuitively, within a long phase the environment faced by the learner is close to stationary, so the learner’s swap regret within the subpopulation of rounds where message s is sent forces its realized action after s to be close to the best response to μs, a, i.e., close to a. More precisely, if we write 𝒯s, a for the set of rounds in the phase where the realized pair equals (s, a), then the no-swap-regret condition implies that, up to an additive O(CSReg(T)) term, the learner cannot improve by systematically remapping (s, a) to a different action. When a is strictly optimal for the posterior induced in that phase, this pins down the learner’s realized action distribution on those rounds.
Two additional ingredients make the construction track the of the
virtual scheme:
(i) the sender allocates the fraction of horizon spent in each phase
(s, a) to match Prπ̃⋆[s̃ = (s, a)]
up to o(1), and
(ii) within each phase, the phase policy πs, a
is chosen so that the induced posterior μs, a
matches (or closely approximates) the target posterior μs, a
from π̃⋆.
The latter step uses the usual Bayes-plausibility feasibility logic: we
design πs, a
so that, conditional on s, the
posterior is μs, a,
while the other K − 1 messages
absorb the remaining mass to preserve the prior μ0. This is a standard
``splitting’’ construction and can be implemented explicitly in finite
instances by solving a small linear program for each phase.
Putting these pieces together, the empirical distribution of (ωt, st, at)
converges (in total variation, along a subsequence of T) to the target distribution
induced by the virtual scheme π̃⋆, and the principal’s
average payoff satisfies
$$
\frac1T\mathbb{E}\Big[\sum_{t=1}^T u(\omega_t,a_t)\Big]
\ge
U^\star_{K|A|}-o(1),
$$
where the o(1) term collects
(a) the receiver’s per-round swap-regret level δT, (b)
phase-transition losses, and (c) statistical fluctuations controlled by
martingale concentration (since ωt are i.i.d.).
This yields the instance-tightness claim in Proposition~4 for the
message-budget model.
The mutual-information setting is analogous, but we must ensure that each per-round policy πt satisfies Iμ0, πt(ω; s) ≤ κ. The phase-coding idea still applies: we pick a family of phase policies {π(m)}m each individually satisfying the information cap, run them in long phases, and rely on the receiver’s endogenous action to index phases, thereby creating a label (s, a) whose effective informativeness can reach κ + log |A|. Formally, within each phase m, the receiver’s action has entropy as high as H(at ∣ st) ≈ log |A| during the adaptation window and becomes nearly deterministic once the phase stabilizes; across phases, the stable action itself serves as an additional tag. Because the constraint is imposed on I(ω; s) rather than on I(ω; s, a), we can keep each π(m) within the cap while using phase variation to increase the informativeness of the realized joint outcome.
The same bookkeeping as above then yields an achievability
bound
$$
\frac1T\mathbb{E}\Big[\sum_{t=1}^T u(\omega_t,a_t)\Big]
\ge
U^\star_{\kappa+\log|A|}-o(1)
$$
for suitable instances (again, typically imposing a strict-best-response
margin to obtain stability of the learner’s mapping within phases).
In the message-budget model, both UK⋆ and UK|A|⋆ can be computed by linear programming in finite instances (equivalently by concavification). Concretely, one can optimize over a finite collection of posteriors {μs} and weights {Pr [s]} satisfying ∑sPr [s]μs = μ0, with the objective ∑sPr [s] ⋅ u(μs, a*(μs)). This makes the expanded benchmark operational: for small |Ω|,|A|,K, we can compute π̃⋆ directly and then feed its posteriors into the phase policies described above.
Under mutual-information caps, the benchmark Uκ⋆
(and likewise Uκ + log |A|⋆)
is typically the value of a information-design problem, but it rarely
admits a clean closed form outside of very symmetric cases. Practically,
one computes it via numerical optimization: for example, by optimizing
over conditional distributions π(s ∣ ω) with the
convex constraint I(ω; s) ≤ κ,
or via a Lagrangian formulation in which one searches over multipliers
λ ≥ 0 and solves
maxπ 𝔼[u(ω, a*(μs))] − λ I(ω; s),
then tunes λ to hit the
desired information level. Algorithms in the spirit of Blahut–Arimoto
(or more general convex solvers) are natural here. Our achievability
message is therefore qualitative but sharp: whenever Uκ⋆
must be obtained numerically, the numerical object—the optimal capped
channel—provides the natural target for the principal’s phased
implementation, and the resulting repeated-game value approaches the
expanded-capacity benchmark rather than the naive cap.
Economically, the achievability constructions underscore that attention constraints enforced on the (bounded K) or on the I(ω; s) do not automatically control the informativeness of the (s, a) once the receiver is a learning system whose policy evolves endogenously. At the same time, our tightness arguments rely on two modeling choices: sufficiently rich feedback for the learner to achieve contextual no-swap regret, and enough time (long phases) for the sender to ``train’’ the learner’s response mapping. In environments with severely delayed feedback, hard exploration constraints, or rapidly changing state distributions, the sender’s ability to implement the expanded benchmark may be substantially reduced. These considerations motivate the extensions we turn to next.
Although we have framed the analysis in the language of repeated Bayesian persuasion, the underlying mechanism is more general: whenever (i) the platform can adapt its over time, and (ii) the agent is evaluated by a regret notion defined on the interaction data, the agent’s endogenous choice can become an auxiliary label that effectively enlarges the communicative alphabet. In this section we sketch four extensions that clarify what is essential for the action-induced channel expansion phenomenon, and what is an artifact of the baseline persuasion benchmark.
A convenient way to see the scope is to separate two ingredients in the persuasion model: Bayes plausibility (the sender can induce any distribution over posteriors consistent with μ0) and receiver optimality (the receiver best-responds to the posterior). Many constrained principal–agent problems have the same logic, but with different objects.
Suppose the principal chooses a each round that maps the observed state into a message, and the agent then chooses an action that affects both parties’ utilities, with the same boundedness. The constraint on the principal may be combinatorial (finite message set, limited menu size) or informational (mutual-information cap), and it may also include auxiliary restrictions (e.g., monotonicity, disclosure rules, or regulatory ``do not say’’ constraints). If the agent’s learning guarantee is still contextual no-swap regret with respect to deviations d : S × A → A, then the proof of the expanded-channel goes through with essentially no change: the relevant one-shot relaxation is the commitment optimum under a channel whose effective label is (s, a), so the appropriate benchmark becomes the best commitment value achievable on that enlarged label space.
What changes relative to textbook persuasion is that, in a generic principal–agent problem, the receiver’s response to a posterior may not be a simple best reply to v(⋅, a); the agent may face additional feasibility constraints, adjustment costs, or a dynamic objective. The reduction is therefore best stated in terms of the agent’s to a context distribution. Concretely, if for each distribution ν over states the agent has a well-defined (possibly set-valued) optimal action correspondence BR(ν), then the swap-regret guarantee implies that, on frequently realized joint labels (s, a), the realized a must be approximately stable against deviations and hence approximately consistent with BR(νs, a) for the induced conditional state law νs, a. The platform’s achievable payoff is then bounded by the best one-shot value among mechanisms that directly choose labels in the joint space, subject to the same feasibility constraints on the principal’s side.
A useful rule of thumb is: (or a factor-m expansion under a hard alphabet cap), provided the regret notion is defined on that realized choice. Actions are one instance; accept/reject decisions, selected tools, or chosen ``modes’’ of an assistant are others.
Real interfaces rarely have a single homogeneous message budget. Instead, the platform may face different disclosure affordances across contexts: different pages, different query types, or different user segments. We can model this by an exogenous context xt ∈ 𝒳 observed by both parties at round t, with either (a) a per-context budget |Sx| ≤ Kx, or (b) a shared global budget with a requirement that only K distinct messages can be used across all contexts.
With per-context budgets, the joint-signal logic applies each
context: the realized label is (xt, st, at),
and conditional on x the
effective alphabet is bounded by Kx|A|.
One therefore expects an upper bound of the form
$$
\frac{1}{T}\mathbb{E}\Big[\sum_{t=1}^T u(\omega_t,a_t)\Big]
\;\le\;
\sum_{x\in\mathcal{X}} \Pr[x]\cdot U^\star_{K_x|A|}(x)\;+\;O(\delta_T),
$$
where UKx|A|⋆(x)
denotes the one-shot persuasion value in the subproblem restricted to
context x (formally, the
concavification is done with the appropriate conditional prior μ0(⋅ ∣ x) if the
state distribution depends on x). The important point is that the
benchmark scales with Kx|A|
context-by-context, not with Kx.
With a shared global budget, the interaction across contexts becomes
more subtle. A single physical message label reused in many contexts can
be disambiguated by the agent’s context-dependent action rule, creating
a form of . In practice, this is exactly what designers observe when a
single UI element (say, one generic warning) acquires different
behavioral meanings across workflows. Analytically, the natural object
is an enlarged alphabet on (x, s, a) whose
cardinality can be as large as
∑x ∈ 𝒳|S|⋅|A| = |𝒳| K|A|
if contexts are disjoint and effectively allow separate ``codebooks.’’
This suggests that global message caps may be considerably weaker than
they look once one permits context-dependent behavior, even holding
K fixed. The precise bound
depends on whether the budget restricts (distinct strings) or (distinct
meanings)—a distinction that is immaterial in one-shot persuasion but
becomes material in repeated interaction with learning.
A second family of extensions introduces normative or regulatory constraints on the platform’s signaling policy. Many such constraints can be expressed as convex restrictions on the induced joint distribution over observables, which makes them compatible with information-design methods.
For example, suppose the principal must choose π from a set Π defined by linear constraints (or more generally convex constraints) on the distribution of (ω, s), such as:In a one-shot commitment problem, these restrictions simply change the feasible set in the concavification/LP. In the repeated no-commitment problem, the same reductions as in Propositions~1–2 still apply, but the feasibility set must be stated on the induced object the learner responds to. When the receiver’s swap-regret benchmark is defined on (s, a), the relevant relaxation is a constrained commitment problem on the enlarged label space, with feasibility determined by the of π through the interaction.
Two caveats matter in applications. First, if the constraint is stated only on I(ω; s) or on the distribution of s conditional on ω, then it does not directly control the distribution of (s, a); the action-induced expansion can therefore bypass the intent of the constraint unless it is formulated on outcomes that include the agent’s response. Second, some fairness constraints are once one conditions on the agent’s action (e.g., constraints on error rates of a downstream classifier), and in those cases the clean convex-analytic one-shot benchmark may not be available. The methodological lesson is that constraints meant to control informativeness or disparate impact should be written, where possible, as convex constraints on the relevant joint distribution—and the relevant joint distribution in a learning system is often , not purely on messages.
Finally, the baseline model assumes the agent receives feedback rich enough to achieve contextual no-swap regret, and that the state distribution is stationary i.i.d. across rounds. Both assumptions are substantively important.
On the agent side, if feedback is bandit (only v(ωt, at) is observed) rather than full-information, swap regret is still achievable but typically at slower rates, which enlarges δT and weakens the quantitative force of the upper bounds. With delayed feedback, the difficulty is sharper: during the delay window the learner cannot update the mapping ρt, which both (i) raises regret and (ii) may prevent the principal from ``steering’’ behavior quickly. In such environments, the action-as-label channel may still exist in principle, but the principal’s ability to exploit it is limited by how quickly the learner can form stable context-action associations. From a modeling perspective, the cleanest way to incorporate these frictions is to treat δT as endogenous to the feedback model (and potentially bounded away from zero over practical horizons).
On the environment side, if ωt is nonstationary (e.g., drifting priors, regime changes, or adversarially chosen states), then both the definition and enforcement of per-round information constraints require care: mutual information is defined relative to a reference distribution, and Bayes plausibility arguments rely on stable priors. One can still formulate per-round constraints Iμt, πt(ω; s) ≤ κ relative to a time-varying μt, or impose worst-case information constraints over a set of possible priors, but the corresponding one-shot benchmarks become robust information-design problems. More importantly for our theme, the phase-based implementations that witness tightness become fragile when the data-generating process changes faster than the time needed for the learner’s policy to stabilize.
Taken together, these extensions suggest that the expanded-channel perspective is not tied to any one persuasion formalism; rather, it is a general reminder that regulating the may not regulate the once learning dynamics and endogenous actions are part of the observable transcript. The next section translates this into concrete design and auditing implications.
Our main results have an immediate regulatory interpretation: caps written solely on the are generally not caps on the once the agent’s endogenous action becomes part of the observable transcript and the agent’s learning objective is evaluated on that transcript. Under a hard alphabet restriction |S| ≤ K, Propositions~1 and~4 say that what matters for worst-case manipulation is not K but (roughly) K|A|, because the realized pair (st, at) can act like a composite label. Under a mutual-information cap I(ω; s) ≤ κ, Proposition~2 similarly indicates that the relevant cap for the induced channel is κ + log |A|. In other words, a rule that limits what the platform can need not limit what the system can when the agent’s behavior is itself a measurable output of the interaction.
This observation clarifies what message limits guarantee. If the platform can commit to a stationary signaling policy, then standard persuasion logic applies and a cap such as |S| ≤ K or I(ω; s) ≤ κ is meaningful. Likewise, if the agent’s response is fixed exogenously (or if the agent commits to a mapping ρ that the platform cannot observe or exploit), then the action does not function as an auxiliary label. But in the learning-and-adaptation regimes that motivate platform governance—A/B testing, iterative prompt and UI tuning, adaptive disclosure, and sequential recommender design—the constraints must be interpreted as governing only one edge of a larger feedback system. Put bluntly, if a regulator hopes that limiting message richness will bound persuasion power, then the object that must be controlled is closer to the channel ω ↦ (s, a) than the pre-interaction channel ω ↦ s.
The same point yields a practical distinction between and . A rule that restricts the number of message strings (a literal K) is easy to satisfy while still inducing a rich set of meanings through context- and history-dependent responses. By contrast, a rule that restricts the set of meanings would need to be stated in terms of the of messages on actions (and perhaps downstream outcomes), which is substantially harder to formalize but is closer to the economically relevant object.
From an engineering perspective, the action-induced expansion is not an inevitability; it is enabled by the particular stability notion we ask of the agent. Contextual no-swap regret is attractive because it is local, behavioral, and implementable under partial feedback, but it is also permissive: it enforces approximate internal consistency only with respect to deviations d : S × A → A evaluated on realized data. This is exactly the structure that allows the principal to treat (s, a) as the operative label.
One design lever is therefore to modify the agent’s objective so that the relevant regret notion does certify optimality on the joint label (s, a). Several possibilities arise, each with tradeoffs that are familiar in online learning but take on a new interpretation here:We do not claim a universal fix: stronger learning objectives can be expensive in sample complexity, require more feedback than is available, or lead to conservatism that users dislike. The key implication is diagnostic: when an assistant is trained or evaluated by a regret metric defined on realized transcripts, we should expect that transcripts themselves become a manipulable communication channel. If the goal is to limit persuasion power, one must align the assistant’s learning criterion with that governance goal, rather than assume that ``no regret’’ is automatically protective.
Because the phenomenon is fundamentally about an , audits should target effective information flow and effective persuasion value rather than stated message constraints. Concretely, a useful audit stack contains three layers:These recommendations also help clarify what regulators can realistically write into rules. A message budget K is easily stated but may be systematically circumvented by behavioral multiplexing. By contrast, restricting adaptivity (e.g., limiting how quickly πt may change), requiring disclosure of policy changes, or imposing constraints on the joint outcome distribution of (s, a) are closer to the operative objects in the repeated setting, even if they are more demanding to specify and enforce.
We close this discussion with a limitation that is also an opportunity: our analysis isolates a worst-case mechanism that exploits internal-consistency learning, but it does not by itself rank alternative learning rules on welfare grounds, nor does it provide a complete blueprint for regulation under partial observability, multi-agent externalities, or strategic user populations. What it does provide is a sharp warning: if governance focuses on the surface-level message channel while ignoring how agents learn from and act on those messages, then the system may satisfy the letter of an attention constraint while violating its spirit.
We studied a repeated persuasion environment in which a non-committing principal adapts disclosures over time and a learning agent updates behavior using contextual no-swap regret. The central conceptual takeaway is that attention constraints written on the sender’s channel do not, by themselves, control the sender’s ability to communicate once the receiver’s realized action becomes part of the publicly observable transcript. Formally, when the principal can condition on the learner’s current response mapping and can vary policies over time, the relevant object is the induced joint channel ω ↦ (s, a). This joint channel has a larger alphabet (under a message budget) or larger effective capacity (under a mutual-information cap) than the original ω ↦ s channel. Our upper bounds make this precise by relating the principal’s long-run payoff to one-shot commitment benchmarks computed under expanded resources, while our instance constructions show that the resulting ``expanded-channel’’ benchmarks are not merely artifacts of analysis but can be approached by explicit adaptive strategies against standard no-swap-regret learners.
At a higher level, the results offer a sharp lens on a recurring tension in platform and AI governance. On the one hand, attention constraints are a natural regulatory primitive: they are legible, enforceable on interfaces, and often correlate with user welfare in static settings. On the other hand, adaptive systems create feedback loops in which the receiver’s behavior carries information and thereby becomes part of the communication protocol. Our analysis identifies one minimal mechanism for this phenomenon—internal-consistency learning evaluated on realized transcripts—and shows that it is sufficient to reopen a constant manipulability gap even when the learner is ``no regret’’ in a strong, contextual swap sense. In that sense, the model illuminates a tradeoff: learning guarantees that are attractive for performance and robustness in generic online environments may be exactly the guarantees that enlarge the sender’s feasible persuasion set when the environment is strategic and adaptive.
Several open questions follow naturally. We view them as opportunities to connect persuasion, information theory, and online learning more tightly, and also as a guide for what future governance proposals should be explicit about.
Our timing assumption allows the principal to observe the receiver’s response mapping ρt before choosing πt, which cleanly captures settings where the platform can infer (or directly observe) current policy parameters, prompt templates, or behavioral rules. In practice, observability may be partial: the platform might only see realized actions, or only lagged aggregates, or might face privacy and access controls. It is open to characterize how much of the expanded-channel effect survives under weaker observability models. A natural conjecture is that the phenomenon is robust whenever the principal can estimate Pr [a ∣ s] quickly enough to implement ``phase coding,’’ but the quantitative bounds (and the relevant expansion term, e.g. log |A|) may change with statistical constraints.
We adopted a stationary i.i.d. prior μ0 and enforced signaling constraints per round. Both are analytically convenient and policy-relevant, but many applications involve drift, Markov structure, or adversarially selected states. If ωt is persistent, then history itself carries state information and can either amplify or dampen the sender’s ability to multiplex through (s, a). Likewise, many governance tools are (e.g. a total privacy or information budget) rather than per-round. An important extension is to replace I(ω; s) ≤ κ with a horizon budget $\sum_{t=1}^T I(\omega_t;s_t)\le \kappa_{\mathrm{tot}}$ or a constraint on I(ω1 : T; s1 : T). In such models, it is unclear whether the right ``action expansion’’ term scales as Tlog |A| or whether tighter intertemporal inequalities can limit how much information actions can contribute.
Our bounds are stated for the principal’s payoff and use a tie-breaking convention in the principal’s favor in the one-shot benchmark. While standard in persuasion, this highlights that the expanded-channel effect is fundamentally about and not necessarily about the receiver’s welfare. In applications, the agent may be a user-facing assistant with objectives aligned (or misaligned) with end users, and welfare comparisons may depend on how v relates to user outcomes. A useful next step is to characterize the joint welfare frontier achievable in the repeated game under learning dynamics, and to identify when expanded-channel persuasion increases, decreases, or redistributes welfare. In particular, can one bound the receiver’s welfare loss (relative to the best commitment outcome under the constraint) as a function of δT and the channel gap, or can the principal’s gain come ``for free’’ from the receiver’s perspective in some environments?
We emphasized that contextual no-swap regret is permissive in exactly the way a strategic principal can exploit. This suggests a more general research program: treat the receiver’s learning objective as a that shapes the effective persuasion set. Which online guarantees are ``persuasion-robust’’ in the sense that they prevent the sender from leveraging (s, a) as a larger alphabet? Are there learning dynamics that preserve good performance in benign environments while limiting exploitation by adaptive senders? This question sits between learning theory and implementation: stronger notions (e.g. policy regret, counterfactual stability, calibrated forecasting coupled to conservative actions) may require more feedback or induce inefficiencies. Mapping these tradeoffs in a persuasion-aware way remains open.
Our expansion terms are clearest for finite A, yielding a log |A| contribution under information constraints. Many systems choose among structured, high-dimensional actions: ranked lists, multi-field responses, or continuous control signals. In such settings, H(a) can be large or even unbounded absent regularization. One direction is to model a of action complexity (e.g. entropy penalties, communication costs, or Lipschitz constraints) and ask how such costs interact with persuasion and learning. Another is to treat downstream behavior (clicks, dwell time, user follow-through) as part of the action channel; then the ``auxiliary label’’ is not the assistant’s action alone but the entire behavioral footprint, raising measurement and governance questions that go beyond the interface.
Even if one accepts UK|A|⋆ and Uκ + log |A|⋆ as the right worst-case yardsticks, computing them can be nontrivial at scale. Under message budgets, one-shot persuasion reduces to a linear program over posteriors when |Ω| is moderate, but the number of variables grows with the signal alphabet. Under mutual information constraints, the optimization is convex but often requires variational approximations or numerical methods. An open computational question is whether the expanded-channel benchmarks admit efficient approximations with explicit error guarantees in large problems, and whether such approximations can be integrated into practical audit pipelines.
Our model is intentionally minimal: one sender, one receiver, and a regulator that sets K or κ. Real platforms face competition, multi-agent externalities, and strategic users who may adapt their own behavior. Introducing multiple senders (e.g. advertisers) raises the possibility that one sender’s manipulation operates through another sender’s induced action distribution, complicating channel accounting. Endogenizing the platform’s objective (e.g. revenue plus compliance costs) also invites mechanism-design questions about which constraints are incentive-compatible to enforce and how they shift equilibrium investment in experimentation and personalization.
A unifying open problem is to identify constraint classes that remain meaningful once we account for feedback through actions. The message budget |S| ≤ K and the cap I(ω; s) ≤ κ are not closed in this sense because the effective channel can be larger. What constraints on the mapping ω ↦ (s, a), or on the adaptivity of πt, are both (i) enforceable in practice and (ii) stable under repeated interaction with learning receivers? Answering this would provide a principled foundation for governance proposals that aim to limit persuasion power without relying on fragile, surface-level restrictions.
We conclude with a methodological point. The analysis here is deliberately worst-case: it shows that a broad and appealing class of learning guarantees does not, by itself, immunize agents against strategic information design under attention constraints. The natural next step is not to abandon no-regret learning, but to integrate persuasion considerations into how we specify objectives, what we measure, and which system interfaces we treat as the relevant ``channel.’’ Doing so will require both sharper theory—to understand when expansion is inevitable and when it can be prevented—and careful engineering—to ensure that the guarantees we can implement correspond to the governance goals we actually have.