← Back

Meta-Learned Mixture-of-Annotators for Unsupervised Meta-Learning: Tight Bounds via Pairwise Constraint Channels

Metadata

Table of Contents

  1. 1. Introduction and Motivation: why task construction should fuse multiple weak signals in 2026; limitations of single-source clustering in DHM-UHT; contributions and guarantees.
  2. 2. Problem Setup: tasks/domains, latent labels, weak annotators as pairwise channels; formal objectives for unsupervised meta-learning with dynamic adapters.
  3. 3. From Pseudo-Labels to Pairwise Constraints: representing clustering, captions, metadata, and temporal continuity as constraint graphs; avoiding label-permutation issues in multi-source fusion.
  4. 4. Algorithm: MetaMix-UHT: bilevel meta-learning with (i) domain-conditioned mixture weights over annotators, (ii) graph partitioning to construct pseudo-labels, (iii) dynamic per-task adapters, and (iv) outer-loop objectives.
  5. 5. Statistical Theory: (a) identifiability of annotator reliabilities/mixture weights; (b) tight recovery bounds under planted partition; (c) translating misclustering to few-shot error after adaptation.
  6. 6. Computational Theory: complexity of task construction (graph building + partitioning); approximation algorithms; hardness via correlation clustering; conditions enabling polynomial-time recovery in planted models.
  7. 7. Experimental Design (to strengthen claims): domain shift (DomainNet-like), web-metadata corruption, long-tail rare classes; ablations showing learned weights shift across domains and robustness improves.
  8. 8. Related Work: DHM-UHT, CACTUs/UMTRA/PsCo; weak supervision (Dawid–Skene, Snorkel); multi-view clustering; VLM/LLM pseudo-labeling; meta-learning robustness.
  9. 9. Discussion and Limitations: correlated annotators, adversarial metadata, scaling and privacy; future directions (online mixture learning, active selection of annotator queries).

Content

1. Introduction and Motivation: why task construction should fuse multiple weak signals in 2026; limitations of single-source clustering in DHM-UHT; contributions and guarantees.

In 2026, the prevailing operational constraint for representation learning is not the absence of supervision per se, but the abundance of weak signals whose reliability is both nonstationary and difficult to calibrate. Contemporary data pipelines routinely provide multiple side channels—metadata fields, temporal co-occurrence, product or user graphs, near-duplicate detectors, retrieval indices, and foundation-model judgements—each of which supplies partial information about whether two examples belong to the same latent class. None of these channels is uniformly trustworthy: a heuristic that is precise for one subpopulation may be brittle for another; an embedding-based nearest-neighbor graph may be informative in one domain and nearly random in another; a large language model may be consistent on short texts but unstable on noisy OCR. The central methodological question is therefore not whether to use weak signals, but how to them in a manner that is adaptive to domain shift, robust to noise, and compatible with episodic training objectives that target downstream few-shot adaptation.

A common failure mode of unsupervised clustering pipelines is the implicit assumption that one can select a single ``best’’ source of similarity or constraints and then treat the resulting pseudo-labels as if they were ground truth. This assumption is particularly fragile in the regime we care about: tasks are small, class cardinality varies, and the latent semantics of clusters drift across domains. In such conditions, a single-source graph (or a single distance function) can be simultaneously under- and over-constrained: it may miss many true within-class links (leading to fragmentation) while also introducing systematic false links (leading to merges that are hard to undo). Moreover, even when the source is unbiased on average, its realized noise in a finite episode can be adversarial relative to the clustering objective. Consequently, the downstream learner may be forced to adapt to pseudo-label artifacts rather than to the latent structure that matters at evaluation.

The limitations of the single-source paradigm appear sharply in methods of the DHM-UHT type, where task construction is effectively driven by one dominant heuristic (e.g., a fixed similarity computed from the current representation) and where the meta-learner is then asked to absorb the discrepancy between pseudo-labels and latent labels. We may phrase the issue in the language of pairwise channels: if one uses only one annotator (or one constraint generator), then the induced graph can be below the recovery threshold in some domains even when other available sources would collectively suffice. When this happens, no amount of downstream adaptation can recover the missing information, because the pseudo-task itself is mis-specified at the level of cluster membership. Empirically, such methods either collapse to overly coarse partitions (few large clusters) or devolve to nearly singleton clusters, both of which can yield deceptively low inner-loop loss while harming transfer. The more subtle issue is that a single-source constructor provides no principled mechanism to a source when its domain-conditional reliability degrades; indeed, the constructor is typically coupled to the representation, creating feedback loops in which early representation errors corrupt the pseudo-labels, which then steer representation learning in the wrong direction.

The remedy we pursue is conceptually simple: since we have multiple weak signals, we should combine them; since their quality depends on the task/domain, the combination should be ; and since the ultimate goal is not clustering accuracy on the training data but downstream performance after adaptation, the combination should be learned with the representation. Concretely, we treat each weak signal as an annotator producing pairwise constraints and we meta-learn a mixture over annotators. The mixture is computed by a gating mechanism conditioned on task descriptors (which may include side information and statistics derived from the current representation), and it outputs weights in the simplex. The mixed constraint matrix defines a pseudo-task partition, which is then used to adapt a task-specific head. This construction separates concerns: the shared backbone is trained to produce representations for which the available weak channels become easier to reconcile, while the dynamic adapter absorbs the inevitable task-wise variability in pseudo-label geometry and class count.

Two technical considerations motivate the specific architecture of a mixture-based task constructor. First, pseudo-labels must be permutation-invariant: there is no consistent label identity across tasks, and the weak sources do not agree on label names. Pairwise constraints and partitions avoid this ambiguity. Second, the task constructor contains inherently non-differentiable components (notably graph partitioning), so the outer objective must be formulated to provide informative gradients to the gating network and the backbone without relying on differentiating through the partitioner. This leads naturally to outer-loop surrogates based on consistency across views/sources and on downstream performance of the adapted head. The result is a bilevel procedure in which we use the partitioner as a black box while still learning to choose mixtures that, in expectation, produce high-quality partitions.

Our contributions are therefore organized around the question: when does such adaptive fusion provably help, and when is it information-theoretically impossible?

These results justify, at a structural level, why multi-source fusion is not merely an engineering trick but a necessary ingredient for reliable unsupervised task construction. The identifiability requirement highlights a point that is often glossed over in practice: if all weak signals are derived from the same underlying heuristic (or from a single embedding model with minor perturbations), then they may be too correlated to permit calibration, and mixture learning degenerates into reweighting noise. Conversely, when weak sources arise from genuinely distinct mechanisms (e.g. text similarity, temporal adjacency, user behavior co-occurrence), their conditional independence is a reasonable approximation, and cross-source disagreement becomes informative rather than problematic.

Finally, we emphasize the role of dynamic adapters in stabilizing the meta-learning objective. Even when the mixed constraints are above threshold on average, finite-task fluctuations and heterogeneous cluster counts can produce episodic partitions that are imperfect. A task-specific head trained in the inner loop can accommodate such imperfections without forcing the shared representation to overfit to idiosyncratic pseudo-label artifacts. In effect, the backbone is incentivized to learn features that are broadly compatible with the family of pseudo-tasks induced by the mixture policy, while the adapter handles the per-task instantiation. This division of labor is essential for the end-to-end guarantees: it is precisely what allows us to translate improvements in partition quality (via mixture learning) into improvements in downstream risk, rather than into brittle memorization of the pseudo-label generator.

The remainder of this work makes these claims precise. We next define the episodic setting, the weak annotator model as pairwise channels, and the bilevel objective for unsupervised meta-learning with dynamic adapters, after which we describe the algorithm and state the accompanying guarantees and lower bounds.


2. Problem Setup: tasks/domains, latent labels, weak annotators as pairwise channels; formal objectives for unsupervised meta-learning with dynamic adapters.

We work in an episodic, domain-shifted unsupervised meta-learning setting. Let 𝒟 denote a (finite or countable) set of domains, and let d ∈ 𝒟 be a latent or observed domain identifier. For each d we posit an unlabeled distribution Dd over examples with optional side information, (x, m) ∼ Dd. A (or ) is a batch
T = {(xi, mi)}i = 1n
obtained by first drawing d ∼ π for some meta-distribution over domains and then sampling $(x_i,m_i)\stackrel{\mathrm{i.i.d.}}{\sim}\mathbf{D}_d$. Each task admits latent ground-truth class labels yi ∈ [r], where the number of classes r = r(T) may vary with the task (and hence with d). We emphasize that during training we never observe yi; labels are used only for evaluation of downstream performance.

The learner has access to K weak supervision sources (annotators'') $\{\lambda_k\}_{k=1}^K$. In task $T$, annotator $k$ returns a (possibly sparse) binary pairwise constraint matrix \[ A^{(k)}(T)\in\{0,1\}^{n\times n},\qquad A^{(k)}_{ij}=1 \ \text{interpreted as a must-link suggestion,} \] and $A^{(k)}_{ij}=0$ interpreted as a cannot-link suggestion. For convenience we take $A^{(k)}_{ii}=1$ and $A^{(k)}_{ij}=A^{(k)}_{ji}$; none of our conclusions depend on these conventions. When annotator outputs are sparse, we may regard $A^{(k)}$ as defined on an edge set $E^{(k)}\subseteq [n]\times[n]$ together with a missingness mask; equivalently, we consider an observed graph and treat unobserved pairs asno vote.’’ We do not require that all annotators provide constraints on the same edge set.

To connect to standard identifiability and recovery analyses, we assume a pairwise channel model: for each domain d and annotator k, there exist parameters
pk(d) = Pr  (Aij(k) = 1 ∣ yi = yjd),   qk(d) = Pr  (Aij(k) = 1 ∣ yi ≠ yjd),
with pk(d) ≠ qk(d) for at least one k in each domain. In the regimes of interest we typically have pk(d) > qk(d), but we allow arbitrary ordering in the definition and let the algorithm learn to downweight or invert unhelpful sources via calibration. We further assume that, conditional on the latent pairwise truth 𝟙[yi = yj] (and on d), the random variables {Aij(k)}k = 1K are independent across annotators. This conditional independence is the structural hypothesis that makes it possible to estimate relative reliabilities from unlabeled disagreements; it is also the minimal condition under which more sources'' can be informationally better thanone source.’’

The representation learner is composed of a shared backbone and a per-task adapter. The backbone is a map
fθ : 𝒳 → ℝd,   zi = fθ(xi),
shared across tasks, and the adapter/head hϕ is reinitialized per task. The adapter may be a linear classifier, a prototype module, or any parametric family suitable for fast inner-loop fitting; we only require that it can be trained using pseudo-supervision induced from the weak constraints and that it supports an inner adaptation operator. Given a task T and a pseudo-partition Π over [n] (to be defined presently), we define the inner update abstractly as
ϕ = Adapt(ϕ; T, Π),
for example one or a few gradient steps on an inner loss in:
ϕ = ϕ − αϕin(fθ, hϕ, T, Π).
The point of introducing ϕ explicitly is that task heterogeneity is not merely nuisance noise: both the effective label geometry (number of clusters, cluster balance) and the mapping from representation space to decision regions can change substantially across episodes, and we do not wish to force the backbone to encode these variations.

Task construction proceeds by the K weak sources. A gating policy with parameters ψ produces mixture weights in the simplex,
wψ(T) ∈ ΔK,   wψ(T) = softmax(gψ(dT)),
where dT is a task descriptor (potentially including observed domain metadata, summary statistics of {zi}, or properties of the annotator graphs). Given w = wψ(T), we form a combined weighted constraint matrix
$$ S(T) \;=\; \sum_{k=1}^K w_k\,\mathrm{Cal}\!\left(A^{(k)}(T)\right), $$
where Cal(⋅) denotes an optional calibration transform (e.g. debiasing, log-odds reweighting, or normalization to correct for annotator sparsity). The combined matrix S is then passed to a (possibly non-differentiable) partitioning routine Part which returns a partition Π of [n] into a variable number of clusters,
Π(T; θ, ψ) = Part(S(T)).
We allow Part to include outlier handling (assigning some indices to a reject set) and to operate on sparsified graphs; these are algorithmic choices that do not alter the mathematical role of Π as the pseudo-label object consumed by the inner loop.

Our end goal is downstream performance under the latent labels after fast adaptation. Formally, if denotes the episodic risk of the adapted model with respect to the latent labels (e.g. classification error on a held-out query split within the episode), then the ideal bilevel objective is

where the expectation is taken over the task-generation process (including domains) and over annotator randomness. Since depends on unobserved labels, cannot be optimized directly in training.

Accordingly, we replace by an unsupervised outer surrogate out which is computable from (T, Π, {A(k)}) and encourages pseudo-tasks that are both learnable in the inner loop and consistent with the weak sources. The general form of our training criterion is

with ϕ(T) obtained by inner adaptation on in. The precise choice of out is not essential at the level of the problem definition; what matters is that it correlates with the latent task quality induced by Π, and hence with the effective separation (signal-to-noise) of the mixed channel. The mixture policy and the representation are thus trained : ψ controls the task constructor through wψ(T), while θ affects both the descriptor used by the gate and the performance of the adapted head on the pseudo-task.

Finally, we state explicitly the two evaluation modes that motivate this setup. First, in , one uses S(T) and Part (optionally with representation-derived affinities) to recover a partition intended to approximate the latent clustering. Second, in , one uses Π(T) to fit the adapter hϕ and then evaluates on latent labels. The purpose of meta-learning is to choose (θ, ψ) so that, across domains, the induced pseudo-partitions are sufficiently accurate for both modes, and so that the adapter absorbs residual episodic variation without forcing the backbone to memorize artifacts of any single weak source.


3. From Pseudo-Labels to Pairwise Constraints: representing clustering, captions, metadata, and temporal continuity as constraint graphs; avoiding label-permutation issues in multi-source fusion.

The central design choice in our setting is to represent all weak task constructors—including those that naturally emit clusters,''similarities,’’ or ``rules’’—as . This unification serves two purposes. First, it provides a common currency in which heterogeneous sources can be compared, calibrated, and mixed. Second, it avoids label-permutation pathologies: when sources output discrete labels or cluster identifiers, those identifiers are not aligned across sources (nor across tasks), whereas pairwise relations are intrinsically permutation-invariant.

Given any partition (clustering) Π of [n], we can associate to it the binary matrix
AijΠ = 𝟙[ iΠj ],
where iΠj denotes that i and j lie in the same block of Π. The map Π ↦ AΠ is invariant to any relabeling of the blocks of Π; in particular, there is no label alphabet to align. Conversely, any binary matrix A ∈ {0, 1}n × n can be interpreted as a set of about which pairs should be co-clustered, even if A is not transitive (as will be the case under noise). Recovering a partition from such suggestions is precisely the role of Part in our pipeline.

It is often conceptually cleaner to regard an annotator output as a labeled graph G(k) = (V, E(k), a(k)) with V = [n], where E(k) ⊆ {(i, j) : 1 ≤ i < j ≤ n} is the set of queried/returned pairs and aij(k) ∈ {0, 1} is the vote on that edge; missing edges are rather than ``cannot-link’’ votes. This distinction matters in sparse regimes: if we were to treat an unobserved pair as 0, then sparsity would act as an overwhelming negative bias. Formally, we may write Aij(k) only for (i, j) ∈ E(k), and define the combined score Sij only on the union edge set E = ∪kE(k), with default value 0 (no evidence) off E.

Many practical weak sources naturally produce pseudo-labels (cluster ids) rather than pairwise votes. Examples include: a caption-based clustering routine, a metadata deduplication system that assigns record ids, or a heuristic segmenter that assigns temporal track ids. Any such source can be converted to pairwise constraints by setting A(k) = AΠ̂(k) for the emitted pseudo-partition Π̂(k). If the source also provides confidence, one can further sparsify to high-confidence edges, e.g.,
E(k) = {(i, j) : conf(k)(i, j) ≥ τk},   Aij(k) = 𝟙[ i(k) = j(k) ]  for (i, j) ∈ E(k).
This representation explicitly accommodates abstention and yields a graph whose density is controlled by τk (or by a fixed kNN budget). The key point is that once we pass to pairwise form, we no longer require the label alphabet {i(k)} to have any meaning beyond within-task equality tests.

The reverse direction, converting pairwise constraints into pseudo-labels, is precisely a graph partitioning problem. Under our planted assumptions, the existence of a near-correct Π is guaranteed when the effective separation of the mixed channel is above threshold (cf. Thm.~2); algorithmically, we may use spectral clustering, thresholded connected components, or correlation clustering relaxations depending on the regime and the presence of negative evidence.

We now record several canonical constructions which instantiate the abstract annotators λk.


Given embeddings {zi}i = 1n, an annotator may define a kNN graph in embedding space and vote must-link on sufficiently similar pairs:
$$ E^{(k)} \;=\; \{(i,j): j\in \mathrm{kNN}_k(i)\},\qquad A^{(k)}_{ij} \;=\; \mathbb{1}\!\left[\ \frac{\langle z_i,z_j\rangle}{\|z_i\|\|z_j\|}\ge \tau_k\ \right]. $$
This can be viewed as a self-annotator whose noise parameters (pk(d), qk(d)) depend on the current backbone quality in domain d. Importantly, such a source is not redundant: even if it is imperfect early in training, it can be weighted down by the mixture and later become a strong contributor as fθ improves.


Suppose each item has a caption or text field mi. A simple annotator computes TF–IDF or sentence-embedding similarity and proposes must-links for highly similar text pairs, optionally restricted to a candidate set generated by approximate nearest neighbors. In domains where captions are reliable, we expect pk(d) to be large and qk(d) small; in domains with templated or noisy text, both may deteriorate, and the gate should compensate.


If mi contains categorical identifiers (user id, product id, camera id, coarse category), then equality (or a known compatibility relation) yields immediate pairwise constraints:
Aij(k) = 𝟙[ id(mi) = id(mj) ].
Such rules can be extremely high-precision but low-recall, hence naturally sparse. They also illustrate why we distinguish abstention from negative votes: failure of metadata equality should not be interpreted as a cannot-link unless the rule is known to be exhaustive.


In sequential data, adjacent (or nearby) time indices provide a weak must-link prior. One may set E(k) = {(i, j) : |ti − tj| ≤ Δ} and vote Aij(k) = 1 for temporally local neighbors. This source typically has moderate pk(d) and nontrivial qk(d), as proximity does not guarantee identity; nonetheless, when mixed with complementary signals it can greatly stabilize partitions by enforcing local smoothness.


Some domains admit explicit negative evidence (e.g. mutually exclusive metadata, non-overlapping tracks). While we encode annotators as binary Aij(k) ∈ {0, 1}, negative evidence is representable either by (a) allowing an annotator whose pk(d) < qk(d) (so that Aij(k) = 1 is anti-correlated with sameness) and correcting via calibration, or (b) using a calibrated transform Cal(A(k)) that maps {0, 1} to signed scores. A canonical choice, when (pk, qk) are known or estimated, is the log-likelihood ratio score for the pairwise hypothesis test:
$$ \mathrm{Cal}\!\left(A^{(k)}_{ij}\right) \;=\; \log\frac{\Pr(A^{(k)}_{ij}\mid y_i=y_j)}{\Pr(A^{(k)}_{ij}\mid y_i\neq y_j)} \;=\; A^{(k)}_{ij}\log\frac{p_k}{q_k} + (1-A^{(k)}_{ij})\log\frac{1-p_k}{1-q_k}. $$
This calibration makes the mixed score Sij additive in evidences and clarifies how an anti-correlated annotator is automatically inverted.

If an annotator outputs a label vector (k) ∈ ℕn, then any permutation of its label names is observationally equivalent. Hence, attempting to fuse sources at the label level (e.g. by majority vote on i(k)) is ill-posed without an alignment step, which itself is combinatorial and unstable under noise. By contrast, pairwise equality events 𝟙[i(k) = j(k)] are well-defined and directly comparable across sources. This is the reason our mixture is performed at the level of constraint matrices:
$$ S \;=\; \sum_{k=1}^K w_k\,\mathrm{Cal}\!\left(A^{(k)}\right), $$
rather than at the level of labels. In particular, the gate needs only to learn which sources produce pairwise relations that are predictive of latent sameness in the current domain; it never needs to discover correspondences between arbitrary label vocabularies.

Given S, the partitioner Part returns Π = Part(S), which serves as the pseudo-label object for inner-loop adaptation. We stress that we do not require Π to be the optimizer of any particular objective, but it is helpful to keep in mind a standard target such as correlation clustering: for a signed score matrix S one may seek Π maximizing within-cluster agreement and across-cluster disagreement, equivalently minimizing a weighted disagreement count. In planted regimes, this objective aligns with the latent partition and admits efficient approximate solvers with sharp thresholds (Thm.~2). The resulting Π can then be used either as hard pseudo-labels (e.g. cross-entropy over cluster ids within the task) or as an induced set of positive/negative pairs for contrastive training, both of which are naturally permutation-invariant.

These constructions reduce the diverse space of ``weak task definitions’’ to a single mathematical interface: each source produces sparse pairwise constraints, the gate chooses a mixture, and a partitioner produces pseudo-label structure. With this interface in place, we may state the complete bilevel procedure—including dynamic per-task adapters and a computable outer objective—without committing to any particular modality-specific heuristic, which is the subject of the next section.


4. Algorithm: MetaMix-UHT: bilevel meta-learning with (i) domain-conditioned mixture weights over annotators, (ii) graph partitioning to construct pseudo-labels, (iii) dynamic per-task adapters, and (iv) outer-loop objectives.

We now describe the bilevel procedure that simultaneously learns (i) a shared representation fθ, and (ii) a task-construction policy which selects, on a per-task basis, a mixture wψ(T) ∈ ΔK over weak annotators. The defining feature is that the inner loop optimizes task-local parameters against pseudo-supervision induced from mixed pairwise constraints, while the outer loop optimizes θ and ψ using a label-free surrogate objective designed to correlate with downstream performance.

An episode (task) is a batch T = {(xi, mi)}i = 1n sampled from a domain-conditioned distribution. We compute embeddings
zi = fθ(xi) ∈ ℝd,   i ∈ [n],
and query each weak source λk to obtain a (possibly sparse) constraint graph A(k)(T). We emphasize that in sparse regimes we interpret missing edges as abstentions; thus, for each k we may represent its output as (E(k), a(k)) with E(k) ⊆ {(i, j) : i < j} and aij(k) ∈ {0, 1} for (i, j) ∈ E(k). This convention prevents sparsity from being misread as systematic negative evidence.

To choose which sources to trust on the present task, we construct a descriptor
d = Desc(T, {zi}i = 1n, {mi}i = 1n),
intended to capture domain information (explicit if available, otherwise implicit statistics such as embedding dispersion, caption entropy, or annotator agreement rates on sampled pairs). A gating network gψ maps d to logits, and we set
wψ(T) = softmax(gψ(d)) ∈ ΔK.
This parameterization ensures wk ≥ 0 and kwk = 1 for every task, and it makes the mixture selection differentiable in ψ.

Annotators differ not only in accuracy but also in semantics (precision–recall tradeoffs, density, and possible anti-correlation). We therefore optionally pass each A(k) through a calibration map Cal, which yields a real-valued score on observed edges. When reliability parameters (pk, qk) are known or estimated, a canonical choice is the log-likelihood ratio score
$$ \mathrm{Cal}\!\left(A^{(k)}_{ij}\right) \;=\; A^{(k)}_{ij}\log\frac{p_k}{q_k} + (1-A^{(k)}_{ij})\log\frac{1-p_k}{1-q_k}, $$
with the understanding that this is applied only on edges (i, j) ∈ E(k). More generally, Cal may be a learned monotone transform, a temperature-scaled mapping, or a centering operation that equalizes source magnitudes.

The combined score matrix is then
$$ S(T) \;=\; \sum_{k=1}^K w_k\,\mathrm{Cal}\!\left(A^{(k)}(T)\right), $$
defined on the union edge set E = ∪kE(k) and set to 0 (no evidence) off E. The mixture acts at the level of pairwise evidence, hence remains invariant to any source-specific label permutations.

Given S, we invoke a partitioner Part to produce a partition Π = Part(S) of (a subset of) [n] into a variable number of clusters. The algorithmic choice of Part depends on regime: spectral clustering on (signed or nonnegative) weighted adjacency is effective in planted regimes; thresholded connected components can be used in high-precision settings; correlation-clustering relaxations can accommodate explicit negative evidence. We permit an mechanism, in which a subset of indices is assigned to a null block and excluded or downweighted in subsequent losses. The output Π is treated as a pseudo-label object: it may be encoded as hard cluster ids i ∈ {1, …, } for non-outliers, or equivalently as induced positive/negative pair sets.

To prevent the shared representation from being forced to absorb task-specific labelings induced by noisy pseudo-partitions, we attach a task-local adapter/head hϕ whose parameters are reinitialized per task. Concretely, hϕ may be a linear classifier over z, a prototype-based head whose prototypes are initialized from cluster means under Π, or a lightweight MLP. The inner loop performs one or a few gradient steps
ϕ = Adapt(ϕ; T, Π) = ϕ − αϕin(fθ, hϕ, T, Π),
with step size α. A standard inner loss is cross-entropy on pseudo-labels:
$$ \mathcal{L}_{\mathrm{in}} \;=\; \frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}} \mathrm{CE}\!\left(h_\phi(z_i),\hat y_i\right), $$
where ℐ ⊆ [n] excludes outliers. Alternatively, we may avoid committing to cluster ids and use a pairwise contrastive loss induced by Π (or directly by S), which can be preferable when the estimated number of clusters is unstable.

Since ground-truth labels are unavailable during meta-training, we choose an outer-loop objective out that is computable from (T, Π, {A(k)}) and is informative about the quality of the induced pseudo-task. A generic and effective construction is a : for each annotator k, split E(k) = Esup(k) ∪ Eqry(k), build S (and thus Π) using only the support edges, adapt ϕ using Π, and evaluate agreement on held-out query edges. For instance, if the adapted model induces a similarity σθ, ϕ(i, j) (e.g. cosine similarity of adapted features or predicted same-cluster probability), we may set
$$ \mathcal{L}_{\mathrm{out}} \;=\; -\sum_{k=1}^K \gamma_k \sum_{(i,j)\in E^{(k)}_{\mathrm{qry}}} \log \Pr\!\left(A^{(k)}_{ij}\mid \sigma_{\theta,\phi'}(i,j)\right), $$
for suitable link function Pr (⋅ ∣ ⋅) and weights γk. This objective encourages the representation and adaptation to predict held-out pairwise relations, and it simultaneously encourages the gate to choose mixtures whose induced partitions generalize across edges.

A complementary choice is a loss: we form two mixed graphs S(a), S(b) either by independent edge subsampling or by using disjoint annotator subsets, obtain partitions Π(a), Π(b), adapt two heads ϕ(a), ϕ(b), and penalize disagreement between the adapted predictions (or between the induced pairwise relations). This targets robustness to annotator idiosyncrasies and empirically correlates with higher effective separation of the mixed channel.

Given a meta-batch of tasks {Ti}i = 1B, we update (θ, ψ) by gradient descent on iout(i). The inner adaptation is differentiable, and we may use MAML-style higher-order gradients, ANIL (adapter-only differentiation), or first-order approximations. The partitioner Part is typically non-differentiable; in our implementation and analysis we treat Π as a piecewise-constant function of S and stop gradients through Part. This is consistent with the planted-regime viewpoint: when the effective separation is above threshold, Π is stable to small perturbations of S, and the outer loop can be driven by losses that depend smoothly on w through the held-out-edge terms and through the adapter’s prediction function.

We optionally add an entropy penalty τH(w) to avoid premature collapse to a single annotator when evidence is weak, and we may enforce a minimum-mass constraint to preserve exploration. In sparse settings, we control the per-task cost by restricting each E(k) to (n) edges (e.g. via approximate nearest neighbors), ensuring that graph construction and spectral routines are nearly linear in n. Finally, because annotator reliability may depend on the evolving backbone, we may maintain running estimates of calibration parameters (or learn Cal jointly) to stabilize the effective scale of scores across training.

This completes the definition of MetaMix-UHT: a unified, permutation-invariant task-construction mechanism coupled to a bilevel meta-learner, whose statistical behavior we analyze next by relating mixture identifiability, partition recovery, and downstream excess risk.


5. Statistical Theory: (a) identifiability of annotator reliabilities/mixture weights; (b) tight recovery bounds under planted partition; (c) translating misclustering to few-shot error after adaptation.

We formalize the statistical behavior of the MetaMix-UHT construction by isolating three questions: (i) when the per-domain reliabilities of the weak annotators (and hence the optimal mixture) are identifiable from unlabeled pairwise observations; (ii) when the mixed constraint matrix admits sharp recovery guarantees under a planted partition model; and (iii) how the induced misclustering rate propagates to downstream few-shot error after inner-loop adaptation.

Fix a domain d and consider a generic task T sampled from that domain. For any pair (i, j), let the latent binary variable be
Yij = 𝟙[yi = yj] ∈ {0, 1},
and let annotator outputs be Aij(k) ∈ {0, 1}. Under the conditional independence assumption,
{Aij(k)}k = 1K ⟂   ⟂ {Aij()} ≠ k  |  Yij,
each annotator k is a binary channel with parameters pk(d) = Pr (Aij(k) = 1 ∣ Yij = 1, d) and qk(d) = Pr (Aij(k) = 1 ∣ Yij = 0, d). The key point is that although Yij is unobserved, the joint law of (Aij(1), …, Aij(K)) contains enough information to recover (pk(d), qk(d)) up to degeneracies provided K ≥ 3 and not all channels are uninformative. Concretely, if we denote πd = Pr (Yij = 1 ∣ d) (a function of the within-task class proportions), then for each k the mean obeys
𝔼[Aij(k) ∣ d] = πdpk(d) + (1 − πd)qk(d),
and for distinct k ≠  the pairwise cross-moment is
𝔼[Aij(k)Aij() ∣ d] = πdpk(d)p(d) + (1 − πd)qk(d)q(d).
These second moments alone do not suffice to separate (pk, qk) from πd without additional assumptions; however, with K ≥ 3 one may use third-order moments such as 𝔼[Aij(k)Aij()Aij(m)] to identify the latent two-component mixture corresponding to Yij ∈ {0, 1}. This is the pairwise analogue of Dawid–Skene identifiability, and it yields method-of-moments estimators for (pk(d), qk(d)) that concentrate at the parametric rate as the number of observed (task, pair) samples grows.

Given reliabilities, an ``optimal’’ mixture is defined relative to a recovery criterion for the downstream partitioner. In the simplest uncalibrated setting S = ∑kwkA(k), the induced channel on a pair has effective parameters
$$ p_w(d) \;=\; \sum_{k=1}^K w_k p_k(d), \qquad q_w(d) \;=\; \sum_{k=1}^K w_k q_k(d), $$
so separation depends on the gap pw − qw and the overall edge density pw (or average degree in sparse graphs). More generally, for calibrated scores Cal(Aij(k)) (e.g. log-likelihood ratios), the relevant quantity is the signal-to-noise ratio of the resulting weighted SBM; we write it abstractly as SNR(w; d), which is maximized by a Bayes-optimal mixture w*(d) ∈ ΔK. Since SNR(w; d) is a known functional of (pk(d), qk(d)) under the model, identifiability of reliabilities implies identifiability of w*(d) except on measure-zero parameter sets where mixtures are non-unique.

Finally, we connect this to the meta-learned gate: the outer objective defined via held-out edge prediction (or cross-source consistency) provides an unbiased or approximately unbiased estimate of a risk functional that is minimized when w yields partitions that generalize across edges. Under the planted model, this generalization is monotone in SNR(w; d) once above the recovery threshold, so minimizing the outer loss drives wψ(T) toward w*(d) in a domain-dependent manner.

We now fix a single task with n items and latent partition into r clusters (for clarity, equal-sized clusters; extensions follow by standard perturbation arguments). For each annotator k, the observed graph is an SBM with within-cluster edge probability pk and cross-cluster probability qk, with pk > qk. The mixed score matrix S = ∑kwkA(k) is then a weighted adjacency matrix whose expectation has rank-r structure: pairs in the same cluster have mean pw, pairs in different clusters have mean qw (or the analogous calibrated means). Thus, recovery reduces to a spectral separation problem: the leading eigenspace of S must be stable under noise.

In the moderately sparse regime where expected degrees scale at least logarithmically, concentration of S around its mean yields a clean threshold. For the uncalibrated mixture, a convenient sufficient condition is
$$ \frac{(p_w-q_w)^2}{r\,p_w} \;\gtrsim\; \frac{\log n}{n}, $$
which ensures that the informative eigenvalues separate from the bulk and that spectral clustering on S (or on a sparsified normalization) returns a partition with vanishing misclustering fraction. Moreover, one obtains exponential tails of the form
err(Π̂, Π*) ≤ Cexp  ( − cn SNR(w)),
for explicit constants C, c > 0 depending on the regime and the algorithmic details. The proof follows the standard SBM route: (i) bound S − 𝔼S by matrix Bernstein (adapted to weighted and possibly sparse edges), (ii) apply Davis–Kahan to relate eigenspace perturbation to this deviation, and (iii) translate eigenspace error into a misclustering bound via k-means stability. When S is built from calibrated scores, the same argument applies with (pw, qw) replaced by the corresponding first two moments of the calibrated edge weights.

Conversely, below the information-theoretic threshold, no method can recover. In particular, when the effective separation is too small (e.g. (pw − qw)2/(rpw) ≲ 1/n in the above scaling), the distribution of graphs under different planted partitions becomes contiguous, and Fano/Le Cam arguments imply a constant lower bound on expected misclustering for any estimator. This establishes that, up to logarithmic factors (and up to the usual distinction between exact recovery and vanishing-error recovery), the condition on SNR(w) is sharp.

The main implication for our setting is that the meta-learner is not required to solve an arbitrary clustering instance: once the gate produces weights w such that SNR(w) is above threshold for the prevailing domain, the partitioner Part recovers a high-quality pseudo-partition. Conversely, if no convex combination of the available annotators yields sufficient effective separation, then no unsupervised procedure can produce vanishing misclustering, and the outer loop cannot circumvent this limitation.

We next relate the quality of Π to downstream adaptation. Let εT ∈ [0, 1] denote the misclustering rate of the partitioner output on task T relative to the true partition Π* (measured as the minimum fraction of misassigned points over cluster permutations, ignoring outliers if present). The inner loop uses Π to define pseudo-supervision, so εT functions as a label-noise parameter: a fraction εT of examples are assigned to incorrect pseudo-clusters, and the induced positive/negative pair sets contain a comparable fraction of corrupted relations.

Assume the inner adaptation map Adapt (equivalently, the learning algorithm for hϕ) is uniformly stable with coefficient β(n) under such noise, and that the outer objective is L-Lipschitz with respect to the adapted predictions. Then one can decompose the post-adaptation episodic risk into three terms: (a) an approximation term proportional to εT capturing the loss incurred by training on corrupted pseudo-supervision; (b) an estimation/stability term O(β(n)) capturing the finite-sample sensitivity of the adapter fit; and (c) a meta-generalization term controlled by the effective complexity of (θ, ψ) over N meta-training tasks. Formally, for an appropriate notion of episodic risk and optimum * achievable with the best representation and mixture, we obtain a bound of the schematic form
$$ \mathbb{E}[\mathcal{R}(\theta,\psi)] - \mathcal{R}^* \;\le\; C\,\mathbb{E}[\varepsilon_T] \;+\; O(\beta(n)) \;+\; O\!\left(\sqrt{\frac{\mathrm{Comp}(\theta,\psi)}{N}}\right), $$
where Comp(θ, ψ) may be instantiated via uniform stability of bilevel optimization, PAC-Bayes bounds, or Rademacher-type controls on the outer gradients. The key operational consequence is that improving 𝔼[εT] is the dominant lever once n and N are sufficiently large, and 𝔼[εT] is itself controlled by SNR(wψ(T)) through the recovery results above.

Dynamic per-task adapters are essential in this translation: since Π is only defined up to a within-task relabeling, a shared head would require global label alignment that is unavailable. Reinitializing hϕ per task avoids this obstruction and confines pseudo-label noise to a task-local fit, leaving the backbone to be shaped primarily by cross-task regularities that are stable under label permutations. In particular, in the planted regime where εT decays exponentially in n SNR(w), the bound above yields an excess risk that matches the corresponding lower bound scaling (up to constants and log factors) once the gate has learned near-optimal mixtures.

The preceding guarantees rely on identifiability of reliabilities. If K ≤ 2, or if annotators are perfectly correlated conditional on Yij, then there exist distinct parameter settings producing the same distribution over unlabeled observations but inducing different Bayes-optimal mixtures. In such cases, no unsupervised algorithm can consistently recover w*(d), and one obtains a constant lower bound on the achievable excess risk in the worst case. This impossibility result explains the structural requirement for at least three conditionally independent views (or an equivalent source of asymmetry), and it delineates precisely which aspects of task construction can and cannot be learned without labeled feedback.


6. Computational Theory: complexity of task construction (graph building + partitioning); approximation algorithms; hardness via correlation clustering; conditions enabling polynomial-time recovery in planted models.

We isolate the computational content of task construction in MetaMix-UHT, i.e., the mapping
$$ T \;\mapsto\; \bigl(A^{(1)}(T),\dots,A^{(K)}(T)\bigr) \;\mapsto\; S(T)=\sum_{k=1}^K w_k\,\mathrm{Cal}\!\left(A^{(k)}(T)\right) \;\mapsto\; \Pi(T), $$
and we make explicit which aspects admit polynomial-time guarantees and which are intractable in the worst case. Throughout, we distinguish (a) costs that scale with the backbone evaluation and annotator queries, and (b) costs incurred by the combinatorial partitioning subroutine.

Consider a task T = {(xi, mi)}i = 1n. The embedding pass costs O(nCf), where Cf denotes the per-example cost of fθ. The remaining costs depend on how each weak annotator λk materializes its constraints. We emphasize that our analysis permits sparse constraint graphs: for each k we may obtain a set of edges Ek ⊆ [n] × [n] with |Ek| = (n), rather than a dense n2 matrix. This regime is both computationally and statistically natural when annotators are derived from approximate nearest neighbors in embedding space, temporal adjacency, shared metadata keys, or other sparse heuristics.

Given sparse edges, mixture formation is linear in the total number of returned constraints. Writing E = ⋃kEk and storing each annotator channel as edge weights on Ek, we can compute the mixed score graph S in time
$$ O\!\left(\sum_{k=1}^K |E_k|\right) $$
(up to calibration overhead), and memory (∑k|Ek|). Calibration Cal can be as simple as an affine rescaling or as principled as a log-likelihood ratio transform; in the latter case one either plugs in estimated reliabilities (k, k) or uses a monotone proxy. In either case, calibration is entrywise and does not alter asymptotic costs beyond constant factors.

The partitioning step dominates once embeddings are cached and E is moderately large. For spectral partitioners applied to sparse graphs, one typically computes a small number of leading eigenvectors of a (normalized) weighted adjacency or Laplacian. Using iterative methods (e.g. Lanczos), the cost is
 (|E| r)
to obtain an r-dimensional embedding when r is known. When r is unknown, one may estimate it from the eigengap or apply a model-selection criterion; the added cost is lower-order relative to the iterative eigensolve, provided we restrict to the top rmax components. Subsequent k-means on n points in r costs (nr) per Lloyd iteration. In contrast, if one insists on dense constraints, both storage and partitioning incur Θ(n2) costs and quickly become infeasible; our proposed interface is therefore explicitly sparse-first.

Finally, inner-loop adaptation with s gradient steps over the task costs O(snCh), where Ch is the per-example cost of the adapter/head. Importantly, the partitioning cost is amortized over these inner steps: once Π is computed, pseudo-supervision is fixed for the episode.

The partitioner Part is, at its core, a solver for a weighted agreement/disagreement objective induced by S. A prototypical formulation is correlation clustering: given signed edge weights (or, equivalently, must-link/cannot-link preferences with confidence), output a partition Π minimizing total disagreement. Even on complete graphs this problem is NP-hard, and hence any implementable method must either (i) settle for approximation guarantees, or (ii) operate under distributional assumptions (as in planted models).

Approximation algorithms provide a principled middle ground when S is not assumed to be generated by a planted partition. For example, in the unweighted complete-graph setting, simple randomized pivoting procedures achieve constant-factor approximations for minimizing disagreements, and several LP/SDP-based relaxations yield improved constants at increased computational cost. In our setting, the graph is typically sparse and weighted, and one may deploy variants of the same ideas: (a) greedy pivoting on a sparsified support, (b) local search procedures that iteratively move vertices to reduce disagreement, or (c) relaxations over the Laplacian embedding followed by rounding. The key point is that these methods provide worst-case polynomial time with explicit approximation ratios (constant in the best-known results), thereby converting an intractable exact objective into a tractable surrogate whose minimizer is still meaningful for task construction. In practice, spectral methods and greedy pivoting occupy complementary points in the cost–robustness trade-off: spectral methods exploit low-rank structure and are exceptionally effective under planted models, whereas greedy methods can be preferable under heavy-tailed noise or highly irregular degree distributions.

We now record the computational obstruction: without distributional assumptions on S, exact task construction is intractable. Formally, consider the decision version of correlation clustering (minimum disagreements) on a weighted graph. We can encode such an instance as a single episode by taking n items, defining annotator channels so that their weighted mixture equals the target matrix S (e.g. by letting one ``annotator’’ output the given weights and setting other channels to zero, or by distributing weights across channels and using w to recombine them). If a polynomial-time routine Part were guaranteed to return an optimal partition for every such S, then we would solve correlation clustering exactly in polynomial time. Since correlation clustering is NP-hard, this is unlikely. Consequently, no end-to-end meta-learner—regardless of how wψ is trained—can bypass the worst-case intractability of exact partitioning, because the combinatorial core already contains an NP-hard problem.

This hardness statement also clarifies the role of the bilevel structure. The outer loop does not ``learn to solve’’ NP-hard instances in the worst case; rather, it learns to select mixtures and representations so that, on the task distribution of interest, the induced S(T) lies in a computationally benign regime where standard polynomial-time partitioners succeed. Thus, the gate and backbone are not merely statistical devices; they are computational regularizers that shape the instance distribution seen by Part.

Under the planted-partition/SBM-type assumptions used in our statistical theory, the computational landscape changes: the typical instances are not adversarial, and efficient algorithms achieve sharp recovery thresholds. For weighted sparse SBMs, spectral methods (possibly after degree normalization and trimming) recover the planted partition with vanishing misclustering whenever the effective separation exceeds the corresponding threshold, and they do so in nearly linear time in |E| (up to factors depending on r and the eigensolver accuracy). Semidefinite programs can also achieve recovery in certain regimes, but at substantially higher computational cost; given that our algorithm must solve many tasks, the spectral route is the appropriate default.

We emphasize that the mixture operation preserves planted structure: if each A(k) is an SBM with within/between parameters (pk, qk) (or their sparse analogues), then S = ∑kwkCal(A(k)) remains a weighted SBM with parameters that are affine (or otherwise explicit) functions of w and of the annotator reliabilities. Hence, once the gate selects w such that the induced effective signal dominates noise, we are in a regime where both (i) statistically meaningful recovery is possible, and (ii) computationally efficient recovery is available. This is precisely the circumstance under which it is meaningful to claim end-to-end guarantees for a practical meta-learning system.

From a systems perspective, the planted-model tractability justifies two additional design choices: first, sparsification is not merely an engineering convenience but part of the recovery theory (spectral methods require at least logarithmic expected degree for concentration); second, approximate partitioners with predictable runtime are preferable to exact solvers with unpredictable behavior, since the latter provide no advantage on planted instances but incur prohibitive costs over N episodes.

The foregoing computational theory motivates the experimental protocols in the sequel: we should stress-test the method precisely in settings that interpolate between planted-like structure (where spectral recovery is expected) and adversarial or corrupted constraints (where approximation robustness and mixture selection are critical).


7. Experimental Design (to strengthen claims): domain shift (DomainNet-like), web-metadata corruption, long-tail rare classes; ablations showing learned weights shift across domains and robustness improves.

We now specify experimental protocols intended to stress precisely the two claims implicit in our theory: (i) the gate can learn domain-conditioned mixtures that track the most informative annotators, and (ii) the resulting task constructor is robust to realistic violations (domain shift, corrupted side information, and long-tail structure) in a way that fixed-mixture or single-annotator constructions are not. Throughout, we evaluate both quality of Π(T) and performance after inner-loop fitting on pseudo-supervision derived from Π(T).

We consider datasets with explicit domain labels and heterogeneous appearance statistics (e.g. DomainNet-style domains such as , , , ), as well as standard digit shifts (e.g. MNISTSVHN/USPS). We meta-train on a subset of domains and meta-test on held-out domains, so that the mixture policy must extrapolate from the training domains using only task descriptors Desc(T, z, m) (e.g. summary statistics of embeddings, metadata availability indicators, degree profiles of annotator graphs, and uncertainty measures derived from disagreement among annotators).

Each episode T is formed by sampling r latent classes and n examples with heterogeneous class sizes (unless otherwise stated). For each annotator k ∈ [K], we construct a sparse constraint graph A(k)(T) with (n) edges. Concretely, we include a diverse suite of weak annotators: (a) must-link proposals based on zi − zj; (b) (shared web tags, product IDs, author IDs, etc.); (c) when captions exist (bag-of-words or pretrained text embeddings); (d) or co-occurrence when tasks are drawn from logs. Domain shift naturally changes the reliability profile (pk(d), qk(d)) of these sources: for example, caption quality may degrade in certain domains, while visual similarity may become less informative under style shift.

We report (i) clustering metrics for Π(T), including ARI/NMI and pairwise F1 on must-link/cannot-link decisions induced by Π(T), and (ii) downstream few-shot classification accuracy after adaptation, where we treat Π(T) as pseudo-labels for support examples and evaluate on held-out query examples within the same episode using ground truth labels only for evaluation. To connect with the planted-model narrative, we additionally compute an for each episode,
$$ \widehat{\Delta}(T) \;=\; \frac{\mathrm{avg}_{(i,j)\in\mathcal{N}} S_{ij} \;-\; \mathrm{avg}_{(i,j)\in\mathcal{F}} S_{ij}}{\sqrt{\mathrm{Var}_{(i,j)}(S_{ij})}}, $$
where 𝒩 and are respectively near and far pairs under embedding distance; we then stratify performance as a function of Δ̂(T) to test whether sharp transitions occur in the manner predicted by SBM-style thresholds.

To test robustness to side-information failure modes, we introduce controlled corruption into metadata-derived annotators. For an annotator based on a metadata key ui (e.g. tag, uploader, or category field), we corrupt with probability ρ by replacing ui with an independently sampled key (or by permuting keys within the task), thereby inducing spurious must-links. This yields a predictable increase in false-positive rates qk(d) while leaving other channels unchanged. We also consider : with probability μ, an item has metadata removed, causing the corresponding annotator to output no edges for that node (degree collapse). Both perturbations are chosen to preserve the sparse regime (edges are removed or rewired, not densified).

We compare: (a) MetaMix-UHT with a learned gate wψ(T) conditioned on descriptor features that include per-annotator degree summaries and cross-annotator disagreement; (b) a fixed uniform mixture wk ≡ 1/K; (c) best single-annotator selection (chosen in hindsight per domain as an upper bound on non-mixture baselines); and (d) an ``oracle’’ mixture computed using evaluation labels to estimate (pk, qk), serving only as a ceiling. The main diagnostic is whether the learned wψ the corrupted channel as ρ increases and whether this reweighting preserves clustering quality and adaptation accuracy. We also report the entropy of wψ(T) and the correlation between wψ(T) and reliability proxies computed from unlabeled cross-annotator agreement (motivated by the identifiability mechanism underlying our mixture learning claim).

A central practical challenge is that tasks seldom have balanced clusters. We therefore generate episodes with class sizes drawn from a Zipf law (or a Dirichlet with small concentration), yielding a long tail of rare classes and occasional singletons. This setting interacts with both the partitioner (which may over-merge small clusters) and the gate (since some annotators may be systematically biased toward frequent classes or produce degree patterns that drown rare clusters).

We evaluate three aspects. First, we include an outlier option in Part and measure precision/recall for outlier assignment on true rare classes and singleton points. Second, we evaluate few-shot adaptation when supports for rare classes are extremely small; here the dynamic adapter is critical, since a fixed head would entangle cluster identities across tasks. Third, we measure explicitly by conditioning accuracy on class frequency within the episode (e.g. top-20% vs. bottom-20% by support size). The intended conclusion is not that any method solves the tail perfectly, but that mixture learning and calibration reduce systematic over-merging compared to naive similarity-only pseudo-labeling.

To attribute improvements to specific components, we perform ablations aligned with the algorithmic structure. We remove or modify one element at a time: (i) : fix w to uniform or to a learned global vector independent of T; (ii) : set Cal to identity and treat all annotators as commensurate; (iii) : replace the per-task head reinitialization by a shared head trained across tasks, which should degrade under label-permutation and heterogeneous r; (iv) : replace Part(S) by thresholding S at a constant and taking connected components, which is fast but brittle; (v) : intentionally limit to K = 2 or construct near-duplicates of an annotator to test the failure mode suggested by non-identifiability. For each ablation, we report not only final accuracy, but also the intermediate misclustering rate (with labels used only for evaluation), since our upper bounds conceptually decompose downstream error into task-construction error and adaptation error.

Finally, to verify that learned weights genuinely , we report wψ aggregated by domain (mean and variance over episodes) and visualize trajectories as domains vary. A necessary sanity check is that weights do not merely collapse to a single annotator globally; rather, they should concentrate on different channels when domain statistics change, and such shifts should align with independently measured reliability changes (e.g. increasing corruption ρ or increasing missingness μ). This suite of experiments is designed to make the causal chain—domain-dependent reliabilities gate-induced mixture improved partitions improved adaptation—empirically inspectable rather than inferred.


Our setting sits at the intersection of unsupervised task construction, weak supervision, and meta-learning under heterogeneous noise. The closest line of work is unsupervised or weakly supervised hypergraph/task construction for meta-learning, including approaches sometimes framed as learning from with structured relations (e.g. DHM-UHT-style formulations). These methods typically assume a fixed procedure for constructing an episode-specific supervision signal (a similarity graph, a hypergraph, or pseudo-labels) and then optimize a representation such that rapid within-episode adaptation is possible. In contrast, our focus is on as a domain-conditioned mixture over multiple weak relation sources, explicitly modeling that different annotators can be informative in different regimes and that the mapping from a task descriptor to a reliability profile can be learned from unlabeled data via cross-source agreement statistics.

A second related thread is unsupervised meta-learning via pseudo-labeling, notably CACTUs, UMTRA, and PsCo . CACTUs forms pseudo-tasks by clustering an embedding space and then applies meta-learning on the induced labels; UMTRA samples augmentations to create implicit class identity; PsCo uses prototypical self-supervision to build contrastive tasks. These methods provide strong baselines for label-free meta-learning, but they typically rely on a pseudo-labeling mechanism (e.g. a chosen clustering algorithm on learned embeddings) and do not directly address the case where pseudo-supervision comes from multiple, noisy, and domain-dependent relation channels. Our construction, through S(T) = ∑kwkA(k)(T) and a partitioner Part, can be seen as generalizing clustering-based pseudo-labeling: rather than clustering solely on z = fθ(x), we cluster on a of relation graphs that may incorporate metadata, text, and other signals orthogonal to vision embeddings.

Our mixture-learning component is closely related in spirit to classical work on aggregating noisy labelers. The Dawid–Skene model establishes identifiability and maximum-likelihood estimation of annotator confusion matrices from unlabeled data under conditional independence assumptions. Subsequent weak supervision systems, such as Snorkel , operationalize this idea using labeling functions and generative models to infer latent labels and source accuracies. We share the high-level goal of estimating source reliabilities without ground truth, but our observation model is rather than itemwise: each source provides constraints Aij(k) intended to indicate whether two examples share a latent class within an episode. This distinction matters both statistically and computationally. Statistically, identifiability arguments must be adapted to pairwise channels and to the fact that the latent variable is 𝟙[yi = yj] rather than yi itself; computationally, supervision enters through graph partitioning rather than independent label denoising. Our analysis formalizes when such pairwise reliabilities and their optimal mixtures are recoverable from unlabeled cross-source correlations, aligning the weak-supervision perspective with the planted-partition viewpoint common in graph inference.

The graph-based perspective also connects our work to multi-view clustering and co-training. In multi-view clustering, multiple similarity graphs or feature views are combined (often via weighted sums or co-regularization) to obtain robust partitions . Many methods treat view weights as global parameters, tuned by heuristics or by optimizing clustering objectives. Our setting differs in two ways. First, we require or domain-conditional mixing: the appropriate weight on a text-based similarity graph may depend on whether captions exist in the task, and the appropriate weight on metadata keys may depend on corruption or missingness patterns. Second, we couple graph mixing to meta-learning: the mixture is not learned solely to optimize an unsupervised clustering criterion, but to minimize a downstream surrogate of post-adaptation risk. In this sense, the gate wψ(T) acts as a learned policy for selecting and combining views so that the induced pseudo-tasks are maximally useful for representation learning and rapid within-task adaptation.

A separate but increasingly relevant line of work uses foundation models (VLMs/LLMs) to generate pseudo-labels or relations, either by direct classification, captioning, or pairwise entailment judgments . These models can be treated as powerful weak annotators, sometimes with prompt-dependent reliability and strong domain effects. Our framework is compatible with this view: a prompted LLM that answers ``same class?’’ questions can instantiate an annotator A(k), and multiple prompts (or multiple models) yield a set of annotators whose reliabilities vary with domain and task. The contribution of our mixture learning is then to provide a principled mechanism to combine such sources, and to detect when certain sources should be downweighted based on unlabeled disagreement patterns. Moreover, the pairwise format naturally accommodates partial or sparse queries (e.g. only ask the LLM about a subset of candidate pairs), aligning with practical constraints on cost and latency.

Within meta-learning, our use of dynamic per-task adapters relates to work emphasizing modularity and robustness under task heterogeneity. Methods such as MAML and its variants (e.g. ANIL) separate shared representations from rapidly adaptable task-specific parameters; prototype-based meta-learners and adapter-based fine-tuning similarly reinitialize or condition a small head per episode. In unsupervised or weakly supervised settings, an additional difficulty is label permutation and variable class cardinality across tasks, which motivates using permutation-invariant supervision signals (partitions or constraints) and heads that can be re-instantiated per episode (e.g. prototype sets). Our design mirrors these principles: the backbone fθ learns transferable features, while the adapter absorbs episode-specific structure induced by Π(T), reducing the risk that the shared model overfits to the idiosyncrasies of any single pseudo-labeling channel.

Finally, our emphasis on robustness to noise and domain shift aligns with a broad literature on meta-learning under distribution shift and corrupted supervision. Empirically, meta-learning can be brittle when tasks are constructed with systematic bias, when pseudo-labels contain structured errors (e.g. over-merging under long-tail class sizes), or when auxiliary signals shift across domains. Prior work studies robust objectives, uncertainty-aware adaptation, and meta-regularization to mitigate such effects, but often assumes access to clean labels at meta-train time or focuses on feature shift rather than . Our perspective highlights that, in unlabeled meta-training, a dominant source of failure is the : if the induced pseudo-partition is below the recoverability threshold of the effective constraint graph, then no amount of downstream optimization can recover the lost information. The mixture mechanism is therefore not merely an engineering detail but a statistically necessary component when multiple weak sources exist and their reliabilities vary across domains.


9. Discussion and Limitations: correlated annotators, adversarial metadata, scaling and privacy; future directions (online mixture learning, active selection of annotator queries).

Our framework isolates a specific bottleneck in unlabeled meta-learning: the quality of the induced within-episode supervision signal. In the present work we formalize this via a mixed pairwise-constraint channel $S(T)=\sum_{k=1}^K w_k A^{(k)}(T)$ and a subsequent partitioner Part whose output Π(T) drives inner-loop adaptation. The positive message of the analysis is that, under a planted model and conditional independence, meta-learning the mixture can push the effective channel above the sharp recovery threshold, at which point downstream risk is controlled by misclustering. The negative message is equally structural: when the effective channel is below threshold, representation learning or bilevel optimization can reconstruct information that is absent from S(T). Thus, while our algorithmic contributions are constructive, the chief limitation is that they necessarily inherit the information-theoretic and identifiability constraints of the underlying weak relation sources.

Theorem~4 makes explicit that mixture learning is not well-posed in full generality: with K ≤ 2 or with strong conditional correlations among annotators, there exist distinct reliability configurations inducing the same unlabeled distribution over {A(k)} but different Bayes-optimal mixtures. In practice, correlation is not an edge case but rather the default. For example, two similarity sources built on the same embedding model (even if post-processed differently) can be nearly deterministic functions of one another; multiple prompts to the same LLM may share systematic biases; and metadata-derived constraints often share common missingness patterns. Under such correlations, method-of-moments estimators based on cross-source agreement can exhibit high variance or converge to an arbitrary point on an equivalence class, and the gate wψ(T) can overfit to spurious agreement that does not correspond to correctness.

A principled way forward is to enlarge the observation model. One option is to replace conditional independence by a latent-factor correlation model, e.g. Aij(k) conditionally independent given (𝟙[yi = yj], uij) for a low-dimensional latent nuisance uij capturing shared artifacts. Another option is to require an condition: for each domain d there exists at least one annotator whose errors are weakly correlated with the rest, or whose (pk(d) − qk(d)) is known to be bounded away from 0. Both relaxations move the burden from algorithm design to verifiable assumptions; nevertheless they are necessary if one aims to apply the mixture-learning guarantee in regimes with heavy source reuse.

Our gate wψ(T) is conditioned on a task descriptor Desc(T, z, m), which may include metadata m and embedding statistics of z = fθ(x). This conditioning is essential for domain-conditional mixtures, but it introduces an attack surface: if m is corrupted, strategically manipulated, or simply nonstationary, then the gating network can be driven to select an inferior mixture even when the underlying annotators are individually informative. This issue is not resolved by the planted-partition analysis, since identifiability is proven for fixed domains and honest descriptors. A minimal mitigation is to constrain Desc to be derived from x and z only (thereby coupling it to the representation), or to enforce invariances by adversarial training so that the gate is insensitive to known nuisance directions in m. A stronger mitigation is to treat the descriptor as uncertain and learn a over mixtures, e.g. via Bayesian gating or ensembles, selecting mixtures by robust risk (worst-case over plausible descriptors) rather than by point estimates.

Nonstationarity also arises without adversaries: the reliability profile (pk(d), qk(d)) can drift over time as data collection changes or as an LLM backend is updated. In this setting, learning a static wψ is not the correct objective; one requires continual estimation of reliabilities and continual adaptation of the gate, ideally with explicit change-point detection so that outdated mixture rules are not propagated through the meta-learner.

Theorems~2–3 are stated for an SBM-style channel with equal cluster sizes and homogeneous edge probabilities. Real constraint graphs violate these assumptions in multiple ways: clusters are imbalanced; edges are not conditionally independent across pairs; and annotators may produce structured errors, such as systematically merging fine-grained classes or splitting classes by viewpoint. Moreover, A(k) may be better modeled as a weighted similarity rather than a Bernoulli adjacency, in which case calibration Cal(⋅) is not merely a technicality but a central modeling choice. While spectral methods and correlation clustering objectives are empirically robust under mild deviations, the sharpness of the recovery threshold is not guaranteed under arbitrary heterogeneity. A natural extension is to analyze mixed graphs under degree-corrected SBMs, overlapping communities, or latent-space models; doing so would better align the theory with common similarity sources (e.g. embedding-based kNN graphs), at the cost of more complex thresholds and potentially weaker converse results.

Our algorithm treats Part as a black box whose output is ``good’’ whenever SNR(w) exceeds a threshold. Computationally, however, partitioning dominates at scale if graphs are dense, and statistically it can be brittle if outliers or hubs are present. Dense graphs are prohibitive for large n since O(n2) storage and time are unacceptable; sparse graphs alleviate this, but then the effective recovery threshold depends on average degree, and aggressive sparsification can erase weak but consistent signal. In practice one must balance (i) the number of queried edges per annotator, (ii) the quality of the partitioner under sparsity, and (iii) the variance of the outer-loop signal used to update ψ. These trade-offs suggest studying end-to-end schemes where graph construction and partitioning are co-designed (e.g. degree-capped kNN, locality-sensitive hashing, or multi-stage partitioning that refines an initial coarse clustering), together with theoretical guarantees that track the entire pipeline rather than an idealized Part.

Even if each A(k) is sparse, querying all K annotators per task can be expensive when some annotators correspond to paid APIs (e.g. LLM calls) or heavy retrieval operations. Our current presentation assumes at most K calls per task; this is already a constraint, but it does not model heterogeneous costs. A practical variant is to introduce costs ck and optimize a constrained mixture, e.g. maximize a proxy for SNR(w) subject to kwkck ≤ C, or to allow the gate to skip annotators entirely. This moves the problem toward budgeted multi-view learning, where the algorithm must decide not only how to combine sources but which ones to acquire.

Pairwise constraints can leak information even when labels are not disclosed. In particular, must-link/cannot-link edges reveal relative similarity structure, which may be sensitive for user data, medical records, or proprietary corpora. If annotators are external services, sending pairs of examples (or their embeddings) can violate privacy policies; even sending derived metadata (e.g. timestamps or IDs) can enable linkage attacks. Addressing this requires explicit privacy mechanisms: local differential privacy for constraint generation, secure enclaves for annotator execution, or federated estimation of mixture statistics where only aggregated cross-annotator moments are shared. Each mechanism perturbs the effective (pk, qk), and thus directly impacts SNR(w) and recoverability; hence privacy is not orthogonal to performance in our setting, and must be incorporated into the recovery analysis.

A central extension is to shift from batch meta-training over N episodes to an online process where tasks arrive sequentially and the gate must adapt in real time. Formally, one may treat the domain-conditioned optimal mixture w*(d) as time-varying and study regret relative to a comparator class of slowly drifting mixtures. This suggests algorithms that combine (i) fast, low-variance updates for ψ from unlabeled agreement signals, and (ii) slower updates for θ driven by the outer objective. A key technical question is whether one can obtain guarantees analogous to Theorem~3 with an additional term capturing tracking error of wψ under drift, while maintaining stable representation learning.

Our current access pattern queries each annotator for (a subset of) pairs and then mixes their returned graphs. When annotator queries are expensive, it is natural to actively choose which pairs to query and which annotators to consult. This can be framed as a contextual bandit: context given by Desc(T, z, m) and candidate pairs (i, j), action given by selecting (k, i, j), and reward given by an unlabeled proxy such as cross-source consistency or downstream outer loss improvement. Active querying can increase the effective degree where it matters (near cluster boundaries) and reduce redundant edges in high-confidence regions, thereby improving the recovery probability at fixed budget. The main open difficulty is to design unbiased or low-bias estimators for the outer objective under adaptive sampling, so that the gate does not exploit its own measurement process.

The framework clarifies when unsupervised meta-learning from weak pairwise signals is possible and when it is provably impossible. The remaining obstacles are primarily about relaxing independence assumptions, handling adversarial/nonstationary descriptors, and scaling the constraint-acquisition pipeline under cost and privacy constraints. These are not peripheral details: they determine whether the effective channel S(T) crosses the recoverability threshold, and thus whether the meta-learner has any hope of learning transferable structure.