How do LLMs change the human knowledge graph?
Gavin Leech asks: "What fraction of knowledge is inside the affine hull of existing knowledge?"
This is a question about LLMs: supposing LLMs are doing only recombination, and the affine hull is the set of true recombinations, what is inside, for a given training set?
It's an interesting formulation. It raises a few questions, to my mind:
- What percentage of the hull is currently accessible?
- What is the economic value of the never-accessed or currently-inaccessible inside of the hull?
- What is the total decrease in cost of access vs. value per LLM model generation (the increase in accessible space from GPT-4 → 5, etc)?
We can model this as a weighted graph. Each node represents a unit of potential knowledge. At any given time, a node falls into one of three states:
1. Disconnected — not connected to current knowledge by any known derivation path. Requires genuine discovery to become reachable.
2. Connected but expensive — a path exists, but the traversal cost exceeds current capacity. Requires cost-lowering technology to become affordable.
3. Accessible — connected and affordable. We might think of this as the "knowledge footprint" or "occupied knowledge space".
The accessible graph expands through two coupled mechanisms: technology reduces traversal costs deterministically, making expensive nodes affordable. Discovery finds new edges stochastically, connecting isolated regions. These reinforce each other: cheaper traversal enables more frontier exploration, which increases the discovery rate.
The dashed circle represents the accessibility horizon, the cost threshold below which traversal is affordable.
Clicking Technology reduces all edge costs, potentially pulling expensive nodes within reach. Clicking Discovery has a chance to find new edges connecting isolated regions.
Watch how the mechanisms couple: as technology makes paths cheaper, more nodes become accessible, which increases the frontier and the discovery rate, which connects more regions, which technology then makes affordable. This is the engine of knowledge expansion.
Three regimes of knowledge increase
How do LLMs fit in here? As cost-decreasing technologies. Consider e.g. Claude Code or GPT Codex. "Infinite code generation" sounds like creation, but it's mostly cost collapse: the unrealized code was already inside the affine hull of existing knowledge. The interesting question is how cost reduction interacts with discovery. Here are three regimes:
The third regime matters because it describes a lot of human knowledge historically: we knew things were connected, but the path was too expensive to walk. LLMs collapse traversal costs, converting regime 3 situations into regime 2.
In practice, cost reduction is continuous and discovery is stochastic. The two interact messily: cheaper traversal expands the frontier, which raises discovery probability, which connects new regions, which gives cost reduction new territory to work on.
When a discovery connects a new region, the ceiling jumps, but realized value takes time to catch up as cost reduction makes the new paths affordable. Sometimes discoveries cluster; sometimes there are long droughts. The system is path-dependent.
When we look at the aggregate statistics we'll likely say something like: LLMs boosted productivity by X% leading to Y% growth in GDP. But it's interesting to decompose to "the value of cheap code generation" as a cost-lowering, within-hull operation, and the discoveries which that code enables, as connection-enabling and hull-expanding.