Information Gravity
A Novel Application to Neural Network Architecture and Training
ZeroDriveX Research
Pre-Training Theory Paper — v1.1 | Training Results Addendum Forthcoming
Abstract
Information gravity — the principle that dense information regions exert pull on surrounding systems — is not a new concept. It appears in physics, economics, and information theory. What we present here is novel: a unified framework for applying information gravity across all three levels of neural computation simultaneously, and an architecture and training system built around it.
We identify that information density is non-uniform across training corpora, attention contexts, and inference states. Existing neural systems acknowledge this implicitly — through attention mechanisms, learning rate schedules, and importance sampling — but treat each level as a separate optimization problem. We argue they are the same problem at different scales, and that treating them as such produces qualitatively different systems.
We present the Branched Attractor Network (BAN), an architecture designed from first principles around this unified principle, and the Information Gravity Well training system that operationalizes it at the learning level. Empirical results during active training exceeded theoretical predictions, suggesting compounding interactions between components that conservative analysis did not anticipate.
1. Introduction
The concept that dense information regions attract computational resources is not new. Search engines weight results by information density. Attention mechanisms in neural networks weight tokens by relevance. Curriculum learning weights training examples by difficulty. Each of these is an application of the same underlying principle — information gravity — at a specific level of computation.
What has not been done is to unify these applications into a single coherent framework, to ask why the same principle should govern computation at all three levels simultaneously, and to design a neural architecture and training system that reflects this from the ground up.
This paper presents that unification. We do not claim to have invented information gravity. We claim to have identified it as the common thread beneath three classes of neural optimization that have been treated as separate problems, and to have built the first architecture and training system explicitly designed around it at all levels.
The practical motivation came from empirical observation. During training of a novel architecture on a Wikipedia corpus, a consistent pattern emerged: at epoch boundaries, loss would temporarily increase before declining again. The optimizer was drifting — losing ground it had already gained. Standard solutions address this through learning rate schedules. We asked a different question: why does the training data not simply pull the optimizer back to where learning was productive? The answer was that standard training does not give data gravitational structure. Our approach does.
2. What Is New Here
To be precise about the contribution, we distinguish what is established from what is novel.
2.1 What Is Established
- Information gravity as a concept — the tendency of systems to cluster around information-dense regions — is well documented in physics, economics, and information theory.
- Attention mechanisms weight computation by relevance — an implicit application of information gravity at the token level.
- Curriculum learning and importance sampling weight training examples by value — implicit applications at the data level.
- KV cache eviction policies prioritize recent or frequently accessed states — an implicit application at the inference level.
2.2 What Is Novel
- The explicit identification of information gravity as the unifying principle beneath attention, training weighting, and inference caching — and the formal treatment of these as the same problem at different scales.
- The Gravity Well training system: wells that form emergently from the loss signal itself, requiring no predefined importance function, no manual labeling, and no external curriculum. The gradient reveals the topology. The data writes its own curriculum.
- The Branched Attractor Network architecture: the first architecture designed from first principles around information gravity at both the inference and architectural levels simultaneously.
- The empirical observation that applying information gravity explicitly at all three levels produces compounding, superadditive effects — the components amplify each other in ways not predicted by individual component analysis.
3. Information Gravity Applied to Neural Systems
3.1 The Core Reframing
Standard neural training frames the optimizer as an agent navigating a loss landscape. The tools of the field — learning rate schedules, momentum, gradient clipping, warmup — are navigation tools. They improve how the optimizer moves through a landscape that is treated as given.
Information gravity reframes this. The landscape is not given. It has structure. High-density regions of the training corpus curve the loss landscape toward themselves. A system that feels this curvature does not need external navigation — it responds to the geometry of the information space.
The question is not: how do we navigate the loss landscape better? The question is: why does the landscape not pull us where we need to go? Information gravity is the mechanism by which it should.
3.2 Three Levels of Application
Information gravity manifests at three distinct levels of neural computation. We have implemented or observed it at each level.
Level One — Inference: KV Cache Scoring
During inference, the model maintains a cache of prior context states. Standard implementations treat all cached states equally, evicting by recency. Information gravity at the inference level asks: not all cached context is equally valuable. States that carry dense semantic information — the load-bearing context — should be retained. Syntactic filler, connective tissue, noise should be evicted first regardless of recency.
KV cache scoring assigns gravitational weight to each cached state based on information density. The cache self-organizes around what matters without being told what matters.
Level Two — Attention: Context Tagging
Within a single forward pass, standard attention distributes computation across all positions weighted only by learned query-key similarity. Information gravity at the attention level asks: not all tokens carry equal semantic weight. High-gravity tokens — those that anchor meaning for surrounding text — should receive more computational focus.
Context tagging identifies and weights high-gravity tokens. Attention becomes information-density-aware, not just position-aware.
Level Three — Training: Gravity Wells
At the training level, information gravity asks: not all examples yield equal learning signal. Some sit in information-dense regions — producing large, consistent, generalizing gradient updates. Others sit in sparse regions — noise, redundancy, degenerate cases. A system that feels this structure does not treat all examples equally.
Gravity wells form dynamically from the loss signal. No predefined importance function is needed. When the loss drops faster than a threshold rate, that batch is identified as high-gravity. Its indices are strengthened. In subsequent steps, high-gravity samples are re-injected for additional gradient updates. Wells decay at epoch boundaries to prevent overfitting. The data writes its own curriculum from the gradient signal.
Level | Novel Application of Information Gravity |
Inference | KV Cache Scoring — high-gravity states retained, low-gravity evicted first |
Attention | Context Tagging — high-gravity tokens receive weighted attention |
Training | Gravity Wells — optimizer pulled back to high-signal regions emergently |
4. The Branched Attractor Network (BAN)
The BAN is an architecture designed from first principles around information gravity. It is not a transformer variant. It begins from a different question: what does productive reasoning look like, and how should computation reflect it?
4.1 Design Axioms
- Thought branches. Multiple candidate interpretations exist simultaneously.
- Focus is selective. The highest-gravity branch receives deep computation. Others remain alive peripherally.
- Peripheral awareness informs without overwhelming. Non-focus branches contribute signal but cannot dominate the focus.
- Convergence, not feedforward. The model settles toward stable representations through iteration — not a single pass.
- Hardware is not an assumption. The architecture must run on anything from a $10 device to a server.
4.2 The AttractorBlock — Core Primitive
The AttractorBlock implements fixed-point iterative convergence. Rather than computing once and passing forward, it refines a state through repeated updates until it settles:
z_{t+1} = (1 - α) · z_t + α · σ(W · z_t + h + γ · influence)
Where h is the input projection, α controls convergence rate, σ is the activation, and influence is the gated peripheral signal. The iteration runs for a fixed number of steps. The final state has converged toward a stable representation of the input.
This mechanism reflects how cortical columns are believed to operate — iteratively refining toward stable states rather than computing once and forwarding. The AttractorBlock provides representational depth without proportional parameter cost.
4.3 Branching and Focus
The BranchGenerator produces N parallel reasoning branches from the input — each initialized from independent learned projections, creating N diverse starting states for the attractor. This is wide perception: all possibilities visible simultaneously.
The SalienceScorer evaluates each branch cheaply — without full convergence — assigning a salience score per branch per position. The highest-salience branch receives deep attractor convergence. Others remain alive.
The InfluenceGate lets peripheral branches nudge the focus. The signal is threshold-gated — below threshold, peripheral input is zeroed. Above threshold, it contributes at a hard-capped maximum (15%). The model stays informed by the periphery without being overwhelmed by it.
4.4 Causal Sequence Output and Training Signal
The BAN produces per-position logits across the full sequence — [B, T, vocab_size] — rather than a single next-token prediction. Every token in every sequence contributes to the loss simultaneously. For a sequence length of 256, this yields 256 times more gradient signal per batch compared to single next-token prediction.
This is not incidental. Dense gradient signal is itself an expression of information gravity at the training level — more signal means more information about where the model is wrong, which means more precise well formation, which means stronger gravitational pull toward productive learning regions.
4.5 Memory Architecture for Constrained Hardware
At inference, only the focus branch lives in RAM. All peripheral branches are stored on disk via memory-mapped files. The OS handles paging automatically. This is O(1) RAM usage regardless of branch count — contrasting with transformer KV cache which scales quadratically with sequence length.
Component | Memory |
Focus branch (active) | ~1 KB in RAM |
16 peripheral branches | ~24 KB on disk via mmap |
Transformer KV cache (equiv.) | O(seq²) — must fit in RAM entirely |
5. Gravity Wells: Emergent Training Curriculum
5.1 The Drift Problem
Standard training exhibits a pathology at epoch boundaries: when learning rate schedules reset or decay, the optimizer loses ground temporarily. Loss increases before declining again. This drift occurs because the optimizer has no memory of productive regions. Each batch is treated identically. Nothing pulls it back.
5.2 Well Formation — No Labels Required
Gravity wells form from the loss signal itself:
- Every step: record loss and batch indices.
- If loss drops faster than a threshold rate: those indices are high-gravity. Their well strength increases proportionally to the drop rate.
- Each subsequent step: sample a well batch weighted by strength. Apply an additional gradient update.
- Each epoch end: decay all well strengths toward baseline. Prevents overfitting to a fixed example set.
- No predefined importance function. No manual labeling. The gradient reveals the topology.
5.3 The Stabilization Effect
When the optimizer wanders into a low-signal region, the well injection immediately counteracts it — pulling a high-signal batch and applying an additional update in the same step. The model never fully loses its way. The epoch-boundary drift pattern is eliminated because the wells continuously anchor the optimizer to productive regions across boundary events.
5.4 Deepening Curriculum
Wells are not static. Early wells form around obviously high-signal examples. As the model learns those regions, their gradient signal decreases and those wells weaken. New wells form at the frontier — subtler patterns, longer-range dependencies, cross-domain structure. The curriculum deepens naturally without external design.
6. Compounding — Why Results Exceeded Predictions
Theoretical predictions for loss reduction were based on individual component analysis. Empirical results during active training exceeded those predictions. This section explains why.
The three components are not additive. They are multiplicative:
- Dense causal sequence loss produces richer gradient signal per batch.
- Richer gradient signal means well formation is faster and more precise — the loss topology is clearer.
- Stronger wells mean the optimizer revisits high-signal regions more frequently.
- Frequent revisitation means the attractor's per-position convergence is repeatedly refined on the most information-dense examples.
- Refined attractor states produce even richer gradient signal on the next pass.
The system is self-reinforcing. Each component amplifies the others. The compounding rate was not captured in conservative individual component estimates, which assumed additive rather than multiplicative interaction.
When empirical results exceed theoretical predictions, the theory was conservative — not wrong. The direction was correct. The magnitude of compounding was underestimated.
7. Implications
7.1 The Parameter Count Question
The dominant scaling assumption in the field is that capability grows with parameter count. Information gravity suggests this is a consequence of computational waste at scale — larger models compensate for inefficient resource allocation by having more resources to waste. A system that allocates computation by information density at all three levels does not need to compensate with scale.
Early results show a 5M parameter model trained with information gravity principles achieving loss curves that prior approaches required significantly more parameters and compute to reach. This does not disprove scaling laws. It suggests they describe systems that ignore information gravity, not systems that implement it.
7.2 Multilingual and Cross-Domain Training
Information gravity has a particularly useful property in multilingual training. The same concept expressed in different languages creates overlapping high-signal regions. Wells forming around these regions are language-agnostic — they reflect information density, not surface form. A model trained this way develops representations that track meaning across languages rather than language-specific patterns.
7.3 Autonomous Agent Systems
The BAN architecture and information gravity framework extend naturally to autonomous agents. An agent operating in an environment faces the same resource allocation problem at all three levels: what prior experience to retain (inference), what features of the current state to attend to (attention), what experiences to learn from most deeply (training). Information gravity provides a principled answer at each level without requiring separate solutions.
7.4 Training Results Addendum
This paper presents the theoretical framework and architectural specification. A companion addendum presenting full empirical results — loss curves, perplexity comparisons between baseline and gravity well training runs, convergence rates, and hardware benchmarks — is forthcoming upon completion of active training experiments. Preliminary results are consistent with the predictions of the framework and exceed conservative estimates.
8. Conclusion
Information gravity is not a new idea. The pull of dense information regions on surrounding systems appears throughout nature, economics, and information theory. What we have done is identify it as the unifying principle beneath three classes of neural optimization that the field has treated as separate problems — inference caching, attention weighting, and training data selection — and build the first architecture and training system that explicitly implements it at all three levels simultaneously.
The results suggest that this unification produces compounding effects. Each level amplifies the others. The system self-organizes around information density in ways that externally controlled approaches cannot match, because the organization emerges from the structure of the information itself rather than from imposed schedules or heuristics.
The field has long known that not all computation is equally valuable. Information gravity is the name for why, and the framework for doing something about it.
ZeroDriveX Research · zerodrivex.com · v1.1
Document Details
Type: docx
Format: DOCX
Published: Yes
Category: Documentation