No. 04 · Evaluation
Project Pigeon: a small model that holds long context
Three early checkpoints, v5 to v7, from our post-transformer line.
Overview
We present three checkpoints of Project Pigeon, v5 through v7. Pigeon is a post-transformer line: a sub-quadratic sequence model trained with a lookahead objective, so its training target includes the trajectory of upcoming output, not only the next token. v7 is the smallest of the three. At 865M parameters it trains in 6.7 GB of GPU memory, and it is the only checkpoint that passes the long-context retrieval test.
Key features
- Post-transformer architecture: a sub-quadratic sequence model rather than an attention-only Transformer.
- Lookahead objective: the training target includes the trajectory of upcoming output, not only next-token prediction.
- Long-context retrieval at small scale: v7 holds a needle across a few thousand tokens at 865M parameters.
- Low footprint: v7 trains in 6.7 GB of GPU memory, about a fifth of v5's 34.4 GB.
Key highlights
- v7, at 865M parameters, stays within a few points of v5 (9.34B) on most of the five shared benchmarks and leads on Winogrande.
- v7 passes 10 of 10 long-context needle-in-haystack trials; v5 passes 0 and v6 passes 1 (context ~2k–5k tokens).
- Validation perplexity falls from 9.91 (v5) to 8.17 (v6); v7 reaches 8.94 at 865M parameters and 5B tokens.
| PigeonV59.34B | PigeonV67.31B | PigeonV7865M | |
|---|---|---|---|
| Params | 9.34B | 7.31B | 865M |
| Tokens seen | 8B | 12B | 5B |
| GPU memory | 34.4 GB | 27.2 GB | 6.7 GB |
| Val perplexity | 9.91 | 8.17 | 8.94 |
Benchmarks
All three checkpoints report the same five tasks. v6 is the strongest on the commonsense and knowledge tests; v7, at roughly a tenth of v5's parameters, lands within a few points on most of them and ahead on Winogrande.
Long-context retrieval
On a needle-in-haystack test, v7 passes all ten trials; v5 passes none and v6 one. The case for sub-quadratic sequence models rests on long context, and here the smallest checkpoint is the one that delivers it. Whether the result holds at longer contexts or across a full training run is still open.
Full results
v5 and v6 add MMLU; v7 adds SciQ, OpenBookQA, and LAMBADA. A dash marks a benchmark that checkpoint did not report.
| Benchmark | PigeonV59.34B | PigeonV67.31B | PigeonV7865M |
|---|---|---|---|
| PIQA | 69.1 | 70.7 | 64.0 |
| ARC-Easy | 54.9 | 57.8 | 48.0 |
| Winogrande | 51.4 | 53.4 | 56.5 |
| HellaSwag | 41.6 | 46.9 | 40.0 |
| ARC-Challenge | 32.8 | 34.4 | 29.5 |
| MMLU (5-shot) | 26.0 | 27.3 | — |
| SciQ | — | — | 74.5 |
| OpenBookQA | — | — | 30.5 |
| LAMBADA | — | — | 22.5 |
| Val perplexity | 9.91 | 8.17 | 8.94 |
Reading the results
These are early checkpoints: 5 to 12 billion training tokens, far short of a full run, with v7 the least trained at 5 billion. Read the scores against chance and the token budget, not against finished models.
Project Pigeon is active research. These checkpoints are internal and have not been released. The architecture direction is described in the research note.
Questions and corrections are welcome at contact@auxerta.com.