Proven on real data. Every figure measured.
Spiderbrain turns a project into a dependency graph and scores every node for importance, blast radius and risk. For those scores to be worth acting on, two things must hold: the ranking has to concentrate what matters, and the engine must never silently corrupt a node's identity or give a different answer on the same input. This report documents both, measured on 14 real datasets: 13 external corpora across six domains, plus Spiderbrain's own codebase.
Summary
Across 13 external corpora, signal lift ranged 1.48× to 2.41× (importance concentrated 48% to 141% above a typical item). Identity silent-corruption was zero; 94 real same-hash collision groups were tested with zero silent misattachment; every decision score was finite; and every result was bit-identical under suppression and deterministic under shuffling. One defect was found by this validation and fixed before release. Run on its own codebase, Spiderbrain found 7 masters across 1,525 nodes with clean community structure and provably independent metrics.
| Domain | Dataset | Nodes | Signal lift | Collisions → misattach | Deterministic |
|---|---|---|---|---|---|
| Software & code | Saroir | 1,672 | 2.25× +125% | 78 → 0 | ✓ |
| Software & code | perform.digital | 219 | 1.90× +90% | 13 → 0 | ✓ |
| People analytics | employee-attrition | 1,470 | 2.41× +141% | none | ✓ |
| Insurance | insurance | 1,338 | 2.03× +103% | 1 → 0 | ✓ |
| Clinical trials | oncology | 1,000 | 1.57× +57% | none | ✓ |
| Clinical trials | cardiology | 1,000 | 1.59× +59% | 1 → 0 | ✓ |
| Clinical trials | diabetes | 1,000 | 1.60× +60% | none | ✓ |
| Clinical trials | mental_health | 1,000 | 1.56× +56% | none | ✓ |
| Clinical trials | neurology | 1,000 | 1.54× +54% | 1 → 0 | ✓ |
| Clinical trials | vaccines | 1,000 | 1.48× +48% | none | ✓ |
| Diagnostics | breast-cancer | 569 | 1.91× +91% | none | ✓ |
| Geophysics & materials | earthquakes | 2,000 | 1.86× +86% | none | ✓ |
| Geophysics & materials | batteries | 339 | 1.94× +94% | none | ✓ |
| Developer tooling (self) | Spiderbrain | 1,525 | 1.11× webscore† | self-analysis | ✓ |
† The blind-spot signal lift is not computable on the self-brain (metric variance too low, see Assessment); webscore concentration is shown instead. Identity and collision guarantees apply to the decision-layer mutation tests on the 13 external corpora.
Metrics, names and definitions
Every metric used in this report, in plain language.
The five decision-layer guarantees
| # | Guarantee | What it tests | Target |
|---|---|---|---|
| A | Identity safety | Rename, edit, delete and correlated-churn mutations applied to a 60-node sample. | 0 silent corruptions |
| B | Forced collision | For every real same-hash group, judge one member, delete it, and prove the judgment never silently re-attaches to a surviving twin. | 0 silent misattach |
| C | Decision scores | Reach (node and cluster scope) and necessity computed on real data. | 0 non-finite |
| D | Determinism | The matcher digest must be stable when node order is shuffled, run three times. | identical ×3 |
| E | Suppression invariant | Partitioning the ranked output must leave the signal lift unchanged. | 0 bits changed |
Validation log
Per-dataset results, exactly as the harness recorded them. Signal lift is the discriminationRatio; rename-auto is the share of renames auto-resolved (the rest correctly defer); reach and necessity are decision-layer scores on the top-ranked node.
| Dataset | Type | Nodes | Edges | Signal lift | Rename-auto | Collisions → misattach | Reach n / c | Necessity |
|---|---|---|---|---|---|---|---|---|
Saroir | Real codebase | 1,672 | 2,286 | 2.2467× | 44/60 | 78 → 0 | 0.19 / 0.20 | 0.96 |
perform.digital | Real codebase | 219 | 254 | 1.9046× | 52/60 | 13 → 0 | 0.64 / 0.58 | 0.99 |
employee-attrition | Public dataset | 1,470 | 2,316 | 2.4143× | 60/60 | none | 0.36 / 0.83 | 0.96 |
insurance | Public dataset | 1,338 | 1,271 | 2.0319× | 60/60 | 1 → 0 | 0.67 / 0.65 | 0.96 |
oncology | ClinicalTrials.gov | 1,000 | 4,836 | 1.5673× | 60/60 | none | 0.54 / 0.54 | 0.96 |
cardiology | ClinicalTrials.gov | 1,000 | 3,788 | 1.5908× | 60/60 | 1 → 0 | 0.78 / 0.60 | 0.96 |
diabetes | ClinicalTrials.gov | 1,000 | 3,232 | 1.6012× | 60/60 | none | 0.75 / 0.57 | 0.96 |
mental_health | ClinicalTrials.gov | 1,000 | 4,641 | 1.5604× | 60/60 | none | 0.57 / 0.52 | 0.96 |
neurology | ClinicalTrials.gov | 1,000 | 4,099 | 1.5414× | 59/60 | 1 → 0 | 0.00 / 0.67 | 0.96 |
vaccines | ClinicalTrials.gov | 1,000 | 4,138 | 1.4800× | 60/60 | none | 0.74 / 0.82 | 0.96 |
breast-cancer | Public dataset | 569 | 978 | 1.9135× | 60/60 | none | 0.00 / 0.71 | 0.94 |
earthquakes | Public dataset | 2,000 | 5,028 | 1.8562× | 60/60 | none | 0.78 / 0.91 | 0.96 |
batteries | Public dataset | 339 | 1,139 | 1.9402× | 60/60 | none | 0.81 / 0.93 | 0.96 |
Every corpus also passed, uniformly: silent-corruption 0, decision scores finite, determinism identical ×3, and signal lift bit-identical before and after suppression. Suppression removed exactly 2 adjudicated nodes per corpus without changing the ranking of the rest.
14th dataset: Spiderbrain on Spiderbrain
We run the engine on its own source. This is the brain of the shipping app, generated on 2026-06-04.
The 7 masters it found: src/graph/types.ts, package.json, public/brain.json, src-tauri/capabilities/default.json, supabase/referral.sql, bridge/server.mjs, scripts/vendor-engine.mjs. The top-scoring node is src/graph/types.ts (webscore 8.7): the type system is the keystone the rest of the app leans on, which is exactly what you would expect, and a useful sanity check that the score tracks reality.
Rationale
Why across domains. An importance score that only works on code proves little. Code, people, insurance, medicine, geophysics and materials share nothing except structure, so passing all of them shows the method keys on structure, not on the quirks of one domain.
Why these guarantees. A ranking is only useful if the thing it ranks keeps its identity as the project changes, and if the same input always yields the same answer. So we test identity under churn, the worst case of a deleted content twin, finiteness, determinism, and invariance under suppression.
Why dogfood. We run Spiderbrain on Spiderbrain. If the team would not trust it on its own code, neither should you.
Why determinism is the bar. Agents act on these scores. A score that changes between runs cannot be audited or trusted. Same input, same brain, every time.
Assessment
All 13 external corpora pass every guarantee after one fix, and the self-analysis behaved exactly as designed, including where it declined to score.
The defect this validation caught. The first run was not clean, which is the point of testing on real data. Neurology and insurance each showed one silent misattachment. The root cause: the matcher assumed a coincidental content twin has different neighbors, but on dense record graphs two records can share both the content hash and most of their neighbors. When one was deleted and exactly one twin survived, the judgment auto-applied. The fix captures whether the content hash was unique at the moment of decision; a judgment made on a non-unique hash never auto-applies and always defers to a human. Code brains never hit this because their twins differ structurally, so only the dense real record graphs exposed it.
Honest notes.Across the six clinical areas, signal lift spans 1.48× (vaccines) to 1.60× (diabetes); we show that exact range rather than rounding it to a tidy 1.5× to 1.6×. The self-brain's blind-spot index could not be computed: its variance was too low because the brain was built without commit history and the codebase is structurally uniform, so the engine returned a diagnostic instead of a fabricated number. Rename-auto rates below 60/60 on the code brains (44/60, 52/60, and neurology's 59/60) are collision-group members correctly deferring, not failures: you cannot auto-resolve a rename while an identical twin still exists.
Scope and limits. Mutations are synthetic but applied systematically (60-sample stride, no clock, no randomness). Each corpus was run once, which is sufficient because every result is deterministic and bit-identical under shuffle. Public datasets are modeled as dependency graphs, not used in their original tabular form.
Impact
Most relevance and importance scores are heuristic and drift between runs. Spiderbrain's does not. It is deterministic, it is identity-safe under churn, and it generalizes from source code to clinical trials to earthquakes on the same engine. That is the property an AI agent needs from its memory: a structural map it can act on, replay and audit, that says the same thing today and next week. This is the foundation under context engineering, the difference between a model that re-reads everything and one that knows what matters.
What this means, by domain
Software & code
On three real codebases (Saroir, perform.digital and Spiderbrain itself) the engine concentrated importance on the handful of files that carry the system, and held identity safe through 91 real content-hash collision groups with zero silent misattachment. For agents and engineers, that means attention lands where failure actually hurts, and the map does not drift.
People analytics
On 1,470 employee records, signal lift was the highest we measured (+141%), pulling the few factors that truly drive attrition out of hundreds of fields.
Insurance
On 1,338 policy records, the engine ranked the high-leverage risk factors (+103%) so underwriting attention lands where it counts, with identity safe through a real collision group.
Clinical trials
Across six therapeutic areas and 6,000 trial records, lift held in a tight 1.48× to 1.60× band, isolating the pivotal criteria consistently instead of spiking on one area and missing another.
Diagnostics
On 569 breast-cancer cases, lift was +91%, surfacing hidden-critical features rather than only the obvious ones.
Geophysics & materials
On the largest graph we ran (2,000 earthquake nodes) and a 339-node materials set, lift held at +86% and +94%, showing the method scales without losing its edge.
Spiderbrain itself
Run on its own 1,525-node codebase, the engine found 7 masters and clean community structure (modularity 0.46), and its shortcut audit returned INDEPENDENT, confirming the metrics are not proxies of one another. The blind-spot index correctly declined to score: with no commit history loaded the variance was too low, so the diagnostic refused to emit noise rather than invent a number. We ship a tool we run on ourselves.