Validation & research

Proven on real data. Every figure measured.

Spiderbrain turns a project into a dependency graph and scores every node for importance, blast radius and risk. For those scores to be worth acting on, two things must hold: the ranking has to concentrate what matters, and the engine must never silently corrupt a node's identity or give a different answer on the same input. This report documents both, measured on 14 real datasets: 13 external corpora across six domains, plus Spiderbrain's own codebase.

External corpora validated 2026-05-30 with validate-real-corpora.mjs on decision layer v5.0.7. Self-analysis generated 2026-06-04 with the v5 engine. Consolidated in v5.1.0: 30 gate scripts, 353 assertions green. Nothing here is a target or a projection.

Summary

Across 13 external corpora, signal lift ranged 1.48× to 2.41× (importance concentrated 48% to 141% above a typical item). Identity silent-corruption was zero; 94 real same-hash collision groups were tested with zero silent misattachment; every decision score was finite; and every result was bit-identical under suppression and deterministic under shuffling. One defect was found by this validation and fixed before release. Run on its own codebase, Spiderbrain found 7 masters across 1,525 nodes with clean community structure and provably independent metrics.

14 real-world datasets13,607 nodes scored0 silent corruptions94 collision groups → 0 misattachbit-for-bit deterministic

Domain	Dataset	Nodes	Signal lift	Collisions → misattach	Deterministic
Software & code	`Saroir`	1,672	2.25× +125%	78 → 0	✓
Software & code	`perform.digital`	219	1.90× +90%	13 → 0	✓
People analytics	`employee-attrition`	1,470	2.41× +141%	none	✓
Insurance	`insurance`	1,338	2.03× +103%	1 → 0	✓
Clinical trials	`oncology`	1,000	1.57× +57%	none	✓
Clinical trials	`cardiology`	1,000	1.59× +59%	1 → 0	✓
Clinical trials	`diabetes`	1,000	1.60× +60%	none	✓
Clinical trials	`mental_health`	1,000	1.56× +56%	none	✓
Clinical trials	`neurology`	1,000	1.54× +54%	1 → 0	✓
Clinical trials	`vaccines`	1,000	1.48× +48%	none	✓
Diagnostics	`breast-cancer`	569	1.91× +91%	none	✓
Geophysics & materials	`earthquakes`	2,000	1.86× +86%	none	✓
Geophysics & materials	`batteries`	339	1.94× +94%	none	✓
Developer tooling (self)	`Spiderbrain`	1,525	1.11× webscore†	self-analysis	✓

† The blind-spot signal lift is not computable on the self-brain (metric variance too low, see Assessment); webscore concentration is shown instead. Identity and collision guarantees apply to the decision-layer mutation tests on the 13 external corpora.

Metrics, names and definitions

Every metric used in this report, in plain language.

webscore

Severity. How much it would hurt if this one node failed on its own.

spikescore

Blast radius. If this breaks, how much breaks along with it.

blastVolume

Total risk in one number, combining severity and blast radius.

Masters

Keystones. The few hubs everything leans on, which also act as firebreaks that stop damage spreading.

Blind-spot index

Important but unexamined. High criticality with little review or documentation.

Drift

Going stale. A node has changed in ways that no longer match its past.

Signal lift (discriminationRatio)

The mean of the top 10% of nodes divided by the population mean. It measures concentration, not count, so it is corpus-size invariant. 2.0× means the important few score twice a typical item.

Reach

How far a node’s influence spreads, scored at node scope and cluster scope (0 to 1).

Necessity

How irreplaceable a node is (0 to 1). 1.0 means nothing else can stand in for it.

Modularity (Leiden)

Quality of the community structure the engine finds (0 to 1). Higher means cleaner, more separable clusters.

Shortcut audit

Checks that no metric is a trivial proxy of another. INDEPENDENT means the scores carry distinct signal, not one number wearing different hats.

The five decision-layer guarantees

#	Guarantee	What it tests	Target
A	Identity safety	Rename, edit, delete and correlated-churn mutations applied to a 60-node sample.	0 silent corruptions
B	Forced collision	For every real same-hash group, judge one member, delete it, and prove the judgment never silently re-attaches to a surviving twin.	0 silent misattach
C	Decision scores	Reach (node and cluster scope) and necessity computed on real data.	0 non-finite
D	Determinism	The matcher digest must be stable when node order is shuffled, run three times.	identical ×3
E	Suppression invariant	Partitioning the ranked output must leave the signal lift unchanged.	0 bits changed

Validation log

Per-dataset results, exactly as the harness recorded them. Signal lift is the discriminationRatio; rename-auto is the share of renames auto-resolved (the rest correctly defer); reach and necessity are decision-layer scores on the top-ranked node.

Dataset	Type	Nodes	Edges	Signal lift	Rename-auto	Collisions → misattach	Reach n / c	Necessity
`Saroir`	Real codebase	1,672	2,286	2.2467×	44/60	78 → 0	0.19 / 0.20	0.96
`perform.digital`	Real codebase	219	254	1.9046×	52/60	13 → 0	0.64 / 0.58	0.99
`employee-attrition`	Public dataset	1,470	2,316	2.4143×	60/60	none	0.36 / 0.83	0.96
`insurance`	Public dataset	1,338	1,271	2.0319×	60/60	1 → 0	0.67 / 0.65	0.96
`oncology`	ClinicalTrials.gov	1,000	4,836	1.5673×	60/60	none	0.54 / 0.54	0.96
`cardiology`	ClinicalTrials.gov	1,000	3,788	1.5908×	60/60	1 → 0	0.78 / 0.60	0.96
`diabetes`	ClinicalTrials.gov	1,000	3,232	1.6012×	60/60	none	0.75 / 0.57	0.96
`mental_health`	ClinicalTrials.gov	1,000	4,641	1.5604×	60/60	none	0.57 / 0.52	0.96
`neurology`	ClinicalTrials.gov	1,000	4,099	1.5414×	59/60	1 → 0	0.00 / 0.67	0.96
`vaccines`	ClinicalTrials.gov	1,000	4,138	1.4800×	60/60	none	0.74 / 0.82	0.96
`breast-cancer`	Public dataset	569	978	1.9135×	60/60	none	0.00 / 0.71	0.94
`earthquakes`	Public dataset	2,000	5,028	1.8562×	60/60	none	0.78 / 0.91	0.96
`batteries`	Public dataset	339	1,139	1.9402×	60/60	none	0.81 / 0.93	0.96

Every corpus also passed, uniformly: silent-corruption 0, decision scores finite, determinism identical ×3, and signal lift bit-identical before and after suppression. Suppression removed exactly 2 adjudicated nodes per corpus without changing the ranking of the rest.

14th dataset: Spiderbrain on Spiderbrain

We run the engine on its own source. This is the brain of the shipping app, generated on 2026-06-04.

1,525nodes scored

7masters (keystones)

7clusters

0.4646Leiden modularity

1.11×webscore lift (top 10% / mean)

1.30×blastVolume lift

INDEPENDENTshortcut audit (R² 0.547)

N/Ablind-spot index (low variance)

The 7 masters it found: src/graph/types.ts, package.json, public/brain.json, src-tauri/capabilities/default.json, supabase/referral.sql, bridge/server.mjs, scripts/vendor-engine.mjs. The top-scoring node is src/graph/types.ts (webscore 8.7): the type system is the keystone the rest of the app leans on, which is exactly what you would expect, and a useful sanity check that the score tracks reality.

Rationale

Why across domains. An importance score that only works on code proves little. Code, people, insurance, medicine, geophysics and materials share nothing except structure, so passing all of them shows the method keys on structure, not on the quirks of one domain.

Why these guarantees. A ranking is only useful if the thing it ranks keeps its identity as the project changes, and if the same input always yields the same answer. So we test identity under churn, the worst case of a deleted content twin, finiteness, determinism, and invariance under suppression.

Why dogfood. We run Spiderbrain on Spiderbrain. If the team would not trust it on its own code, neither should you.

Why determinism is the bar. Agents act on these scores. A score that changes between runs cannot be audited or trusted. Same input, same brain, every time.

Assessment

All 13 external corpora pass every guarantee after one fix, and the self-analysis behaved exactly as designed, including where it declined to score.

The defect this validation caught. The first run was not clean, which is the point of testing on real data. Neurology and insurance each showed one silent misattachment. The root cause: the matcher assumed a coincidental content twin has different neighbors, but on dense record graphs two records can share both the content hash and most of their neighbors. When one was deleted and exactly one twin survived, the judgment auto-applied. The fix captures whether the content hash was unique at the moment of decision; a judgment made on a non-unique hash never auto-applies and always defers to a human. Code brains never hit this because their twins differ structurally, so only the dense real record graphs exposed it.

Honest notes.Across the six clinical areas, signal lift spans 1.48× (vaccines) to 1.60× (diabetes); we show that exact range rather than rounding it to a tidy 1.5× to 1.6×. The self-brain's blind-spot index could not be computed: its variance was too low because the brain was built without commit history and the codebase is structurally uniform, so the engine returned a diagnostic instead of a fabricated number. Rename-auto rates below 60/60 on the code brains (44/60, 52/60, and neurology's 59/60) are collision-group members correctly deferring, not failures: you cannot auto-resolve a rename while an identical twin still exists.

Scope and limits. Mutations are synthetic but applied systematically (60-sample stride, no clock, no randomness). Each corpus was run once, which is sufficient because every result is deterministic and bit-identical under shuffle. Public datasets are modeled as dependency graphs, not used in their original tabular form.

Impact

Most relevance and importance scores are heuristic and drift between runs. Spiderbrain's does not. It is deterministic, it is identity-safe under churn, and it generalizes from source code to clinical trials to earthquakes on the same engine. That is the property an AI agent needs from its memory: a structural map it can act on, replay and audit, that says the same thing today and next week. This is the foundation under context engineering, the difference between a model that re-reads everything and one that knows what matters.

What this means, by domain

Software & code

On three real codebases (Saroir, perform.digital and Spiderbrain itself) the engine concentrated importance on the handful of files that carry the system, and held identity safe through 91 real content-hash collision groups with zero silent misattachment. For agents and engineers, that means attention lands where failure actually hurts, and the map does not drift.

People analytics

On 1,470 employee records, signal lift was the highest we measured (+141%), pulling the few factors that truly drive attrition out of hundreds of fields.

Insurance

On 1,338 policy records, the engine ranked the high-leverage risk factors (+103%) so underwriting attention lands where it counts, with identity safe through a real collision group.

Clinical trials

Across six therapeutic areas and 6,000 trial records, lift held in a tight 1.48× to 1.60× band, isolating the pivotal criteria consistently instead of spiking on one area and missing another.

Diagnostics

On 569 breast-cancer cases, lift was +91%, surfacing hidden-critical features rather than only the obvious ones.

Geophysics & materials

On the largest graph we ran (2,000 earthquake nodes) and a 339-node materials set, lift held at +86% and +94%, showing the method scales without losing its edge.

Spiderbrain itself

Run on its own 1,525-node codebase, the engine found 7 masters and clean community structure (modularity 0.46), and its shortcut audit returned INDEPENDENT, confirming the metrics are not proxies of one another. The blind-spot index correctly declined to score: with no commit history loaded the variance was too low, so the diagnostic refused to emit noise rather than invent a number. We ship a tool we run on ourselves.

See plans Read the FAQ