All articles
BenchmarkJune 8, 2026·7 min read

We benchmarked our own tool on EGFR — and we're publishing the mediocre number

An honest dogfooding study: a browser docking pipeline on DUD-E EGFR. ROC-AUC 0.65, why receptor relaxation didn't help, and what we're doing about it.

Every AI drug-discovery tool claims to work. Almost none show you a number. So before we ask anyone to trust MolHub, we ran it against itself on a target with decades of public data — EGFR — and we're publishing what we found, including the parts that aren't flattering.

1. The generator embarrassed us first

We asked the Copilot to design novel EGFR inhibitor leads. It returned a confident report — but the molecules were a red flag: eight near-identical 2,5-dimethylpyrrole scaffolds, several joined by reactive vinyl bridges (think dyes / Michael acceptors). The PAINS and Brenk filters didn't catch them; a medicinal chemist would have on sight. We only saw it because we were using the tool, not demoing it.

The fix shipped the same day: a reactive-motif reject filter (for the liabilities the catalogs miss) plus a Bemis–Murcko scaffold-diversity cap. The same EGFR run now returns clean, diverse piperazine amides — synthetic-accessibility 1.4–2.5, zero alerts. Lesson: a transparent baseline is valuable precisely because you can watch it fail.

2. The docking benchmark: ROC-AUC 0.65

We benchmarked the product's actual docking protocol on a labelled DUD-E EGFR set — 40 known actives + 160 property-matched decoys — and measured how well it ranks actives above decoys. Whole-protein blind box, AutoDock Vina,exhaustiveness=8:

ROC-AUC ≈ 0.65. Better than random (0.50), well short of a strong 0.80+. The single best-scoring hits were enriched, but property-matched decoys crowded the rest of the top. That is the well-known limitation of blind docking into a raw AlphaFold model — and our own data confirms it instead of hiding it.

3. Pocket-targeting helped where it matters; relaxation didn't

Narrowing the box to the ATP site roughly doubled enrichment in the top 5–10% (the slice you actually screen) — but overall AUC stayed flat. That said the box wasn't the bottleneck; the raw pocket's side-chain packing was. So we tried the obvious next fix: energy-minimize the receptor (PDBFixer + OpenMM, backbone restrained). Re-benchmarked on the same 200 molecules and pocket box:

ProtocolROC-AUCEF 5%EF 10%Δ act−dec
Raw AlphaFold · blind box (default)0.6481.01.250.48
Raw AlphaFold · pocket box0.6452.02.250.61
Vacuum-minimized · pocket box0.5832.02.0−0.02
GBSA-minimized · pocket box0.6351.51.750.78

Vacuum minimization made it worse (0.58 — the apo pocket collapses inward). Implicit solvent recovered to ~0.635 with the largest mean active-vs-decoy gap, but it never beat the raw structure on the ranking metrics — and cost ~28 minutes of compute per receptor. Single-structure relaxation does not improve enrichment here, so we're not shipping it. The data points where practitioners said it would: the next real lever is ensemble docking over multiple conformers, not one minimized snapshot.

4. What we actually believe about docking

Docking — especially blind, into an unrefined AlphaFold model — is a triage signal, not binding truth. The honest use is enrichment: rank a library so your wet-lab effort goes to the most promising candidates first. We'd rather show you the 0.65 and the work to improve it than hand you a press release.

Why we publish this

The whole benchmark is reproducible (harness and raw results are in our repo). We think the scientific audience is tired of tools that claim accuracy and show none — so the number, and the limitations, are the product pitch. If a result looks too clean, distrust it. If a tool won't show its ROC-AUC, ask why.

Want to run your own target? It's free for academics — and if the docking or generation falls short for you, that's exactly the feedback we want.

Try it yourself — free for academics

2.9M molecules, AlphaFold targets, docking, and ADMET in your browser. No install, no card.

Start free