User Guide
Exporting datasets
Turn your molecules, docking results, and QM properties into a dataset your ML stack can consume.
Export formats#
- HuggingFace Datasets (Parquet) — drop-in for
datasets-cli - PyTorch
DataLoader— streaming or in-memory - TDC benchmark group — TDC-compatible row order & labels
- CSV (SMILES + labels)
- SDF with 3D coordinates
- RDF / Turtle (OntoCompChem semantic export)
Train / valid / test splits#
- Scaffold (default) — 70 / 15 / 15, Bemis–Murcko scaffolds
- Random — 80 / 10 / 10, seeded
- Cold target — 60 / 20 / 20, held-out receptors
SDK example#
python
from pymolhub import MolHub
mh = MolHub()
ds = mh.datasets.get("ds_01")
train, valid, test = ds.split(method="scaffold")
# As HuggingFace dataset
hf = ds.to_huggingface()
# As PyTorch DataLoader
loader = ds.to_torch(batch_size=32, split="train")
for batch in loader:
print(batch["smiles"], batch["label"])
breakCitation & DOI#
Click Get DOI from Zenodo on the dataset detail page to mint a permanent identifier suitable for publication.