User Guide

Exporting datasets

Turn your molecules, docking results, and QM properties into a dataset your ML stack can consume.

Export formats#

  • HuggingFace Datasets (Parquet) — drop-in for datasets-cli
  • PyTorch DataLoader — streaming or in-memory
  • TDC benchmark group — TDC-compatible row order & labels
  • CSV (SMILES + labels)
  • SDF with 3D coordinates
  • RDF / Turtle (OntoCompChem semantic export)

Train / valid / test splits#

  • Scaffold (default) — 70 / 15 / 15, Bemis–Murcko scaffolds
  • Random — 80 / 10 / 10, seeded
  • Cold target — 60 / 20 / 20, held-out receptors

SDK example#

python
from pymolhub import MolHub
mh = MolHub()

ds = mh.datasets.get("ds_01")
train, valid, test = ds.split(method="scaffold")

# As HuggingFace dataset
hf = ds.to_huggingface()

# As PyTorch DataLoader
loader = ds.to_torch(batch_size=32, split="train")
for batch in loader:
    print(batch["smiles"], batch["label"])
    break

Citation & DOI#

Click Get DOI from Zenodo on the dataset detail page to mint a permanent identifier suitable for publication.