PLATFORM ACTIVE

DOXLAS

A machine learning pipeline that predicts how strongly drug-like molecules bind to disease targets — turning raw chemical data into actionable drug discovery intelligence.

5,124
Compounds Analyzed
0.712
Best Prediction R²
8
Models Trained
2,056
Molecular Features
What DOXLAS Does
The Problem

Drug discovery is slow and expensive — testing every possible molecule against a disease target in a lab costs millions and takes years. Most candidate drugs fail. Researchers need a way to predict which molecules are worth testing before synthesizing them.

The Solution

DOXLAS uses machine learning to predict how potently a molecule will bind to a target protein. Given a molecule's chemical structure, it outputs a potency score — letting researchers prioritize the most promising candidates. Currently predicts with 71% accuracy on its best target.

Drug Targets Under Analysis
EGFR
CHEMBL220 · Kinase
Easier Target

Epidermal Growth Factor Receptor — a key protein in cell growth. When it mutates, cells grow uncontrollably, causing cancers (especially lung cancer). Drugs like gefitinib and erlotinib target EGFR. Its binding site is well-characterized, making it more predictable for ML.

Compounds
2,199
Best R²
0.712
Retention
44%
5000 raw → 2199 clean
DRD2
CHEMBL217 · GPCR
Harder Target

Dopamine D2 Receptor — controls mood, motivation, and movement. Implicated in schizophrenia, Parkinson's, and addiction. Antipsychotics like haloperidol block DRD2. Its flexible binding site makes structure-activity relationships more complex, harder to model.

Compounds
2,925
Best R²
0.537
Retention
59%
5000 raw → 2925 clean
Model Performance — Higher is Better
EGFR — CANCER TARGET
RF
0.712
0.712
XGB
0.709
0.709
MLP
0.694
0.694
MLP+
0.681
0.681
DRD2 — NEUROLOGICAL TARGET
RF
0.537
0.537
MLP+
0.492
0.492
XGB
0.491
0.491
MLP
0.468
0.468
Key Takeaways
Random Forest Wins

At ~2-3k compounds, Random Forest consistently outperforms neural networks. Tree models handle sparse binary fingerprints very well without heavy tuning.

Target Difficulty Matters

EGFR (R²=0.71) is significantly easier than DRD2 (R²=0.54). Kinase binding sites are more rigid and predictable, while GPCR sites are flexible and complex.

Neural Nets Need More Data

The MLP shows promise (val R²=0.76) but overfits on holdout. With 10k+ compounds and scaffold splitting, deep learning should close the gap.

Model Comparison
Full Leaderboard
8 MODELS
#ModelTargetWhat It IsRMSETrainTestNotes
How Each Model Works
RFRandom Forest

Builds 500 decision trees, each trained on a random subset of the data. Final prediction = average of all trees. Like asking 500 experts and taking their consensus — robust and hard to fool.

500 trees · min_samples_leaf=3 · 5-fold CV
XGBXGBoost

Builds trees sequentially — each new tree fixes the mistakes of the previous ones. Like a student who reviews wrong answers and improves. Stops early when performance plateaus.

depth=6 · lr=0.1 · subsample=0.8 · early stopping
MLPNeural Network

A deep learning model with 3 hidden layers (512→256→128 neurons). Learns complex nonlinear patterns. Uses dropout to prevent memorizing training data and batch normalization for stability.

512→256→128→1 · dropout=0.3 · Adam · patience=15
MLP+Tuned Neural Net

Same architecture but trained on fingerprints plus 8 molecular descriptors. Uses lower learning rate and less dropout for more careful optimization. Slightly better on DRD2.

2056 features · dropout=0.2 · lr=5e-4 · patience=20
MLP Training Progress — EGFR
Neural Network Learning Over 81 Epochs
EARLY STOPPED

Each "epoch" = one full pass through training data. Loss (error) decreases as the model learns. Training auto-stopped when performance plateaued.

EpochTrain LossVal R²Status
140.840-22.30Started
102.1890.419Rapid gain
201.2190.667Good progress
301.0630.734Strong
500.8210.745Near peak
600.8200.762Best val
810.7470.760Auto-stopped
Compound Library

Each compound is a SMILES string — text notation encoding molecular structure. pChEMBL measures binding potency (log scale): ≥7 is very potent, 5-7 is moderate, <5 is weak.

ID
SMILES
MW
LogP
HBA
pChEMBL
Potency
Molecular Feature Engineering

Each molecule is converted into a numeric vector for ML. DOXLAS uses two types: a molecular "fingerprint" capturing structural patterns, plus 8 physical/chemical properties.

Morgan Fingerprints (ECFP4)

Scans each atom and records what's nearby (within 2 bonds). Each unique pattern gets hashed into a 2048-bit binary vector. Present pattern = 1, absent = 0. A "digital fingerprint" of the molecule.

Bits
2,048
Radius
2
ECFP4 standard
Physicochemical Descriptors

Eight properties computed from the structure using RDKit. Describe the molecule's size, shape, and drug-likeness. Complement the fingerprint with global properties.

Descriptors
8
Total Features
2,056
2048 FP + 8 desc
Descriptor Statistics — EGFR
What Each Descriptor Means
DescriptorMeasuresWhy It MattersEGFR Mean
MWHow heavy the molecule isLarger = more binding surface but harder to absorb354.4 ± 138.6
LogPOil vs water preferenceControls absorption — too high and it won't dissolve in blood3.16 ± 2.49
HBDH-bond donorsKey for protein binding; too many hurt absorption0.90 ± 0.89
HBAH-bond acceptorsMore acceptors can mean stronger binding3.89 ± 1.76
TPSAPolar surface areaPredicts ability to cross cell membranes51.08 ± 25.89
RotBondsFlexible bondsMore flexibility = harder to bind tightly4.2 ± 2.8
AroRingsAromatic ringsCommon in drugs — stack with protein residues2.1 ± 1.2
HeavyAtomsNon-hydrogen atomsProxy for molecular size/complexity24.8 ± 8.9
Data Processing Pipeline

Four stages, each a dedicated Python module. One command processes a new drug target end-to-end.

01
Ingest
chembl_fetch.py
Pulls bioactivity data from ChEMBL — the world's largest open drug-target database. 1,000 records/request with rate limiting.
Fetches IC50, Ki, EC50, Kd
02
Standardize
smiles_clean.py
Validates every molecule with RDKit. Drops errors, missing data, ambiguous measurements. Removes duplicates.
Retention: 44-59% of raw
03
Featurize
fingerprints.py
Converts molecules into numeric vectors: 2048-bit fingerprint (molecular barcode) + 8 physicochemical descriptors.
Output: 2,056-dim vector
04
Model
rf / xgb / mlp
Trains multiple ML models on 80% of data, tests on held-out 20%. Saves metrics and trained model files.
Best: R²=0.712
One-Command Execution
python -m src.pipeline --target CHEMBL220 --max-records 5000

Downloads data, cleans it, computes features, trains RF. ~75 seconds on M3 MacBook Pro.

Project Architecture
doxlas/ ├── src/ │ ├── ingest/ chembl_fetch.py │ ├── standardize/ smiles_clean.py │ ├── featurize/ fingerprints.py │ ├── model/ rf_baseline.py │ │ mlp_model.py │ │ mlp_tuned.py │ │ xgb_model.py │ ├── report/ summary.py │ └── pipeline.py ├── data/ │ ├── raw/ chembl_*.csv │ └── processed/ clean_* fp_* meta_* └── models/ *.pkl *.pt *.json
Tech Stack
Python 3.11RDKitPyTorch 2.10scikit-learnXGBoostpandasNumPyChEMBL APIM3 ARM64
Design Principles
Modular: Each stage is an independent package
Reproducible: Pinned deps, fixed random seeds
CPU-complete: Full CPU support, GPU optional
Version controlled: Clean git history, 11 commits
Development Timeline

Built from scratch over two sessions. Every commit = a working pipeline state.

Git Commit History
11 COMMITS
What's Next
Ensemble Models

Combine RF + XGBoost predictions for better accuracy than either alone.

Scaffold Splits

Split by molecular scaffolds instead of randomly — more realistic test of generalization.

Generative Chemistry

Use AI to propose entirely new molecules optimized for a target.

DOXLAS v0.2.0 · CPU-completeBuilt by Mathew Stamper · Stamper Lab LLC · 2026