PLATFORM ACTIVE

DOXLAS

A machine learning pipeline that predicts how strongly drug-like molecules bind to disease targets — turning raw chemical data into actionable drug discovery intelligence.

5,124

Compounds Analyzed

0.712

Best Prediction R²

Models Trained

2,056

Molecular Features

What DOXLAS Does

The Problem

Drug discovery is slow and expensive — testing every possible molecule against a disease target in a lab costs millions and takes years. Most candidate drugs fail. Researchers need a way to predict which molecules are worth testing before synthesizing them.

The Solution

DOXLAS uses machine learning to predict how potently a molecule will bind to a target protein. Given a molecule's chemical structure, it outputs a potency score — letting researchers prioritize the most promising candidates. Currently predicts with 71% accuracy on its best target.

Drug Targets Under Analysis

EGFR

CHEMBL220 · Kinase

Easier Target

Epidermal Growth Factor Receptor — a key protein in cell growth. When it mutates, cells grow uncontrollably, causing cancers (especially lung cancer). Drugs like gefitinib and erlotinib target EGFR. Its binding site is well-characterized, making it more predictable for ML.

Compounds

2,199

Best R²

0.712

Retention

44%

5000 raw → 2199 clean

DRD2

CHEMBL217 · GPCR

Harder Target

Dopamine D2 Receptor — controls mood, motivation, and movement. Implicated in schizophrenia, Parkinson's, and addiction. Antipsychotics like haloperidol block DRD2. Its flexible binding site makes structure-activity relationships more complex, harder to model.

Compounds

2,925

Best R²

0.537

Retention

59%

5000 raw → 2925 clean

Model Performance — Higher is Better

EGFR — CANCER TARGET

0.712

XGB

0.709

MLP

0.694

MLP+

0.681

DRD2 — NEUROLOGICAL TARGET

0.537

MLP+

0.492

XGB

0.491

MLP

0.468

Key Takeaways

Random Forest Wins

At ~2-3k compounds, Random Forest consistently outperforms neural networks. Tree models handle sparse binary fingerprints very well without heavy tuning.

Target Difficulty Matters

EGFR (R²=0.71) is significantly easier than DRD2 (R²=0.54). Kinase binding sites are more rigid and predictable, while GPCR sites are flexible and complex.

Neural Nets Need More Data

The MLP shows promise (val R²=0.76) but overfits on holdout. With 10k+ compounds and scaffold splitting, deep learning should close the gap.

Model Comparison

Full Leaderboard

8 MODELS

#	Model	Target	What It Is	R²	RMSE	Train	Test	Notes

How Each Model Works

RFRandom Forest

Builds 500 decision trees, each trained on a random subset of the data. Final prediction = average of all trees. Like asking 500 experts and taking their consensus — robust and hard to fool.

500 trees · min_samples_leaf=3 · 5-fold CV

XGBXGBoost

Builds trees sequentially — each new tree fixes the mistakes of the previous ones. Like a student who reviews wrong answers and improves. Stops early when performance plateaus.

depth=6 · lr=0.1 · subsample=0.8 · early stopping

MLPNeural Network

A deep learning model with 3 hidden layers (512→256→128 neurons). Learns complex nonlinear patterns. Uses dropout to prevent memorizing training data and batch normalization for stability.

512→256→128→1 · dropout=0.3 · Adam · patience=15

MLP+Tuned Neural Net

Same architecture but trained on fingerprints plus 8 molecular descriptors. Uses lower learning rate and less dropout for more careful optimization. Slightly better on DRD2.

2056 features · dropout=0.2 · lr=5e-4 · patience=20

MLP Training Progress — EGFR

Neural Network Learning Over 81 Epochs

EARLY STOPPED

Each "epoch" = one full pass through training data. Loss (error) decreases as the model learns. Training auto-stopped when performance plateaued.

Epoch	Train Loss	Val R²	Status
1	40.840	-22.30	Started
10	2.189	0.419	Rapid gain
20	1.219	0.667	Good progress
30	1.063	0.734	Strong
50	0.821	0.745	Near peak
60	0.820	0.762	Best val
81	0.747	0.760	Auto-stopped

Compound Library

Each compound is a SMILES string — text notation encoding molecular structure. pChEMBL measures binding potency (log scale): ≥7 is very potent, 5-7 is moderate, <5 is weak.

SMILES

LogP

HBA

pChEMBL

Potency

Molecular Feature Engineering

Each molecule is converted into a numeric vector for ML. DOXLAS uses two types: a molecular "fingerprint" capturing structural patterns, plus 8 physical/chemical properties.

Morgan Fingerprints (ECFP4)

Scans each atom and records what's nearby (within 2 bonds). Each unique pattern gets hashed into a 2048-bit binary vector. Present pattern = 1, absent = 0. A "digital fingerprint" of the molecule.

Bits

2,048

Radius

ECFP4 standard

Physicochemical Descriptors

Eight properties computed from the structure using RDKit. Describe the molecule's size, shape, and drug-likeness. Complement the fingerprint with global properties.

Descriptors

Total Features

2,056

2048 FP + 8 desc

Descriptor Statistics — EGFR

What Each Descriptor Means

Descriptor	Measures	Why It Matters	EGFR Mean
MW	How heavy the molecule is	Larger = more binding surface but harder to absorb	354.4 ± 138.6
LogP	Oil vs water preference	Controls absorption — too high and it won't dissolve in blood	3.16 ± 2.49
HBD	H-bond donors	Key for protein binding; too many hurt absorption	0.90 ± 0.89
HBA	H-bond acceptors	More acceptors can mean stronger binding	3.89 ± 1.76
TPSA	Polar surface area	Predicts ability to cross cell membranes	51.08 ± 25.89
RotBonds	Flexible bonds	More flexibility = harder to bind tightly	4.2 ± 2.8
AroRings	Aromatic rings	Common in drugs — stack with protein residues	2.1 ± 1.2
HeavyAtoms	Non-hydrogen atoms	Proxy for molecular size/complexity	24.8 ± 8.9

Data Processing Pipeline

Four stages, each a dedicated Python module. One command processes a new drug target end-to-end.

Ingest

chembl_fetch.py

Pulls bioactivity data from ChEMBL — the world's largest open drug-target database. 1,000 records/request with rate limiting.

Fetches IC50, Ki, EC50, Kd

→

Standardize

smiles_clean.py

Validates every molecule with RDKit. Drops errors, missing data, ambiguous measurements. Removes duplicates.

Retention: 44-59% of raw

→

Featurize

fingerprints.py

Converts molecules into numeric vectors: 2048-bit fingerprint (molecular barcode) + 8 physicochemical descriptors.

Output: 2,056-dim vector

→

Model

rf / xgb / mlp

Trains multiple ML models on 80% of data, tests on held-out 20%. Saves metrics and trained model files.

Best: R²=0.712

One-Command Execution

python -m src.pipeline --target CHEMBL220 --max-records 5000

Downloads data, cleans it, computes features, trains RF. ~75 seconds on M3 MacBook Pro.

Project Architecture

doxlas/ ├── src/ │ ├── ingest/ chembl_fetch.py │ ├── standardize/ smiles_clean.py │ ├── featurize/ fingerprints.py │ ├── model/ rf_baseline.py │ │ mlp_model.py │ │ mlp_tuned.py │ │ xgb_model.py │ ├── report/ summary.py │ └── pipeline.py ├── data/ │ ├── raw/ chembl_*.csv │ └── processed/ clean_* fp_* meta_* └── models/ *.pkl *.pt *.json

Tech Stack

Python 3.11RDKitPyTorch 2.10scikit-learnXGBoostpandasNumPyChEMBL APIM3 ARM64

Design Principles

Modular: Each stage is an independent package
Reproducible: Pinned deps, fixed random seeds
CPU-complete: Full CPU support, GPU optional
Version controlled: Clean git history, 11 commits

Development Timeline

Built from scratch over two sessions. Every commit = a working pipeline state.

Git Commit History

11 COMMITS

What's Next

Ensemble Models

Combine RF + XGBoost predictions for better accuracy than either alone.

Scaffold Splits

Split by molecular scaffolds instead of randomly — more realistic test of generalization.

Generative Chemistry

Use AI to propose entirely new molecules optimized for a target.

DOXLAS v0.2.0 · CPU-completeBuilt by Mathew Stamper · Stamper Lab LLC · 2026