DOXLAS
A machine learning pipeline that predicts how strongly drug-like molecules bind to disease targets — turning raw chemical data into actionable drug discovery intelligence.
Drug discovery is slow and expensive — testing every possible molecule against a disease target in a lab costs millions and takes years. Most candidate drugs fail. Researchers need a way to predict which molecules are worth testing before synthesizing them.
DOXLAS uses machine learning to predict how potently a molecule will bind to a target protein. Given a molecule's chemical structure, it outputs a potency score — letting researchers prioritize the most promising candidates. Currently predicts with 71% accuracy on its best target.
Epidermal Growth Factor Receptor — a key protein in cell growth. When it mutates, cells grow uncontrollably, causing cancers (especially lung cancer). Drugs like gefitinib and erlotinib target EGFR. Its binding site is well-characterized, making it more predictable for ML.
Dopamine D2 Receptor — controls mood, motivation, and movement. Implicated in schizophrenia, Parkinson's, and addiction. Antipsychotics like haloperidol block DRD2. Its flexible binding site makes structure-activity relationships more complex, harder to model.
At ~2-3k compounds, Random Forest consistently outperforms neural networks. Tree models handle sparse binary fingerprints very well without heavy tuning.
EGFR (R²=0.71) is significantly easier than DRD2 (R²=0.54). Kinase binding sites are more rigid and predictable, while GPCR sites are flexible and complex.
The MLP shows promise (val R²=0.76) but overfits on holdout. With 10k+ compounds and scaffold splitting, deep learning should close the gap.
| # | Model | Target | What It Is | R² | RMSE | Train | Test | Notes |
|---|
Builds 500 decision trees, each trained on a random subset of the data. Final prediction = average of all trees. Like asking 500 experts and taking their consensus — robust and hard to fool.
Builds trees sequentially — each new tree fixes the mistakes of the previous ones. Like a student who reviews wrong answers and improves. Stops early when performance plateaus.
A deep learning model with 3 hidden layers (512→256→128 neurons). Learns complex nonlinear patterns. Uses dropout to prevent memorizing training data and batch normalization for stability.
Same architecture but trained on fingerprints plus 8 molecular descriptors. Uses lower learning rate and less dropout for more careful optimization. Slightly better on DRD2.
Each "epoch" = one full pass through training data. Loss (error) decreases as the model learns. Training auto-stopped when performance plateaued.
| Epoch | Train Loss | Val R² | Status |
|---|---|---|---|
| 1 | 40.840 | -22.30 | Started |
| 10 | 2.189 | 0.419 | Rapid gain |
| 20 | 1.219 | 0.667 | Good progress |
| 30 | 1.063 | 0.734 | Strong |
| 50 | 0.821 | 0.745 | Near peak |
| 60 | 0.820 | 0.762 | Best val |
| 81 | 0.747 | 0.760 | Auto-stopped |
Each compound is a SMILES string — text notation encoding molecular structure. pChEMBL measures binding potency (log scale): ≥7 is very potent, 5-7 is moderate, <5 is weak.
Each molecule is converted into a numeric vector for ML. DOXLAS uses two types: a molecular "fingerprint" capturing structural patterns, plus 8 physical/chemical properties.
Scans each atom and records what's nearby (within 2 bonds). Each unique pattern gets hashed into a 2048-bit binary vector. Present pattern = 1, absent = 0. A "digital fingerprint" of the molecule.
Eight properties computed from the structure using RDKit. Describe the molecule's size, shape, and drug-likeness. Complement the fingerprint with global properties.
| Descriptor | Measures | Why It Matters | EGFR Mean |
|---|---|---|---|
| MW | How heavy the molecule is | Larger = more binding surface but harder to absorb | 354.4 ± 138.6 |
| LogP | Oil vs water preference | Controls absorption — too high and it won't dissolve in blood | 3.16 ± 2.49 |
| HBD | H-bond donors | Key for protein binding; too many hurt absorption | 0.90 ± 0.89 |
| HBA | H-bond acceptors | More acceptors can mean stronger binding | 3.89 ± 1.76 |
| TPSA | Polar surface area | Predicts ability to cross cell membranes | 51.08 ± 25.89 |
| RotBonds | Flexible bonds | More flexibility = harder to bind tightly | 4.2 ± 2.8 |
| AroRings | Aromatic rings | Common in drugs — stack with protein residues | 2.1 ± 1.2 |
| HeavyAtoms | Non-hydrogen atoms | Proxy for molecular size/complexity | 24.8 ± 8.9 |
Four stages, each a dedicated Python module. One command processes a new drug target end-to-end.
Downloads data, cleans it, computes features, trains RF. ~75 seconds on M3 MacBook Pro.
Reproducible: Pinned deps, fixed random seeds
CPU-complete: Full CPU support, GPU optional
Version controlled: Clean git history, 11 commits
Built from scratch over two sessions. Every commit = a working pipeline state.
Combine RF + XGBoost predictions for better accuracy than either alone.
Split by molecular scaffolds instead of randomly — more realistic test of generalization.
Use AI to propose entirely new molecules optimized for a target.