Large scale classification of chemical reactions from patent data Gregory Landrum

Large scale classification of chemical
reactions from patent data
Gregory Landrum
NIBR Informatics, Basel
Novartis Institutes for BioMedical Research
10th International Conference on Chemical Structures/
10th German Conference on Chemoinformatics
Outline
§  Public data sources and reactions
§  Fingerprints for reactions
§  Validation:
•  Machine learning
•  Clustering
§  Application: models for predicting yield
2
Public data sources in cheminformatics
an aside at the beginning
§  Publicly available data sources for small molecules and
their biological activities/interactions:
•  PDB, PubChem, ChEMBL, etc.
§  Publicly available data sources for the chemistry behind
how those molecules were actually made (i.e. reactions):
•  pretty much nothing until recently
§  Plenty of data locked up in large commercial databases,
and pharmaceutical companies’ ELNs, very very little in
the open
The “public/open” point is important for
collaboration and reproducibility
3
A large, public source of chemical reactions
Not just what we made, but how we made it
§  Text-mining applied to open patent data to extract chemical reactions :
1.12 million reactions[1]
§  Reactions classified using namerxn, when possible, into 318 standard
types : >599000 classified reactions[2]
Lowe DM: “Extraction of chemical structures and reactions from the literature.” PhD
thesis. University of Cambridge: Cambridge, UK; 2012.
[2] Reaction classification from Roger Sayle and Daniel Lowe (NextMove Software)
http://nextmovesoftware.com/blog/2014/02/27/unleashing-over-a-million-reactions-into-thewild/
4
[1]
More about the classes
Frequency of reaction classes:
20 most common classes:
5
44675
39297
28194
26739
22400
20465
20405
17226
16602
16021
12952
12250
10659
8538
7261
7102
7071
6472
6383
5791
2.1.2 Carboxylic acid + amine reaction
1.7.9 Williamson ether synthesis
2.1.1 Amide Schotten-Baumann
1.3.7 Chloro N-arylation
1.6.2 Bromo N-alkylation
7.1.1 Nitro to amino
1.6.4 Chloro N-alkylation
6.2.2 CO2H-Me deprotection
6.1.1 N-Boc deprotection
6.2.1 CO2H-Et deprotection
1.2.1 Aldehyde reductive amination
2.2.3 Sulfonamide Schotten-Baumann
11.9 Separation
3.1.5 Bromo Suzuki-type coupling
1.7.7 Mitsunobu aryl ether synthesis
6.3.7 Methoxy to hydroxy
3.3.1 Sonogashira coupling
3.1.1 Bromo Suzuki coupling
1.8.5 Thioether synthesis
9.1.6 Hydroxy to chloro
Got the reactions, what about reaction fingerprints?
Criteria for them to be useful
§  Question 1: do they contain bits that are helpful in
distinguishing reactions from another?
Test: can we use them with a machine-learning approach to build a
reaction classifier?
§  Question 2: are similar reactions similar with the
fingerprints
Test: do related reactions cluster together?
6
Our toolbox: the RDKit
§  Open-source C++ toolkit for cheminformatics
§  Wrappers for Python (2.x), Java, C#
§  Functionality:
•  2D and 3D molecular operations
•  Descriptor generation for machine learning
•  PostgreSQL database cartridge for substructure and similarity searching
•  Knime nodes
•  IPython integration
•  Lucene integration (experimental)
•  Supports Mac/Windows/Linux
§  Releases every 6 months
§  business-friendly BSD license
§  Code: https://github.com/rdkit
§  http://www.rdkit.org
Similarity and reactions
What are we talking about?
§  These two reactions are both type: “1.2.5 Ketone reductive amination”
It’s obvious that these are the same, right?
8
Similarity and reactions
What are we talking about?
§  These two reactions are both type: “1.2.5 Ketone reductive amination”
It’s obvious that these are the same, right?
9
Got the reactions, what about reaction fingerprints?
Start simple: use difference fingerprints:
∑
FPReacts =
FPi
i∈Reactants
FPProducts =
∑
FPi
i∈Products
FPRxn = FPProds − FPReacts
Similar idea here:
1) Ridder, L. & Wagener, M. SyGMa: Combining Expert Knowledge and Empirical Scoring in the Prediction of
Metabolites. ChemMedChem 3, 821–832 (2008).
2) Patel, H., Bodkin, M. J., Chen, B. & Gillet, V. J. Knowledge-Based Approach to de NovoDesign Using Reaction
10
Vectors.
J. Chem. Inf. Model. 49, 1163–1184 (2009).
Refine the fingerprints a bit
Text-mined reactions often include catalysts,
reagents, or solvents in the reactants
Explore two options for handling this:
1.  Decrease the weight of reactant molecules where too many
of the bits are not present in the product fingerprint
2.  Decrease the weight of reactant molecules where too many
atoms are unmapped
11
Are the fingerprints useful?
§  Question 1: do they contain bits that are helpful in
distinguishing reactions from another?
Test: can we use them with a machine-learning approach to build a
reaction classifier?
§  Question 2: are similar reactions similar with the
fingerprints
Test: do related reactions cluster together?
12
Machine learning and chemical reactions
§  Validation set:
•  The 68 reaction types with at least 2000 instances from the patent
data set
-  “Resolution” reaction types removed (e.g. 11.9 Separation and 11.1 Chiral
separation)
-  Final: 66 reaction types
§  Process:
•  Training set is 200 random instances of each reaction type
•  Test set is 800 random instances of each reaction type
•  Learning: random forest (scikit-learn)
13
Learning reaction classes
Results for test data
Overall:
•  Recall: 0.94
•  Precision: 0.94
•  Accuracy: 0.94
For a 66-class classifier, this looks pretty good!
14
Learning reaction classes
Confusion matrix for test data
~94% accuracy
much of the
confusion is
between related
types
Bromo Suzuki coupling
Bromo Suzuki-type coupling
Bromo N-arylation
15
Are the fingerprints useful?
§  Question 1: do they contain bits that are helpful in
distinguishing reactions from another?
Test: can we use them with a machine-learning approach to build a
reaction classifier?
§  Question 2: are similar reactions similar with the
fingerprints
Test: do related reactions cluster together?
16
Clustering reactions
§  Reaction similarity validation set:
•  The 66 most common reaction types from the patent data set
•  Look at the homogeneity of clusters with at least 10 members
1.2.5 Ketone reductive
amination
1.2.5 Ketone reductive
amination
1.2.5 Ketone reductive
amination
Integration
Interpretation: <40% of clusters are <80% homogeneous
Interpretation:
<30% of clusters are <90% homogeneous
17
Using the fingerprints
Can we help classify the remaining 600K reactions?
§  Apply the 66 class random forest to generate class predictions for the
unclassified compounds in order to find reactions we missed
§  Cluster the unclassified molecules, look for big clusters of unclassified
molecules, and (manually) assign classes to them.
§  Both of these approaches have been successful
18
Predicting yields
§  The data set includes text-mined yield information as well as
calculated yields.
§  For modeling: prefer the text-mined value, but take the calculated one
if that’s the only thing available
§  Look at stats for the 93 reaction classes that have at least 500
members with yields, a min yield > 0 and a max yield < 110 %:
19
Predicting yields
§  Look at the most populated classes:
20
Try building models for yield
§  Start with class 7.1.1 “nitro to amino”
§  Break into low-yield (<50%) and high-yield (>70%)
classes.
14% are low-yield
21
Try building models for yield
things that don’t work
§  Try building a random forest using the atom-pair based
reaction fingerprints
That’s performance on the training set
22
Try building models for yield
things that don’t work
§  Try building a random forest using the atom-pair based
reactant fingerprints
That’s performance on the training set
23
Try building models for yield
things that don’t work?
§  Look at the ROC curve for the training-set data
nine wrong “low-yield” predictions
first wrong “low-yield” prediction
The model is doing a great job
of ordering compounds, but a
bad job of classifying
compounds
24
Unbalanced data and ensemble classifiers
an aside
§  Usual decision rule for a two-class ensemble classifier:
take the result that the the majority of the models (decision
trees for random forests) vote for.
§  That’s a decision boundary = 0.5
§  If the dataset is unbalanced, why should we expect
balanced behavior from the classifier?
§  Idea: use the composition of the training set to decide
what the decision boundary should be.
For example: if the data set is ~20% “low yield”, then assign “low
yield” to any example where at least 20% of the trees say “low yield”
25
Try building models for yield
Getting close to working
§  Try building a random forest using the atom-pair based
reactant fingerprints
That’s performance on the training set
§  What about moving the decision boundary to 0.2 to reflect
the unbalanced data set ?
26
Starting to look ok. What about the test set?
Try building models for yield
Getting close to working
§  Results from a random forest using the atom-pair based
reactant fingerprints with the shifted decision boundary
test set
Not too terrible.
27
Try building models for yield
Some more models
§  Aldehyde reductive amination (no shift):
test set
§  Williamson ether synthesis (boundary 0.3)
test set
28
Try building models for yield
Some more models
§  Chloro N-Alkylation (no shift):
test set
§  Chloro N-Alkylation (0.4 shift)
test set
29
Wrapping up
§  Dataset: 1+ million reactions text mined from patents
(publically available) with reaction classes assigned
§  Fingerprints: weighted atom-pair delta and functionalgroup delta fingerprints implemented using the RDKit
§  Fingerprint Validation:
•  Multiclass random-forest classifier ~94% accurate
•  Similarity measure works: similar reactions cluster together
§  Combination of clustering + functional group analysis
allows identification of new reaction classes
§  We’re also able to use the fingerprints to build reasonable
models for yield
30
Acknowledgements
§ NextMove Software:
• Roger Sayle
• Daniel Lowe
§ NIBR:
• Anna Pelliccioli
• Sereina Riniker
• Mike Tarselli
31
Advertising
3rd RDKit User Group Meeting
22-24 October 2014
Merck KGaA, Darmstadt, Germany
Talks, “talktorials”, lightning talks, social activities, and a hackathon on
the 24th.
Registration: http://goo.gl/z6QzwD
Full announcement: http://goo.gl/ZUm2wm
We’re looking for speakers. Please contact greg.landrum@gmail.com
32