- Previous Article
- Next Article

## I. INTRODUCTION

Ii. methods, a. operator quantum machine learning, b. datasets, c. geometry optimization, d. transition state search, e. validation, iii. results and discussion, a. geometry optimization, b. transition state search, iv. conclusion, acknowledgments, author declarations, conflict of interest, author contributions, data availability, transition state search and geometry relaxation throughout chemical compound space with quantum machine learning.

Note: This paper is part of the JCP Special Topic on Modern Semiempirical Electronic Structure Methods.

- Split-Screen
- Article contents
- Figures & tables
- Supplementary Data
- Peer Review
- Open the PDF for in another window
- Reprints and Permissions
- Cite Icon Cite
- Search Site

Stefan Heinen , Guido Falk von Rudorff , O. Anatole von Lilienfeld; Transition state search and geometry relaxation throughout chemical compound space with quantum machine learning. J. Chem. Phys. 14 December 2022; 157 (22): 221102. https://doi.org/10.1063/5.0112856

Download citation file:

- Ris (Zotero)
- Reference Manager

We use energies and forces predicted within response operator based quantum machine learning (OQML) to perform geometry optimization and transition state search calculations with legacy optimizers but without the need for subsequent re-optimization with quantum chemistry methods. For randomly sampled initial coordinates of small organic query molecules, we report systematic improvement of equilibrium and transition state geometry output as training set sizes increase. Out-of-sample S N 2 reactant complexes and transition state geometries have been predicted using the LBFGS and the QST2 algorithms with an root-mean-square deviation (RMSD) of 0.16 and 0.4 Å—after training on up to 200 reactant complex relaxations and transition state search trajectories from the QMrxn20 dataset, respectively. For geometry optimizations, we have also considered relaxation paths up to 5’595 constitutional isomers with sum formula C 7 H 10 O 2 from the QM9-database. Using the resulting OQML models with an LBFGS optimizer reproduces the minimum geometry with an RMSD of 0.14 Å, only using ∼6000 training points obtained from normal mode sampling along the optimization paths of the training compounds without the need for active learning. For converged equilibrium and transition state geometries, subsequent vibrational normal mode frequency analysis indicates deviation from MP2 reference results by on average 14 and 26 cm −1 , respectively. While the numerical cost for OQML predictions is negligible in comparison to density functional theory or MP2, the number of steps until convergence is typically larger in either case. The success rate for reaching convergence, however, improves systematically with training set size, underscoring OQML’s potential for universal applicability.

One of the fundamental challenges in quantum chemistry is the understanding of reaction mechanisms in order to predict chemical processes. To this end, numerous neural networks (reaction predictors) have been introduced, proposing the most likely reaction path way 1–7 for a given product. These models were trained on data obtained from experimental studies 8 only containing the molecular graph (as SMILES strings 9,10 ) and their corresponding yields. However, a crucial property of a chemical reaction is the activation energy (i.e., the difference between reactant and transition state energy), which is linked to the kinetics of the reaction. To predict activation energies with conventional electronic structure methods, both the reactant complex geometry and the transition state geometry need to be obtained. This is commonly done by iteratively following gradients of the potential energy surface (PES) toward the minimum or the saddle point, respectively. Due to the iterative nature of these schemes, imposing the repeated need to perform self-consistent field calculations to obtain updated forces, the computational burden is as large as it is predictable. 11 Furthermore, finding saddle points remains an additional challenge because, often enough, considerable manual work is required beforehand in order to generate reasonable initial structure guesses. Consequently, it is not surprising that so far only few reaction datasets, which contain transition state geometries as well as corresponding energies have been published in 2020 12,13 and 2021. 14

Only very recently, attempts have been made to use machine learning models to speed up transition state predictions. In 2019, Bligaard and co-workers used the nudged elastic band (NEB) 15,16 method to find transition states, relying on the neural network based Δ-ML model 17 together with a low level of theory as a baseline. 18 More recently, Mortensen et al. contributed the “atomistic structure learning algorithm” (ASLA), 19 enabling autonomous structure determination with a much reduced need for costly first-principles total energy calculations. Lemm et al. 20 introduced the graph to structure (G2S) machine learning model, predicting reactant complexes and transition state geometries for the QMrxn20 12 dataset without any account for energy considerations, solely using molecular graphs as input. Moreover, for 30 small organic molecules, neural networks predicting energies and forces to accelerate the geometry optimization in between ab initio iterations were introduced by Meyer and Hauser, 21 and by Born and Kästner. 22 Similar to G2S, Makós et al. 23 propose a “transition state generative adversarial neural network” (TS-GAN), which estimates transition state geometries using information from reactants and products only. This procedure allows for better initial geometries for a transition state search, reducing the number of steps toward a saddle point. Jackson et al. 14 developed a neural network (TSNet) predicting transition states for a small (∼50) S N 2 reaction dataset, as well as the geometries of the QM9 24 dataset.

Recent work from the group of Liu shows machine-learned potentials for small organic reactions (C 4 H 8 ) or the investigation of glucose pyrolysis reactions using the stochastic surface walking method neural network approach for global sampling of the potential energy surface and subsequent minimal energy path search using double-ended surface walking. 25,26 The resulting minimal energy paths were then further refined using density functional theory (DFT) calculations. Furthermore, using the same methods, Liu et al. investigated similar reactions using a heterogeneous catalytic system. 27,28

However, to the best of our knowledge, there is no machine learning model yet using, in strict analogy to the conventional quantum chemistry based protocol, predicted energies and forces only within the conventional optimization algorithms in order to relax geometries or find transition states. To tackle this challenge and show proof of principle, we have used the response operator-based quantum machine learning (OQML) 29,30 model with the Faber-Christensen-Huang-Lilienfeld (FCHL) representation 31,32 from our previous work and trained on energies and forces across chemical compound space in order to speed up geometry relaxations as well as transition state searches for new, out-of-sample compounds (see Fig. 1 ). As for any properly trained QML model, prediction errors decay systematically with training set size, and we demonstrate for the chemistries presented that encouraging levels of accuracy can be reached.

Schematic potential energy surfaces in chemical compound space. Arrows show the working principle of operator-based quantum machine learning (OQML)-based iterative structural optimization: Training on reaction profiles of different chemical systems (purple), the OQML model is able to interpolate forces and energies throughout chemical compound space, enabling the relaxation of the reactant and the search for the transition state (orange). Input geometries (squares) are easily obtained, e.g., from universal force field predictions.

To reduce the extensive amount of training data from global sampling as in Ref. 25 and 26 , we use normal mode sampling of optimization paths of relaxations and transition state searches of 200/500 training compounds, yielding ∼4000/6000 training data. Using out-of-sample predictions, we were able to optimize constitutional isomers, reactant complexes, and transition states. Furthermore, global optimization algorithms looking for minimal energy paths can assist in locating the transition state, but it should still be followed by an additional optimization (for example, Berny optimization) to find the “exact” saddle point. Therefore, using legacy optimizers (LBFGS and Berny) together with numerical frequencies obtained from our force predictions allows for a thorough validation of converged geometries without the need for further refinement using quantum chemical calculations.

For proof of principle, we have investigated geometry optimizations for all constitutional isomers with the C 7 H 10 O 2 sum formula drawn from QM9. 24 After training the OQML model on random geometries along the optimization path of 5500 calculations, going from a universal force field (UFF) minimum energy geometry to the B3LYP/6-31G(2df) minimum geometry, we optimized the remaining 500 constitutional isomers, resulting in a total root-mean-square deviation (RMSD) of only 0.14 Å. To probe transition states, we have trained OQML models on the QMrxn20 dataset 12 with thousands of examples for the S N 2 text book reaction at the MP2/6-311G(d) level of theory, enabling the relaxation of reactant complexes and the search for transition states, which both compare well to common density functional theory (DFT) results. As shown in Fig. 1 , this means training the OQML model on quantum chemistry reference energies and forces along the optimization trajectory obtained for relaxation and transition state search runs of training systems. Starting with universal force field (UFF) 33 geometries, OQML subsequently predicts energies and forces for 200 out-of-sample query systems, thereby enabling the application of legacy relaxation and transition state search algorithms throughout chemical compound space.

We have relied on a standard quantum machine learning approach, operator quantum machine learning (OQML), from our previous work as introduced by Christensen et al., 29,30 which is a kernel ridge regression (KRR) model, which explicitly encodes target functions and their derivatives. A detailed derivation can be found in Christensen et al., 32 Sec. II (Operator quantum machine learning). To train a model, the regression coefficients α , the following cost function is minimized using an Singular Value Decomposition (SVD) approach, as implemented in the QMLcode: 34

with K being the training kernel, U being the energies, and F being the forces. To predict the energies, the following matrix equation can be used:

and similarly for the forces,

where K s is the test kernel containing training and test instances.

The representation used throughout this work is the FCHL19 representation. 32 FCHL19 makes use of interatomic distances in its two-body terms and includes interatomic angles in the three-body term. FCHL19 was selected because of its remarkable performance for QM9 related datasets, 32 and due to its being the best structure-based representation in direct learning of activation energies in QMrxn20. 35 In Refs. 29 and 30 , the authors introduced the OQML approach using FCHL19. They proposed the use of two sets of hyperparameters, one for energies only and one for energies and forces. The parameters were optimized using the following cost function:

where U i is the energy of molecule i in the test set and F i and n i are the forces and number of atoms of the same molecule. In this work, we used the same optimized hyperparameters.

In this work, three datasets were used; the constitutional isomers from QM9 24 and the reactant complexes, as well as the transition states from QMrxn20 12,36 of the S N 2 reaction.

## 1. Constitutional isomers

The constitutional isomers are part of the QM9 dataset 24 with the sum formula C 7 O 2 H 10 . The dataset contains 6095 compounds at the B3LYP/6-31G(p,2df) level of theory (Gaussian09 37 ). The dataset was split into 500 out of sample compounds and 5595 training compounds. For this work, all initial geometries were optimized with OpenBabel 38 using the UFF force field 33 (truncated after 200 steps). Subsequently, the geometries were re-optimized using the ORCA 4.0 39 electronic structure code at the B3LYP/6-31G(2df) level of theory. For the training, a normal mode scan of these optimization steps was performed, and 6000 randomly chosen geometries from this set were chosen as training data.

## 2. Nucleophilic substitution reaction (S N 2)

This dataset is a subset of the QMrxn20 12,36 dataset. The scaffold of the molecules is ethane, which was substituted with leaving groups –F, –Cl, –Br, nucleophiles H − , F − , Cl − , Br − , and functional groups –H, –NO 2 , –CN, –CH 3 , and –NH 2 . The dataset contains 1807 reactions consisting of reactant complexes and transition states on an MP2/6-311G(d) level of theory. Out of those reactions, 200 were randomly chosen for training, and the test set contains 300 out of sample compounds (reactant complexes and transition states). For the reaction dataset in QMrxn20, the reactant complexes (training set) were optimized with the UFF force field (truncated after 200 steps), and subsequent geometry optimizations on the MP2/6-311G(d) 40–44 level of theory were performed using the ORCA 4.0 code. For the reactant complexes used in training, random geometries along the optimization paths were chosen and displaced along their normal modes (using Gaussian09 freq = hpmodes ), yielding 3753 training instances for the reactant complexes. Energies and gradients were calculated using the engrad keyword from the ORCA 4.0 package.

For the transition state training data, reactant and product complexes from the QMrxn20 12 dataset were optimized using Openbabel’s UFF force field (truncated after 200 steps). Then, a transition state search for the 200 training instances was performed using the Gaussian09 QST2 keyword (Berny algorithm 45 ) and the loose keyword. On the geometries along the transition state search path, normal mode calculations using the freq = hpmodes from the Gaussian09 code were performed, yielding 3812 training instances for the transition states. Energies and gradients were obtained using the ORCA 4.0 engrad keyword.

For the 500 and 300 out of sample constitutional isomers and reactant complexes, the LBFGS 46 algorithm as implemented in Atomic Simulation Environment (ASE) 47 with the ORCA 4.0 calculator was used on the B3LYP/6-31G(2df) and MP2/6-311G(d) level of theory, respectively, to receive the reference data. For the OQML models, a machine learning calculator yielding forces and energies when given a geometry was implemented in ASE. Example scripts can be found in Ref. 48 .

In addition, for these 300 out of sample reactant complexes, DFT geometry optimizations were performed for comparison. The three functionals used were B3LYP, 49,50 PBE0, 51 and ω B97X 52 with the 6-311G(d) 53–55 basis set.

For every converged geometry (QM and ML), numerical normal modes (as implemented in ASE) were calculated.

To find transition states for the 300 out of sample compounds, the Gaussian09 QST2 56 algorithm with loose convergence criteria was used, which also allows for external energies and forces, in this case provided by OQML models. Note that no explicit Hessian is required by this method, nor is one available from our model. The QST2 algorithm uses reactant and product geometries for an initial transition state guess, from which the Berny algorithm 45 is used to converge to a saddle point.

For every normally terminated run (QM and ML), numerical frequencies as implemented in ASE were calculated for comparison and further validation of the transition states. Example scripts can be found in Ref. 48 .

The validation criterion for the constitutional isomers was the convergence after 50 LBFGS steps with the thresh hold of fmax = 0.05 eV/Å as implemented in ASE. For the reactant complexes, the same criteria as for the constitutional isomers were used. In addition to the convergence criteria for the LBFGS optimization, the fragments for the reactant complexes were analyzed. For every reactant complex, there should be only two fragments, the main molecule and the nucleophile Y − containing only one atom. The transition state validation contains multiple tests (derived from 12 ):

normal termination of the Gaussian09 code,

1 imaginary frequency <100 cm −1 ,

Y–C–X (nucleophile–reaction center–leaving group) angle >155°,

minimal distance of 0.9 Å between atoms,

correct movement of the reaction center for the first normal mode, and

movement of remaining normal modes <0.5 Å.

Displaced geometries for 5 and 6 were obtained using the vibration package from the ASE code to calculate numerical frequencies. Only if a compound passes all tests, it is considered in the subsequent analysis of RMSD’s and frequencies. A more detailed description of the validation can be found in the SI.

In the context of statistical learning theory, cross-validated learning curves amount to numerical proof of the robustness and applicability of a machine learning model, and they provide quantitative measures of the data-efficiency obtained. For the three OQML models studied herein (geometries of constitutional isomers, of reactant complexes, and of transition states), Fig. 2(a) displays the OQML-based learning curves for energies (top) and atomic forces (bottom), which indicate the systematic improvement of energy and force predictions as training set size increases. For the reactant complexes and the transition states, DFT reference calculations were performed for comparison. Furthermore, we calculated frequencies (QM and ML) for all converged geometries for further analysis and, in the case of the transition states, also for validation.

(a) Learning curves for energies (top) and atomic forces (bottom), (b) RMSD performance curves of geometry vs training set size N , and (c) success rates of the optimization and transition state searches. Colors correspond to results based on three distinct datasets: constitutional isomers (QM9), reactant complexes (QMrxn20), and transition states (QMrxn20). The dashed green and dotted orange horizontal lines correspond to the success rates of the reference calculations for the TS searches and the geometry optimization, respectively.

The learning curve for the constitutional isomers is in line with the results by Christensen and von Lilienfeld 30 Surprisingly, although FCHL19 was optimized for small organic closed shell molecules, the learning curves for the reactant complexes and the transition states have a faster learning rate. A possible reason for this trend could be that the reactions in the QMrxn20 dataset share a common scaffold with only the substituents changing, which represents a lower effective dimensionality of the problem, which typically leads to faster learning. Moreover, relaxations for only 200 reactions were considered in the training set, which implies an overall smaller subset of the chemical universe. By contrast, for the constitutional isomers, geometries from 5,500 different compounds were chosen, covering a much broader chemical space.

While accurate OQML-based estimates of forces and energies are necessary for subsequent relaxation and transition state search, the eventual key figure of merit, the RMSD with respect to query reference coordinates for increasing training set size, amounts to a performance curve, as shown in panel (b) of Fig. 2 . We observe strong systematic improvements with increasing training set size of the RMSD for the constitutional isomers and reactant complexes. By contrast, RMSD performance curves for transition states, while also monotonically increasing with training set size, exhibit substantially smaller learning rates. Differences in learning for different datasets while using the same representations and model architectures imply that the target function is more complicated. One can argue that the constitutional isomers are less pathological since they consist of small organic and closed shell molecules, whereas the transition states include charged compounds and non-covalent binding to leaving and attacking groups. The relatively flat progress made for the transition states might also be simply due to the more complex optimization problems toward a saddle point compared to the simple downhill search of a geometry optimization. More specifically, due to the underpinning high dimensionality, the training set grows much more rapidly when adding a new reactive system, including optimization steps along the way to the saddle point. This implies that the training will be less efficient. Possible ways to mitigate such a bottleneck could include the use of the Amons approach, 57 which decomposes molecules into sub-structures, drastically reducing the effective dimensionality of the problem. Moreover Δ-ML, 17 multi-level grid combination techniques, 58 or transfer learning 59,60 could lead to significant speed-ups and would render the models more transferable.

Regarding the performance curve for the transition states, it is encouraging to note that the slope is substantially steeper than for the equilibrium geometry, also indicating that the OQML based energies and gradients also work well for locating saddle-points, which is unprecedented in the literature, to the best of our knowledge. A direct one-to-one comparison to the equilibrium geometry relaxations, however, is not possible as the differences might also be due to the use of two very different optimizers (LBFGS vs QST2).

Performance curves for success rates have also been included in Fig. 2 panel (c). We note that for all models and datasets the success rate of the optimization runs systematically increases with training set size. We find that even for OQML models trained on small training set sizes resulting in relatively high RMSDs (∼0.4 Å), the success rate increases from 7% to 28% and from 20% to 65% for the reactant complexes and the transition states, respectively, as shown in Fig. 2 , closing to the MP2 success rates (horizontal lines). Surprisingly, even though the RMSD performance curve for the constitutional isomer set is the best, the success performance curve is the worst. This could be due to the higher dimensionality in the QM9-based datasets, where the optimizer has to locate the minimum for substantially larger systems encoding more degrees of freedom. The systematic increase in success rate, however, represents strong evidence in favor of the proposed model, as one can always improve it through the mere addition of training instances, apparently resulting in increasingly smooth potential energy surfaces with fewer and fewer artifacts—an important prerequisite for successful optimization runs using algorithms such as LBFGS. 61

For further analysis of reactant complexes and transition states, we used 300 out-of-sample compounds. Table I shows a summary of the predictions of the reactant complex optimization of the ground states (GS) and transition state (TS) searches, as well as the comparison to the three DFT methods (B3LYP, PBE0, and ω B97X) with the 6-311G(d) basis set. The RMSDs for the geometries are around 0.1 Å for the DFT methods and 0.4 Å for the OQML method considering the transition states. The performance of the ML model reaches the same accuracy for the reactant complexes as the DFT geometry optimization, resulting in RMSD’s on the order of 0.05 to 0.14 Å. Using the same model, we calculated numerical frequencies and reached a mean absolute error over the 300 test transition states of 33.63 and 14.09 cm −1 for transition states and reactant complexes, respectively, which is comparable to the DFT errors.

For the activation energy E a , the ML model reaches slightly a higher MAE of 5.851 kcal/mol compared to the MAE of ∼1 kcal/mol of the DFT methods. Although the error of ∼6 kcal/mol is still high, other ML models could be used to learn the activation energies, e.g., the R2B model, 35 which was applied to this dataset and solely uses the molecular graph as input for the ML model.

We note that a direct comparison of the OQML and DFT results in Table I would not be fair as OQML was fitted on data similar to the query compounds, while DFT methods and basis sets are universal in nature and were fitted against much more diverse chemistries.

Summary of results for 300 out of sample test cases for relaxation of reactants (left) and transition state searches (right) for OQML models and three DFT methods for comparison with MP2/6-311G(d) as reference. The table shows the difference in geometry (RMSD), in frequency ( ν ), and in activation energy ( E a ) for each method. N corresponds to the training set size of the ML models (MP2/6-311G(d)) and to the dataset size for parametrization of the DFT functionals. Geometry optimizations using the LBFGS algorithm from the ASE package were truncated after 50 iterations and the default threshold ( fmax = 0.05 eV/Å) was used. The limit for the transition state search was the default iteration limit of 100 steps. Frequency values in parentheses for the transition states are the errors of the first (imaginary frequency).

Finally, we showcase the OQML predicted results for the transition state of one randomly drawn exemplary reaction, involving [H(CN)C–C(CH 3 )(NH2)] with Cl and F as leaving group and nucleophile, respectively. In Fig. 3 , the calculated transition state normal modes are shown, with energies once predicted by OQML and once obtained from MP2 for comparison. Even though the RMSD of the predicted geometries is off by 0.4 Å, the curvature is described reasonably well by the OQML model, which is supported by the relatively small errors in frequencies as well as by the high success rate of the transition state search.

Example normal mode scan showing energy changes as a function of distortion along TS modes for the transition state of the S N 2 reaction of [H(CN)C–C(CH 3 ) (NH2)] with Cl and F as leaving group and nucleophile, respectively. Geometry of an MP2/6-311G(d) converged and validated TS was used and distorted along its normal modes. Subsequently, single point calculation (MP2) as well as ML predictions were plotted for the first 23 normal modes and their displacements. The x-axis describes the index of the distorted geometry and the y-axis describes the energy relative to the MP2 equilibrium geometry. Both energies, MP2 (blue) and ML (orange), are scaled by the equilibrium geometry (index 10).

Our findings indicate that OQML can be trained across chemical compound space in order to optimize geometries and search for saddle points (transition states) for new out-of-sample query compounds. As such, we have demonstrated the applicability of OQML to serve as a surrogate model of conventional quantum-based energies and forces and can be employed within legacy optimizers. Most notably, prediction errors of OQML in terms of RMSDs, frequencies, and activation energies systematically improve as training set sizes increase. Similarly, the success rates of convergence of optimizers also improve as training set sizes grow.

The reported learning curves exhibit linear decay as a function of training set size on log–log scales, indicating the completeness of the model and that even further improvements in predictive power can be achieved through the mere addition of further training data. Corresponding performance curves of RMSDs suggest that the optimization process (RMSD as well as success rate) based on these models could also be further improved by increasing the training set size. Especially for the constitutional isomers (small, organic, and closed shell molecules), the description of the potential energy surfaces improves steadily by adding more training data. For example, the success rate improves from 2% to 52% for the smallest and largest training set size, respectively. Small deviations in predicted vibrational frequencies from reference MP2 frequencies further corroborate this point.

In the future, the exploration of out-of-equilibrium geometries farther away from a local minima in the QM7x dataset 63 could be investigated. To make OQML more transferable and also applicable to larger reactants, an Amon based extension 57 could also be implemented. OQML could also be helpful for the generation of larger and more consistent datasets in quantum chemistry, especially for the study of reactions.

We thank A. S. Christensen for the discussions. This project received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant Agreement No. 772834). This result only reflects the author’s view, and the EU is not responsible for any use that may be made of the information it contains. This research was also supported by the NCCR MARVEL, a National Centre of Competence in Research, funded by the Swiss National Science Foundation (Grant No. 182892).

The authors have no conflicts to disclose.

Stefan Heinen : Data curation (lead); Methodology (equal); Software (lead); Validation (lead); Visualization (lead); Writing – original draft (lead); Writing – review & editing (equal). Guido Falk von Rudorff : Conceptualization (equal); Formal analysis (equal); Supervision (equal). O. Anatole von Lilienfeld : Conceptualization (equal); Funding acquisition (lead); Supervision (lead).

The data that support the findings of this study are openly available in “Supplementary Information of Geometry Relaxation and Transition State Search with Quantum Machine Learning” at http://doi.org/10.5281/zenodo.6823150 , reference number 48 .

## Citing articles via

Submit your article.

## Sign up for alerts

- Online ISSN 1089-7690
- Print ISSN 0021-9606
- For Researchers
- For Librarians
- For Advertisers
- Our Publishing Partners
- Physics Today
- Conference Proceedings
- Special Topics

## pubs.aip.org

- Privacy Policy
- Terms of Use

## Connect with AIP Publishing

This feature is available to subscribers only.

Sign In or Create an Account

## IMAGES

## VIDEO

## COMMENTS

Only very recently, attempts have been made to use machine learning models to speed up transition state predictions. In 2019, Bligaard and co-workers used the nudged elastic band (NEB) 15,16 method to find transition states, relying on the neural network based Δ-ML model 17 together with a low level of theory as a baseline. 18 More recently, Mortensen et al. contributed the “atomistic ...