Less is more: sampling chemical space with active learning

by   Justin S. Smith, et al.

The development of accurate and transferable machine learning (ML) potentials for predicting molecular energetics is a challenging task. The process of data generation to train such ML potentials is a task neither well understood nor researched in detail. In this work, we present a fully automated approach for the generation of datasets with the intent of training universal ML potentials. It is based on the concept of active learning (AL) via Query by Committee (QBC), which uses the disagreement between an ensemble of ML potentials to infer the reliability of the ensemble's prediction. QBC allows our AL algorithm to automatically sample regions of chemical space where the machine learned potential fails to accurately predict the potential energy. AL improves the overall fitness of ANAKIN-ME (ANI) deep learning potentials in rigorous test cases by mitigating human biases in deciding what new training data to use. AL also reduces the training set size to a fraction of the data required when using naive random sampling techniques. To provide validation of our AL approach we develop the COMP6 benchmark (publicly available on GitHub), which contains a diverse set of organic molecules. We show the use of our proposed AL technique develops a universal ANI potential (ANI-1x), which provides very accurate energy and force predictions on the entire COMP6 benchmark. This universal potential achieves a level of accuracy on par with the best ML potentials for single molecule or materials while remaining applicable to the general class of organic molecules comprised of the elements CHNO.


SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials

Machine learning potentials are an important tool for molecular simulati...

Active Learning of Uniformly Accurate Inter-atomic Potentials for Materials Simulation

An active learning procedure called Deep Potential Generator (DP-GEN) is...

Hyperactive Learning (HAL) for Data-Driven Interatomic Potentials

Data-driven interatomic potentials have emerged as a powerful class of s...

Graphical Gaussian Process Regression Model for Aqueous Solvation Free Energy Prediction of Organic Molecules in Redox Flow Battery

The solvation free energy of organic molecules is a critical parameter i...

Machine Learning Inter-Atomic Potentials Generation Driven by Active Learning: A Case Study for Amorphous and Liquid Hafnium dioxide

We propose a novel active learning scheme for automatically sampling a m...

BenchML: an extensible pipelining framework for benchmarking representations of materials and molecules at scale

We introduce a machine-learning (ML) framework for high-throughput bench...

Fast and Sample-Efficient Interatomic Neural Network Potentials for Molecules and Materials Based on Gaussian Moments

Artificial neural networks (NNs) are one of the most frequently used mac...

Please sign up or login with your details

Forgot password? Click here to reset