Synthetic data enable experiments in atomistic machine learning

by   John L. A. Gardner, et al.

Machine-learning models are increasingly used to predict properties of atoms in chemical systems. There have been major advances in developing descriptors and regression frameworks for this task, typically starting from (relatively) small sets of quantum-mechanical reference data. Larger datasets of this kind are becoming available, but remain expensive to generate. Here we demonstrate the use of a large dataset that we have "synthetically" labelled with per-atom energies from an existing ML potential model. The cheapness of this process, compared to the quantum-mechanical ground truth, allows us to generate millions of datapoints, in turn enabling rapid experimentation with atomistic ML models from the small- to the large-data regime. This approach allows us here to compare regression frameworks in depth, and to explore visualisation based on learned representations. We also show that learning synthetic data labels can be a useful pre-training task for subsequent fine-tuning on small datasets. In the future, we expect that our open-sourced dataset, and similar ones, will be useful in rapidly exploring deep-learning models in the limit of abundant chemical data.


Molecular-orbital-based Machine Learning for Open-shell and Multi-reference Systems with Kernel Addition Gaussian Process Regression

We introduce a novel machine learning strategy, kernel addition Gaussian...

Copula-based synthetic data generation for machine learning emulators in weather and climate: application to a simple radiation model

Can we improve machine learning (ML) emulators with synthetic data? The ...

DeePKS+ABACUS as a Bridge between Expensive Quantum Mechanical Models and Machine Learning Potentials

Recently, the development of machine learning (ML) potentials has made i...

Scaling machine learning-based chemical plant simulation: A method for fine-tuning a model to induce stable fixed points

Idealized first-principles models of chemical plants can be inaccurate. ...

Encrypted machine learning of molecular quantum properties

Large machine learning models with improved predictions have become wide...

Man versus Machine: AutoML and Human Experts' Role in Phishing Detection

Machine learning (ML) has developed rapidly in the past few years and ha...

Beyond Convergence: Identifiability of Machine Learning and Deep Learning Models

Machine learning (ML) and deep learning models are extensively used for ...

Please sign up or login with your details

Forgot password? Click here to reset