On Data-Driven Equation Discovery | by George Miloshevich

Photo by ThisisEngineering RAEng on Unsplash

Describing the nature with the help of analytical expressions verified through experiments has been a hallmark of the success of science especially in physics from fundamental law of gravitation to quantum mechanics and beyond. As challenges such as climate change, fusion, and computational biology pivot our focus toward more compute, there is a growing need for concise yet robust reduced models that maintain physical consistency at a lower cost. Scientific machine learning is an emergent field which promises to provide such solutions. This article is a short review of recent data-driven equation discovery methods targeting scientists and engineers familiar with very basics of machine learning or statistics.

Simply fitting the data well has proven to be a short-sighted endeavour, as demonstrated by the Ptolemy’s model of geocentrism which was the most observationally accurate model until Kepler’s heliocentric one. Thus, combining observations with fundamental physical principles plays a big role in science. However, often in physics we forget the extent to which our models of the world are already data driven. Take an example of standard model of particles with 19 parameters, whose numerical values are established by experiment. Earth system models used for meteorology and climate, while running on a physically consistent core based on fluid dynamics, also require careful calibration to observations of many of their sensitive parameters. Finally, reduced order modelling is gaining traction in fusion and space weather community and will likely remain relevant in future. In fields such as biology and social sciences, where first principle approaches are less effective, statistical system identification already plays a significant role.

There are various methods in machine learning that allow predicting the evolution of the system directly from data. More recently, deep neural networks have achieved significant advances in the field of weather forecasting as demonstrated by the team of Google’s DeepMind and others. This is partly due to the enormous resources available to them as well as general availability of meteorological data and physical numerical weather prediction models which have interpolated this data over the whole globe thanks to data assimilation. However, if the conditions under which data has been generated change (such as climate change) there is a risk that such fully-data driven models would poorly generalise. This means that applying such black box approaches to climate modelling and other situations where we have lack of data could be suspect. Thus, in this article I will emphasise methods which extract equation from data, since equations are more interpretable and suffer less from overfitting. In machine learning speak we can refer to such paradigms as high bias — low variance.

The first method which deserves a mention is a seminal work by Schmidt and Lipson which used Genetic Programming (GP) for symbolic regression and extracted equation from data of trajectories of simple dynamical systems such as double pendulum etc. The procedure consists of generating candidate symbolic functions, derive partial derivatives involved in these expressions and compare them with numerically estimated derivatives from data. Procedure is repeated until sufficient accuracy is reached. Importantly, as there is a very large number of potential candidate expressions which are relatively accurate, one choses the ones which satisfy the principle of “parsimony”. Parsimony is measured as the inverse of the number of terms in the expression, whereas the predictive accuracy is measured as the error on withheld experimental data used only for validation. This principle of parsimonious modelling forms the bedrock of equation discovery.

The idea of Genetic Programming (GP) consists of exploring the space of possible analytical expressions by trying a family of potential terms. This expression is encoded in the tree above, whose structure can be represented as a sort of “gene”. New trees are obtained by mutating sequences of these genes selecting and crossing over best candidates. For instance to obtain the equation in the box on the right just follow the arrows in the hierarchy of the tree on the right.

This method has the advantage of exploring various possible combinations of analytical expressions. It has been tried in various systems, in particular, I will highlight AI — Feynman which with the help of GP and neural networks allowed to identify from data 100 equations from Feynman lectures on physics. Another interesting application of GP is discovering ocean parameterisations in climate, where essentially a higher fidelity model is run to provide the training data, while a correction for cheaper lower fidelity model is discovered from the training data. With that being said, GP is not without its faults and human-in-the-loop was indispensable to ensure that the parameterisations work well. In addition, it can be very inefficient because it follows the recipe of evolution: trial and error. Are there other possibilities? This brings us to the method which has dominated the field of equation discovery in the recent years.

Sparse Identification of Nonlinear Dynamics (SINDy) belongs to the family of conceptually simple yet powerful methods. It was introduced by the group of Steven L. Brunton alongside other groups and is supplied with well-documented, well-supported repository and youtube tutorials. To get some practical hands-on experience just try out their Jupyter notebooks.