Data science is coming of age – here we take a look at its roots

As a consultant who has designed and built Big Data projects across a number of different fields, seen data science in action across even more, and meets regularly with data leaders in some of the world’s leading organisations, I have been asked what the future holds for data science more than once.  In this first instalment of a two-part blog, to try and predict the future, I take a look into the past.

What does the future of data science hold? The honest answer is… I don’t know. And I don’t think any futurist, visionary or data guru can say either. At least, not beyond a fairly low level of confidence. Data science is a very young and rapidly expanding field. It did not exist up until 10 years ago, when the combination of advances according to Moore’s law, the growth of the internet, algorithmic developments and the  sheer scale of cloud computing combined to make advanced analytics widely accessible on very large amounts of very granular data in commercial settings.

Extrapolating from this, I think the future of data will be wrapped up in trends in technology, society and the economy over the next few decades. Advances in materials science and chemicals, energy, transport, biotech, medicine and other areas will certainly change life as we know it, and possibilities in data science will change as well. It is possible to see the general direction of these trends, harder to understand exactly how far each will go, and much harder again to understand how they will interrelate and affect each other. Moore’s law looks like it is breaking down – and will it recover? Even if not, quantum computing is a wildcard with the potential to change everything. It is possible we won’t really talk of data science or Big Data in 20 years as the field fragments into sub-specialisms.

However, there is one area I think I can predict with some certainty. This is both because it is the area I am most familiar with and because there are already a number of initiatives, driven by some of the most innovative minds in the field, to make these potential advances happen. It is the area of algorithmic developments.

First, a little history….

Let me set the scene: algorithmic developments are advances in a few specific areas of machine learning, mainly ensemble learning and deep learning, during the decade between the early 1990s and early 2000s.

Much of the work was on modern ensemble methods, essentially simple but very powerful meta-learning frameworks that aggregate many learning algorithms. The mainstay of gradient boosting, the AdaBoost algorithm was first published in 1997, and random forests and bootstrap aggregation methods were developed between 1994 and 2001. Both were built upon somewhat earlier work by Leo Breiman, Classification and Regression Trees (CART) from 1984. These tree-based techniques are some of the most robust, useable and scalable methods around, and responsible for much of the growth in everyday machine learning across industry. That is not to say that the field has been stagnant for 20 years. The extreme gradient boosting (Xgboost) algorithm, that has been the popular choice by winners of machine learning competitions, was published in 2014.

Deep learning refers to the use of ‘deep’ (multi-layered) artificial neural networks to learn patterns. These networks are particularly good at learning non-linear decision boundaries, with the downside that they tend to be more complex to build and maintain, more expensive to train (usually requiring specialised hardware) and require more data to have maximum impact.

The core of the deep learning technologies we routinely use today were developed between the mid 1980s and early 2000s. The idea of backpropagation, which makes the training of a network incremental and therefore efficient, was first published in 1986. Long-Short Term Memory Networks (LSTMs), a substantial improvement on traditional recurrent neural networks for learning from sequence data, were developed in the mid-1990s by Schmidhuber and Hochreiter. Convolutional Neural Networks (CNNs), now the go-to method for image recognition, were developed in the late 1980s and early 1990s. Again, this is not to say the field has been stagnant since; the release of the generative adversarial net (GAN) in 2014 by Ian Goodfellow was certainly genuine pathbreaking innovation. So too is the Transformer architecture developed in 2017. However, a lot of the developments in the past decade have been focused on improving the useability and accessibility of neural networks rather than on fundamental change. Examples include the Tensorflow and PyTorch libraries released in 2015 and 2016 respectively, the Gated Recurrent Unit (GRU), a derivation of the LSTM, in 2014, and AlexNet of 2012.

Although ensemble learning and deep learning were the most noteworthy examples of this phenomenon, other techniques such as Support Vector Machines (developed in the mid-90s at Bell Labs) and DBSCAN (first published in 1996) are also noteworthy. Many of these were, of course, simply algorithmic optimisations of mathematical ideas with much longer histories. However, in computer science, the algorithmics are essential to useability, as they determine runtime.

Looking back – how data science has progressed over the years

The unreasonable effectiveness of these advances has been visible in our lives over the past 20 years, albeit sometimes subtly. Even a few years before Sebastian Thrun’s Stanford AI Lab team won the DARPA Grand Challenge in 2005, many experts thought it unlikely we would see autonomous vehicles emerge in the next few decades. By 2015, Google’s self-driving car, built by a team led by Thrun, had passed 1 million autonomous miles.

On a more pedestrian note, if you used Apple’s Siri, or Google Translate, in the early 2010s, you will remember the difficulty they had in picking up even moderately complex words or phrases, especially common slang or vernacular, and the ordering of words (different in different languages) was a perennial problem. The developments in recurrent neural networks described above, allowed enormous improvements in ML-driven services. I won’t labour the point by going into the multitude of advances in data science over the past decade at Netflix, Amazon, Uber, Facebook and many other companies, or the wonders of DeepMind’s AlphaGo. The important point is that these were all made possible by deploying technology that was actually developed 10 or 15 years earlier. The public understanding of the speed of advances in AI is warped because of this lag. So that is a (brief) history of machine learning and AI up to the present, and the not inconsiderable impact it is having on our lives. In part 2, I will explore where I think all this leads us in the near future.

To speak to our expert team about our data solutions, contact us.


About the author

Finn Wheatley, Director of Data Science

Finn has over a decade of experience working in lead data science and quantitative roles in both the public and private sectors. Following his undergraduate degree from King’s College London, Finn worked for several years in the hedge fund industry in risk management and portfolio management roles. Subsequent to an MSc in Computer Science from University College London, he joined the civil service and helped to establish the data science team at the Department for Work and Pensions (DWP), delivering innovative analytical projects for senior departmental leaders. Since joining Whitehat Analytics, he has been involved in establishing the data science team at EDF Energy.