The Machine Learning Validation Challenge: What it is and How to Overcome the ‘Black Box’ Criticism
Cases where systems powered by ML made incomprehensible choices or significant mistakes have gained prominence and media coverage, especially in the context of autonomous driving vehicles. In such safety-critical applications, where the consequences of an error can cost human lives, the quality and reliability of the data-driven algorithms used, and the transparency of the ML based decision-making process, are critical factors. And because it is tempting for those who are less informed to overgeneralise from these examples, the issue of trust is key to the successful deployment of ML for many applications.
Figure 1: Traffic light recognition – Identifying Regions of Interest in Video, Kenan Alkiek 2018 (image source: https://github.com/KenanA95/tl-detector)
Lack of interpretative framework: the correlation vs causality issue
Over the past few decades, a number of powerful new ML approaches have been proposed, with some successful results reported, but with little theoretical framework for interpreting the models and the results in their application domain. This issue is especially important where there is a risk of confusing correlation with causality, or forgetting that “to predict is not to explain” (Rene Thom). For instance, Google Flu Trends, using personal online Google searches for doctors, pharmacies, products, etc, outperformed the US Centers for Disease Control and Prevention in 2009 to track the yearly propagation of an influenza epidemic. However, five years later, it was unable to interpret a change in the distribution of search queries, and it started to fail to accurately predict the spread and severity of flu outbreaks. In other words, although the Google Flu Trends model had a remarkable fit with the data in 2009, it was unable to capture a subsequent shift in behaviour which made it underperform. This shows the limits of relying exclusively on “theory-free” data driven answers and not grounding the results in domain knowledge; as Tim Hartford wrote in the FT (2014): “If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down”.
The “Black Box” issue
Another problem is that the more sophisticated a system gets, the less easy it is to understand it. Increased sophistication is precisely what seems to characterise the direction of travel of ML. Some ML approaches have been compared to a “black box” by critics for their lack of transparency, notably the application of Artificial Neural Networks and Deep Learning approaches. Indeed, whereas the logic of the decision process of a rules-based system can be linearly followed by a human reviewer, the way the hidden layers of a neural network adjust their own parameters during their training phase to best optimise their forecasts does not give any insight into what constitutes important factors for them. In other words, neural networks focus on performance, not on meaning. There are applications for which this is more a strength than a weakness, for example when ML is asked to discover new patterns that escape identification by human analysts, but there are many cases where knowing the “why” is as important, or more, as predicting the “what”.
Figure 2: XKCD on Machine Learning
The dependency on data issue
More fundamentally, ML systems are totally dependent on data, be it the dataset they are trained on, the data used to validate them, and the fresh data they must constantly access to maintain their relevance and continue to ‘learn’. The consequence is that ML algorithms fundamentally reflect the flaws and limitations of the underlying data they receive as input. Many classification errors directly stem from either lack of data, or lack of data representativeness. In many instances where ML algorithms gave biased or inappropriate results (for example Google’s photo app, Google News and Wikipedia accused of racist or sexist classifications, errors with the facial recognition of criminals), it was found that the ML algorithm incriminated was merely bringing out and amplifying the biases already present in the training data.
Figure 3: Mobile Rail LiDAR (SSI’s Mobile LiDAR system was utilised on a Hi-Rail vehicle to scan the railway and surrounding structures. This scanned information produced a detailed model.
To properly train a predictive model, historical data must be not only correct (in the sense of no measurement or recording error) and properly labelled, but it also needs to be devoid of such biases. Data scientists take up to 80% of their time to “cleanse” the data before training the predictive model. Even with such efforts, the data cleansing phase neither detects nor corrects all the errors, and the impact of these errors on the predictive strength of the model is difficult to assess or anticipate.
Yet the picture is far from dark: Network Rail’s Geo-RINM Viewer for example captures imagery and detail of all 20,000 miles of track and surrounding infrastructure, revealing the railway in a clarity never seen before. Based on the data gathered, Asset Information Services now manages and maintains the Viewer, liaising with engineers across the country to keep the data layers up to date. Remote failure detection of earthwork movement based on the data gathered and multiple monitoring stations connected to the data repository has been procured to help avoid derailments through early detection and intervention.
The validation journey and principles
The three key limitations mentioned in this article can be put in good check using an appropriate validation approach.
ML validation should strive to ensure fairness (identify and mitigate bias), privacy (protection of sensitive information), robustness (to small changes in input data). It should also account for the domain specific contextualisation and questioning and be able to explain the causal relations between input and output.
Figure 4 : Driving Scene segmentation, Lex Fridman (MIT) 2018
Some auditing principles
As we have seen, the problem with ML validation is less with the algorithms than with the data itself. The validation process must therefore centre on the data, its in-depth review and testing. Here are a few principles, identified in an excellent article by the Smith Institute.
- Data review: the data must be reviewed to ensure that it is free of implicit and explicit biases, has suitable coverage across expected inputs, and has sufficient coverage of edge cases.
- Validation testing: (i) the performance of the algorithm must be reviewed to ensure that the algorithm is fit for purpose. (ii) Overfitting (ie when a model is tailored to fit the quirks and random noise in a specific data sample rather than reflect the overall population) can be checked with a testing data set that is distinct from the training data set. (iii) Sensitivity analysis of the model (changing variable inputs using domain knowledge) should be carried out.
Figure 5: Wikipedia Commons, Overfitted data. The blue polynomial curve strives too much to go through all data points (overfit) and has little predictive power for the next points in the series, whereas the black regression line is a sufficient approximation of the trend present in the data.
- Stress test: a comprehensive suite of test cases should be carried out with the aim of ‘breaking’ the algorithm and probing the bounds of its use (“riskiest inputs”).
- Context integration: run time must be fast enough, and the algorithm should be hosted on a system that is accessible to users and automatically integrated into the organisation’s data flows.
- Future proofing: periodic retraining of the algorithm to ensure that the algorithm evolves over time and remains up-to-date.
- Independent review: to increase trust and overcome industry blind spots.
ML and its place within the knowledge chain
There is an abundance of data in rail. To maximise the value of this data, we need to combine data science with domain knowledge, that is we need industry expertise, operational understanding, familiarity with rail datasets as well as skills in data processing, data cleansing, data testing, and the ability to understand the respective virtues and limits of different statistical approaches.
Figure 6: Handwriting recognition by a Neural Network: handwritten digits from the MNIST dataset (70000 images) reduced to their basic B/W features through data cleansing and passed to a neural network to assign them to a real number between 0 and 9, MIT Deep Learning (2019)
Keeping a place for Cartesian doubt is also important, especially with systems that essentially deal with uncertainty and infer results rather than deduce them. More importantly, ML is still modelling: just as scientists are trained to always question the data, the assumptions and hypotheses of their models, data scientists and ML users should think of some validation frameworks relevant to their application domain together, set up checks and safeguards, as well as design mitigation features in their system.
ML-based foresight (predicting what will happen) can take place with confidence, whether it is used for safety critical systems or delay predictions, only when we also have hindsight (trusted reports of what happened) and insight (theory of why it happened) to learn from. We need to both predict and explain, and to do this rail needs to maintain domain knowledge experts and attracts more data scientists.
Next week, Liz Davis - Professional Lead Data & Modelling R&D - and Giulia Lorenzini - Senior Partnerships and Grants Manager - will look into this very challenge: the skill transformation we need to make the most of ML.