The Red Pill or the Blue Pill? Machine Learning vs Statistical Modelling

Over the years I’ve helped a number of organisations, both large and small – public and private, build up their Data Science capabilities, and derive value from data using various analytical techniques.

However, one key concern I’ve had a number of times is the confusion that can exist between means and ends ie solutions searching for problems as opposed to genuine problems that need solving.

This is often a result of an inherent lack of understanding of the use of analytics, and is sometimes unfortunately related to organisations simply jumping on the Data Science bandwagon.

True value arises from understanding the business problem, and then choosing the most appropriate approach to find a solution, with emphasis placed on the quality of the results.

As I’ve written before, there is no magic bullet, but I often see Data Scientists immediately resort to diving in to the latest and greatest algorithm to quickly try solve a problem.

Not only should they begin my first implementing the simplest method after developing a deep understanding of the business domain, but more importantly, they should first consider whether or not to use Machine Learning at all! And if they do, it’s important to understand the limitations of the algorithms and what lies within the ‘black-box’.

Deus ex Machina

Statistical Models (such as regression models, which are the most commonly used, including ordinary, Bayesian and penalized regression) are sometimes the better option compared to fancy Machine Learning algorithms (such as Random Forests, Support Vector Machines and Artificial Neural Networks), especially in certain fields and applications, such as medicine.

Machine Learning can be great, and fun to use, but it is primarily used to make predictions. In order to obtain uncertainty estimates along with predictions, which are sometimes desirable, if not necessary, one often needs to resort to Statistical Modelling.

Here I’m creating a clear distinction between the two, even though there is a high degree of overlap, in order to facilitate discussion, thought and understanding. In doing so, I’m following Frank Harrell’s delineation between the fundamental attributes of the two, as he outlines in his fantastic post.

Below is a summary of Frank’s guidelines on which approach to use. It is driven by the fundamental difference between employing an algorithmic approach versus a data model that incorporates probabilities and has a preconceived structure imposed on the relationship between predictors and outcomes:

Machine Learning may be preferable if:

  • The sample size is huge;
  • Interpretability of the model is not important;
  • Overall prediction is the goal, without the need to describe the impact of any single variable;
  • Non-additivity/complexity is expected to be strong; or
  • The signal:noise ratio is large and the outcome being predicted doesn’t have a large component of randomness.

However, Statistical Models may be the better choice if:

  • The signal:noise ratio is not large and uncertainty is inherent;
  • Perfect training data is unavailable;
  • Isolation of the effects of a small number of variables is required;
  • The sample size isn’t huge;
  • Model interpretability is important;
  • Estimation of the uncertainty in forecasts is sought; or
  • Additivity is the dominant way that predictors affect the outcome.

Data Science/Analytics should not exclusively mean Machine Learning – it is fundamentally the process of extracting meaning from data, by using the most appropriate method, in order to inform and improve decision making.

 

Leave a Reply

Your email address will not be published. Required fields are marked *