This post is a collection of valuable lessons that I’ve learned over the last year doing machine learning experiments for fun, a course in Machine Learning, and for the most part, my first commercial machine learning project at Touchtech. Some of them are really obvious, some were discovered through trial and error. Whether you’re starting a career as an machine learning engineer, doing it for fun, or just following the hype around artificial intelligence - here are some things you might find applicable to your work, or simply entertaining. So let’s begin.
The key difference between a development project and a machine learning project is that the latter inevitably requires a considerable amount of research, which you can’t really estimate beforehand. So as you can imagine, the worst approach for a machine learning project is a fixed price. Agile works slightly better, but at the early stages of the project it is also hard to break down that research work into sprints, and deliver something (other than just intermediate research results) to the client at the end of each sprint. From our experience, a T&M approach and regular workshops with the client work best. The client most likely knows their data better than you, so asking them the right questions sometimes saves you hours of digging deeper into the dataset. That’s why communication is as important for a machine learning project as it is for a development one.
From our experience, there are three important risks when dealing with machine learning. Well, there’s probably an infinite number of risks, but these three provide the foundation for that initial research phase.
Is it theoretically possible to train a usable model to solve the problem? Take, for example, earthquakes. We have tons of precise historical data over the past several decades, but the best we can do is saying something like; ‘the probability of a magnitude 7 or more earthquake, in this area, over the next year, is 15%’. Sure, it is better than nothing, but definitely not impressive given the amount and the quality of the data. But this is not because our models are poor, it’s the nature of the data itself. Whatever happened in the past does not precisely affect the future. The answer to this question is, of course, individual for each project but there are some good reads on that topic.
Is it practically possible to train a usable model on the existing data? Does the data actually represent the real world, or some artificial pattern created by the logic behind the data collection? It is important to work together with the client and investigate how the data you’ll be using for the model’s training was collected and processed.
Forget about Machine Learning for a minute (I know, it’s hard), take a step back, imagine you already have a model and you’re happy about its characteristics. What are the use-cases of the model and its environment? Most likely, it will be integrated into the client’s infrastructure or running as an API service. If so, are all the features necessary for the model’s input available at the moment of an API call? It is possible that you have trained a pretty good model on the client’s historical data, which was post-processed and now complete to decent quality standard - but the raw data that model has to work with on production is different, and simply doesn’t have half the features that your model is using. Yes, another point on the importance of the communication with the client.
The most obvious metric for a classification model is its accuracy, which is simply the number of correct decisions divided by all the decisions made by the model. But is it always applicable? Let’s look at the example of cancer diagnostics with Machine Learning. In this problem, the model tries to recognise whether the tumour on the MRI image is malignant or benign. Let’s assume as an example, that 1 in 10000 people actually have cancer. Let’s also assume that our model is always making a negative decision (tumour is benign, patient is healthy) without actually finding patterns in the data. The accuracy of such a model would be 99.99% even though it’s obviously practically useless.
You should define the metrics and procedures to measure the performance of a model before starting to work on the model. Otherwise you might be pursuing the wrong goals and simply wasting you and your client’s time.
Feature engineering is a big part of the job and it’s also out of scope of today’s post. But I wanted to share one thought which might help you to decide if it’s necessary to spend a lot of time on it; or if it’s possible to let your awesome, deep feed-forward, fully-connected neural network to do it for you. There are two types of classification problems: objective and subjective. In an objective problem, the outcome is known for a fact and can easily be verified. Image recognition and cancer diagnostics are objective. We have the labelled data and the class of each example from the dataset is known.
Content approval, on the other hand, is subjective. The decision is made by human and it is possible that two humans can make different decisions on the same content example. In an objective problem, a machine learning model can outperform a human because it can possibly learn to extract the features from a data that humans can’t. This isn’t always the case for a subjective problem because quite often we’re assuming that human decision is always correct.
For subjective kinds of problems it is definitely important to do some research on how humans actually solve them, what features they use, and remove those ones that might potentially add more bias to the model - rather than help it to perform better. So guess what? You got it, research and communication with a client again.
As the best practice suggests, split your dataset not in two parts (training and testing) but in three (training, validation, and testing). Use the validation data to measure the accuracy (and/or other metrics) of the model iteratively while adjusting model’s and training parameters, experimenting with different neural net architectures, feature sets, etc. Finally, test the model on the testing data to actually verify its performance. Most likely it’ll be slightly worse on a testing set than on a validation set. The same way as it worse on a validation set comparing to a training set. If you’re using only training/testing data split you risk to overfit the model on the testing data because what you’re essentially doing is trying to maximise its performance on it. Validation set sometimes also called development set. So do as the names suggest: train and develop on training and development data, test the final model on testing data.
There are lots of neat techniques that might help you along your way to the perfect model for the problem you’re trying to solve: Anomaly Detection, Principal Component Analysis, Dimensionality Reduction, data visualisations, and good old statistics.
Research, feature engineering, and training process adjustment can sometimes be more art/magic than exact science. So sometimes it is surprisingly helpful to simply play with the data without any clear purpose: visualising a distribution of a certain feature, calculating means, medians, and correlations might not directly answer your questions but at least give you more insights on the structure of your datasets and more ideas of what approaches worth trying.