It is always good to get in the basics when you study something new as the roots of it builds up the foundation for the things to come. In this post, I wanted to share some basics of machine learning terminology that I have learned.
Machine learning is a subfield of AI and encompasses the use of linear algebra, probability theory and statistics to build up models to learn from the data at hand. Python is synonymous to machine learning due to the fact that it has so many libraries developed that helps in building up those models. NumPy, Pands, SciPy, Scikit-learn, TernsorFlow are few of the most commonly used libraries in Python. As I spoke about in a previous post, managing all these dependencies is even easier with Anaconda and the conda package manager.
The main goal for machine learning is to make predictions with learnings gathered from historical data. Machine learning tasks can be further divided into four categories;
- Unsupervised learning
Deals mostly with unlabelled data where the goal is to find the structure of the data underneath and extract the information we need out of it. Examples would include fraud detection, clustering customers for marketing campaigns.
- Supervised learning
We have data with full description and desired outputs. The goal here is to have a general model that would work well with the inputs and map to the desired outputs. Examples include speech recognition, movie or shopping recommendations.
Supervised learning can be further broken down as regression and classification.
- Semi-supervised learning
In this case, not all samples would be labeled and generally you would have a large amount of unlabelled data along side labeled data.
- Reinforcement learning
In this instance, the system adapts to dynamic behaviours based on a certain defined end goal. It works based on a reward and feedback loop mechanism. Examples would include applications such as AlphaGo which beat the best Go players in the world. Self driving cars is another good example in this area.
Generally, building up a machine learning model would involve the following steps;
- Understanding the business
- Understanding the data
- Data pre-processing
Steps 3 to 5 is usually done in an iterative manner and fine tuned as we go along. If there are any issues in understanding the data, that would require even more work to get back to cleansing, understanding and validating the data. As with any software application, it should end with a standard way of promoting the system to a production environment.
Training your model
When you get your dataset that will be used to train you model, it is usually broken down into three categories;
- Training set
- Test set
- Validation set
You will start off training your model with the training set. At this point, as part of the generalisation you want to get from you model, two issues would come out of it which we call as overfitting and underfitting.
Overfitting is where the model tries to compensate to satisfy all the samples in the training set well enough that it is not generalised anymore and is much rather specific for the training set. As the model learn too much from the training set, it results in something called low bias in machine learning. The model would have a high variance when it is tested against any other sample set other than the training set.
Underfitting, on the other hand is the exact opposite. The model does not work well with the training set samples which in turn means it will not perform well with any other samples. This is usually a result of using a small set of samples for training. As with overfitting, this results in high bias and lower variance.
The errors that come out of the learning model is called the Recall bias. We need a way of trading off the bias vs the variance which is called bias-variance trade off. To do that, we the “mean squared error(MSE)”, which measures the error of the estimation.
Overfitting is more of an imposter as opposed to underfitting. With underfitting, you know your model is just not working and you can work to make it better. But with overfitting, you might end up being complacent and happy as it works so well with the training set.
So how does one go about overcoming the issues related to overfitting?
Cross-validation is all about partitioning your dataset between training, testing and validation so that almost each sample goes through the model during training. Exhaustive and non-exhaustive are two schemes used carrying out cross-validation.
Exhaustive scheme includes leaving out a fixed number of samples for training and using the rest for testing and validation. Leave-One-Out-Cross-Validation(LOOCV) is one approach that can be used in this instance where you each sample will be in the testing phase.
Note: It is not recommended to carry out the exhaustive scheme when the data set is too large as it is computationally too costly to train the model with multiple rounds.
Non-exhaustive scheme takes a different approach as the name implies. K-fold cross-validation is one of the mechanisms used in this scheme where the data is randomly split into k-equal-sized folds. The training set is then done on one of the folds and on each iteration, the fold changes to the next. A picture would help visualise this better;
According to the principle of Occam’s razor, simpler methods are favoured over the more complex. Overfitting is usually a result of a complex model and what regularisation does it to add extra parameters to the error function(More on this in later posts).
There are different ways of controlling the complexity of the model. Early stopping is one mechanism used to stop the training of the model early so that it will end up producing a simpler model as opposed to a complex one that is susceptible to overfitting.
When we get our initial dataset, as you go through it, you will find out that all the data is not relevant to the problem that you are trying to solve. As a matter of fact, including them as part of training your model would just add more randomness to the process which would result in overfitting. As part of feature selection, it is important to understand which of the features are actually important to the problem. In general, you can approach feature selection in two ways. One would be to use all the features and remove them as needed on each iteration and the second approach would be to start with a bare minimum set of features and add features iteratively as you progress.
Data is usually represented as a matrix. With certain types of data such as text, images, the dimensions would be quite large. The issue with having higher dimensions is that it is not so easy to visualise the data. Also, it adds to the complexity which results in overfitting.
One common approach taken to reduce the complexity and thereby overfitting is to transform it to a lower dimensional space(More about this in later posts).
That is about it for this post. I will be sharing more on the coming posts on some of the aspects I could not cover in detail in this post. Thanks again for reading and as always, comments are welcome.