Table of contents
No headings in the article.
Machine learning problems are like any typical scientific problem that needs to be solved, only by applying the correct method will you be able to get the desired result. Therefore, this means that you must be able to define the problem that you want to solve in the clearest and simplest terms possible.
Problem definition means that you must state why you intend to solve the issue and the best way to solve it. You might also want to consider whether building a model is the only way that the problem can be solved. This may also be referred to as the problem statement.
Data collection
In the process of defining your problem, you’ll have already figured out the kind of data that you’ll need to obtain the best result.
Data is the center of every model that you might want to build, I would say that it is the backbone of any model. There is a common saying “garbage in garbage out”; if you feed a tone of messed up data into you model all you get is inaccurate and unreliable results.
Data collection can be a time-consuming and pretty expensive process, this, therefore, means that before you embark on this process all the requirements and other assumptions have been carefully considered.
A few of the factors that you might want to consider are the cost, time, human resource availability, tools that are needed, and finally the method that you will use to collect the data.
The reliability and validity of your source(s) of data are also important factors to consider alongside any legal constraints that may come into play.
Preprocessing and Preparing Data
If the data is already collected, then proceed to preprocess the data to get it ready for use in your modeling. It is at this step that data is subjected to methods such as:
Error detection and correction, Outlier detection, Dumping, Normalization, Deduping which is the process of removing duplicates.
According to Wikipedia data validation is an important procedure of ensuring that data has the desired level of quality and correctness to produce accurate models.
There are several validation methods and choosing one depends on how you wish your data to be structured. A few of them is listed below:
a)Consistency validation. b) Structured validation. c) Range validation. d) Data type validation.
When preparing data for modeling, Python has very good libraries that make this process effortless at the top of this list is : pandas, NumPy and Matplotlib Pandas is particularly good in performing analysis and manipulation while NumPy provides numerous scientific tools for calculations, matplotlib enables you to perform 2d plots that may be essential in identifying outliers. After having your data ready and we now need to split the data into training and testing datasets the most common ratio used in splitting is taking 80% as the training dataset while 20% is used for testing. The training dataset as the name suggests is used to train the model in learning different aspects of the data. Testing data set on the other hand is used to evaluate how well the model has learnt. You must also have thought about the validation data set the is usually reserved from the training and testing process. It is used in determining how well the model has performed while adjusting hyperparameters. Scikit learn is good at splitting data into these categories.
In case your data is not enough to perform all the three processes of coming up with the model, them techniques such k-fold cross validation that enable you to split the data into different folds that you can the use for testing, training and validation.
Choosing the Right Algorithm
Here we go again, as we had mentioned earlier understanding the problem that you want to solve is key in choosing the right model. Different algorithms serve different purposes, you might need to do some research on them to determine the one that best suits your needs.
Other important factors that come into play when choosing the right machine learning algorithm include:
Level of accuracy needed
The time required to train the model(s).
The number of Features in your dataset
Linearity of your data
And finally whether you might need to combine more that one Algorithm (Ensemble methods)
Some of the most common algorithms to choose from include:
- Logistic Regression
- Linear Regression
- Decision Trees
- Random Forest
- Suport Vector Machine
- K-means Clustering
Training and evaluation of the Algorithm
Training a machine learning algorithm is more like an iterative process that stops when the desired accuracy has been achieved.
Feeding the training data into the model enables the algorithm to learn the patterns in the data by itself and in the process learn how to predict the output whenever exposed to the new set of data.
You’ll need to feed the training data into the model test the performance of the model and update it continually.
Once you feel that your model is ready it is now time to expose it to a new data set and see how it performs. This is where the Test data set is used.
Improving Your model
Suppose you feel that the performance of the model is a little bit below your expectations then you can do some tuning.
Hyperparameters are key in achieving this, these are special variables in your data whose values can be changed to provide a more accurate output.
Of course, you don’t have to do this manually there are efficient methods that can help you select the best hyperparameters and this include;
- Bayesian Optimization
- Performing a Grid Search
- A random Search
After you’re confident that your model is good enough it’s time to deploy it to production it could part of a web application or an application. After deployment, the model needs to be maintained and monitored regularly to ensure that it is performing as expected.
This is how the entire process of building ML models work.
Follow me for more such blogs.