Machine Learning: Question Set – 05
Explain Bias-Variance trade-off.
Bias is a type of error caused by incorrect or overly simplistic assumptions in the learning algorithm you’re using. As a result, the model may underfit your data, making it difficult for it to achieve high predictive accuracy and for you to generalize your knowledge from the training set to the test set.
Variance is an error caused by an overly complex learning algorithm. As a result, the algorithm becomes extremely sensitive to high levels of variation in your training data, which can cause your model to overfit the data. You will be carrying far too much noise from your training data for your model to be useful for your test data.
The bias-variance decomposition essentially decomposes any algorithm’s learning error by adding the bias, the variance, and a bit of irreducible error due to noise in the underlying dataset. Essentially, as the model becomes more complex and more variables are added, you will lose bias but gain some variance — in order to achieve the optimally reduced amount of error, you must tradeoff bias and variance. You don’t want your model to have either a high bias or a high variance.
What is underfitting? How to reduce it?
Underfitting occurs when a machine attempts to learn from an insufficient dataset. As a result, underfitting is proportional to the amount of data.
The cross-validation method can be used to avoid underfitting in small databases. We will divide the dataset into two sections in this approach. Testing and training sets will be included in these two sections. The training dataset will be used to train the model, and the testing dataset will be used to test the model for new inputs.
Explain the terms: Training set, Validation set and Test set
The training set is used to develop the model and to alter the model’s variables. However, we cannot rely on the model built on top of the training set to be right. When new inputs are fed into the model, it may produce inaccurate results.
A validation set is used to examine the model’s response in addition to samples that do not exist in the training dataset. Then, using the estimated benchmark of the validation data, we will tune hyperparameters.
The test dataset is a subset of the original dataset that has not yet been used to train the model. This dataset is unknown to the model. As a result, we can compute the response of the generated model on concealed data using the test dataset. On the basis of the test dataset, we assess the model’s performance.
Explain various components of confusion matrix
True Positive (TP): A True Positive value is assigned to the Machine Learning model when it correctly predicts the condition.
True Negative (TN): A True Negative value is assigned to a Machine Learning model when it correctly predicts a negative condition or class.
False Positive (FP): A False Positive value is assigned when the Machine Learning model predicts a negative class or condition incorrectly.
False Negative (FN): When the Machine Learning model predicts a positive class or condition incorrectly, it has a False Negative value.
What is confusion matrix?
The confusion matrix is used to explain the performance of a model and provides a summary of predictions on classification problems. It aids in determining the degree of uncertainty between classes.
A confusion matrix displays the number of correct and incorrect values as well as the error types. Consider the following confusion matrix. For a classification model, it consists of True Positive, True Negative, False Positive, and False Negative values. The model’s accuracy can now be calculated as follows:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
= (50 + 100) / (50 + 100 + 10 + 5) = 150 + 165 = 0.90 = 90%
This model has accuracy is 90%.
What is Principal Component Analysis (PCA)? How it works?
We deal with multidimensional data in the real world. As the dimensions of data grow larger, data visualization and computation become more difficult. In such a case, we may need to reduce the dimensions in order to easily analyze and visualize the data. We accomplish this by removing irrelevant dimensions and keeping only the most relevant dimensions. This is where Principal Component Analysis comes into play (PCA).
The goals of Principal Component Analysis are to find a new set of uncorrelated dimensions (orthogonal) and rank them based on variance.
Steps for PCA:
- Find mean centered data from original dataset
- Compute the covariance matrix for each data object
- Compute the Eigen vectors and Eigen values of covariance matrix in descending order.
- Project the data on first few eigen vectors to get reduced representation
Additional Reading: Intuitive explanation of the bias-variance tradeoff. Click to read.