Machine Learning: Question Set – 12
What is the difference between regression and classification?
Classification and regression are sometimes confusing to the new readers. Classification is used to provide discrete outcomes, as well as to categories data into specified categories. Classifying emails into spam and non-spam groups, for example.
Regression, on the other hand, works with continuous data. Predicting stock prices at a specific point in time, for example.
Geometric representation of both the concepts is shown below:
The term “classification” refers to the process of categorizing the output into a set of categories. Is it going to be hot or cold tomorrow, for example?
Regression, on the other hand, is used to forecast the connection that data reflects. What will the temperature be tomorrow, for example?
What is a Neural Network and How Does It Work?
It’s a simplified representation of the human mind. It has neurons that activate when it encounters anything similar to the brain.
The various neurons are linked by connections that allow information to travel from one neuron to the next.
What is a Random Forest, exactly? What is the mechanism behind it?
Random forest is a machine learning method that can be used for both regression and classification. Random forest, like bagging and boosting, operates by merging a number of different tree models.
Random forest creates a tree using a random sample of the test data columns.
The following are the stages that a random forest takes to create trees:
- Using the training data, calculate a sample size.
- Start with just one node.
- Begin with the start node and run the following algorithm:
- Stop if the number of observations is less than the node size.
- Make a list of random variables.
- Determine which variable performs the “best” job of separating the observations.
- Split the observations into two nodes.
- Call step
aon each of these nodes.
What is Unsupervised Learning and How Does It Work?
Unsupervised learning is a sort of machine learning method that searches for patterns in a set of data. There is no dependent variable or label to forecast in this case.
A T-shirt grouping, for example, will be classified as “collar style and V neck style,” “crew neck style,” and “sleeve types.”
Algorithms for Unsupervised Learning:
- Anomaly detection
- Neural network
- Support Vector Machine
There are numerous machine learning algorithms available now. How does one decide which algorithm to apply when given a data set?
The Machine Learning algorithm that is utilized is solely determined by the type of data in a given dataset. When the data is linear, we utilize linear regression. If the data indicates non linearity, the bagging method will perform better. If we need to analyze/interpret the data for business goals, we can utilize decision trees or SVM. If the dataset contains photos, videos, or audios, neural networks will be useful in obtaining an accurate solution.
As a result, there is no one metric for determining which method to utilize in a given situation or data set.
We must use EDA (Exploratory Data Analysis) to investigate the data and comprehend the goal of using the dataset in order to develop the best fit method. As a result, it is critical to thoroughly research all of the algorithms.
What is the difference between bias and variance?
The bias-variance decomposition essentially decomposes any algorithm’s learning error by adding bias, variance, and a bit of irreducible error due to noise in the underlying dataset. You will reduce bias but gain variance if you make the model more sophisticated and include more variables. To achieve the lowest possible level of error, you must trade off bias and variation. It is not desirable to have a high bias or a high variance.
Algorithms with high bias and low variance train models that are consistent but wrong on average.
Algorithms with a high variance and a low bias train models that are accurate but inconsistent.
Define Precision and Recall.
Precision: Precision is the ratio of several events you can correctly recall to the total number of events you recall (mix of correct and wrong recalls).
Precision = (True Positive) / (True Positive + False Positive)
Recall: A recall is the ratio of the number of events you can recall the number of total events.
Recall = (True Positive) / (True Positive + False Negative)
How would you deal with an unbalanced dataset?
When you have a classification test, for example, and 90% of the data is in one class, you have an imbalanced dataset. This causes issues: a 90% accuracy can be distorted if you have little predictive capacity on the other category of data! Here are a few strategies for getting over the hump:
- Gather extra data to equal out the dataset’s imbalances.
- Resample the dataset to account for any imbalances.
- On your dataset, try a different approach entirely.
What matters here is that you have a deep understanding of the damage that an unbalanced dataset can do, as well as how to balance it.
When should classification be preferred over regression?
Classification yields discrete values and restricts the dataset to strict categories, but regression yields continuous findings that allow you to discern differences between individual points more effectively. If you intended your results to reflect the belongingness of data points in your dataset to specific explicit categories, you would prefer classification over regression (ex: If you wanted to know whether a name was male or female rather than just how correlated they were with male and female names.)
What approach do you use to avoid overfitting?
Overfitting occurs when a model attempts to learn from a small dataset. Overfitting can be avoided by using a vast amount of data. However, if we just have a tiny database and are obliged to build a model on it, we can utilize a technique known as cross-validation.
Overfitting scenario is escribed in following diagram.
A model is typically provided a dataset of known data on which the training data set is performed and a dataset of unknown data against which the model is evaluated in this method. The fundamental goal of cross-validation is to define a dataset that will be used to “test” the model during the training phase. If there is enough data, ‘Isotonic Regression’ is employed to avoid overfitting.
What is the difference between Machine Learning and Deep Learning?
Machine learning is all about algorithms that read data, learn from it, and then apply what they’ve learned to make sound decisions.
Deep learning is a subset of machine learning that is inspired by the structure of the human brain and is especially good for feature detection.
Additional Reading: Deep Learning v/s Machine Learning. Click to read