Introductions to Data Science Algorithms
A high-level description of the essential algorithms used in Data Science. As you already know data science is a field of study where decisions are made based on the insights we get from the data instead of classic rule-based deterministic approaches. Typically we can divide a machine learning task into three parts
- Obtaining the data and mapping the business problem,
- Applying machine learning techniques and observing the performance metric
- Testing and deploying the model
In this whole life cycle, we use various data science algorithms to solve the task at hand. In this article, we will divide the most commonly used algorithms based on their learning types and will have a high-level discussion on those.
Types of Data Science Algorithms
Based on the learning methodologies we can simply divide machine learning or data science algorithms into the following types
- Supervised Algorithms
- Unsupervised Algorithms
1. Supervised Algorithms
As the name suggests supervised algorithms are a class of machine learning algorithms where the model is trained with the labeled data. For example, based on the historical data, you want to predict a customer will default a loan or not. After preprocess and feature engineering of the labeled data, supervised algorithms are trained over the structured data and tested over a new data point or in this case to predict a loan defaulter. Let’s dive into the most popular supervised machine learning algorithms.
K Nearest Neighbors
K nearest neighbors(KNN) is one of the simplest yet powerful machine learning algorithms. It is a supervised algorithm where the classification is done based on k nearest data points. The idea behind KNN is that similar points are clustered together, by measuring the properties of nearest data points we can classify a test data point. For example we are solving a standard classification problem where we want to predict a data point belongs to class A or class B.Let k=3, now we will test 3 nearest datapoint of the test data point, if two of them belongs to class A we will declare the test data point as class A otherwise class B. The right value of K is found through cross-validation. It has a linear time complexity hence can not be used for low latency applications.
Linear Regression
Linear regression is a supervised data science algorithm.
Output:
Variable is continuous. The idea is to find a hyperplane where the maximum number of points lies in the hyperplane. For example, predicting the amount of rain is a standard regression problem where linear regression can be used. Linear regression assumes that the relation between the independent and dependent variables is linear and there is very little or no multicollinearity.
4.7 (3,220 ratings)
View Course
Logistic Regression
Though the name says regression, logistic regression is a supervised classification algorithm.
Output:
The geometric intuition is that we can separate different Class labels using a linear decision boundary. The output variable of logistic regression is categorical. Please note that we can not use mean squared error as a cost function for logistic regression as it is nonconvex for logistic regression.
Support Vector Machine
In logistic regression, our main motto was to find a separating linear surface.
Output:
We can consider the Support vector machine as an extension of this idea where we need to find a hyperplane that maximizes the margin. But what is a margin?. For a vector W (the decision surface we need to come up with), we draw two parallel lines on both sides. The distance between these two lines is called the margin. SVM assumes the data is linearly separable. Though we can use SVM for nonlinear data also using the Kernel trick.
Decision Tree
Decision Tree is a nested If-Else based classifier that uses a tree-like graph structure to make the decision. Decision Trees are very popular and one of the most used supervised machine learning algorithms in the whole area of data science. It provides better stability and accuracy in most cases comparatively than other supervised algorithms and robust to outliers. The output variable of the decision tree is usually categorical but it also can be used to solve regression problems.
Ensembles
Ensembles are a popular category of data science algorithms where multiple models are used together to get better performance. If you are familiar with Kaggle (a platform by google for practicing and competing in data science challenges), you will find most winner solutions are using some kind of ensembles.
We can roughly divide ensembles into the following categories
- Bagging
- Boosting
- Stacking
- Cascading
Random Forest, Gradient Boosting Decision Trees are examples of some popular ensemble algorithms.
2. Unsupervised Algorithms
Unsupervised algorithms are used for the tasks where the data is unlabelled. The most popular use case of unsupervised algorithms is clustering. Clustering is the task of grouping together similar data points without manual intervention. Let’s discuss some of the popular unsupervised machine learning algorithms here
K Means
K Means is a randomized unsupervised algorithm used for clustering.K Means follows the below steps
1.Initialize K points randomly(c1,c2..ck)
2. For each point (Xi) in the data set
Select nearest Ci {i=1,2,3..k}
Add Xi to Ci
3. Recompute the centroid using proper metrics (i.e intracluster distance)
4, Repeat step (2)(3) until converges
K Means++
The initialization step in K means is purely random and based on the initialization, the clustering changes drastically. K means++ solves this problem by initializing k in a probabilistic way instead of pure randomization. K means++ is more stable than classic K means.
K Medoids
K medoids is also a clustering algorithm based on K means. The main difference between the two is the centroids of K means does not necessarily exist in the data set which is not the case for K medoids. K medoids offer better interpretability of clusters. K means minimizes the total squared error while K medoids minimize the dissimilarity between points.
Conclusion
In this article, we discussed the most popular machine learning algorithms used in the field of data science. After all these, a question may come to your mind that ‘Which algorithm is the best?’ Clearly there is no winner here. It solely depends on the task at hand and business requirements. As a best practice always starts with the simplest algorithm and increases the complexity gradually.
Recommended Articles
This has been a guide to Data Science Algorithms. Here we have discuss an overview of data science algorithms with two types of data science algorithms in detail. You can also go through our given articles to learn more –