Introduction to Data Science Techniques
In today’s world where data is the new gold, there are different kinds of analysis available for a business to do. The result of a data science project varies greatly with the type of data available and hence the impact is a variable as well. Since there is a lot of a different kind of analysis available, it becomes imperative to understand what a few baselines techniques need to be selected. The essential goal of data science techniques is not only searching for relevant information but also detect weak links which tend to make the model perform poorly.
What is Data Science?
Data science is a field that spreads over several disciplines. It incorporates scientific methods, processes, algorithms, and systems to gather knowledge and work on the same. This field includes a variety of genres and is a common platform for the unification of concepts of statistics, data analysis, and machine learning. In this, the theoretical knowledge of statistics along with real-time data and techniques in machine learning work hand-in-hand to derive fruitful outcomes for the business. Using different techniques employed in data science, we in today’s world can imply better decision making which otherwise might miss from the human eye and mind. Remember the machine never forgets! To maximize profit in a data-driven world, the magic of Data Science is a necessary tool to have.
Different types of Data Science Technique
In the following few paragraphs we would look into common data science techniques used in every other project. Though sometimes the data science technique can be business problem-specific, and might not fall in the below categories, it is perfectly okay to term them as miscellaneous types. At a high level, we divide the techniques into Supervised (we know target impact) and Unsupervised (We don’t know about the target variable we are trying to achieve). In the next level, the techniques can be divided in terms of
- The output we would get or what is the intent of the business problem
- Type of data used.
Let us first look at segregation based on intent.
1. Unsupervised Learning
- Anomaly Detection
In this type of technique, we identify any unexpected occurrence in the entire dataset. Since the behaviour differs from the actual happening of a data the underlying assumptions are:
- The occurrence of these instances is very small in number.
- The difference in behaviour is significant.
Anomaly algorithms are explained, such as the Isolation Forest, which provides a score for each record in a dataset. This algorithm is a tree-based model. Using this type of detection technique and its popularity they are used in various business cases, for example, Web Page views, Churn Rate, Revenue per click, etc. In the below graph we can explain what anomaly looks like.
Here the ones in blue represent an anomaly in the dataset. They vary from the regular trend line and are less in occurrence.
- Clustering Analysis
Through this analysis, the main task is to segregate the entire dataset into groups so that the trend or traits in one group data points are quite similar to each other. In data science terminology we call these as the cluster. For example, in the retail business, there is a plan to scale the business and it becomes imperative to know how the new customers would behave in a new region based on the past data we have. It becomes impossible to devise a strategy for each individual in a population, but it will be useful to bucket the population into clusters so that strategy will be effective in a group and is scalable.
Here the blue and orange colors are different clusters having unique traits within themselves.
- Association Analysis
This analysis helps us in building interesting relationships between items in a dataset. This analysis uncovers hidden relationships and helps in representing dataset items in the form of association rules or sets of frequent items. The association rule is broken down into 2 steps:
- Frequent Itemset Generation: In this, a set is generated where frequently occurring items are set up together.
- Rule Generation: The set built above is passed through different layers of rule formation to build a hidden relationship between themselves. For example, the set can fall into either conceptual or implementation issues or application issues. These are then branched down in respective trees to build the association rules.
For example, APRIORI is an association rule building algorithm.
2. Supervised Learning
- Regression Analysis
In regression analysis, we define the dependent/target variable and the remaining variables as independent variables and eventually hypothesize how one/more independent variables influence the target variable. The regression with one independent variable is called univariate and with more than one is known as multivariate. Let us understand using univariate and then scale for multivariate.
For example, y is the target variable and x1 is the independent variable. So, from the knowledge of the straight line, we can write the equation as y = mx1 + c. Here “m” determines how strongly y is influenced by x1. If “m” is very close to zero, it means that with a change in x1, y is not affected strongly. With a number greater than 1, the impact gets stronger and small change in x1 leads into big variation in y. Similar to univariate, in multivariate can be written as y = m1x1 + m2x2 + m3x3………., here the impact of each independent variable is determined by its corresponding “m”.
- Classification Analysis
Similar to clustering analysis, Classification algorithms are built having the target variable in the form of classes. The difference between clustering and classification lies in the fact that in clustering we don’t know which group the data points fall in, whereas in classification we know which group it belongs to. And it differs from regression from the perspective that the number of groups should be a fixed number unlike regression, it is continuous. There are a bunch of algorithms in classification analysis, for example, Support Vector Machines, Logistic Regression, Decision Trees, etc.
In conclusion, we understand that each type of analysis is vast in itself, but here we can provide a small flavor to different techniques. In the next few notes, we would take each of them separately and go into details on different sub-techniques employed in each parent techniques.
This is a guide to Data Science Techniques. Here we discuss the introduction and different types of techniques in data science in detail. You can also go through our other suggested articles to learn more –