Introduction to Data Mining Methods
Data is increasing daily on an enormous scale. But all data collected or gathered is not useful. Meaningful data must be separated from noisy data (meaningless data). This process of separation is done by data mining.
What is Data Mining?
Data mining is a process of extracting useful information or knowledge from a tremendous amount of data (or big data). The gap between data and information has been reduced by using various data mining tools. Data mining can also be referred as Knowledge discovery from data or KDD.
Data mining can be performed on various types of databases and information repositories like Relational databases, Data Warehouses, Transactional databases, data streams and many more.
Different Data Mining Methods:
There are many methods used for Data Mining but the crucial step is to select the appropriate method from them according to the business or the problem statement. These data mining Methods help in predicting the future and then making decisions accordingly. These also help in analyzing market trend and in increasing company revenue.
Some Data Mining Methods are:
- Clustering Analysis
- Sequential Patterns or Pattern Tracking
- Decision Trees
- Outlier Analysis or Anomaly Analysis
- Neural Network
Let us understand every data mining methods one by one.
It is a method used to find a correlation between two or more items by identifying the hidden pattern in the data set and hence also called as relation analysis. This method is used in market basket analysis to predict the behavior of the customer.
Suppose, the marketing manager of a supermarket wants to determine which products are frequently purchased together.
As an example,
Buys (x,”beer”) -> buys(x, “chips”) [support = 1%, confidence = 50%]
- Here x represents a customer buying beer and chips together.
- Confidence shows certainty that if a customer buys a beer, there is a 50% chance that he/she will buy the chips also.
- Support means that 1% of all the transactions under analysis showed that beer and chips were bought together.
Many similar examples like bread and butter or computer and software can be considered.
There are two types of Association Rules:
- Single dimensional association rule: These rules contain a single attribute that is repeated.
- Multidimensional association rule: These rules contain multiple attributes that are repeated.
This data mining method is used to distinguish the items in the data sets into classes or groups. It helps to accurately predict the behavior of items within the group. It is a two step process:
- Learning step (training phase): In this, a classification algorithm builds the classifier by analyzing a training set.
- Classification step: Test data are used to estimate the accuracy or precision of the classification rules.
For example, a banking company uses to identify loan applicants at low, medium or high credit risks. Similarly, a medical researcher analyzes cancer data to predict which medicine to prescribe to the patient.
3. Clustering Analysis:
Clustering is almost similar to classification but in this clusters are made depending on the similarities of data items. Different clusters have dissimilar or unrelated objects. It is also called as data segmentation as it partitions huge data sets into clusters according to the similarities.
There are various clustering methods that are used:
- Hierarchical Agglomerative methods
- Grid-Based Methods
- Partitioning Methods
- Model-Based Methods
- Density-Based Methods
Similar example of loan applicants can be considered here also. There are some differences which are depicted in the figure below.
This method is used to predict the future based on the past and present trends or data set. Prediction is mostly used with the combination of other data mining Methods such as classification, pattern matching, trend analysing and relation.
For example, if the sales manager of a supermarket would like to predict the amount of revenue that each item would generate based on past sales data. It models continuous valued function that predicts missing numeric data values.
Regression Analysis is the best choice to perform prediction. It can be used to set a relationship between independent variables and dependent variables.
5. Sequential patterns or Pattern tracking:
This data mining method is used to identify patterns that occur frequently over a certain period of time.
For example, the sales manager of clothing company sees that sales of jackets seem to increase just before the winter season, or sales in bakery increases during Christmas or New Year’s eve.
Let’s look at an example with a graph
A decision tree is a tree structure (as its name suggests), where
- Each internal node represents a test on the attribute.
- Branch denotes the result of the test.
- Terminal nodes hold the class label.
- The topmost node is the root node which has the simple question that has two or more answers. Accordingly, the tree grows and a flow chart like structure is generated.
In this decision, tree government classifies citizens below age 18 or above age 18. This would help them to decide whether a license must be issued to a particular citizen or not.
7.Outlier Analysis or Anomaly Analysis:
This data mining method is used to identify the data items that do not comply with the expected pattern or expected behavior. These unexpected data items are considered as outliers or noise. They are helpful in many domains like credit card fraud detection, intrusion detection, fault detection etc. This is also called as Outlier Mining.
For example, let’s assume the graph below is plotted using some data sets in our database.
So the best fit line is drawn. The points lying nearby the line show expected behavior while the point far from the line is an Outlier.
This would help to detect the anomalies and take possible actions accordingly.
This data mining method or model is based on biological neural networks. It is a collection of neurons like processing units with weighted connections between them. They are used to model the relationship between inputs and outputs. It is used for classification, regression analysis, data processing etc. This technique works on three pillars-
- Learning Algorithm (supervised or unsupervised)
- Activation function
This has been a guide to Data Mining Methods Here we have discussed What is Data Mining and different types of Data Mining method with the example. You may also look at the following articles to learn more –