Overview of Data Mining Task Primitives
Data is the most important part of an industry, and it’s important to understand it. Understanding data simply means finding its characteristics, patterns, and trends. To do all these operations, Data mining provides us with methods and functions. Let’s look at all those methods and understand them with simple but real-life scenarios.
Table of Contents
- Overview of Data Mining Task Primitives
- Descriptive Functions
- Classification and Prediction
- Data Mining Task Primitives in Processes
- Example of Data Mining Task Primitive
- Advantages of Data Mining Task Primitive
These methods are divided into two topics
Descriptive functions usually deal with summarizing the general properties and structure of data in the database; its common techniques are:
1) Class/Concept Description
Class/concept descriptions are used to describe and understand data, where class description defines the properties of a specific class, and concept description defines the characteristics of one or more classes.
Let’s take an example of a shopping mall where Mall employees keep a complete record of the customers with a class of “high-spending customers” and “regular customers” and the concept of customers with a common description like “visiting on weekends” and “visiting during sale seasons”. Both descriptions are derived in two ways:
- Data Characterization: In data characterization, characteristics of data from the target class are summarised into different forms like pie charts, bar charts, etc., or in rule form, which is called a characteristic rule.
- Data Discrimination: It refers to the mapping or comparison of a target class with some predefined set of classes.
2) Mining of Frequent Patterns
To increase sales and enhance customer satisfaction, Mining algorithms are used to find frequent patterns. Suppose there is a large data set of customers visiting shopping malls and buying products. The algorithm finds that whenever there is a purchase of trousers, there is a purchase of a belt. This is called the frequent pattern; different frequent patterns are:
- Frequent Item Set: refers to the set of items that come frequently in a dataset; the products bought frequently by many customers are organized together to increase sales.
- Frequent Substructure: refers to the different structural patterns in a dataset, such as graphs, Trees, GPS, etc.
- Frequent Subsequence: refers to the sequence of items that come frequently in a dataset. Frequent Subsequence analyzes customer behavior by tracking the sequence of actions taken while shopping.
3) Mining of Association
Mining of Associations is used to find out the relation between the different items of data in a large dataset. The Association rule uses if-then statements to identify the pattern between different purchases of any product. For example, a mall owner tracks a relationship of product purchases where 60% of the time, belts are purchased with trousers, and 20% of the time, wallets are purchased with trousers.
buys(X, “Trouser”)=> buys(X,” belt”) [support = 1%, confidence = 60%],
Where X is a variable representing a customer purchasing trousers and a belt, and confidence of 60% means a customer buys a trouser, there is a 60% chance of him buying a belt, and 1% support means the products are bought together
4) Mining of Correlations
Identifying the effect of 2 aspects of business on each other, whether it is positive, negative, or null. For instance, mall owners run certain promotion strategies like advertisements, social media influence, etc., to increase sales. After this promotion act, the sales from the mall increased. This shows that the promotion campaigns had a positive impact on the sales. Based on these results, it can decide whether the promotion campaign is worth the money or should invest money in such a campaign.
5) Mining of clusters
Mining of clusters refers to identifying groups having the same characteristics. For example, People from age 20 to 40 have a particular type of taste when it comes to the clothing industry. This group has a high range of expenses and prefers clothes from an expensive brand. The shopping frequency is also high compared to other groups, like people aged 40 to 60 who prefer simple clothing and do not consider brands for buying clothes. By knowing this data, the brands target the 1st group through emails and social media and increase their sales.
Classification and Prediction
Classification means dividing the data set and assigning classes or labels. This division is done based on certain characteristics of data. Prediction is giving an idea or estimation of future values based on data available from the past. Some techniques used are:
1) Classification (IF – THEN) rule
As the names suggest, IF a certain condition is satisfied, THEN only approve something. For instance, a loan-offering company uses the If-then rule to sort to whom the loan should be given. The company has the people’s data like age, income, past financial records, credit score, etc. When someone from the list applies for the loan, the algorithm checks the past record of the application holder. If all the conditions like ‘good credit score’ and ‘good income score’ are fulfilled, then only the application will be sent for further process. It will be rejected at that initial step.
If – then the rule of classification is commonly used to separate emails as spam and not spam. If the algorithm checks and detects specific patterns in the mail like “Surprise Offer” and “Lucky Winner”, then the mail is marked as “spam”, or else mail is marked as “not spam.
2) Decision Trees
Based on past data, the decision tree algorithm predicts future results as positive, negative, or null. Let’s consider a scenario. There is a gym with 500 people. The gym owner wants to set up new plans for the coming year. For this, the gym owner needs to know how people from the existing 500 will continue to come next year. The algorithm checks a few criteria like what subscription plan is taken by “Alex”, whether it is monthly, quarterly, etc. Does he pay for cardio and spa along with gym? What is the frequency of Alex coming to the gym in a week? Does he consider buying protein shakes and other accessories from the gym itself? Does he consider paying for a personal trainer? By checking such scenarios for all the people from the gym, the decision tree algorithm gives a result, whether the gym owner should consider investing in gym renovation, gym promotions, buying a new area for the gym, increasing inventory of gym accessories, hiring more personal trainers, etc.
3) Mathematical formulae
It uses models and equations to analyze past data and make future predictions accordingly. Such mathematical models and equations are used for forecasting, real estate pricing, sales prediction, etc. Let us check the most commonly used expressions with real-life business examples.
1) Linear regression – This equation is used in predicting property prices depending upon data on where the property is and its area.
Price = $1 + $2 * sq. foot
Here, $ is the dependent variable, and sq. foot is the dependent.
When the sq. foot value is 0, $1 is at its starting point. $2 is the slope of how much the dependent variable will change according to the independent variable.
4) Neural networks
A neural network is inspired by the working methods of a human brain. Some of the main components of neural networks are neurons, layers, and edges (connections).
A) Neurons: are the components that possess information.
B) Layer: These neurons are arranged on a layer. There are multiple layers. Layers are basically of three types: input, hidden, and output. Initial data is received here in the input layer. The hidden layer processes the data.
C) Edges: Neurons are interconnected using connections, and each connection has a weight.
D) Process (training): The initial data is received and processed by different layers. Training is given to the model, which adjusts its edge weight to increase its prediction accuracy. Once the training is completed, the model can predict new data.
Data Mining Task Primitives in Processes
Task primitives are used to construct data mining processes, which are input to the system as a data mining query. These primitives interact or communicate with the data mining system to do the data mining process. Data mining task primitives provide an efficient and reusable approach as follows:
- Set of task-relevant Data:
It includes database attributes and data warehouse dimensions where only relevant data is extracted by the following process:
- Data selection: Selecting the appropriate data from the different data sets and sources matching the task requirement.
- Data Gathering: Data can be selected from more than one data set or data sources. After selection, it is crucial to gather data in one place.
- Data integration: Once the data is collected and stored in one place, it is necessary to bring all the data in one form. So, format the data from different sources into one format for a smooth process.
- Data cleaning: Once the integration stage is completed, Remove the unwanted parts, errors, and non-relevant components from the data.
- Data Sampling: After all the above sets, we get data in a proper format, but we need to make subsets of the data so that at a time, only required topics from the data can be used and processed.
- Kind of Knowledge :
After preparing the data for further processing, it is crucial to initially identify the type of knowledge that needs to be extracted from the available data so different data mining functions are performed on relevant data such as Characterization, Discrimination, Association and correlation analysis, Classification, prediction, and Clustering. Let us understand a few types:
- Descriptive: From the available dataset, a mall owner wants to find out the characteristics of its customers, like what they buy, what their expense range is, do they prefer branded or non-branded items, etc. Such type of information is called Descriptive knowledge.
- Predictive Knowledge: Based on the application applicant’s past financial records, the Bank analyses whether the respective person applying for a loan is capable of repaying their loan and whether the bank accepts or rejects a loan application. Such information is called predictive knowledge.
- Associative knowledge: Based on available customer data, the Mall owner wants to find out which products are best suited to each other so that if kept together in a single section, there is a high probability that customers will buy both of them. Such type of information is called Associative knowledge.
- The background knowledge is to be used in the discovery process.
During the data mining process, past knowledge is used to guide the data mining processes, such as concept hierarchies and user beliefs about relationships in data. Some skills and features used are as follows:
- Domain Knowledge: While extracting data, the person should have a proper knowledge of the domain from which data is being extracted. This includes understanding standard industry terms, its business and working methodology,
- Past Data: Data that is already extracted can serve as a foundation and help in understanding the type of data and the industry.
- Existing models and equations: Existing models also help in understanding how the data works and the expected outcomes of the data.
- Interestingness measures and thresholds for pattern evaluation:
Interestingness of measures: Multiple factors like support (frequent patterns), confidence (strength of association between elements of data), lift (unexpected confidence), and interest (combo of support, confidence, lift) are taken into consideration to identify the patterns in the data set.
Threshold for pattern evaluation: After identifying the patterns according to interestingness, thresholds are used as filters to get only the desired ones. Thresholds are predefined values, which were initially set by data miners through which the required patterns can be filtered according to use case and industry. If high thresholds are set, only a few patterns will meet the threshold values, and fewer patterns can be extracted.
- Representation for visualizing the discovered pattern:
Discovered patterns are sometimes hard to understand, but if viewed using diagrams, it becomes quite easy. Visualization techniques are used to represent data, which helps to understand important relationships and patterns within data. These visualization techniques are Rules, tables, reports, charts, graphs, decision trees, and cubes
Example of Data mining Task Primitive
Scenario – Mall, sales data, and its analysis.
- Set of task-relevant data: The mall, over a period of time, has collected a large amount of data like data on its employees, data on mall infrastructure, logistics, Customer information, sales records, Festive offers, etc. From this data, find out the relevant data which will help to analyze the graph of sales. This most important data will be sales done in the last few months and customer information. This data will be gathered, and unwanted components will be removed. After this, the Formatting of data will be done.
- Kind of knowledge: From the data, experts will find the pattern from customer priorities with respect to product type, product price range, shopping frequency, etc.
- Background knowledge: Experts will introduce their skills about how the mall sales function and what extra measures must be taken to improve sales. If they have any past experience with other malls, any specific models through which sales can be improved.
- Interestingness of measures and thresholds for pattern evaluation:
Support – Calculating the most items sold and the least items sold from the mall.
Confidence – Identifying if product B is sold along with product A.
Lift – Calculating Unexpected sales of Product B and when product A is sold.
- Data Visualization: Visualization of extracted patterns using charts and diagrams is presented for a better understanding of the relation between data. Below is the representation of sales revenue in a year using a line chart and a bar chart:
Advantages of Data Mining Task Primitive
Data mining task primitive is a structured approach that gives a broader perspective of data extraction and analysis. Starting from data extraction, data is well extracted, formatted, and cleaned to give precise and required data only. Experts having domain-specific knowledge and good experience, patterns are found to understand the sales of a particular business. Different patterns allow us to decide which sections are to be focused on and which sections are working well. Data analysis gives a better understanding to view data from different angles and understand different aspects of the business. This improves efficiency and gives a long-term improvement path to the industry.
In any type of industry, data is the most crucial component. This data allows us to map a graph of the industry and its future. Hence, it is necessary that data should be used with care. To do this, primitive data mining techniques are used to extract data from different sources format, and clean data for further use. Different patterns can be extracted to be decided as descriptive, predictive, associative, etc. Different techniques and methods are used by experts to extract such patterns. Once the pattern is known, experts analyze the patterns with the help of charts and diagrams for better understanding. They decide what steps should be taken to enhance the business and its strategies. Overall, Data mining plays an important role in predicting the outcomes of ongoing practices and how to improve if anything is going wrong with the system.
This has been a guide to Data Mining Task Primitives. Here we have discussed the Classification, Descriptive Functions, advantages, and Processes. You may also look at the following articles to learn more –