Text Mining Introduction
Text Mining – In today’s context text is the most common means through which information is exchanged. But understanding the meaning from the text is not an easy job at all. We need a good business intelligence tool which will help to understand the information in an easy way.
What is Text Mining
Text Mining is also known as Text Analytics. It is the process of understanding information from a set of texts. Text Mining is designed to help the business find out valuable knowledge from text based content. These contents can be in the form of word document, email or postings on social media.
Text Mining is the use of automated methods for understanding the knowledge available in the text documents.
Text Mining can also be used to make the computer understand structured or unstructured data. Qualitative data or unstructured data are data that cannot be measured in terms of numbers. These data usually contain information like colour, texture and text. Quantitative data or structured data are data that can be measured easily.
Text mining is an interdisciplinary field which includes information retrieval, data mining, machine learning, statistics and others. Text Mining is a slight different field from data mining.
Advantages of Text Mining
There are a lot of advantages of using Text Mining. They are listed below
- It saves time and resources and performs efficiently than human brains.
- It helps to track opinions over time
- Text Mining helps to summarize the documents
- Text analytics helps to extract concepts from text and present it in a more simple way
- The text which is indexed using Text mining can be used in predictive analytics
- You can plug in any vocabularies to use the terminology in your area of interest
Uses of Text Mining
- The names of different entities and relationships between the text can be easily found using various techniques.
- It helps to extract patterns from large amount of unstructured data
- Systematic reviewing of literature – It can go for in depth research of text, find out key themes and highlight the repeated terms or text and the popular topics over a period of time.
- Testing of Hypothesis – Through text mining a particular hypothesis can be tested to see whether the document confirm or deny the hypothesis. Mostly an established belief is tested over the document first.
Develop solutions to business problems effectively. Learn to define, analyze and document business requirements. Investigate business activities to make them more efficient.
Importance of Text Mining
- Text Mining is enables better and smart decision making
- It helps to solve knowledge discovery problems in different areas of business
- Through text mining you can easily visualize the data in many ways like html tables, charts, graphs and others
- It is a great productivity tool. It gives better results faster than any other tool.
- Text mining tool is used by both large and small scale organizations who are knowledge driven organizations.
Applications of Text Mining
Analyzing open ended survey responses
Open ended survey questions will help the respondents to give their view or opinion without any constraints. This will help to know more about the customers’ opinions than relying on structured questionnaires. Text mining can be used to analyze such information in the form of text.
Automatic processing of messages, emails
Text Mining is also mainly used to classify the text. Text Mining can be used to filter the unnecessary mail using certain words or phrases. Such mails will automatically discard such mails to spam. Such automatic system of classifying and filtering selected mails and sending it the corresponding department is done using Text Mining system. Text Mining will also send an alert to the email user to remove the mails with such offending words or content.
Analyzing warranty or insurance claims
In most of the business organizations information is collected mainly in the form of text. For example in a hospital the patient interviews can be narrated briefly in text form and the reports are also in the form of text. These notes are now a day’s collected electronically so that it can be easily transferred into text mining algorithms. These records can then be used to diagnose the actual situation.
Investigating competitors by crawling their web sites
Another important application area of Text Mining is processing the contents of web pages in a particular domain. Through this way the text mining system will automatically find a list of terms which is used in the site. Through this way one can find out the most important terms used in the website. By this way one can know the capabilities about the competitors which can help you to deliver business efficiently.
The other applications of Text Mining includes the following
- Business Intelligence
- E Discovery
- Records Management
- National Security or Intelligence works
- Social Media Monitoring
Techniques used in Text Mining
There are five basic technologies used in Text Mining system. They are discussed in detail below
This is used to analyze the unstructured text by finding out the important words and finding the relationships between them. In this technique the process of pattern matching is used to find out the order in text. It helps in transforming the unstructured text into structured form. The Information extraction technique involves language processing modules. This is mostly used where there is large amount of data. The process of Information Extraction is explained in the picture below.
Categorization technique classifies the text document under one or more category. It is based on input output examples to do the classification. The categorization process includes pre processing, indexing, dimensional reduction and classification. The text can be categorized using techniques like Naive Bayesian classifier, Decision tree, Nearest Neighbour classifier and Support Vendor Machines.
Clustering method is used to group text documents which has similar contents. It has partitions called clusters and each partition will have a number of documents with similar contents. Clustering makes sure that no document will be omitted from the search and it derives all the documents which has similar contents. K-means is the frequently used clustering technique. This technique also compares each cluster and finds how well the document are connected to each other. Companies use this technique to create a database with thousand of similar documents.
Visualization technique is used to simplify the process of finding relevant information. This technique uses text flags to represent documents or group of documents and uses colours to indicate the compactness. Visualization technique helps to display textual information in a more attractive way. The below picture will represent the Visualization technique
Summarization technique will help to reduce the length of the document and summarize the details of the documents in brief. It makes the document work reading for the users and understand the content at a glance. Summarization replaces the entire set of documents. It summarizes large text document easily and quickly. Humans take more time to read and then summarize the document but this technique makes it very fast. It helps to highlight major points in a document. Summarization process is represented in the picture below.
Methods and Models Used in Text Mining
Based on the information retrieval Text Mining has four main methods
Term Based Method (TBM)
Term in a document means a word which has semantic meaning. In this method the entire set of documents is analyzed on the basis of term. One main disadvantage of this method is the problem of synonymy and polysemy. Synonymy is where multiple words having the same meaning. Polysemy is where a single word has more meanings.
Phrase Based Method (PBM)
In this method the document is analyzed based on the phrases which are less obvious to more meanings and more discriminative. The disadvantages of this method includes
- They have inferior statistical properties to terms
- They have low frequency of occurrence
- They have large number of noisy phrases
Concept Based Method (CBM)
In this method the document is analyzed based on sentence and document level. In this method there are three main components. The first component examines the meaningful part of the sentences. The second component produces a conceptual ontological graph to explain the structures. The third component extracts top concepts based on the first two components. This method can differentiate between the important and unimportant words.
Pattern Taxonomy Method (PTM)
In this method the document is analyzed based on the patterns. Patterns in a document can be found out using data mining techniques like association rule mining, sequential pattern mining, frequent item set mining and closed pattern mining. This method uses two processes – pattern deploying and pattern evolving. This method is proved to perform better than all the other models or methods.
How does Text Mining work
Now you should have understood that text mining allows to understand the text better that anything else. Text Mining system makes an exchange of words from unstructured data into numerical values. Text mining helps to identify patterns and relationships that exists within a large amount of text. Text mining often uses computational algorithms to read and analyze textual information. Without text mining it will be difficult to understand the text easily and quickly. Text can be mined in a more systematic and comprehensive way and the information about the business can be captured automatically. The steps in the text mining process is listed below.
Step 1 : Information Retrieval
This is the first step in the process of data mining. This step involves the help of a search engine to find out the collection of text also known as corpus of texts which might need some conversion. These texts should also be brought together in a particular format which will be helpful for the users to understand. Usually XML is the standard for text mining
Step 2 : Natural Language Processing
This step allows the system to perform grammatical analysis of a sentence to read the text. It also analyzes the text in structures.
Step 3 : Information extraction
This is the second stage where in order to identify the meaning of a particular text mark-up is done. In this stage a metadata is added to the database about the text. It also involves adding names or locations to the text. This step lets the search engine to get the information and find out the relationships between the texts using their metadata.
Step 4 : Data Mining
The final stage is data mining using different tools. This step finds the similarities between the information that has the same meaning which will be otherwise difficult to find. Text Mining is a tool which boosts the research process and helps to test the queries.
Text Mining includes the following list of elements
- Text Categorization
- Text Clustering
- Concept/entity extraction
- Granular taxonomies
- Sentiment Analysis
- Document Summarization
- Entity Relation Modelling
Challenges of Text Mining
The main challenge faced by Text Mining system is the natural language. The natural language faces the problem of ambiguity. Ambiguity means one term having several meanings, one phrase being interpreted in various ways and as a result different meanings are obtained.
Another limitation is that while using Information Extraction system it involves semantic analysis. Due to this the full text is not presented, only a limited part of the text is presented to the users. But these days there is a need for more text understanding.
Text Mining also has limitation with copyright legislation. There are lot of restrictions in text mining a document. Most of the times it includes the rights of the copyright holders. Most of the texts will not be found as open source and in such cases permissions are required from the respective authors, publishers and other related parties.
One more limitation is text mining do not generate new facts and it is not an end process.
Text mining or text analytics is a booming technology but still the results and depth of analysis varies from business to business. An organization can use text mining to gain knowledge about content specific values