Differences Between Text Mining vs Text Analytics
Structured data has been out there since the early 1900s but what made text mining and text analytics so special is that leveraging the information from unstructured data (Natural Language Processing). Once we are able to convert this unstructured text into semi-structured or structured data it will be available to apply all the data mining algorithms ex. Statistical and machine learning algorithms.
Even Donald Trump was able to leverage the data and convert it to information which helped him to win the US presidential elections, well basically he didn’t do it his subordinates did. There is a very good article out there https://fivethirtyeight.com/features/the-real-story-of-2016/ you can go through it.
Many businesses have started using text mining to use valuable inputs from the text available out there, for example, a product based company can use the twitter data/ Facebook data to know how well or bad their product is doing out there in the world using Sentimental Analysis. In the early days the processing used to take a lot of time, days, in fact, to process or even implement the machine learning algorithms, but with the introduction of tools such as Hadoop, Azure, KNIME, and other big data processing software’s the text mining has gained enormous popularity in the market. One of the best examples of text analytics using association mining is Amazon’s Recommendation engine where it automatically gives recommendations to its customers what else other people bought when buying any one particular product.
One of the biggest challenges of applying text mining tools to something which is not in a digital format/ on computer drive is the process of making it. The old archives and many important documents that are available only on papers are sometimes read through OCR (Optical Character Recognition) which have many errors and sometimes data is entered manually which is prone to human mistakes. The reason we want these is that we may be able to derive other insights which are not visible from traditional reading.
Some of the steps of text mining are as below
- Information Retrieval
- Data Preparation and Cleaning
- Stop-word numbers and punctuation removal
- Convert to lowercase
- POS tagging
- Create text corpus
- Term-Document matrix
And below are the steps in Text Analytics which are applied after the Term Document Matrix is prepared
- Modeling ( This may include inferential models, predictive models or prescriptive models)
- Training and evaluation of models
- Application of these Models
- Visualizing the Models
The only thing one must always remember is that text mining always precedes text analytics.
Head to Head Comparison Between Text Mining and Text Analytics (Infographics)
Below is the 5 comparison between Predictive Text Mining and Text Analytics:
Key Differences Between Text Mining and Text Analytics
Let’s differentiate text mining and text analytics based on the steps which are involved in few applications where these text mining and text analytics both are applied:
• Classification of documents
In this the steps which are included in text mining are tokenization, stemming and lemmatization, removing stopwords and punctuation and at last computing the term frequency matrix or document frequency matrices.
Tokenization – The process of splitting the whole data (corpus) into smaller chunks or smaller words usually single words is known as tokenization (N-Gram model or Bag of words Model)
Stemming and Lemmatization – For example the words, big bigger and biggest all mean the same and it will form duplicate data, in order to keep the data redundant we do lemmatization, linking of words with the root word.
Removing stop words — Stop words are no use in analytics which will include words like is, the, and etc.
Term frequencies – This is a matrix that has row headers as the document names and columns as the terms(words) and the data is the frequency of the words occurring in those particular documents. Below is a sample screenshot.
In the above figure, we have the attributes in the rows (words) and the document number as columns and the word frequency as the data.
Now coming to text analytics we have the following steps that need to be considered
Clustering – Using K-means clustering/Neural Networks/ CART(Classification and regression trees) or any other clustering algorithm we can now cluster the documents based on the features that were generated (features here being the words).
Evaluation and Visualization – We van plot the cluster into two dimensions and look how these clusters vary from each other, and if the model holds good on test data we can deploy it in production and it will be a good document classifier which will classify any new documents which are given as input and it would just name the cluster in which it will fall into.
One of the most powerful tools out there in the market which help in processing twitter data/ Facebook data or any other data which can be used to derive the sentiment out of it whether the sentiment is good, bad or neutral to any particular process/product or person is sentiment analysis.
The source of the data can easily be available by using twitter API / Facebook API to get the tweets/comments/likes etc. on the tweet or a post of a company. The major problem being, this data is hard to structure. The data would contain various advertisements too and the data scientist who works for the company has to make sure that the selection of data is done in the right way so that only selected tweets/posts go through for pre-processing stages.
Other tools include Web- Scraping, this is a part of text mining wherein you scrap the data from websites using crawlers.
The process of text mining remains the same as tokenization, stemming and lemmatization, removing stopwords and punctuation and at last computing, the term frequency matrix or document frequency matrices but the only difference comes while applying the sentiment analysis.
Usually, we give a score to any post/tweet. Usually, when you buy a product and review if you are also given an option to give stars to the review and post a comment. Google, Amazon, and other websites use the stars to rate the comment, not only this they also take the tweets/posts and give them to human beings to rate it as good/bad/neutral and on combing these two scores they generate a new score to any particular tweet/post.
Visualization of sentiment analysis can be done using a word cloud, bar charts of the frequency term matrix.
•Association of Mining Analysis
One of the applications on which some guys were working on was the “Adverse Drug Event Probabilistic model” wherein one can check for which adverse events may cause other adverse events if he takes any particular medicine.
The text mining included the below workflow
From the above figure, we can see that till data-mining all steps belong to text mining which is identifying the source of data, extracting them and then preparing it ready to be analyzed.
Then applying association mining we have the below model
As we can see that some arrow marks point towards the orange circle and then one arrow points towards any one particular ADE (Adverse drug event). If we take an example on the left bottom side of the image we can find apathy, asthenia and feeling abnormal leads to feeling guilty, well one can say that’s obvious, it is obvious because as a human you can interpret and relate but here a machine is interpreting it and giving us the next adverse drug event.
An example of the word cloud is as below
Text Mining and Text Analytics Comparison Table
Below are the lists of points, describe the comparisons between Text Mining and Text Analytics:
|Basis for Comparison||Text Mining||Text Analytics|
|Text mining is basically cleaning up od data to be available for text analytics||Text Analytics is applying of statistical and machine learning techniques to be able to predict /prescribe or infer any information from the text-mined data.|
|Text mining is a tool that helps in getting the data cleaned up.||Text Analytics is the process of applying the algorithms|
|If we talk about the framework, text mining is similar to ETL(Extract Transform Load), which means to be able to insert data into database these steps are carried out||In-text analytics this data is used to add values to the business, example creating word clouds, bi-grams frequency charts, N-grams in some cases|
|Python and R are the most famous text mining tools out there for text mining||For text analytics, once the data is available at database level then we can use any of the analytics software out there including python and R. Other software ’s include Power BI, Azure, KNIME, etc.|
The future of text mining vs text analytics is not only applicable to English, but there have also been continuous advancements and using linguistic tools not only English other languages are too considered for analysis.
The scope and future of text mining will grow as there are limited resources to analyze other languages.
Text Analytics has a very broad range where it can be applied, some of the examples of the industries where this can be used are:
- Social Media Monitoring
- Pharma /Biotech Applications
- Business and Marketing Applications
This has been a guide to Difference between Text Mining vs Text Analytics. Here we have discussed Text Mining vs Text Analytics head to head comparison, key difference along with infographics and comparison table. You may also look at the following articles to learn more –