What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is a set of methodologies or philosophies to uncover the insights or patterns of the data. Exploratory data analysis helps us to explore underlying patterns, extract feature importance, identify outliers and anomalies of a data set, etc. Exploratory data analysis is mainly a set of visualization techniques that can help to maximize the insights from the data which can be used to implement the whole analytics pipeline. There is no hard and fast rule or ‘One size fits all’ techniques in EDA. One needs to apply various graphical methods to explore the data that can vary from use case to use case. For example, the exploratory techniques for a regression and classification problem will vary.
How Exploratory Data Analysis is Performed?
Let us see how the exploratory data analysis is performed:
1. Univariate Analysis
‘Uni’ means ‘One’. As the name suggests univariate analysis is the data analysis where only a single variable is involved. The variable can be either a ‘Categorical’ variable or ‘Numerical’ variable. Let us discuss the most commonly used graphical methods used for exploratory data analysis of univariate analysis.
2. Frequency Table and Count Plot
Frequency table or count plots are used to identify the frequency or how many times a value occurs. For example, we are tossing an unbiased coin 5 times (H, T, H, H, T). The frequency or count of the head here is 3. Let us see how the count plot looks from a movie review data set. Count plot is also referred to as a bar plot because of the rectangular bars. We generate bar plot in python using Seaborn library
Count plot of Movie Review Dataset
Source Link: https://stackoverflow.com/questions/48043365/
Histograms are the smoothen version of Kernel density estimation. Histograms help us to get knowledge about the underlying distribution of the data. For example, the preprocessing methodologies of a normal (bell-shaped curve) distribution will significantly different than other skewed distribution like Pareto distribution. Let’s see how the distribution of flight arrival displays in the form of a histogram.
Source Link: Wikipedia
4. Pie Chart
A pie chart is a circle which is divided into parts based on relative count or frequency of a sample or population. Suppose we want to compare the relative performance or sales or multiple products, a pie chart is a useful graphical way to visualize it.
5. Box Plot with Whisker and Violin Plot
Box plot with whisker is used to graphically display the 25-50-75 percentile values of the variable. Box plot gives us a clear picture of where 50%, 25% or 95% of the values lies in our data. Violin plot is the enhanced plot of boxplot which includes some more information (distribution of the variable) of the variable. Let us show how the boxplot and violin plot looks
4.5 (2,959 ratings)
6. Multivariate Analysis
Multivariate analysis is the methodology of comparative analysis between multiple variables. If we compare the two variables it is called bi-variate analysis. Scatter plots, contour plots, multivariate probability density plots are the most commonly used graphical methods to analyze multi-dimensional data. Let us show how a to scatter plot looks like
Advantages and Disadvantages of Exploratory Data Analysis
Below are given the advantages and disadvantages of Exploratory Data Analysis:
Advantages of EDA
- It gives us valuable insights about the data
- It helps us for feature selection (i.e using PCA)
- Visualization is an effective way of detecting outliers
Disadvantages of EDA
- If not perform properly EDA can misguide a problem
- EDA does not effective when we deal with high dimensional data
Applications of Exploratory Data Analysis
Let’s analyze the applications of Exploratory Data Analysis with a use case of univariate analysis where we will seek the measurement of the central tendency of the data.
Measurement of central tendency gives us an overview of the univariate variable. Central tendency is the measurement of Mean, Median, and Mode. Mean is the simple average where the median is the 50% percentile and Mode is the most frequently occurring value. Suppose we want the get the knowledge about the salary of a data scientist. Also, suppose we have carefully collected data of the data scientist with similar expertise and experience range.
Now if we want to get the average it is simply the total salary of all the data scientists of the sample divided by the number of data scientists in the sample or population. But if you think carefully the average salary is not a proper term because in the presence of some extreme values the result will be skewed. Suppose for maximum cases the salary is between 8-10 LPA and for one or two cases it is 32 LPA. Now adding all these the average will be skewed. Median is more suitable for such situations, it is more robust to outliers.
In this article, we have discussed the various methodologies involved in exploratory data analysis, the applications, advantages, and disadvantages of it. We also walked through the sample codes to generate the plots in python using seaborn and Matplotlib libraries. EDA is the art part of data science literature which helps to get valuable insights and visualize the data.
This is a guide to Exploratory Data Analysis. Here we discuss the Introduction to EDA, how Exploratory Data Analysis is Performed? along with applications of EDA and the advantages and disadvantages. You can also go through our other suggested articles –