Introduction to Pandas Find Duplicates
Dealing with real-world data can be messy and overwhelming at times, as the data is never perfect. It consists of many problems such as outliers, duplicate and missing values, etc. There is a very popular fact in the data science world that data scientists / data analysts spend 80% of their time in data cleaning and preparation for a machine learning algorithm. In this article, we will be covering a very popular problem, that is, how to find and remove duplicate values/records in a pandas dataframe. Pandas module in python provides us with some in-built functions such as dataframe.duplicated() to find duplicate values and dataframe.drop_duplicates() to drop duplicate values. We will be discussing these functions along with others in detail in the subsequent sections.
Syntax and Parameters:
The basic syntax for dataframe.duplicated() function is as follows :
dataframe.duplicated(subset = ‘column_name’, keep = {‘last’, ‘first’, ‘false’)
The parameters used in the above mentioned function are as follows :
- Dataframe : Name of the dataframe for which we have to find duplicate values.
- Subset : Name of the specific column or label based on which duplicate values have to be found.
- Keep : While finding duplicate values, which occurrence of the value has to be marked as duplicate. That is, the first value is to be considered duplicated or the subsequent values have to be considered as duplicate or all the values (false) have to be considered duplicate.
The subset argument is optional. Having understood the dataframe.duplicated() function to find duplicate records, let us discuss dataframe.drop_duplicates() to remove duplicate values in the dataframe.
The basic syntax for dataframe.drop_duplicates() function is similar to duplicated() function. It can be written as follows :
dataframe.drop_duplicates(subset = ‘column_name’, keep = {‘last’, ‘first’, ‘false’}, inplace = {‘True’, ‘False’})
Most of the arguments mentioned in this function have been discussed above in dataframe.duplicated() function. The one which is not discussed is :
Inplace: Inplace ensures if the changes are to be made in the original data frame(True) or not(False).
Examples
Now we have discussed the syntax and arguments used for working with functions used for dealing with duplicate records in pandas. But no learning is complete without some practical examples, ergo let’s try a few examples based on these functions. In order to do that, we have to first create a dataframe with duplicate records. You may use the following data frame for the purpose.
Code:
#importing pandas
import pandas as pd
#input data
data = {'Country': ['India','India','USA','USA','UK','Germany','India','Germany', 'USA', 'China', 'Japan'],
'Personality': ['Sachin Tendulkar','Sania Mirza','Serena Williams','Venus Willians',
'Morgan Freeman','Michael Schumacher','Priyanka Chopra','Michael Schumacher',
'Serena Williams','Jack Ma','Sakamoto Ryoma']
}
#create a dataframe from the data
df = pd.DataFrame(data, columns = ['Country','Personality'])
#print dataframe
df
The output of the given code snippet would be a data frame called ‘df’ as shown below :
Duplicate Values of Data Frame
We can clearly see that there are a few duplicate values in the data frame.
1. Finding Duplicate Values in the Entire Dataset
In order to find duplicate values in pandas, we use df.duplicated() function. The function returns a series of boolean values depicting if a record is duplicate or not.
df.duplicated()
By default, it considers the entire record as input, and values are marked as a duplicate based on their subsequent occurrence, i.e. the first value is not marked as duplicate but other values after that are marked as duplicate.
2. Finding Duplicate Values in a Specific Column
In the previous example, we have used the duplicated() function without any arguments. Here, we have used the function with a subset argument to find duplicate values in the countries column.
df.duplicated(subset = 'Country')
3. Finding Duplicate Values in a Specific Column and Marking Last Occurrence as Not Duplicate.
df.duplicated(subset = 'Country', keep = 'last')
4. Finding the Count of Duplicate Records in the Entire Dataset.
In order to find the total number of values, we can perform a sum operation on the results obtained from the duplicated() function, as shown below.
df.duplicated().sum()
5. Finding the Count of Duplicate Values in a Specific Column.
df.duplicated(subset='Country').sum()
6. Removing Duplicate Records in the Dataset.
df.drop_duplicates(keep = 'first')
The function has successfully removed record no. 7 and 8 as they were duplicated. We should note that the drop_duplicates() function does not make inplace changes by default. That is the records would not be removed from the original dataframe. As can be seen from the image below.
df
The original data frame is still the same with duplicate records. In order to save changes to the original dataframe, we have to use an inplace argument as shown in the next example.
7. Removing Duplicate Records in the Dataset Inplace.
df.drop_duplicates(keep = 'first', inplace = True)
df
Conclusion
Finding and removing duplicate values can seem like a daunting task for large datasets. But pandas has made it easy, by providing us with some in-built functions such as dataframe.duplicated() to find duplicate values and dataframe.drop_duplicates() to remove duplicate values.
Recommended Articles
This is a guide to Pandas Find Duplicates. Here we discuss the introduction and Pandas Find Duplicates works in Pandas Dataframe? You may also have a look at the following articles to learn more –