EDUCBA

EDUCBA

MENUMENU
  • Free Tutorials
  • Free Courses
  • Certification Courses
  • 360+ Courses All in One Bundle
  • Login

Pandas.Dropna()

By Priya PedamkarPriya Pedamkar

Home » Data Science » Data Science Tutorials » Data Analytics Basics » Pandas.Dropna()

Pandas.Dropna()

Introduction to Pandas.Dropna()

Python’s open-source library Pandas is undoubtedly is the most widely-used library for data science and analysis. It’s also a preferable package for ad-hoc data manipulation operations. The credit goes to its extremely flexible data representation using DataFrames and the arsenal of functions exposed to manipulating data present in these Data Frames. Any real-life data-problems will cause the issue of missing data and it is imperative that such data points are taken care of in the right way. Handling missing data in any-manner suitable is supported by this function, Pandas.Dropna().

What Exactly are Pandas.Dropna()?

Pydata page can be referred for the official function definition.

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

The function header shown is as follows (along with the default parameter values):

DataFrame.dropna(self, axis=0, how='any',thresh=None, subset=None, inplace=False)

The Parameters (excluding, self (the data frame object itself)) shown in the function definition are as follows:

  • axis: It refers to the orientation (row or column) in which data is dropped. Possible values are 0 or 1 (also ‘index’ or ‘columns’ respectively). 0/’index’ represents dropping rows and 1/’columns’ represent dropping columns.
  • how: Specifies the scenario in which the column/row containing null value has to be dropped. The values are ‘any’ or ‘all’. ‘all’ drop the row/column only if all the values in the row/column are null. ‘any’ drops the row/column when at-least one value in row/column is null.
  • thresh: Specifies the minimum number of non-NA values in row/column in order for it to be considered in the final result. Any row/column with the number of non-NA values < thresh value is removed irrespective of other parameters passed. When thresh=none, this filter is ignored.
  • subset: axis specifies whether rows/columns have to be dropped. subset takes a list of columns/rows respectively (opposite to the axis) which are to be searched for null/NA values instead of an entire list of columns/rows respectively.
  • inplace: As a good practice, the original DataFrame (or data representation) is not modified, but a separate copy with the changes (i.e. dropped rows/columns) is returned. inplace attribute gives you the flexibility to modify the original data structure itself.

Now that we have a general idea of the parameters exposed by dropna(), let’s see some possible scenarios of missing data and how we tackle them.

Example Use-cases of Pandas.Dropna()

Below are the examples of pandas.dropna():

Import pandas: To use Dropna(), there needs to be a DataFrame. To create a DataFrame, the panda’s library needs to be imported (no surprise here). We will import it with an alias pd to reference objects under the module conveniently. For defining null values, we will stick to numpy.nan. Thus we will import the numpy library with an alias np:

Code:

In [1]: import pandas as pd
In [2]: import numpy as np

Popular Course in this category
Pandas and NumPy Tutorial (4 Courses, 5 Projects)4 Online Courses | 5 Hands-on Projects | 37+ Hours | Verifiable Certificate of Completion | Lifetime Access
4.5 (4,864 ratings)
Course Price

View Course

Related Courses
Data Scientist Training (76 Courses, 60+ Projects)Machine Learning Training (17 Courses, 27+ Projects)Cloud Computing Training (18 Courses, 5+ Projects)

1. Create a DataFrame Object for Manipulation

Upon importing pandas, all methods, functions, and constructors are available in your workspace. So let’s create a DataFrame that can help us in demonstrate uses of dropna().

Code:

In [3]: df = pd.DataFrame(
{'Company':['Google','Amazon','Infosys','Directi'],
'Age':['21','23','38','22'],
'NetWorth ($ bn)':[300, np.nan, np.nan, 1.3],
'Founder':[np.nan, np.nan, np.nan, np.nan],
'Headquarter-Country':['United States', np.nan, 'India', 'India'] })
In [4]: print(df)
Company Age  NetWorth ($ bn)  Founder Headquarter-Country
0   Google  21   300.0    NaN     United States
1   Amazon  23    NaN   NaN   NaN
2  Infosys  38    NaN    NaN    India
3  Directi  22     1.3    NaN     India

The printed DataFrame will be manipulated in our demonstration below.

2. Dropping Rows vs Columns

The axis parameter is used to drop rows or columns as shown below:

Code:

In [5]: df.dropna(axis=1)

Output:

Out[5]:
Company Age
0   Google  21
1   Amazon  23
2  Infosys  38
3  Directi  22

Any column containing at-least 1 NaN as cell value is dropped. Let’s see how rows (axis=0) will work.
Note: This is the default behavior when the axis is not explicitly specified.

Code:

In [6]: df.dropna(axis=0)

Output:

Out[6]:
Empty DataFrame
Columns: [Company, Age, NetWorth ($ bn), Founder, Headquarter-Country] Index: []

Hmm, So there’s no data in the returned DataFrame anymore! This obviously is not the intended behavior. Let’s see how to fix this.

3. Using the Subset Attribute

Previous operations were dropping based on all columns when axis=0. Since there’s a column Founder which has only null values, all rows are dropped. Let’s specify a column to be used for filtering:

Code:

In [7]: df.dropna(axis=0,subset=['NetWorth ($ bn)'])

Output:

Out[7]:
Company Age  NetWorth ($ bn) Founder Headquarter-Country
0  Google  21    300.0  NaN     United States
3  Directi  22    1.3   NaN     India

Now, as we see, only records with Nanvalue in the Networth column are dropped. The returned DataFrame can be again be modified by applying dropna() once again to filter out columns by passing axis=1.

4. Using How Parameter

By default, dropna() drops the complete row/column even if only 1 value is missing. A quick flip-side exposed is to drop only when all the values in a row/column are null. This is achieved by setting how=’all’ instead of how=’any’ (the default behavior).

Code:

In [8]: df.dropna(axis=1,how='all')

Output:

Out[8]:
Company Age  NetWorth ($ bn) Headquarter-Country
0   Google  21   300.0   United States
1   Amazon  23    NaN     NaN
2  Infosys  38    NaN   India
3  Directi  22    1.3     India

Now the resultant DataFrame can be used for dropping rows/columns with a more complex logic if required.

5. Getting Control through Thresh

The thresh parameter, is probably the most powerful tool when combined with the rest appropriately.

Code:

In [17]: df.dropna(axis=1,thresh=2)

Output:

Out[17]:
Company Age  NetWorth ($ bn) Headquarter-Country
0   Google  21   300.0  United States
1   Amazon  23    NaN    NaN
2  Infosys  38    NaN  India
3  Directi  22    1.3       India

By setting axis=1 and thresh=2, only those columns with at-least 2 non-NaN values are retained.

Conclusion

The examples shown above are simplistic in nature, yet are powerful enough to deal with the majority of the problems you might stumble upon in real-life situations. None-the-less, one should practice combining different parameters to have a crystal-clear understanding of their usage and build speed in their application.

Recommended Articles

This is a guide to Pandas.Dropna(). Here we discuss what is Pandas.Dropna(), the parameters and examples. You can also go through our other related articles to learn more-

  1. What is Pandas
  2. NLP in Python
  3. Abstract Class in Python
  4. Factorial in Python
  5. Guide to Pandas iterrows()

Pandas and NumPy Tutorial (4 Courses, 5 Projects)

4 Online Courses

5 Hands-on Projects

37+ Hours

Verifiable Certificate of Completion

Lifetime Access

Learn More

0 Shares
Share
Tweet
Share
Primary Sidebar
Data Analytics Basics
  • Basics
    • What is Natural Language Processing
    • What Is Apache
    • What is Business Intelligence
    • Predictive Modeling
    • What is NoSQL Database
    • Types of NoSQL Databases
    • What is Cluster Computing
    • Uses of Salesforce
    • The Beginners Guide to Startup Analytics
    • Analytics Software is Hiding From You
    • Real Time Analytics
    • Lean Analytics
    • Important Elements of Mudbox Software
    • Business Intelligence Tools (Benefits)
    • Mechatronics Projects
    • Know about A Business Analyst
    • Flexbox Essentials For Beginners
    • Predictive Analytics Tool
    • Data Modeling Tools (Free)
    • Modern Data Integration
    • Crowd Sourcing Data
    • Build a Data Supply Chain
    • What is Minitab
    • Sqoop Commands
    • Pig Commands
    • What is Apache Flink
    • What is Predictive Analytics
    • What is Business Analytics
    • What is Pig
    • What is Fuzzy Logic
    • What is Apache Tomcat
    • Talend Data Integration
    • Talend Open Studio
    • How MapReduce Works
    • Types of Data Model
    • Test Data Generation
    • Apache Flume
    • NoSQL Data Models
    • Advantages of NoSQL
    • What is Juypter Notebook
    • What is CentOS
    • What is MuleSoft
    • MapReduce Algorithms
    • What is Dropbox
    • Pandas.Dropna()
    • Salesforce IoT Cloud
    • Talend Tools
    • Data Integration Tool
    • Career in Business Analytics
    • Marketing Analytics For Dummies
    • Risk Analytics Helps in Risk management
    • Salesforce Certification
    • Tips to Become Certified Salesforce Admin
    • Customer Analytics Techniques
    • What is Data Engineering?
    • Business Analysis Tools
    • Business Analytics Techniques
    • Smart City Application
    • COBOL Data Types
    • Business Intelligence Dashboard
    • What is MDM?
    • What is Logstash?
    • CAP Theorem
    • Pig Architecture
    • Pig Data Types
    • KMP Algorithm
    • What is Metadata?
    • Data Modelling Tools
    • Sqoop Import
    • Apache Solr
    • What is Impala?
    • Impala Database
    • What is Digital Image?
    • What is Kibana?
    • Kibana Visualization
    • Kibana Logstash
    • Kibana_query
    • Kibana Reporting
    • Kibana Alert
    • Longitudinal Data Analysis
    • Metadata Management Tools
    • Time Series Analysis
    • Types of Arduino
    • Arduino Shields
    • What is Arduino UNO?
    • Arduino Sensors
    • Arduino Boards
    • Arduino Application
    • 8085 Architecture
    • Dynatrace Competitors
    • Data Migration Tools
    • Likert Scale Data Analysis
    • Predictive Analytics Techniques
    • Data Governance
    • What is RTK
    • Data Virtualization
    • Knowledge Engineering
    • Data Dictionaries
    • Types of Dimensions
    • What is Google Chrome?
    • Embedded Systems Architecture
    • Data Collection Tools
    • Panel Data Analysis
    • Sqoop Export
    • What is Metabase?

Related Courses

Data Science Certification

Online Machine Learning Training

Cloud Computing Certification

Footer
About Us
  • Blog
  • Who is EDUCBA?
  • Sign Up
  • Corporate Training
  • Certificate from Top Institutions
  • Contact Us
  • Verifiable Certificate
  • Reviews
  • Terms and Conditions
  • Privacy Policy
  •  
Apps
  • iPhone & iPad
  • Android
Resources
  • Free Courses
  • Database Management
  • Machine Learning
  • All Tutorials
Certification Courses
  • All Courses
  • Data Science Course - All in One Bundle
  • Machine Learning Course
  • Hadoop Certification Training
  • Cloud Computing Training Course
  • R Programming Course
  • AWS Training Course
  • SAS Training Course

© 2020 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA Login

Forgot Password?

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you
Book Your One Instructor : One Learner Free Class

Let’s Get Started

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you

Special Offer - Pandas and NumPy Tutorial (4 Courses, 5 Projects) Learn More