Updated March 21, 2023

Introduction to Pandas.Dropna()

Python’s open-source library Pandas is undoubtedly is the most widely-used library for data science and analysis. It’s also a preferable package for ad-hoc data manipulation operations. The credit goes to its extremely flexible data representation using DataFrames and the arsenal of functions exposed to manipulating data present in these Data Frames. Any real-life data-problems will cause the issue of missing data and it is imperative that such data points are taken care of in the right way. Handling missing data in any-manner suitable is supported by this function, Pandas.Dropna().

What Exactly are Pandas.Dropna()?

Pydata page can be referred for the official function definition.

The function header shown is as follows (along with the default parameter values):

DataFrame.dropna(self, axis=0, how='any',thresh=None, subset=None, inplace=False)

The Parameters (excluding, self (the data frame object itself)) shown in the function definition are as follows:

axis: It refers to the orientation (row or column) in which data is dropped. Possible values are 0 or 1 (also ‘index’ or ‘columns’ respectively). 0/’index’ represents dropping rows and 1/’columns’ represent dropping columns.
how: Specifies the scenario in which the column/row containing null value has to be dropped. The values are ‘any’ or ‘all’. ‘all’ drop the row/column only if all the values in the row/column are null. ‘any’ drops the row/column when at-least one value in row/column is null.
thresh: Specifies the minimum number of non-NA values in row/column in order for it to be considered in the final result. Any row/column with the number of non-NA values < thresh value is removed irrespective of other parameters passed. When thresh=none, this filter is ignored.
subset: axis specifies whether rows/columns have to be dropped. subset takes a list of columns/rows respectively (opposite to the axis) which are to be searched for null/NA values instead of an entire list of columns/rows respectively.
inplace: As a good practice, the original DataFrame (or data representation) is not modified, but a separate copy with the changes (i.e. dropped rows/columns) is returned. inplace attribute gives you the flexibility to modify the original data structure itself.

Now that we have a general idea of the parameters exposed by dropna(), let’s see some possible scenarios of missing data and how we tackle them.

Example Use-cases of Pandas.Dropna()

Below are the examples of pandas.dropna():

Import pandas: To use Dropna(), there needs to be a DataFrame. To create a DataFrame, the panda’s library needs to be imported (no surprise here). We will import it with an alias pd to reference objects under the module conveniently. For defining null values, we will stick to numpy.nan. Thus we will import the numpy library with an alias np:

Code:

In [1]: import pandas as pd In [2]: import numpy as np

1. Create a DataFrame Object for Manipulation

Upon importing pandas, all methods, functions, and constructors are available in your workspace. So let’s create a DataFrame that can help us in demonstrate uses of dropna().

Code:

In [3]: df = pd.DataFrame( {'Company':['Google','Amazon','Infosys','Directi'], 'Age':['21','23','38','22'], 'NetWorth ($ bn)':[300, np.nan, np.nan, 1.3], 'Founder':[np.nan, np.nan, np.nan, np.nan], 'Headquarter-Country':['United States', np.nan, 'India', 'India'] }) In [4]: print(df) Company Age NetWorth ($ bn) Founder Headquarter-Country 0 Google 21 300.0 NaN United States 1 Amazon 23 NaN NaN NaN 2 Infosys 38 NaN NaN India 3 Directi 22 1.3 NaN India

The printed DataFrame will be manipulated in our demonstration below.

2. Dropping Rows vs Columns

The axis parameter is used to drop rows or columns as shown below:

Code:

In [5]: df.dropna(axis=1)

Output:

Out[5]: Company Age 0 Google 21 1 Amazon 23 2 Infosys 38 3 Directi 22

Any column containing at-least 1 NaN as cell value is dropped. Let’s see how rows (axis=0) will work.
Note: This is the default behavior when the axis is not explicitly specified.

Code:

In [6]: df.dropna(axis=0)

Output:

Out[6]: Empty DataFrame Columns: [Company, Age, NetWorth ($ bn), Founder, Headquarter-Country] Index: []

Hmm, So there’s no data in the returned DataFrame anymore! This obviously is not the intended behavior. Let’s see how to fix this.

3. Using the Subset Attribute

Previous operations were dropping based on all columns when axis=0. Since there’s a column Founder which has only null values, all rows are dropped. Let’s specify a column to be used for filtering:

Code:

In [7]: df.dropna(axis=0,subset=['NetWorth ($ bn)'])

Output:

Out[7]: Company Age NetWorth ($ bn) Founder Headquarter-Country 0 Google 21 300.0 NaN United States 3 Directi 22 1.3 NaN India

Now, as we see, only records with Nanvalue in the Networth column are dropped. The returned DataFrame can be again be modified by applying dropna() once again to filter out columns by passing axis=1.

4. Using How Parameter

By default, dropna() drops the complete row/column even if only 1 value is missing. A quick flip-side exposed is to drop only when all the values in a row/column are null. This is achieved by setting how=’all’ instead of how=’any’ (the default behavior).

Code:

In [8]: df.dropna(axis=1,how='all')

Output:

Out[8]: Company Age NetWorth ($ bn) Headquarter-Country 0 Google 21 300.0 United States 1 Amazon 23 NaN NaN 2 Infosys 38 NaN India 3 Directi 22 1.3 India

Now the resultant DataFrame can be used for dropping rows/columns with a more complex logic if required.

5. Getting Control through Thresh

The thresh parameter, is probably the most powerful tool when combined with the rest appropriately.

Code:

In [17]: df.dropna(axis=1,thresh=2)

Output:

Out[17]: Company Age NetWorth ($ bn) Headquarter-Country 0 Google 21 300.0 United States 1 Amazon 23 NaN NaN 2 Infosys 38 NaN India 3 Directi 22 1.3 India

By setting axis=1 and thresh=2, only those columns with at-least 2 non-NaN values are retained.

Conclusion

The examples shown above are simplistic in nature, yet are powerful enough to deal with the majority of the problems you might stumble upon in real-life situations. None-the-less, one should practice combining different parameters to have a crystal-clear understanding of their usage and build speed in their application.