Working with missing data in Pandas

Abiakshar
6 min readJan 15, 2021

--

In this blog, I have shared, how to handle missing values in Pandas.

1. What it missing value? Why it is occurs?

The concept of missing values is important to understand in order to successfully manage data. If the missing values are not handled properly, then we may end up drawing an inaccurate inference about the data. Due to improper handling, the result obtained will differ from ones where the missing values are present.

Missing data, or missing values, occur when no data value is stored for the variable in an observation.

2. Reason for missing data

  • Sometimes missing values are caused by the researcher — for example, when data collection is done improperly or mistakes are made in data entry.
  • People do not respond to survey (or specific questions in a survey).
  • Species are rare and cannot be found or sampled.
  • The individual dies or drops out before sampling.
  • Some things are easier to measure than others.
  • Many others!

3. Values considered “missing”

As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. While NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object.

In many cases, however, the Python None will arise and we wish to also consider that “missing” or “not available” or “NA”.

There are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame. They are:

  • isnull()
  • notnull()
  • dropna()
  • fillna()
  • replace()
  • interpolate()

Create a DataFrame

Checking for Missing Values using isnull()

In order to check null values in Pandas DataFrame, we use isnull() function this function return dataframe of Boolean values which are True for NaN values.

Here we can see, Height columns has 2 null values (indicated by value “True”) and Test Score has 1 null value.

We can find the sum of null values in our dataframe using sum()

Checking for missing values using notnull()

In order to check null values in Pandas Dataframe, we use notnull() function this function return dataframe of Boolean values which are False for NaN values.

4. Cleaning/Filling missing data

Pandas objects are equipped with various data manipulation methods for dealing with missing data.

In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function these function replace NaN values with some value of their own. All these function help in filling a null values in datasets of a DataFrame. Interpolate() function is basically used to fill NA values in the dataframe but it uses various interpolation technique to fill the missing values rather than hard-coding the value.

i) Filling missing values using fillna()

ii) Fill gaps forward or backward

Using the same filling arguments as reindexing, we can propagate non-NA values forward or backward.

Available filling method

  • pad / ffill — Fill values forward
  • bfill / backfill — Fill values backward

With time series data, using pad/ffill is extremely common so that the “last known value” is available at every time point.

ffill() = (method=’ffill’) and bfill() = (method=’bfill’)

iii) Fill using mean of the column

iv) Dropping axis labels with missing data: dropna

We may wish to simply exclude labels from a data set which refer to missing data. To do this, use dropna():

Here we can see the row contains missing values (NaN) is removed even it has one NaN value. We can accomplish the same using the below code:

If we want to remove NaN values on columns, simply pass the axis=1 on dropna function.

v) Interpolation

Both Series and DataFrame objects have interpolate(), that, by default, performs linear interpolation at missing data points. This will take the average of the number above and below the missing value in the dataframe. Note that Linear method ignore the index and treat the values as equally spaced.

vi) Replacing generic values

Often times we want to replace arbitrary values with other values.

replace() in Series and replace() in DataFrame provides an efficient yet flexible way to perform such replacements.

For a Series, we can replace a single value or a list of values by another value:

For a DataFrame, we can specify individual values by column:

Here, we replaced a’s value 1 by 15 and b’s value 8 by 15.

5. Dealing with String/Regular expression

Sometimes the missing values in our data may not come up when we do our check for missing values. When checking for missing values, the isnull() function only picks up ‘Nan’ and not other types of missing values such as a dash(‘-‘) or even ‘na’. Let’s create a new dataframe to show this.

Here we can see 5 missing values, one in 1st column, one in 2nd column, and three in 4th column. Let’s see the isnull() method detect the missing values or not:

As we can see, the isnull() method returns 0 which means even though our dataframe has 5 null values it indicates there is no missing values in our dataframe. We can deal with this by first converting it into ‘NaN’ by using the replace function, and then deal with the missing values however we want.

Sometimes we might have to deal with data in the wrong format. For example, in the ‘First Score’ and ‘Third Score’ column we might have a word instead of a number.

As we can see, in the 1st and 3rd column, there is a non number entry. This can be deal with the same way as we did with non standard missing values. We use the replace function to change it to missing value or ‘ NaN ’.

References

https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#values-considered-missing

https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/

https://www.codementor.io

--

--