Thursday, April 4, 2019

Pandas - 8 (NaN Data)

Having values that are not defined in a data structure is quite common in data analysis. Usually missing data is recognizable in the data structures by the NaN (Not a Number) value. pandas is designed to better manage this eventuality. For example, in the pandas library, calculating descriptive statistics excludes NaN values implicitly.

1. How to assign a NaN value

To specifically assign a NaN value to an element in a data structure, use the np.NaN (or np.nan) value of the NumPy library. See the following program:

import pandas as pd
import numpy as np
ser = pd.Series([0,1,2,np.NaN,9],index=['red','blue','yellow','white','green'])

print('\nseries\n')
print(ser)
ser['blue'] = None
print('\nModified series\n')
print(ser) 


The output of the program is shown below:

series

red       0.0
blue      1.0
yellow    2.0
white     NaN
green     9.0
dtype: float64

Modified series

red       0.0
blue      NaN
yellow    2.0
white     NaN
green     9.0
dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . . 


2. Filtering Out NaN Values

pandas has the dropna() function to eliminate the NaN values during data analysis. 

See the following program:

import pandas as pd


ser = pd.Series([0,1,2,np.NaN,9],index=['red','blue','yellow','white','green'])

print('\nseries\n')
print(ser)

print('\nModified series\n')
print(ser.dropna())


The output of the program is shown below:

series

red       0.0
blue      1.0
yellow    2.0
white     NaN
green     9.0
dtype: float64

Modified series

red       0.0
blue      1.0
yellow    2.0
green     9.0
dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . .


As you can see from the output, the element with NaN value has been eliminated from the series. We can also directly perform the filtering function by placing notnull() in the selection condition as shown in the following program:

import pandas as pd
import numpy as np


ser = pd.Series([0,1,2,np.NaN,9],index=['red','blue','yellow','white','green'])

print('\nseries\n')
print(ser)

print('\nModified series\n')
print(ser[ser.notnull()])


The output of the program is shown below:

series

red       0.0
blue      1.0
yellow    2.0
white     NaN
green     9.0
dtype: float64

Modified series

red       0.0
blue      1.0
yellow    2.0
green     9.0
dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . .


Eliminating elements from a dataframe is a bit complex. See the following program:

import pandas as pd

frame = pd.DataFrame([[6,np.nan,6],[np.nan,np.nan,np.nan],[2,np.nan,5]],
                        index = ['blue','green','red'],
                        columns = ['ball','mug','pen'])


print('\nOriginal dataframe\n')
print(frame)

print('\nModified dataframe\n')
print(frame.dropna()) 


The output of the program is shown below: 

Original dataframe

           ball  mug  pen
blue    6.0  NaN  6.0
green   NaN  NaN  NaN
red     2.0  NaN  5.0

Modified dataframe

Empty DataFrame
Columns: [ball, mug, pen]
Index: []
------------------
(program exited with code: 0)

Press any key to continue . . .


We used the dropna() function and it eliminates the column or row with NaN value no matter if there is only one NaN value. Thus we see an empty DataFrame in the output. Thus, to avoid having entire rows and columns disappear completely, pandas allows us to specify the how option, assigning a value of all to it. This tells the dropna() function to delete only the rows or columns in which all elements are NaN. See the following program:

import pandas as pd


frame = pd.DataFrame([[6,np.nan,6],[np.nan,np.nan,np.nan],[2,np.nan,5]],
                        index = ['blue','green','red'],
                        columns = ['ball','mug','pen'])
print('\nOriginal dataframe\n')
print(frame)

print('\nModified dataframe\n')
print(frame.dropna(how='all'))


The output of the program is shown below:

Original dataframe

            ball  mug  pen
blue    6.0  NaN  6.0
green   NaN  NaN  NaN
red     2.0  NaN  5.0

Modified dataframe

           ball  mug  pen
blue   6.0  NaN  6.0
red    2.0  NaN  5.0
------------------
(program exited with code: 0)

Press any key to continue . . .


3. Filling in NaN Occurrences

In case we don't wanna take the risk of losing data by filtering the NaN values within the data structures, we can replace them with other values. To do so, pandas has the fillna() function. It takes one argument, the value with which to replace any NaN. See the following program:

import pandas as pd

frame = pd.DataFrame([[6,np.nan,6],[np.nan,np.nan,np.nan],[2,np.nan,5]],
                        index = ['blue','green','red'],
                        columns = ['ball','mug','pen'])


print('\nOriginal dataframe\n')
print(frame)

print('\nModified dataframe\n')
print(frame.fillna(0))

print('\nModified dataframe\n')
print(frame.fillna({'ball':1,'mug':0,'pen':99}))


The output of the program is shown below:

Original dataframe

           ball  mug  pen
blue    6.0  NaN  6.0
green   NaN  NaN  NaN
red     2.0  NaN  5.0

Modified dataframe

           ball  mug  pen
blue    6.0  0.0  6.0
green   0.0  0.0  0.0
red     2.0  0.0  5.0

Modified dataframe with provided values

           ball  mug   pen
blue    6.0  0.0   6.0
green   1.0  0.0  99.0
red     2.0  0.0   5.0
------------------
(program exited with code: 0)

Press any key to continue . . .



As seen in the output, we first replace any NaN value with a 0 using fillna(0). Then we replaced NaN with different values depending on the column, specifying one by one the indexes and the associated values fillna({'ball':1,'mug':0,'pen':99}).


Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

0 comments:

Post a Comment