Sunday, April 28, 2019

Pandas - 24 (Data Transformation)

By now we have a decent background on how to prepare data for analysis and ready to begin the second stage of data manipulation: the data transformation. After we arrange the form of data and their disposal within the data structure, it is important to transform their values. We will see some common issues and the steps required to overcome them using functions of the pandas library.

Some of these operations involve the presence of duplicate or invalid values, with possible removal or replacement. Other operations relate instead by modifying the indexes. Other steps include handling and processing the numerical values of the data and strings. Let's start with:

1. Removing Duplicates

Duplicate rows might be present in a dataframe for various reasons. In dataframes of enormous size, the detection of these rows can be very problematic. In this case, pandas provides a series of tools to analyze the duplicate data present in large data structures. See the following program:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame({ 'color': ['white','white','red','red','white'],
                                    'value': [2,1,3,3,2]})
print('The dataframe \n')
print(mydataframe)


print('\nDetect the duplicate rows \n')
print(mydataframe.duplicated())


print('\nKnow which are the duplicate rows\n')
print(mydataframe[mydataframe.duplicated()]) 


In the above program we first create a simple dataframe with some duplicate rows. Then the duplicated() function is applied to the dataframe which can detect the rows that appear to be duplicated. It returns a series of Booleans where each element corresponds to a row, with True if the row is duplicated (i.e., only the other occurrences, not the first), and with False if there are no duplicates in the previous elements.

Finally we check which are the duplicate rows in our dataframe. The output of the program is shown below:

The dataframe

      color  value
0    white      2
1    white      1
2    red          3
3    red          3
4    white      2

Detect the duplicate rows

0    False
1    False
2    False
3     True
4     True
dtype: bool

Know which are the duplicate rows

      color      value
3    red          3
4    white      2

-----------------
(program exited with code: 0)

Press any key to continue . . .


pandas provides the drop_duplicates() function, which returns the dataframes without duplicate rows as shown in the following program:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame({ 'color': ['white','white','red','red','white'],
                                    'value': [2,1,3,3,2]})
print('The dataframe \n')
print(mydataframe)

print('\nThe duplicate rows\n')
print(mydataframe[mydataframe.duplicated()])

print('\nThe dataframe after deleting the duplicate rows \n')
print(mydataframe.drop_duplicates())


The drop_duplicates() deletes the other occurrences of the rows. The output of the program is shown below:

The dataframe

   color  value
0  white      2
1  white      1
2    red      3
3    red      3
4  white      2

The duplicate rows

     color  value
3    red      3
4   white      2

The dataframe after deleting the duplicate rows

   color  value
0  white      2
1  white      1
2    red       3
------------------
(program exited with code: 0)

Press any key to continue . . .


2. Mapping

Mapping is the creation of a list of matches between two different values, with the ability to bind a value to a particular label or string. The pandas library provides a set of functions which, exploit mapping to perform some operations. The dict objects are used to define mapping as shown below:

mymapping =  {
                         'label1' : 'value1,
                         'label2' : 'value2,
                          ...
                         }

The pandas library provides the following set of specific functions:

• replace()—Replaces values
• map()—Creates a new column
• rename()—Replaces the index values

Let's see how to use these functions:

Replacing Values via Mapping

Sometimes in our data structure there are values that do not meet our needs. For example, the text may be in a foreign language, or may be a synonym of another value, or may not be expressed in the desired shape. In such cases, a replace operation of various values is often a necessary process.

See the following program:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
                                    'color':['white','lal','hara','black','yellow'],
                                    'price':[5.56,4.20,1.30,0.56,2.75]})
print('The dataframe \n')
print(mydataframe)

newcolors = {'lal': 'red','hara': 'green'}

print('\nThe dataframe after replacing the values \n')
print(mydataframe.replace(newcolors))


In the above shown program we have created a dataframe containing various objects and colors, including two colors, lal and hara, that are not in English. To be able to replace the incorrect values with new values, we define a mapping of correspondences, containing as a key the new values which are English equivalents of the two colors. Finally we use the replace() function with the mapping as an argument and get the dataframe with the correct values. 

The output of the program is shown below:

The dataframe

      item   color  price
0     ball   white   5.56
1      mug     lal   4.20
2      pen    hara   1.30
3   pencil   black   0.56
4  ashtray  yellow   2.75

The dataframe after replacing the values

      item   color  price
0     ball   white   5.56
1      mug     red   4.20
2      pen   green   1.30
3   pencil   black   0.56
4  ashtray  yellow   2.75


------------------
(program exited with code: 0)

Press any key to continue . . .


A common use of the replace() is in the replacement of NaN values with another value, for example 0. See the following program:

import pandas as pd
import numpy as np

myseries = pd.Series([1,3,np.nan,4,6,np.nan,3])

print('\nOriginal Series\n')
print(myseries)

print('\nModified Series\n')
print(myseries.replace(np.nan,0))


In the above program we have replaced all the occurrences on NaN with 0. The output of the program is shown below:

Original Series

0    1.0
1    3.0
2    NaN
3    4.0
4    6.0
5    NaN
6    3.0
dtype: float64

Modified Series

0    1.0
1    3.0
2    0.0
3    4.0
4    6.0
5    0.0
6    3.0
dtype: float64


------------------
(program exited with code: 0)

Press any key to continue . . .

Adding Values via Mapping

We also sometimes need to add values in a column depending on the values contained in another. This is accomplished use the map() as shown in the following program: 

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
                             'color':['white','red','green','black','yellow']})

print('The dataframe \n')
print(mydataframe)

prices = { 'ball' : 5.56, 
           'mug' : 4.20,
           'bottle' : 1.30,
           'scissors' : 3.41,
           'pen' : 1.30,
           'pencil' : 0.56,
           'ashtray' : 2.75
           }

mydataframe['price'] = mydataframe['item'].map(prices)

print('\nThe dataframe with prices added\n')
print(mydataframe) 


In the above program we first create a dataframe with two colomns. Later we want to add a column to indicate the price of the item shown in the dataframe. The prices are defined for each item in the prices dictionary. We then applied the mapping of the prices on the column item using the map(), making sure to add a column to the price dataframe. The map() function applied to a series or to a column of a dataframe accepts a function or an object containing a dict with mapping.

The output of the program is shown below:

The dataframe

      item   color
0     ball   white
1      mug     red
2      pen   green
3   pencil   black
4  ashtray  yellow

The dataframe with prices added

      item   color  price
0     ball   white   5.56
1      mug     red   4.20
2      pen   green   1.30
3   pencil   black   0.56
4  ashtray  yellow   2.75
------------------
(program exited with code: 0)

Press any key to continue . . . .


Rename the Indexes of the Axes

Just like the values contained within the series and the dataframe, the axis label can be transformed in a very similar way using the mapping. To replace the label indexes, pandas provides the rename() function, which takes the mapping as an argument, that is, a dict object. Let's replace the label index in our previous program  so that instead of 01234 it displays first second third fourth. Add the following code to the previous program:

reindex = {0: 'first',1: 'second',2: 'third',3: 'fourth',4: 'fifth'}

print('\nThe dataframe with label renamed\n')
print(mydataframe.rename(reindex))  

The output of the program is shown below:

The dataframe with label renamed

              item       color     price
first        ball       white     5.56
second   mug      red         4.20
third       pen       green     1.30
fourth    pencil    black     0.56
fifth       ashtray  yellow   2.75
------------------
(program exited with code: 0)

Press any key to continue . . .

As we can see  the indexes are renamed. If we want to rename columns we must use the columns option. In this case we assign various mapping explicitly to the two index and columns options as shown below:

recolumn = {'item':'object','price': 'value'}

print('\nThe dataframe with column renamed\n')
print(mydataframe.rename(index=reindex, columns=recolumn))


Add this code to the previous program and see the output which should be:

The dataframe with column renamed

              object    color    value
first        ball       white    5.56
second   mug      red        4.20
third       pen       green    1.30
fourth    pencil    black     0.56
fifth       ashtray  yellow   2.75
------------------
(program exited with code: 0)

Press any key to continue . . .


It is also possible to rename single values as shown in  the following program:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
                             'color':['white','red','green','black','yellow']})
prices = { 'ball' : 5.56, 
           'mug' : 4.20,
           'bottle' : 1.30,
           'scissors' : 3.41,
           'pen' : 1.30,
           'pencil' : 0.56,
           'ashtray' : 2.75
           }

mydataframe['price'] = mydataframe['item'].map(prices)

print('\nThe original dataframe\n')
print(mydataframe) 

print('\nThe dataframe with label renamed\n')
print(mydataframe.rename(index={1:'first'}, columns={'item':'object'})) 


The output of the program is shown below:


The original dataframe

      item   color  price
0     ball   white   5.56
1      mug     red   4.20
2      pen   green   1.30
3   pencil   black   0.56
4  ashtray  yellow   2.75

The dataframe with label renamed

        object   color  price
0         ball   white   5.56
first      mug     red   4.20
2          pen   green   1.30
3       pencil   black   0.56
4      ashtray  yellow   2.75

The original dataframe

      item   color  price
0     ball   white   5.56
1      mug     red   4.20
2      pen   green   1.30
3   pencil   black   0.56
4  ashtray  yellow   2.75
------------------
(program exited with code: 0)

Press any key to continue . . .


As was desired, the index 1 is renames to first. As you have seen that the rename() function returns a dataframe with the changes, leaving unchanged the original dataframe. If we want the changes to take
effect on the object on which you call the function, you will set the inplace option to True. Add the following code to the previous program:

mydataframe.rename(columns={'item':'object'}, inplace=True)

print('\nThe dataframe with change in item object\n')
print(mydataframe)


The output of the program is shown below:

The original dataframe

      item   color  price
0     ball   white   5.56
1      mug     red   4.20
2      pen   green   1.30
3   pencil   black   0.56
4  ashtray  yellow   2.75

The dataframe with label renamed

        object   color  price
0         ball   white   5.56
first      mug     red   4.20
2          pen   green   1.30
3       pencil   black   0.56
4      ashtray  yellow   2.75

The original dataframe

      item   color  price
0     ball   white   5.56
1      mug     red   4.20
2      pen   green   1.30
3   pencil   black   0.56
4  ashtray  yellow   2.75

The dataframe with change in item object

     object   color  price
0     ball   white   5.56
1      mug     red   4.20
2      pen   green   1.30
3   pencil   black   0.56
4  ashtray  yellow   2.75
------------------
(program exited with code: 0)

Press any key to continue . . .


If you print the original data frame again what will be the output?

Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

0 comments:

Post a Comment