Wednesday, April 24, 2019

Pandas - 21 (Data Manipulation)

In the previous posts we learned how to acquire data from data sources such as databases and files. Once we have the data in the dataframe format, they are ready to be manipulated. It’s important to prepare the data so that they can be more easily subjected to analysis and manipulation. Especially in preparation for the next phase, the data must be ready for visualization.

In the coming posts we'll learn about the functionality that the pandas library offers for this stage of data analysis. The data manipulation has three phases which are:

• Data preparation
• Data transformation
• Data aggregation

The three phases of data manipulation will be treated individually, illustrating the various operations with a series of examples and on how best to use the functions of this library for carrying out such operations. Let's discuss about the first phase in this post which is -

Data Preparation

Data Preparation is necessary before we start manipulating the data. The data must be assembled in the form data structures such that they can be manipulated later with the tools made available by the pandas library. The different procedures for data preparation are:

• Loading
• Assembling
     • Merging
     • Concatenating
     • Combining
• Reshaping (pivoting)
• Removing

In our previous posts we have covered loading where we learned about the conversion from many different formats into a data structure such as a dataframe. But even after we have the data, probably from different sources and formats, and unified it into a dataframe, we need to perform further
operations of preparation. The data contained in the pandas objects can be assembled in different ways:

• Merging—The pandas.merge() function connects the rows in a dataframe based on one or more keys. This mode is very familiar to those who are confident with the SQL language, since it also
implements join operations.

• Concatenating—The pandas.concat() function concatenates the objects along an axis.

• Combining—The pandas.DataFrame.combine_first() function is a method that allows you to connect overlapped data in order to fill in missing values in a data structure by taking data from another structure.

In addition to these, a part of the preparation process is also pivoting, which consists of the
exchange between rows and columns.

Merging

The merging operation consists of a combination of data through the connection of rows using one or more keys, this corresponds to the JOIN operation in SQL.

Usually one uses the JOIN query with SQL to get data from different tables using some reference values (keys) shared between them. On the basis of these keys, it is possible to obtain new data in a tabular form as the result of the combination of other tables. This operation with the library pandas is called merging, and merge() is the function to perform this kind of operation.

Let's make a program which shows merging of two dataframes:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62]})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})

print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))


In our program we first define two dataframes and print them. Then we carry out the merging by applying the merge() function to the two dataframe objects. The output of the program is shown below:

First dataframe

        id         price
0     ball      12.33
1   pencil    11.44
2      pen     33.21
3      mug   13.23
4  ashtray  33.62

Second dataframe

       id         color
0  pencil     white
1  pencil     red
2    ball      red
3     pen     black

Merged dataframe

       id       price  color
0    ball    12.33    red
1  pencil  11.44  white
2  pencil  11.44    red
3     pen   33.21  black
------------------
(program exited with code: 0)

Press any key to continue . . .


As we can see from the output, the returned dataframe consists of all rows that have an ID in common. In addition to the common column, the columns from the first and the second dataframe are added. We used the merge() function without specifying any column explicitly. In fact, in most cases we need to decide which is the column on which to base the merging.

To do this, add the on option with the column name as the key for the merging. In the following program we'll implement decision based merging by specifying the column :

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62],
                        'color':['white','red','red','black','green']})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})

print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))

print('\nMerged dataframe on basis of specified column = id\n')
print(pd.merge(frame1,frame2,on='id'))

print('\nMerged dataframe on basis of specified column = color\n')
print(pd.merge(frame1,frame2,on='color'))



It is necessary to explicitly define the criteria for merging that pandas must follow,
specifying the name of the key column in the on option. The output of the program is shown below:

First dataframe

        id  price  color
0     ball  12.33  white
1   pencil  11.44    red
2      pen  33.21    red
3      mug  13.23  black
4  ashtray  33.62  green

Second dataframe

       id  color
0  pencil  white
1  pencil    red
2    ball    red
3     pen  black

Merged dataframe

       id  price color
0  pencil  11.44   red

Merged dataframe on basis of specified column = id

       id  price color_x color_y
0    ball  12.33   white     red
1  pencil  11.44     red   white
2  pencil  11.44     red     red
3     pen  33.21     red   black

Merged dataframe on basis of specified column = color

     id_x  price  color    id_y
0    ball  12.33  white  pencil
1  pencil  11.44    red  pencil
2  pencil  11.44    red    ball
3     pen  33.21    red  pencil
4     pen  33.21    red    ball
5     mug  13.23  black     pen
------------------
(program exited with code: 0)

Press any key to continue . . .


Sometimes we have two dataframes in which the key columns do not have the same name. In this situation, we have to use the left_on and right_on options, which specify the key column for the first and for the second dataframe. See the following program :

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62],
                        'color':['white','red','red','black','green']})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})
frame2.columns = ['sid','color']
print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))

print('\nMerged dataframe on basis of specified options\n')
print(pd.merge(frame1,frame2,left_on='id', right_on='sid'))


The output of the program is shown below:

First dataframe

        id  price  color
0     ball  12.33  white
1   pencil  11.44    red
2      pen  33.21    red
3      mug  13.23  black
4  ashtray  33.62  green

Second dataframe

      sid  color
0  pencil  white
1  pencil    red
2    ball    red
3     pen  black

Merged dataframe

       id  price  color     sid
0    ball  12.33  white  pencil
1  pencil  11.44    red  pencil
2  pencil  11.44    red    ball
3     pen  33.21    red  pencil
4     pen  33.21    red    ball
5     mug  13.23  black     pen

Merged dataframe on basis of specified options

       id  price color_x     sid color_y
0    ball  12.33   white    ball     red
1  pencil  11.44     red  pencil   white
2  pencil  11.44     red  pencil     red
3     pen  33.21     red     pen   black

------------------
(program exited with code: 0)

Press any key to continue . . . 


By default, the merge() function performs an inner join; the keys in the result are the result of an intersection. Other possible options are the left join, the right join, and the outer join. The outer
join produces the union of all keys, combining the effect of a left join with a right join. To select the type of join you have to use the how option. See the following program :

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62],
                        'color':['white','red','red','black','green']})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})
frame2.columns = ['id','color']
print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))

print('\nMerged dataframe on basis of outer join\n')
print(pd.merge(frame1,frame2,on='id',how='outer'))

print('\nMerged dataframe on basis of left join\n')
print(pd.merge(frame1,frame2,on='id',how='left'))

print('\nMerged dataframe on basis of right join\n')
print(pd.merge(frame1,frame2,on='id',how='right'))


The output of the program is shown below:

First dataframe

        id  price  color
0     ball  12.33  white
1   pencil  11.44    red
2      pen  33.21    red
3      mug  13.23  black
4  ashtray  33.62  green

Second dataframe

       id  color
0  pencil  white
1  pencil    red
2    ball    red
3     pen  black

Merged dataframe

       id  price color
0  pencil  11.44   red

Merged dataframe on basis of outer join

        id  price color_x color_y
0     ball  12.33   white     red
1   pencil  11.44     red   white
2   pencil  11.44     red     red
3      pen  33.21     red   black
4      mug  13.23   black     NaN
5  ashtray  33.62   green     NaN

Merged dataframe on basis of left join

        id  price color_x color_y
0     ball  12.33   white     red
1   pencil  11.44     red   white
2   pencil  11.44     red     red
3      pen  33.21     red   black
4      mug  13.23   black     NaN
5  ashtray  33.62   green     NaN

Merged dataframe on basis of right join

       id  price color_x color_y
0    ball  12.33   white     red
1  pencil  11.44     red   white
2  pencil  11.44     red     red
3     pen  33.21     red   black

------------------
(program exited with code: 0)

Press any key to continue . . . 


To merge multiple keys, we simply add a list to the on option as shown in the following program :

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62],
                        'color':['white','red','red','black','green']})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})

print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))

print('\nMerged dataframe on basis of outer join\n')
print(pd.merge(frame1,frame2,on=['id','color'],how='outer'))


The output of the program is shown below:

First dataframe

        id  price  color
0     ball  12.33  white
1   pencil  11.44    red
2      pen  33.21    red
3      mug  13.23  black
4  ashtray  33.62  green

Second dataframe

       id  color
0  pencil  white
1  pencil    red
2    ball    red
3     pen  black

Merged dataframe

       id  price color
0  pencil  11.44   red

Merged dataframe on basis of outer join

        id  price  color
0     ball  12.33  white
1   pencil  11.44    red
2      pen  33.21    red
3      mug  13.23  black
4  ashtray  33.62  green
5   pencil    NaN  white
6     ball    NaN    red
7      pen    NaN  black
------------------
(program exited with code: 0)

Press any key to continue . . . 


Merging on an Index

Sometimes instead of considering the columns of a dataframe as keys, indexes could be used as keys for merging. Then in order to decide which indexes to consider, we set the left_index or right_index options to True to activate them, with the ability to activate them both. Another way to do the merging by indexes is by using the join() function on the dataframe objects. It can also be used to combine many dataframe objects having the same or the same indexes but with no columns overlapping. See the following program which shows merging by indexes :

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62],
                        'color':['white','red','red','black','green']})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})
frame2.columns = ['id','color']
print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))

print('\nMerged dataframe on basis of indexes\n')
print(pd.merge(frame1,frame2,right_index=True, left_index=True))

frame2.columns = ['color2','id2']
print('\nMerged dataframe using join()\n')
print(frame1.join(frame2))


The output of the program is shown below:

First dataframe

        id  price  color
0     ball  12.33  white
1   pencil  11.44    red
2      pen  33.21    red
3      mug  13.23  black
4  ashtray  33.62  green

Second dataframe

       id  color
0  pencil  white
1  pencil    red
2    ball    red
3     pen  black

Merged dataframe

       id  price color
0  pencil  11.44   red

Merged dataframe on basis of indexes

     id_x  price color_x    id_y color_y
0    ball  12.33   white  pencil   white
1  pencil  11.44     red  pencil     red
2     pen  33.21     red    ball     red
3     mug  13.23   black     pen   black

Merged dataframe using join()

        id  price  color  color2    id2
0     ball  12.33  white  pencil  white
1   pencil  11.44    red  pencil    red
2      pen  33.21    red    ball    red
3      mug  13.23  black     pen  black
4  ashtray  33.62  green     NaN    NaN
------------------
(program exited with code: 0)

Press any key to continue . . .


In the above program we have used frame2.columns = ['color2','id2']  if we don't rename the columns then the program gives error as some columns in frame1 have the same name as frame2. Thus we rename the columns in frame2 before launching the join() function. 

Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

0 comments:

Post a Comment