Pandas - 21 (Data Manipulation) ~ Python is easy to learn

In the previous posts we learned how to acquire data from data sources such as databases and files. Once we have the data in the dataframe format, they are ready to be manipulated. It’s important to prepare the data so that they can be more easily subjected to analysis and manipulation. Especially in preparation for the next phase, the data must be ready for visualization.

In the coming posts we'll learn about the functionality that the pandas library offers for this stage of data analysis. The data manipulation has three phases which are:

• Data preparation
• Data transformation
• Data aggregation

The three phases of data manipulation will be treated individually, illustrating the various operations with a series of examples and on how best to use the functions of this library for carrying out such operations. Let's discuss about the first phase in this post which is -

Data Preparation

Data Preparation is necessary before we start manipulating the data. The data must be assembled in the form data structures such that they can be manipulated later with the tools made available by the pandas library. The different procedures for data preparation are:

• Loading
• Assembling
     • Merging
     • Concatenating
     • Combining
• Reshaping (pivoting)
• Removing

In our previous posts we have covered loading where we learned about the conversion from many different formats into a data structure such as a dataframe. But even after we have the data, probably from different sources and formats, and unified it into a dataframe, we need to perform further
operations of preparation. The data contained in the pandas objects can be assembled in different ways:

• Merging—The pandas.merge() function connects the rows in a dataframe based on one or more keys. This mode is very familiar to those who are confident with the SQL language, since it also
implements join operations.

• Concatenating—The pandas.concat() function concatenates the objects along an axis.

• Combining—The pandas.DataFrame.combine_first() function is a method that allows you to connect overlapped data in order to fill in missing values in a data structure by taking data from another structure.

In addition to these, a part of the preparation process is also pivoting, which consists of the
exchange between rows and columns.

Merging

The merging operation consists of a combination of data through the connection of rows using one or more keys, this corresponds to the JOIN operation in SQL.

Usually one uses the JOIN query with SQL to get data from different tables using some reference values (keys) shared between them. On the basis of these keys, it is possible to obtain new data in a tabular form as the result of the combination of other tables. This operation with the library pandas is called merging, and merge() is the function to perform this kind of operation.

Let's make a program which shows merging of two dataframes:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62]})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})

print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))

In our program we first define two dataframes and print them. Then we carry out the merging by applying the merge() function to the two dataframe objects. The output of the program is shown below:

First dataframe

        id         price
0     ball      12.33
1   pencil    11.44
2      pen     33.21
3      mug   13.23
4 ashtray 33.62

Second dataframe

       id         color
0 pencil     white
1 pencil   red
2    ball      red
3     pen     black

Merged dataframe

       id       price color
0    ball    12.33    red
1 pencil 11.44 white
2 pencil 11.44    red
3     pen   33.21 black
------------------
(program exited with code: 0)

Press any key to continue . . .

As we can see from the output, the returned dataframe consists of all rows that have an ID in common. In addition to the common column, the columns from the first and the second dataframe are added. We used the merge() function without specifying any column explicitly. In fact, in most cases we need to decide which is the column on which to base the merging.

To do this, add the on option with the column name as the key for the merging. In the following program we'll implement decision based merging by specifying the column :

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62],
                        'color':['white','red','red','black','green']})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})

print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))

print('\nMerged dataframe on basis of specified column = id\n')
print(pd.merge(frame1,frame2,on='id'))

print('\nMerged dataframe on basis of specified column = color\n')
print(pd.merge(frame1,frame2,on='color'))

It is necessary to explicitly define the criteria for merging that pandas must follow,
specifying the name of the key column in the on option. The output of the program is shown below:

First dataframe

        id price color
0     ball 12.33 white
1   pencil 11.44    red
2      pen 33.21    red
3      mug 13.23 black
4 ashtray 33.62 green

Second dataframe

       id color
0 pencil white
1 pencil    red
2    ball    red
3     pen black

Merged dataframe

       id price color
0 pencil 11.44   red

Merged dataframe on basis of specified column = id

       id price color_x color_y
0    ball 12.33   white     red
1 pencil 11.44     red   white
2 pencil 11.44     red     red
3     pen 33.21     red   black

Merged dataframe on basis of specified column = color

     id_x price color    id_y
0    ball 12.33 white pencil
1 pencil 11.44    red pencil
2 pencil 11.44    red    ball
3     pen 33.21    red pencil
4     pen 33.21    red    ball
5     mug 13.23 black     pen
------------------
(program exited with code: 0)

Press any key to continue . . .

Sometimes we have two dataframes in which the key columns do not have the same name. In this situation, we have to use the left_on and right_on options, which specify the key column for the first and for the second dataframe. See the following program :

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62],
                        'color':['white','red','red','black','green']})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})
frame2.columns = ['sid','color']
print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))

print('\nMerged dataframe on basis of specified options\n')
print(pd.merge(frame1,frame2,left_on='id', right_on='sid'))

The output of the program is shown below:

First dataframe

        id price color
0     ball 12.33 white
1   pencil 11.44    red
2      pen 33.21    red
3      mug 13.23 black
4 ashtray 33.62 green

Second dataframe

      sid color
0 pencil white
1 pencil    red
2    ball    red
3     pen black

Merged dataframe

       id price color     sid
0    ball 12.33 white pencil
1 pencil 11.44    red pencil
2 pencil 11.44    red    ball
3     pen 33.21    red pencil
4     pen 33.21    red    ball
5     mug 13.23 black     pen

Merged dataframe on basis of specified options

       id price color_x     sid color_y
0    ball 12.33   white    ball     red
1 pencil 11.44     red pencil   white
2 pencil 11.44     red pencil     red
3     pen 33.21     red     pen   black

------------------
(program exited with code: 0)

Press any key to continue . . .

By default, the merge() function performs an inner join; the keys in the result are the result of an intersection. Other possible options are the left join, the right join, and the outer join. The outer
join produces the union of all keys, combining the effect of a left join with a right join. To select the type of join you have to use the how option. See the following program :

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62],
                        'color':['white','red','red','black','green']})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})
frame2.columns = ['id','color']
print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))

print('\nMerged dataframe on basis of outer join\n')
print(pd.merge(frame1,frame2,on='id',how='outer'))

print('\nMerged dataframe on basis of left join\n')
print(pd.merge(frame1,frame2,on='id',how='left'))

print('\nMerged dataframe on basis of right join\n')
print(pd.merge(frame1,frame2,on='id',how='right'))

The output of the program is shown below:

First dataframe

        id price color
0     ball 12.33 white
1   pencil 11.44    red
2      pen 33.21    red
3      mug 13.23 black
4 ashtray 33.62 green

Second dataframe

       id color
0 pencil white
1 pencil    red
2    ball    red
3     pen black

Merged dataframe

       id price color
0 pencil 11.44   red

Merged dataframe on basis of outer join

        id price color_x color_y
0     ball 12.33   white     red
1   pencil 11.44     red   white
2   pencil 11.44     red     red
3      pen 33.21     red   black
4      mug 13.23   black     NaN
5 ashtray 33.62   green     NaN

Merged dataframe on basis of left join

        id price color_x color_y
0     ball 12.33   white     red
1   pencil 11.44     red   white
2   pencil 11.44     red     red
3      pen 33.21     red   black
4      mug 13.23   black     NaN
5 ashtray 33.62   green     NaN

Merged dataframe on basis of right join

       id price color_x color_y
0    ball 12.33   white     red
1 pencil 11.44     red   white
2 pencil 11.44     red     red
3     pen 33.21     red   black

------------------
(program exited with code: 0)

Press any key to continue . . .

To merge multiple keys, we simply add a list to the on option as shown in the following program :

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62],
                        'color':['white','red','red','black','green']})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})

print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))

print('\nMerged dataframe on basis of outer join\n')
print(pd.merge(frame1,frame2,on=['id','color'],how='outer'))

The output of the program is shown below:

First dataframe

        id price color
0     ball 12.33 white
1   pencil 11.44    red
2      pen 33.21    red
3      mug 13.23 black
4 ashtray 33.62 green

Second dataframe

       id color
0 pencil white
1 pencil    red
2    ball    red
3     pen black

Merged dataframe

       id price color
0 pencil 11.44   red

Merged dataframe on basis of outer join

        id price color
0     ball 12.33 white
1   pencil 11.44    red
2      pen 33.21    red
3      mug 13.23 black
4 ashtray 33.62 green
5   pencil    NaN white
6     ball    NaN    red
7      pen    NaN black
------------------
(program exited with code: 0)

Press any key to continue . . .

Merging on an Index

Sometimes instead of considering the columns of a dataframe as keys, indexes could be used as keys for merging. Then in order to decide which indexes to consider, we set the left_index or right_index options to True to activate them, with the ability to activate them both. Another way to do the merging by indexes is by using the join() function on the dataframe objects. It can also be used to combine many dataframe objects having the same or the same indexes but with no columns overlapping. See the following program which shows merging by indexes :

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62],
                        'color':['white','red','red','black','green']})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})
frame2.columns = ['id','color']
print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))

print('\nMerged dataframe on basis of indexes\n')
print(pd.merge(frame1,frame2,right_index=True, left_index=True))

frame2.columns = ['color2','id2']
print('\nMerged dataframe using join()\n')
print(frame1.join(frame2))

The output of the program is shown below:

First dataframe

        id price color
0     ball 12.33 white
1   pencil 11.44    red
2      pen 33.21    red
3      mug 13.23 black
4 ashtray 33.62 green

Second dataframe

       id color
0 pencil white
1 pencil    red
2    ball    red
3     pen black

Merged dataframe

       id price color
0 pencil 11.44   red

Merged dataframe on basis of indexes

     id_x price color_x    id_y color_y
0    ball 12.33   white pencil   white
1 pencil 11.44     red pencil     red
2     pen 33.21     red    ball     red
3     mug 13.23   black     pen   black

Merged dataframe using join()

        id price color color2    id2
0     ball 12.33 white pencil white
1   pencil 11.44    red pencil    red
2      pen 33.21    red    ball    red
3      mug 13.23 black     pen black
4 ashtray 33.62 green     NaN    NaN
------------------
(program exited with code: 0)

Press any key to continue . . .

In the above program we have used frame2.columns = ['color2','id2'] if we don't rename the columns then the program gives error as some columns in frame1 have the same name as frame2. Thus we rename the columns in frame2 before launching the join() function.

Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!

Python is easy to learn

Wednesday, April 24, 2019

Pandas - 21 (Data Manipulation)

0 comments:

Post a Comment