Sunday, April 7, 2019

Pandas - 9 (Hierarchical Indexing and Leveling)

Hierarchical Indexing allows us to have multiple levels of indexes on a single axis. It gives a way to work with data in multiple dimensions while continuing to work in a two-dimensional structure.
Let’s make a program which creates a series containing two arrays of indexes,that is, creates a structure with two levels:

import pandas as pd

ser = pd.Series( np.random.rand(8),index=[['white','white','white','blue','blue','red','red','red'],['up','down','right','up','down','up','down','left']])

print('\nOriginal series\n')
print(ser)

print('\nIndex of series\n')
print(ser.index)


The output of the program is shown below: 

Original series

white  up       0.072994
       down     0.934169
       right    0.224944
blue   up       0.368885
       down     0.502154
red    up       0.714370
       down     0.900322
       left     0.626923
dtype: float64

Index of series

MultiIndex(levels=[['blue', 'red', 'white'], ['down', 'left', 'right', 'up']],
           codes=[[2, 2, 2, 0, 0, 1, 1, 1], [3, 0, 2, 3, 0, 3, 0, 1]])


------------------
(program exited with code: 0)

Press any key to continue . . .


The output shows a series containing two arrays of indexes, that is, a structure with two levels. Through the specification of hierarchical indexing, selecting subsets of values is simplified. Thus we can select the values for a given value of the first index as shown in the following program:

import pandas as pd
import numpy as np

ser = pd.Series( np.random.rand(8),index=[['white','white','white','blue','blue','red','red','red'],['up','down','right','up','down','up','down','left']])

print('\nValues for a given value of the first index\n')
print(ser['white'])

print('\nValues for a given value of the second index\n')
print(ser[:,'up'])

print('\nA specific value by specifying both indexes\n')
print(ser['white','up'])


The output of the program is shown below:

Values for a given value of the first index

up       0.829193
down     0.066195
right    0.403016
dtype: float64

Values for a given value of the second index

white    0.829193
blue     0.148104
red      0.558666
dtype: float64

A specific value by specifying both indexes

0.8291926932314726
------------------
(program exited with code: 0)

Press any key to continue . . . 


Hierarchical indexing plays a critical role in reshaping data and group-based operations such as a pivot-table. In the following program we'll use the unstack() and stack() functions. The unstack() function converts the series with a hierarchical index to a simple dataframe, where the second set of indexes is converted into a new set of columns. To perform the reverse operation, which is to convert a dataframe to a series, we use the stack() function. See the following program:

import pandas as pd
import numpy as np

ser = pd.Series( np.random.rand(8),index=[['white','white','white','blue','blue','red','red','red'],['up','down','right','up','down','up','down','left']])

f1=ser.unstack()

print('\nConverting the series with a hierarchical index to a simple dataframe\n')
print(f1)

print('\nConverting a dataframe to a series\n')
print(f1.stack())


The output of the program is shown below: 

Converting the series with a hierarchical index to a simple dataframe

           down      left     right        up
blue   0.656229       NaN       NaN  0.063722
red    0.828567  0.735408       NaN  0.118687
white  0.938660       NaN  0.120901  0.687507

Converting a dataframe to a series

blue     down     0.656229
            up       0.063722
red      down     0.828567
           left     0.735408
           up       0.118687
white  down     0.938660
           right    0.120901
           up       0.687507
 

dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . .


With the dataframe we can define a hierarchical index both for the rows and for the columns. To do so, at the time the dataframe is declared, define an array of arrays for the index and columns options as shown in the following program:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=[['red','blue','yellow','white'],['up','down','up','down']],
                    columns=[['ball','pen','pencil','paper'],[1,2,1,2]])


print('\nThe dataframe\n')
print(frame1)


The output of the program is shown below:

The dataframe

                        ball pen pencil paper
                           1   2      1     2
red       up           0   1      2     3
blue     down      4   5      6     7
yellow up           8   9     10    11
white   down     12  13   14    15
------------------
(program exited with code: 0)

Press any key to continue . . .



Sometimes it is required to rearrange the order of the levels on an axis or sort for values at a specific level. This is done using the swaplevel() function which accepts as arguments the names assigned to the two levels that we want to interchange and returns a new object with the two levels interchanged between them, while leaving the data unmodified. The following program shows how to use this function:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=[['red','blue','yellow','white'],['up','down','up','down']],
                    columns=[['ball','pen','pencil','paper'],[1,2,1,2]])


frame1.columns.names = ['objects','id']

frame1.index.names = ['colors','status']

print('\nThe dataframe\n')
print(frame1)

print('\nUsing swaplevel()\n')
print(frame1.swaplevel('colors','status'))



The output of the program is shown below:


The dataframe

objects                  ball pen pencil paper
id                              1   2      1     2
colors    status
red         up               0   1      2     3
blue     down            4   5      6     7
yellow up                 8   9     10    11
white  down            12  13     14    15

Using swaplevel()

objects                 ball pen pencil paper
id                            1   2      1     2
status    colors
up         red             0   1      2     3
down    blue           4   5      6     7
up         yellow       8   9     10    11
down   white         12  13     14    15
------------------
(program exited with code: 0)

Press any key to continue . . .


There is this sort_index() function which orders the data considering only those of a certain level by specifying it as parameter. This is used in the following program:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=[['red','blue','yellow','white'],['up','down','up','down']],
                    columns=[['ball','pen','pencil','paper'],[1,2,1,2]])


frame1.columns.names = ['objects','id']

frame1.index.names = ['colors','status']

print('\nThe dataframe\n')
print(frame1)

print('\nUsing sort_index()\n')
print(frame1.sort_index(level='colors'))


The output of the program is shown below: 

The dataframe

objects                 ball pen pencil paper
id                             1   2      1     2
colors    status
red        up               0   1      2     3
blue      down          4   5      6     7
yellow  up               8   9     10    11
white    down          12  13  14    15

Using sort_index()

objects                ball pen pencil paper
id                            1   2      1     2
colors   status
blue     down          4   5      6     7
red       up               0   1      2     3
white   down          12  13  14    15
yellow up               8   9     10    11
------------------
(program exited with code: 0)

Press any key to continue . . .



Many descriptive statistics and summary statistics performed on a dataframe or on a series have a level option, with which we can determine at what level the descriptive and summary statistics should be determined.

In the following program  we'll create a summary statistic at row level for which we have to simply specify the level option with the level name (level='colors'):

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=[['red','blue','yellow','white'],['up','down','up','down']],
                    columns=[['ball','pen','pencil','paper'],[1,2,1,2]])


frame1.columns.names = ['objects','id']

frame1.index.names = ['colors','status']

print('\nThe dataframe\n')
print(frame1)

print('\nSummary Statistic by Level\n')
print(frame1.sum(level='colors'))


The output of the program is shown below:

The dataframe

objects               ball pen pencil paper
id                           1   2      1     2
colors    status
red        up             0   1      2     3
blue      down        4   5      6     7
yellow  up             8   9     10    11
white    down        12  13     14    15

Summary Statistic by Level

objects    ball pen pencil paper
id                1   2      1     2
colors
red              0   1      2     3
blue            4   5      6     7
yellow        8   9     10    11
white        12  13     14    15
------------------
(program exited with code: 0)

Press any key to continue . . .



In the next program we create a statistic for a given level of the column, the id, here we  must specify the second axis as an argument through the axis option set to 1.

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=[['red','blue','yellow','white'],['up','down','up','down']],
                    columns=[['ball','pen','pencil','paper'],[1,2,1,2]])


frame1.columns.names = ['objects','id']

frame1.index.names = ['colors','status']

print('\nThe dataframe\n')
print(frame1)

print('\nA statistic for a level=id\n')
print(frame1.sum(level='id', axis=1))


The output of the program is shown below: 

The dataframe

objects                        ball pen pencil paper
id                                    1   2      1     2
colors   status
red        up                      0   1      2     3
blue      down                 4   5      6     7
yellow  up                      8   9     10    11
white    down                12  13   14    15

A statistic for a level=id

id                       1   2
colors   status
red       up          2   4
blue     down     10  12
yellow up          18  20
white   down     26  28
------------------
(program exited with code: 0)

Press any key to continue . . .



Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

0 comments:

Post a Comment