Friday, April 26, 2019

Pandas - 23 (Pivoting)

Pivoting is a common operation  just like data assembling. In the context of pivoting, we have two basic operations:

• Stacking—Rotates or pivots the data structure converting columns to rows
• Unstacking—Converts rows into columns

Using the stack() function on the dataframe, we will get the pivoting of the columns in rows, thus producing a series. From this hierarchically indexed series, we can reassemble the dataframe into a
pivoted table by use of the unstack() function. We can also do the unstack on a different level, specifying the number of levels or its name as the argument of the function. The following program shows the stacking and unstacking:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(9).reshape(3,3),index=['white','black','red'],columns=['ball','pen','pencil'])

print('frame 1\n')
print(frame1)

ser5 = frame1.stack()
print('\nHierarchically indexed Series\n')
print(ser5)

print('\nReassemble the dataframe into a pivoted table\n')
print(ser5.unstack())

print('\nUnstack on a different level\n')
print(ser5.unstack(0))


The output of the program is shown below:

frame 1

       ball  pen  pencil
white     0    1       2
black     3    4       5
red       6    7       8

Hierarchically indexed Series

white  ball      0
       pen       1
       pencil    2
black  ball      3
       pen       4
       pencil    5
red    ball      6
       pen       7
       pencil    8
dtype: int32

Reassemble the dataframe into a pivoted table

       ball  pen  pencil
white     0    1       2
black     3    4       5
red       6    7       8

Unstack on a different level

        white  black  red
ball        0      3    6
pen         1      4    7
pencil      2      5    8
------------------
(program exited with code: 0)

Press any key to continue . . .


The most common way to store datasets is in CSV format or a table of a database. This happens especially when you have instrumental readings, calculation results iterated over time, or the simple manual input of a series of values. A similar case of these files is for example the logs file, which is filled line by line by accumulating data in it.

The peculiar characteristic of this type of dataset is to have entries on various columns, often duplicated in subsequent lines. Always remaining in tabular format of data, when you are in such cases you can refer them to as long or stacked format. To get a clearer idea about that, consider the following program:

import pandas as pd
import numpy as np

frame2 = pd.DataFrame({ 'color':['white','white','white','red','red','red','black','black','black'],
                                    'item':['ball','pen','mug','ball','pen','mug','ball','pen','mug'],
                                    'value': np.random.rand(9)})
print('Long frame \n')
print(frame2)


The output of the program is shown below:

Long frame

   color  item     value
0  white  ball  0.460043
1  white   pen  0.929175
2  white   mug  0.069400
3    red  ball  0.148175
4    red   pen  0.798596
5    red   mug  0.771690
6  black  ball  0.820520
7  black   pen  0.735380
8  black   mug  0.557580
------------------
(program exited with code: 0)

Press any key to continue . . .


The output shows that this mode of data recording has some disadvantages. One, for example, is the
multiplicity and repetition of some fields. Considering the columns as keys, the data in this format will be difficult to read, especially in fully understanding the relationships between the key values and the rest of the columns.

Instead of the long format, there is another way to arrange the data in a table that is called wide. This mode is easier to read, allowing easy connection with other tables, and it occupies much less space. So in general it is a more efficient way of storing the data, although less practical, especially if during the filling of the data.

As a criterion, select a column, or a set of them, as the primary key; then, the values contained in it must be unique.

In this regard, pandas gives you a function that allows you to make a transformation of a dataframe from the long type to the wide type. This function is pivot() and it accepts as arguments the column, or columns, which will assume the role of key.

Starting from the previous example, we choose to create a dataframe in wide format by choosing the color column as the key, and item as a second key, the values of which will form the new columns of the dataframe. See the following program:

import pandas as pd
import numpy as np

frame2 = pd.DataFrame({ 'color':['white','white','white','red','red','red','black','black','black'],
                                    'item':['ball','pen','mug','ball','pen','mug','ball','pen','mug'],
                                    'value': np.random.rand(9)})
print('Long frame \n')
print(frame2)

wideframe = frame2.pivot('color','item')
print('\nWide frame \n')
print(wideframe)


The output of the program is shown below:

Long frame

   color  item     value
0  white  ball  0.475684
1  white   pen  0.147188
2  white   mug  0.426831
3    red  ball  0.400304
4    red   pen  0.203074
5    red   mug  0.739819
6  black  ball  0.824657
7  black   pen  0.889884
8  black   mug  0.244726

Wide frame

          value
item       ball       mug       pen
color
black  0.824657  0.244726  0.889884
red    0.400304  0.739819  0.203074
white  0.475684  0.426831  0.147188
------------------
(program exited with code: 0)

Press any key to continue . . . 


In the output we can now see, in this format, the dataframe is much more compact and the data contained in it are much more readable.

The last stage of data preparation is the removal of columns and rows. Let's see this through our next program:

import pandas as pd
import numpy as np

frame3 = pd.DataFrame(np.arange(9).reshape(3,3),
                                    index=['white','black','red'],
                                    columns=['ball','pen','pencil'])
print('frame \n')
print(frame3)

del frame3['ball']

print('\nModified frame by removing column \n')
print(frame3)


print('\nModified frame by removing row \n')
print(frame3.drop('white'))


In order to remove a column, we simply use the del command applied to the dataframe with the column name specified. To remove an unwanted row, we have to use the drop() function with the label of the corresponding index as an argument. The output of the program is shown below:

frame

       ball  pen  pencil
white     0    1       2
black     3    4       5
red       6    7       8

Modified frame by removing column

       pen  pencil
white    1       2
black    4       5
red      7       8

Modified frame by removing row

       pen  pencil
black    4       5
red      7       8
------------------
(program exited with code: 0)

Press any key to continue . . .


Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

0 comments:

Post a Comment