Pivoting is a common operation just like data assembling. In the context of pivoting, we have two basic operations:
• Stacking—Rotates or pivots the data structure converting columns to rows
• Unstacking—Converts rows into columns
Using the stack() function on the dataframe, we will get the pivoting of the columns in rows, thus producing a series. From this hierarchically indexed series, we can reassemble the dataframe into a
pivoted table by use of the unstack() function. We can also do the unstack on a different level, specifying the number of levels or its name as the argument of the function. The following program shows the stacking and unstacking:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(9).reshape(3,3),index=['white','black','red'],columns=['ball','pen','pencil'])
print('frame 1\n')
print(frame1)
ser5 = frame1.stack()
print('\nHierarchically indexed Series\n')
print(ser5)
print('\nReassemble the dataframe into a pivoted table\n')
print(ser5.unstack())
print('\nUnstack on a different level\n')
print(ser5.unstack(0))
The output of the program is shown below:
frame 1
ball pen pencil
white 0 1 2
black 3 4 5
red 6 7 8
Hierarchically indexed Series
white ball 0
pen 1
pencil 2
black ball 3
pen 4
pencil 5
red ball 6
pen 7
pencil 8
dtype: int32
Reassemble the dataframe into a pivoted table
ball pen pencil
white 0 1 2
black 3 4 5
red 6 7 8
Unstack on a different level
white black red
ball 0 3 6
pen 1 4 7
pencil 2 5 8
------------------
(program exited with code: 0)
Press any key to continue . . .
The most common way to store datasets is in CSV format or a table of a database. This happens especially when you have instrumental readings, calculation results iterated over time, or the simple manual input of a series of values. A similar case of these files is for example the logs file, which is filled line by line by accumulating data in it.
The peculiar characteristic of this type of dataset is to have entries on various columns, often duplicated in subsequent lines. Always remaining in tabular format of data, when you are in such cases you can refer them to as long or stacked format. To get a clearer idea about that, consider the following program:
import pandas as pd
import numpy as np
frame2 = pd.DataFrame({ 'color':['white','white','white','red','red','red','black','black','black'],
'item':['ball','pen','mug','ball','pen','mug','ball','pen','mug'],
'value': np.random.rand(9)})
print('Long frame \n')
print(frame2)
The output of the program is shown below:
Long frame
color item value
0 white ball 0.460043
1 white pen 0.929175
2 white mug 0.069400
3 red ball 0.148175
4 red pen 0.798596
5 red mug 0.771690
6 black ball 0.820520
7 black pen 0.735380
8 black mug 0.557580
------------------
(program exited with code: 0)
Press any key to continue . . .
The output shows that this mode of data recording has some disadvantages. One, for example, is the
multiplicity and repetition of some fields. Considering the columns as keys, the data in this format will be difficult to read, especially in fully understanding the relationships between the key values and the rest of the columns.
Instead of the long format, there is another way to arrange the data in a table that is called wide. This mode is easier to read, allowing easy connection with other tables, and it occupies much less space. So in general it is a more efficient way of storing the data, although less practical, especially if during the filling of the data.
As a criterion, select a column, or a set of them, as the primary key; then, the values contained in it must be unique.
In this regard, pandas gives you a function that allows you to make a transformation of a dataframe from the long type to the wide type. This function is pivot() and it accepts as arguments the column, or columns, which will assume the role of key.
Starting from the previous example, we choose to create a dataframe in wide format by choosing the color column as the key, and item as a second key, the values of which will form the new columns of the dataframe. See the following program:
import pandas as pd
import numpy as np
frame2 = pd.DataFrame({ 'color':['white','white','white','red','red','red','black','black','black'],
'item':['ball','pen','mug','ball','pen','mug','ball','pen','mug'],
'value': np.random.rand(9)})
print('Long frame \n')
print(frame2)
wideframe = frame2.pivot('color','item')
print('\nWide frame \n')
print(wideframe)
The output of the program is shown below:
Long frame
color item value
0 white ball 0.475684
1 white pen 0.147188
2 white mug 0.426831
3 red ball 0.400304
4 red pen 0.203074
5 red mug 0.739819
6 black ball 0.824657
7 black pen 0.889884
8 black mug 0.244726
Wide frame
value
item ball mug pen
color
black 0.824657 0.244726 0.889884
red 0.400304 0.739819 0.203074
white 0.475684 0.426831 0.147188
------------------
(program exited with code: 0)
Press any key to continue . . .
In the output we can now see, in this format, the dataframe is much more compact and the data contained in it are much more readable.
The last stage of data preparation is the removal of columns and rows. Let's see this through our next program:
import pandas as pd
import numpy as np
frame3 = pd.DataFrame(np.arange(9).reshape(3,3),
index=['white','black','red'],
columns=['ball','pen','pencil'])
print('frame \n')
print(frame3)
del frame3['ball']
print('\nModified frame by removing column \n')
print(frame3)
print('\nModified frame by removing row \n')
print(frame3.drop('white'))
In order to remove a column, we simply use the del command applied to the dataframe with the column name specified. To remove an unwanted row, we have to use the drop() function with the label of the corresponding index as an argument. The output of the program is shown below:
frame
ball pen pencil
white 0 1 2
black 3 4 5
red 6 7 8
Modified frame by removing column
pen pencil
white 1 2
black 4 5
red 7 8
Modified frame by removing row
pen pencil
black 4 5
red 7 8
------------------
(program exited with code: 0)
Press any key to continue . . .
Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
• Stacking—Rotates or pivots the data structure converting columns to rows
• Unstacking—Converts rows into columns
Using the stack() function on the dataframe, we will get the pivoting of the columns in rows, thus producing a series. From this hierarchically indexed series, we can reassemble the dataframe into a
pivoted table by use of the unstack() function. We can also do the unstack on a different level, specifying the number of levels or its name as the argument of the function. The following program shows the stacking and unstacking:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(9).reshape(3,3),index=['white','black','red'],columns=['ball','pen','pencil'])
print('frame 1\n')
print(frame1)
ser5 = frame1.stack()
print('\nHierarchically indexed Series\n')
print(ser5)
print('\nReassemble the dataframe into a pivoted table\n')
print(ser5.unstack())
print('\nUnstack on a different level\n')
print(ser5.unstack(0))
The output of the program is shown below:
frame 1
ball pen pencil
white 0 1 2
black 3 4 5
red 6 7 8
Hierarchically indexed Series
white ball 0
pen 1
pencil 2
black ball 3
pen 4
pencil 5
red ball 6
pen 7
pencil 8
dtype: int32
Reassemble the dataframe into a pivoted table
ball pen pencil
white 0 1 2
black 3 4 5
red 6 7 8
Unstack on a different level
white black red
ball 0 3 6
pen 1 4 7
pencil 2 5 8
------------------
(program exited with code: 0)
Press any key to continue . . .
The most common way to store datasets is in CSV format or a table of a database. This happens especially when you have instrumental readings, calculation results iterated over time, or the simple manual input of a series of values. A similar case of these files is for example the logs file, which is filled line by line by accumulating data in it.
The peculiar characteristic of this type of dataset is to have entries on various columns, often duplicated in subsequent lines. Always remaining in tabular format of data, when you are in such cases you can refer them to as long or stacked format. To get a clearer idea about that, consider the following program:
import pandas as pd
import numpy as np
frame2 = pd.DataFrame({ 'color':['white','white','white','red','red','red','black','black','black'],
'item':['ball','pen','mug','ball','pen','mug','ball','pen','mug'],
'value': np.random.rand(9)})
print('Long frame \n')
print(frame2)
The output of the program is shown below:
Long frame
color item value
0 white ball 0.460043
1 white pen 0.929175
2 white mug 0.069400
3 red ball 0.148175
4 red pen 0.798596
5 red mug 0.771690
6 black ball 0.820520
7 black pen 0.735380
8 black mug 0.557580
------------------
(program exited with code: 0)
Press any key to continue . . .
The output shows that this mode of data recording has some disadvantages. One, for example, is the
multiplicity and repetition of some fields. Considering the columns as keys, the data in this format will be difficult to read, especially in fully understanding the relationships between the key values and the rest of the columns.
Instead of the long format, there is another way to arrange the data in a table that is called wide. This mode is easier to read, allowing easy connection with other tables, and it occupies much less space. So in general it is a more efficient way of storing the data, although less practical, especially if during the filling of the data.
As a criterion, select a column, or a set of them, as the primary key; then, the values contained in it must be unique.
In this regard, pandas gives you a function that allows you to make a transformation of a dataframe from the long type to the wide type. This function is pivot() and it accepts as arguments the column, or columns, which will assume the role of key.
Starting from the previous example, we choose to create a dataframe in wide format by choosing the color column as the key, and item as a second key, the values of which will form the new columns of the dataframe. See the following program:
import pandas as pd
import numpy as np
frame2 = pd.DataFrame({ 'color':['white','white','white','red','red','red','black','black','black'],
'item':['ball','pen','mug','ball','pen','mug','ball','pen','mug'],
'value': np.random.rand(9)})
print('Long frame \n')
print(frame2)
wideframe = frame2.pivot('color','item')
print('\nWide frame \n')
print(wideframe)
The output of the program is shown below:
Long frame
color item value
0 white ball 0.475684
1 white pen 0.147188
2 white mug 0.426831
3 red ball 0.400304
4 red pen 0.203074
5 red mug 0.739819
6 black ball 0.824657
7 black pen 0.889884
8 black mug 0.244726
Wide frame
value
item ball mug pen
color
black 0.824657 0.244726 0.889884
red 0.400304 0.739819 0.203074
white 0.475684 0.426831 0.147188
------------------
(program exited with code: 0)
Press any key to continue . . .
In the output we can now see, in this format, the dataframe is much more compact and the data contained in it are much more readable.
The last stage of data preparation is the removal of columns and rows. Let's see this through our next program:
import pandas as pd
import numpy as np
frame3 = pd.DataFrame(np.arange(9).reshape(3,3),
index=['white','black','red'],
columns=['ball','pen','pencil'])
print('frame \n')
print(frame3)
del frame3['ball']
print('\nModified frame by removing column \n')
print(frame3)
print('\nModified frame by removing row \n')
print(frame3.drop('white'))
In order to remove a column, we simply use the del command applied to the dataframe with the column name specified. To remove an unwanted row, we have to use the drop() function with the label of the corresponding index as an argument. The output of the program is shown below:
frame
ball pen pencil
white 0 1 2
black 3 4 5
red 6 7 8
Modified frame by removing column
pen pencil
white 1 2
black 4 5
red 7 8
Modified frame by removing row
pen pencil
black 4 5
red 7 8
------------------
(program exited with code: 0)
Press any key to continue . . .
Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
0 comments:
Post a Comment