The groupby object discussed in the previous post supports the operation of an iteration to generate a sequence of two-tuples containing the name of the group together with the data portion. See the following program for an example:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],'object': ['pen','pencil','pencil','ashtray','pen'],
'price1' : [5.56,4.20,1.30,0.56,2.75],
'price2' : [4.75,4.12,1.60,0.75,3.15]})
print('\nThe original dataframe\n')
print(mydataframe)
for name, group in mydataframe.groupby('color'):
print('\nName and group\n')
print(name)
print(group)
The output of the program is shown below:
The original dataframe
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.60
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15
Name and group
green
color object price1 price2
2 green pencil 1.30 1.60
4 green pen 2.75 3.15
Name and group
red
color object price1 price2
1 red pencil 4.20 4.12
3 red ashtray 0.56 0.75
Name and group
white
color object price1 price2
0 white pen 5.56 4.75
------------------
(program exited with code: 0)
Press any key to continue . . .
In the program shown above we only applied the print variable for illustration.We replace the printing operation of a variable with the function to be applied on it. We have seen that for each grouping, when subjected to some function calculation or other operations in general, regardless of how it was obtained and the selection criteria, the result will be a data structure series (if we selected a single column data) or a dataframe, which then retains the index system and the name of the columns. See the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],'object': ['pen','pencil','pencil','ashtray','pen'],
'price1' : [5.56,4.20,1.30,0.56,2.75],
'price2' : [4.75,4.12,1.60,0.75,3.15]})
print('\nThe original dataframe\n')
print(mydataframe)
result1 = mydataframe['price1'].groupby(mydataframe['color']).mean()
print(type(result1))
result2 = mydataframe.groupby(mydataframe['color']).mean()
print(type(result2))
The output of the program is shown below:
The original dataframe
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.60
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
------------------
(program exited with code: 0)
Press any key to continue . . .
Thus it is possible to select a single column at any point in the various phases of this process. Let's see three cases in which the selection of a single column in three different stages of the process applies. The following example illustrates the great flexibility of this system of grouping provided by pandas.See the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],'object': ['pen','pencil','pencil','ashtray','pen'],
'price1' : [5.56,4.20,1.30,0.56,2.75],
'price2' : [4.75,4.12,1.60,0.75,3.15]})
print('\nThe original dataframe\n')
print(mydataframe,'\n')
print(mydataframe['price1'].groupby(mydataframe['color']).mean(),'\n')
print(mydataframe.groupby(mydataframe['color'])['price1'].mean(),'\n')
print((mydataframe.groupby(mydataframe['color']).mean())['price1'],'\n')
The output of the program is shown below:
The original dataframe
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.60
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15
color
green 2.025
red 2.380
white 5.560
Name: price1, dtype: float64
color
green 2.025
red 2.380
white 5.560
Name: price1, dtype: float64
color
green 2.025
red 2.380
white 5.560
Name: price1, dtype: float64
------------------
(program exited with code: 0)
Press any key to continue . . .
Sometimes after an operation of aggregation, the names of some columns may not be very meaningful. In fact it is often useful to add a prefix to the column name that describes the type of business combination. Adding a prefix, instead of completely replacing the name, is very useful for keeping track of the source data from which they derive aggregate values. This is important if we apply a process of transformation chain (a series or dataframe is generated from each other) in which it is important to keep some reference with the source data. See the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],'object': ['pen','pencil','pencil','ashtray','pen'],
'price1' : [5.56,4.20,1.30,0.56,2.75],
'price2' : [4.75,4.12,1.60,0.75,3.15]})
print('\nThe original dataframe\n')
print(mydataframe,'\n')
print(mydataframe.groupby('color').mean().add_prefix('mean_'),'\n')
The output of the program is shown below:
The original dataframe
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.60
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15
mean_price1 mean_price2
color
green 2.025 2.375
red 2.380 2.435
white 5.560 4.750
------------------
(program exited with code: 0)
Press any key to continue . . .
There are many methods which were not implemented specifically for use with GroupBy, but actually work correctly with data structures as the series. We have already seen how to get the series by a GroupBy object, by specifying the name of the column and then by applying the method to make the calculation. For example, we can use the calculation of quantiles with the quantiles() function as shown in the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],'object': ['pen','pencil','pencil','ashtray','pen'],
'price1' : [5.56,4.20,1.30,0.56,2.75],
'price2' : [4.75,4.12,1.60,0.75,3.15]})
print('\nThe original dataframe\n')
print(mydataframe,'\n')
group = mydataframe.groupby('color')
print(group['price1'].quantile(0.6),'\n')
The output of the program is shown below:
The original dataframe
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.60
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15
color
green 2.170
red 2.744
white 5.560
Name: price1, dtype: float64
------------------
(program exited with code: 0)
Press any key to continue . . .
We can also define our own aggregation functions. Define the function separately and then pass as an argument to the mark() function. For example see the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],'object': ['pen','pencil','pencil','ashtray','pen'],
'price1' : [5.56,4.20,1.30,0.56,2.75],
'price2' : [4.75,4.12,1.60,0.75,3.15]})
print('\nThe original dataframe\n')
print(mydataframe,'\n')
group = mydataframe.groupby('color')
def range(series):
return series.max() - series.min()
print(group['price1'].agg(range),'\n')
print(group.agg(range),'\n')
print(group['price1'].agg(['mean','std',range]),'\n')
In the above program we calculate the range of the values of each group using the agg(). Next we used the agg() function on an entire dataframe. Finally we used the agg() by passing an array containing the list of operations to be done, which will become the new columns.
The output of the program is shown below:
The original dataframe
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.60
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15
color
green 1.45
red 3.64
white 0.00
Name: price1, dtype: float64
price1 price2
color
green 1.45 1.55
red 3.64 3.37
white 0.00 0.00
mean std range
color
green 2.025 1.025305 1.45
red 2.380 2.573869 3.64
white 5.560 NaN 0.00
------------------
(program exited with code: 0)
Press any key to continue . . .
Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],'object': ['pen','pencil','pencil','ashtray','pen'],
'price1' : [5.56,4.20,1.30,0.56,2.75],
'price2' : [4.75,4.12,1.60,0.75,3.15]})
print('\nThe original dataframe\n')
print(mydataframe)
for name, group in mydataframe.groupby('color'):
print('\nName and group\n')
print(name)
print(group)
The output of the program is shown below:
The original dataframe
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.60
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15
Name and group
green
color object price1 price2
2 green pencil 1.30 1.60
4 green pen 2.75 3.15
Name and group
red
color object price1 price2
1 red pencil 4.20 4.12
3 red ashtray 0.56 0.75
Name and group
white
color object price1 price2
0 white pen 5.56 4.75
------------------
(program exited with code: 0)
Press any key to continue . . .
In the program shown above we only applied the print variable for illustration.We replace the printing operation of a variable with the function to be applied on it. We have seen that for each grouping, when subjected to some function calculation or other operations in general, regardless of how it was obtained and the selection criteria, the result will be a data structure series (if we selected a single column data) or a dataframe, which then retains the index system and the name of the columns. See the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],'object': ['pen','pencil','pencil','ashtray','pen'],
'price1' : [5.56,4.20,1.30,0.56,2.75],
'price2' : [4.75,4.12,1.60,0.75,3.15]})
print('\nThe original dataframe\n')
print(mydataframe)
result1 = mydataframe['price1'].groupby(mydataframe['color']).mean()
print(type(result1))
result2 = mydataframe.groupby(mydataframe['color']).mean()
print(type(result2))
The output of the program is shown below:
The original dataframe
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.60
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
------------------
(program exited with code: 0)
Press any key to continue . . .
Thus it is possible to select a single column at any point in the various phases of this process. Let's see three cases in which the selection of a single column in three different stages of the process applies. The following example illustrates the great flexibility of this system of grouping provided by pandas.See the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],'object': ['pen','pencil','pencil','ashtray','pen'],
'price1' : [5.56,4.20,1.30,0.56,2.75],
'price2' : [4.75,4.12,1.60,0.75,3.15]})
print('\nThe original dataframe\n')
print(mydataframe,'\n')
print(mydataframe['price1'].groupby(mydataframe['color']).mean(),'\n')
print(mydataframe.groupby(mydataframe['color'])['price1'].mean(),'\n')
print((mydataframe.groupby(mydataframe['color']).mean())['price1'],'\n')
The output of the program is shown below:
The original dataframe
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.60
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15
color
green 2.025
red 2.380
white 5.560
Name: price1, dtype: float64
color
green 2.025
red 2.380
white 5.560
Name: price1, dtype: float64
color
green 2.025
red 2.380
white 5.560
Name: price1, dtype: float64
------------------
(program exited with code: 0)
Press any key to continue . . .
Sometimes after an operation of aggregation, the names of some columns may not be very meaningful. In fact it is often useful to add a prefix to the column name that describes the type of business combination. Adding a prefix, instead of completely replacing the name, is very useful for keeping track of the source data from which they derive aggregate values. This is important if we apply a process of transformation chain (a series or dataframe is generated from each other) in which it is important to keep some reference with the source data. See the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],'object': ['pen','pencil','pencil','ashtray','pen'],
'price1' : [5.56,4.20,1.30,0.56,2.75],
'price2' : [4.75,4.12,1.60,0.75,3.15]})
print('\nThe original dataframe\n')
print(mydataframe,'\n')
print(mydataframe.groupby('color').mean().add_prefix('mean_'),'\n')
The output of the program is shown below:
The original dataframe
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.60
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15
mean_price1 mean_price2
color
green 2.025 2.375
red 2.380 2.435
white 5.560 4.750
------------------
(program exited with code: 0)
Press any key to continue . . .
There are many methods which were not implemented specifically for use with GroupBy, but actually work correctly with data structures as the series. We have already seen how to get the series by a GroupBy object, by specifying the name of the column and then by applying the method to make the calculation. For example, we can use the calculation of quantiles with the quantiles() function as shown in the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],'object': ['pen','pencil','pencil','ashtray','pen'],
'price1' : [5.56,4.20,1.30,0.56,2.75],
'price2' : [4.75,4.12,1.60,0.75,3.15]})
print('\nThe original dataframe\n')
print(mydataframe,'\n')
group = mydataframe.groupby('color')
print(group['price1'].quantile(0.6),'\n')
The output of the program is shown below:
The original dataframe
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.60
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15
color
green 2.170
red 2.744
white 5.560
Name: price1, dtype: float64
------------------
(program exited with code: 0)
Press any key to continue . . .
We can also define our own aggregation functions. Define the function separately and then pass as an argument to the mark() function. For example see the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],'object': ['pen','pencil','pencil','ashtray','pen'],
'price1' : [5.56,4.20,1.30,0.56,2.75],
'price2' : [4.75,4.12,1.60,0.75,3.15]})
print('\nThe original dataframe\n')
print(mydataframe,'\n')
group = mydataframe.groupby('color')
def range(series):
return series.max() - series.min()
print(group['price1'].agg(range),'\n')
print(group.agg(range),'\n')
print(group['price1'].agg(['mean','std',range]),'\n')
In the above program we calculate the range of the values of each group using the agg(). Next we used the agg() function on an entire dataframe. Finally we used the agg() by passing an array containing the list of operations to be done, which will become the new columns.
The output of the program is shown below:
The original dataframe
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.60
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15
color
green 1.45
red 3.64
white 0.00
Name: price1, dtype: float64
price1 price2
color
green 1.45 1.55
red 3.64 3.37
white 0.00 0.00
mean std range
color
green 2.025 1.025305 1.45
red 2.380 2.573869 3.64
white 5.560 NaN 0.00
------------------
(program exited with code: 0)
Press any key to continue . . .
Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
0 comments:
Post a Comment