Advanced Data Aggregation uses functions like transform() and apply() which allow us to perform many kinds of group operations, some very complex. See the following program which aims to bring together in the same dataframe the following: the dataframe of origin (the one containing the data) and that obtained by the calculation of group aggregation, for example, the sum:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],
'price1' : [5.56,4.20,1.30,0.56,2.75],
'price2' : [4.75,4.12,1.60,0.75,3.15]})
print('\nThe original dataframe\n')
print(mydataframe,'\n')
sums = mydataframe.groupby('color').sum().add_prefix('tot_')
print(sums,'\n')
print(pd.merge(mydataframe,sums,left_on='color',right_index=True),'\n')
print(mydataframe.groupby('color').transform(np.sum).add_prefix('tot_'),'\n')
The output of the program is shown below:
The original dataframe
color price1 price2
0 white 5.56 4.75
1 red 4.20 4.12
2 green 1.30 1.60
3 red 0.56 0.75
4 green 2.75 3.15
tot_price1 tot_price2
color
green 4.05 4.75
red 4.76 4.87
white 5.56 4.75
color price1 price2 tot_price1 tot_price2
0 white 5.56 4.75 5.56 4.75
1 red 4.20 4.12 4.76 4.87
3 red 0.56 0.75 4.76 4.87
2 green 1.30 1.60 4.05 4.75
4 green 2.75 3.15 4.05 4.75
tot_price1 tot_price2
0 5.56 4.75
1 4.76 4.87
2 4.05 4.75
3 4.76 4.87
4 4.05 4.75
------------------
(program exited with code: 0)
Press any key to continue . . .
As seen in the output, using the merge() can add the results of the aggregation in each line of the
dataframe to start. This can also be achieved by using transform(). This function performs aggregation as we have seen before, but at the same time shows the values calculated based on the key value on each line of the dataframe to start.
The transform() method is a more specialized function that has very specific requirements: the function passed as an argument must produce a single scalar value (aggregation) to be broadcasted.
The method to cover more general GroupBy is applicable to apply(). This method applies in its entirety the split-apply-combine scheme. In fact, this function divides the object into parts in order to be manipulated, invokes the passage of functions on each piece, and then tries to chain together the various parts. See the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','black','white','white','black','black'],
'status':['up','up','down','down','down','up'],
'value1':[12.33,14.55,22.34,27.84,23.40,18.33],
'value2':[11.23,31.80,29.99,31.18,18.25,22.44]})
print('\nThe original dataframe\n')
print(mydataframe,'\n')
print(mydataframe.groupby(['color','status']).apply( lambda x: x.max()),'\n')
reindex = {0: 'first',1: 'second',2: 'third',3: 'fourth',4: 'fifth'}
recolumn = {'item':'object','price': 'value'}
print(mydataframe.rename(index=reindex, columns=recolumn),'\n')
temp = pd.date_range('1/1/2015', periods=10, freq= 'H')
print(temp,'\n')
print('time series')
timeseries = pd.Series(np.random.rand(10), index=temp)
print(timeseries,'\n')
print('time table')
timetable = pd.DataFrame( {'date': temp, 'value1' : np.random.rand(10),'value2' : np.random.rand(10)})
print(timetable,'\n')
timetable['cat'] = ['up','down','left','left','up','up','down','right','right','up']
print(timetable,'\n')
The output of the program is shown below:
The original dataframe
color status value1 value2
0 white up 12.33 11.23
1 black up 14.55 31.80
2 white down 22.34 29.99
3 white down 27.84 31.18
4 black down 23.40 18.25
5 black up 18.33 22.44
color status value1 value2
color status
black down black down 23.40 18.25
up black up 18.33 31.80
white down white down 27.84 31.18
up white up 12.33 11.23
color status value1 value2
first white up 12.33 11.23
second black up 14.55 31.80
third white down 22.34 29.99
fourth white down 27.84 31.18
fifth black down 23.40 18.25
5 black up 18.33 22.44
DatetimeIndex(['2015-01-01 00:00:00', '2015-01-01 01:00:00',
'2015-01-01 02:00:00', '2015-01-01 03:00:00',
'2015-01-01 04:00:00', '2015-01-01 05:00:00',
'2015-01-01 06:00:00', '2015-01-01 07:00:00',
'2015-01-01 08:00:00', '2015-01-01 09:00:00'],
dtype='datetime64[ns]', freq='H')
time series
2015-01-01 00:00:00 0.044260
2015-01-01 01:00:00 0.652472
2015-01-01 02:00:00 0.746970
2015-01-01 03:00:00 0.749644
2015-01-01 04:00:00 0.814090
2015-01-01 05:00:00 0.441604
2015-01-01 06:00:00 0.175240
2015-01-01 07:00:00 0.389164
2015-01-01 08:00:00 0.697356
2015-01-01 09:00:00 0.571612
Freq: H, dtype: float64
time table
date value1 value2
0 2015-01-01 00:00:00 0.353713 0.914037
1 2015-01-01 01:00:00 0.333735 0.088877
2 2015-01-01 02:00:00 0.792908 0.552222
3 2015-01-01 03:00:00 0.150644 0.479201
4 2015-01-01 04:00:00 0.469318 0.786390
5 2015-01-01 05:00:00 0.854175 0.743490
6 2015-01-01 06:00:00 0.932783 0.667295
7 2015-01-01 07:00:00 0.304123 0.579275
8 2015-01-01 08:00:00 0.238325 0.117312
9 2015-01-01 09:00:00 0.017224 0.225015
date value1 value2 cat
0 2015-01-01 00:00:00 0.353713 0.914037 up
1 2015-01-01 01:00:00 0.333735 0.088877 down
2 2015-01-01 02:00:00 0.792908 0.552222 left
3 2015-01-01 03:00:00 0.150644 0.479201 left
4 2015-01-01 04:00:00 0.469318 0.786390 up
5 2015-01-01 05:00:00 0.854175 0.743490 up
6 2015-01-01 06:00:00 0.932783 0.667295 down
7 2015-01-01 07:00:00 0.304123 0.579275 right
8 2015-01-01 08:00:00 0.238325 0.117312 right
9 2015-01-01 09:00:00 0.017224 0.225015 up
------------------
(program exited with code: 0)
Press any key to continue . . .
In the above program the through the line
timetable['cat'] = ['up','down','left','left','up','up','down','right','right','up']
we add to the dataframe preceding a column that represents a set of text values that you will use as key values.
Here I am ending today’s post. We learned about the three basic parts that divide the data manipulation: preparation, processing, and data aggregation and about a set of library functions that allow pandas to perform these operations.
Until we meet again keep practicing and learning Python, as Python is easy to learn!
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],
'price1' : [5.56,4.20,1.30,0.56,2.75],
'price2' : [4.75,4.12,1.60,0.75,3.15]})
print('\nThe original dataframe\n')
print(mydataframe,'\n')
sums = mydataframe.groupby('color').sum().add_prefix('tot_')
print(sums,'\n')
print(pd.merge(mydataframe,sums,left_on='color',right_index=True),'\n')
print(mydataframe.groupby('color').transform(np.sum).add_prefix('tot_'),'\n')
The output of the program is shown below:
The original dataframe
color price1 price2
0 white 5.56 4.75
1 red 4.20 4.12
2 green 1.30 1.60
3 red 0.56 0.75
4 green 2.75 3.15
tot_price1 tot_price2
color
green 4.05 4.75
red 4.76 4.87
white 5.56 4.75
color price1 price2 tot_price1 tot_price2
0 white 5.56 4.75 5.56 4.75
1 red 4.20 4.12 4.76 4.87
3 red 0.56 0.75 4.76 4.87
2 green 1.30 1.60 4.05 4.75
4 green 2.75 3.15 4.05 4.75
tot_price1 tot_price2
0 5.56 4.75
1 4.76 4.87
2 4.05 4.75
3 4.76 4.87
4 4.05 4.75
------------------
(program exited with code: 0)
Press any key to continue . . .
As seen in the output, using the merge() can add the results of the aggregation in each line of the
dataframe to start. This can also be achieved by using transform(). This function performs aggregation as we have seen before, but at the same time shows the values calculated based on the key value on each line of the dataframe to start.
The transform() method is a more specialized function that has very specific requirements: the function passed as an argument must produce a single scalar value (aggregation) to be broadcasted.
The method to cover more general GroupBy is applicable to apply(). This method applies in its entirety the split-apply-combine scheme. In fact, this function divides the object into parts in order to be manipulated, invokes the passage of functions on each piece, and then tries to chain together the various parts. See the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','black','white','white','black','black'],
'status':['up','up','down','down','down','up'],
'value1':[12.33,14.55,22.34,27.84,23.40,18.33],
'value2':[11.23,31.80,29.99,31.18,18.25,22.44]})
print('\nThe original dataframe\n')
print(mydataframe,'\n')
print(mydataframe.groupby(['color','status']).apply( lambda x: x.max()),'\n')
reindex = {0: 'first',1: 'second',2: 'third',3: 'fourth',4: 'fifth'}
recolumn = {'item':'object','price': 'value'}
print(mydataframe.rename(index=reindex, columns=recolumn),'\n')
temp = pd.date_range('1/1/2015', periods=10, freq= 'H')
print(temp,'\n')
print('time series')
timeseries = pd.Series(np.random.rand(10), index=temp)
print(timeseries,'\n')
print('time table')
timetable = pd.DataFrame( {'date': temp, 'value1' : np.random.rand(10),'value2' : np.random.rand(10)})
print(timetable,'\n')
timetable['cat'] = ['up','down','left','left','up','up','down','right','right','up']
print(timetable,'\n')
The output of the program is shown below:
The original dataframe
color status value1 value2
0 white up 12.33 11.23
1 black up 14.55 31.80
2 white down 22.34 29.99
3 white down 27.84 31.18
4 black down 23.40 18.25
5 black up 18.33 22.44
color status value1 value2
color status
black down black down 23.40 18.25
up black up 18.33 31.80
white down white down 27.84 31.18
up white up 12.33 11.23
color status value1 value2
first white up 12.33 11.23
second black up 14.55 31.80
third white down 22.34 29.99
fourth white down 27.84 31.18
fifth black down 23.40 18.25
5 black up 18.33 22.44
DatetimeIndex(['2015-01-01 00:00:00', '2015-01-01 01:00:00',
'2015-01-01 02:00:00', '2015-01-01 03:00:00',
'2015-01-01 04:00:00', '2015-01-01 05:00:00',
'2015-01-01 06:00:00', '2015-01-01 07:00:00',
'2015-01-01 08:00:00', '2015-01-01 09:00:00'],
dtype='datetime64[ns]', freq='H')
time series
2015-01-01 00:00:00 0.044260
2015-01-01 01:00:00 0.652472
2015-01-01 02:00:00 0.746970
2015-01-01 03:00:00 0.749644
2015-01-01 04:00:00 0.814090
2015-01-01 05:00:00 0.441604
2015-01-01 06:00:00 0.175240
2015-01-01 07:00:00 0.389164
2015-01-01 08:00:00 0.697356
2015-01-01 09:00:00 0.571612
Freq: H, dtype: float64
time table
date value1 value2
0 2015-01-01 00:00:00 0.353713 0.914037
1 2015-01-01 01:00:00 0.333735 0.088877
2 2015-01-01 02:00:00 0.792908 0.552222
3 2015-01-01 03:00:00 0.150644 0.479201
4 2015-01-01 04:00:00 0.469318 0.786390
5 2015-01-01 05:00:00 0.854175 0.743490
6 2015-01-01 06:00:00 0.932783 0.667295
7 2015-01-01 07:00:00 0.304123 0.579275
8 2015-01-01 08:00:00 0.238325 0.117312
9 2015-01-01 09:00:00 0.017224 0.225015
date value1 value2 cat
0 2015-01-01 00:00:00 0.353713 0.914037 up
1 2015-01-01 01:00:00 0.333735 0.088877 down
2 2015-01-01 02:00:00 0.792908 0.552222 left
3 2015-01-01 03:00:00 0.150644 0.479201 left
4 2015-01-01 04:00:00 0.469318 0.786390 up
5 2015-01-01 05:00:00 0.854175 0.743490 up
6 2015-01-01 06:00:00 0.932783 0.667295 down
7 2015-01-01 07:00:00 0.304123 0.579275 right
8 2015-01-01 08:00:00 0.238325 0.117312 right
9 2015-01-01 09:00:00 0.017224 0.225015 up
------------------
(program exited with code: 0)
Press any key to continue . . .
In the above program the through the line
timetable['cat'] = ['up','down','left','left','up','up','down','right','right','up']
we add to the dataframe preceding a column that represents a set of text values that you will use as key values.
Here I am ending today’s post. We learned about the three basic parts that divide the data manipulation: preparation, processing, and data aggregation and about a set of library functions that allow pandas to perform these operations.
Until we meet again keep practicing and learning Python, as Python is easy to learn!
0 comments:
Post a Comment