Monday, May 6, 2019

Pandas - 30 (Advanced Data Aggregation)

Advanced Data Aggregation uses functions like transform() and apply() which allow us to perform many kinds of group operations, some very complex.  See the following program which aims to bring together in the same dataframe the following: the dataframe of origin (the one containing the data) and that obtained by the calculation of group aggregation, for example, the sum:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],
                                'price1' : [5.56,4.20,1.30,0.56,2.75],
                                'price2' : [4.75,4.12,1.60,0.75,3.15]})
                               
print('\nThe original dataframe\n')
print(mydataframe,'\n')

sums = mydataframe.groupby('color').sum().add_prefix('tot_')

print(sums,'\n')
print(pd.merge(mydataframe,sums,left_on='color',right_index=True),'\n')

print(mydataframe.groupby('color').transform(np.sum).add_prefix('tot_'),'\n')


The output of the program is shown below:

The original dataframe

   color  price1  price2
0  white    5.56    4.75
1    red    4.20    4.12
2  green    1.30    1.60
3    red    0.56    0.75
4  green    2.75    3.15

       tot_price1  tot_price2
color
green        4.05        4.75
red          4.76        4.87
white        5.56        4.75

   color  price1  price2  tot_price1  tot_price2
0  white    5.56    4.75        5.56        4.75
1    red    4.20    4.12        4.76        4.87
3    red    0.56    0.75        4.76        4.87
2  green    1.30    1.60        4.05        4.75
4  green    2.75    3.15        4.05        4.75

   tot_price1  tot_price2
0        5.56        4.75
1        4.76        4.87
2        4.05        4.75
3        4.76        4.87
4        4.05        4.75

------------------
(program exited with code: 0)

Press any key to continue . . .


As seen in the output, using the merge() can add the results of the aggregation in each line of the
dataframe to start. This can also be achieved by using transform(). This function performs aggregation as we have seen before, but at the same time shows the values calculated based on the key value on each line of the dataframe to start.

The transform() method is a more specialized function that has very specific requirements: the function passed as an argument must produce a single scalar value (aggregation) to be broadcasted.


The method to cover more general GroupBy is applicable to apply(). This method applies in its entirety the split-apply-combine scheme. In fact, this function divides the object into parts in order to be manipulated, invokes the passage of functions on each piece, and then tries to chain together the various parts. See the following program:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame({ 'color': ['white','black','white','white','black','black'],
                                'status':['up','up','down','down','down','up'],
                                'value1':[12.33,14.55,22.34,27.84,23.40,18.33],
                                'value2':[11.23,31.80,29.99,31.18,18.25,22.44]})                              
                              
                              
print('\nThe original dataframe\n')
print(mydataframe,'\n')

print(mydataframe.groupby(['color','status']).apply( lambda x: x.max()),'\n')
reindex = {0: 'first',1: 'second',2: 'third',3: 'fourth',4: 'fifth'}
recolumn = {'item':'object','price': 'value'}
print(mydataframe.rename(index=reindex, columns=recolumn),'\n')

temp = pd.date_range('1/1/2015', periods=10, freq= 'H')
print(temp,'\n')
print('time series')
timeseries = pd.Series(np.random.rand(10), index=temp)
print(timeseries,'\n')
print('time table')
timetable = pd.DataFrame( {'date': temp, 'value1' : np.random.rand(10),'value2' : np.random.rand(10)})
print(timetable,'\n')

timetable['cat'] = ['up','down','left','left','up','up','down','right','right','up']
print(timetable,'\n')


The output of the program is shown below:

The original dataframe

   color status  value1  value2
0  white     up   12.33   11.23
1  black     up   14.55   31.80
2  white   down   22.34   29.99
3  white   down   27.84   31.18
4  black   down   23.40   18.25
5  black     up   18.33   22.44

                            color status  value1  value2
color   status
black   down        black   down   23.40   18.25
           up             black     up   18.33   31.80
white  down        white   down   27.84   31.18
           up            white     up   12.33   11.23

              color status  value1  value2
first       white     up   12.33   11.23
second   black     up   14.55   31.80
third      white   down   22.34   29.99
fourth    white   down   27.84   31.18
fifth       black   down   23.40   18.25
5           black     up   18.33   22.44

DatetimeIndex(['2015-01-01 00:00:00', '2015-01-01 01:00:00',
               '2015-01-01 02:00:00', '2015-01-01 03:00:00',
               '2015-01-01 04:00:00', '2015-01-01 05:00:00',
               '2015-01-01 06:00:00', '2015-01-01 07:00:00',
               '2015-01-01 08:00:00', '2015-01-01 09:00:00'],
              dtype='datetime64[ns]', freq='H')

time series


2015-01-01 00:00:00    0.044260
2015-01-01 01:00:00    0.652472
2015-01-01 02:00:00    0.746970
2015-01-01 03:00:00    0.749644
2015-01-01 04:00:00    0.814090
2015-01-01 05:00:00    0.441604
2015-01-01 06:00:00    0.175240
2015-01-01 07:00:00    0.389164
2015-01-01 08:00:00    0.697356
2015-01-01 09:00:00    0.571612
Freq: H, dtype: float64

time table
                 date    value1    value2
0 2015-01-01 00:00:00  0.353713  0.914037
1 2015-01-01 01:00:00  0.333735  0.088877
2 2015-01-01 02:00:00  0.792908  0.552222
3 2015-01-01 03:00:00  0.150644  0.479201
4 2015-01-01 04:00:00  0.469318  0.786390
5 2015-01-01 05:00:00  0.854175  0.743490
6 2015-01-01 06:00:00  0.932783  0.667295
7 2015-01-01 07:00:00  0.304123  0.579275
8 2015-01-01 08:00:00  0.238325  0.117312
9 2015-01-01 09:00:00  0.017224  0.225015

                      date           value1      value2        cat
0 2015-01-01 00:00:00  0.353713  0.914037     up
1 2015-01-01 01:00:00  0.333735  0.088877   down
2 2015-01-01 02:00:00  0.792908  0.552222   left
3 2015-01-01 03:00:00  0.150644  0.479201   left
4 2015-01-01 04:00:00  0.469318  0.786390     up
5 2015-01-01 05:00:00  0.854175  0.743490     up
6 2015-01-01 06:00:00  0.932783  0.667295   down
7 2015-01-01 07:00:00  0.304123  0.579275  right
8 2015-01-01 08:00:00  0.238325  0.117312  right
9 2015-01-01 09:00:00  0.017224  0.225015     up

------------------
(program exited with code: 0)

Press any key to continue . . .



In the above program the through the line

timetable['cat'] = ['up','down','left','left','up','up','down','right','right','up']

we add to the dataframe preceding a column that represents a set of text values that you will use as key values.

Here I am ending today’s post. We learned about the three basic parts that divide the data manipulation: preparation, processing, and data aggregation and about a set of library functions that allow pandas to perform these operations.

Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

0 comments:

Post a Comment