Pandas - 30 (Advanced Data Aggregation) ~ Python is easy to learn

Advanced Data Aggregation uses functions like transform() and apply() which allow us to perform many kinds of group operations, some very complex. See the following program which aims to bring together in the same dataframe the following: the dataframe of origin (the one containing the data) and that obtained by the calculation of group aggregation, for example, the sum:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],
                                'price1' : [5.56,4.20,1.30,0.56,2.75],
                                'price2' : [4.75,4.12,1.60,0.75,3.15]})

print('\nThe original dataframe\n')
print(mydataframe,'\n')

sums = mydataframe.groupby('color').sum().add_prefix('tot_')

print(sums,'\n')
print(pd.merge(mydataframe,sums,left_on='color',right_index=True),'\n')

print(mydataframe.groupby('color').transform(np.sum).add_prefix('tot_'),'\n')

The output of the program is shown below:

The original dataframe

   color price1 price2
0 white    5.56    4.75
1    red    4.20    4.12
2 green    1.30    1.60
3    red    0.56    0.75
4 green    2.75    3.15

       tot_price1 tot_price2
color
green        4.05        4.75
red          4.76        4.87
white        5.56        4.75

   color price1 price2 tot_price1 tot_price2
0 white    5.56    4.75        5.56        4.75
1    red    4.20    4.12        4.76        4.87
3    red    0.56    0.75        4.76        4.87
2 green    1.30    1.60        4.05        4.75
4 green    2.75    3.15        4.05        4.75

   tot_price1 tot_price2
0        5.56        4.75
1        4.76        4.87
2        4.05        4.75
3        4.76        4.87
4        4.05        4.75

------------------
(program exited with code: 0)

Press any key to continue . . .

As seen in the output, using the merge() can add the results of the aggregation in each line of the
dataframe to start. This can also be achieved by using transform(). This function performs aggregation as we have seen before, but at the same time shows the values calculated based on the key value on each line of the dataframe to start.

The transform() method is a more specialized function that has very specific requirements: the function passed as an argument must produce a single scalar value (aggregation) to be broadcasted.

The method to cover more general GroupBy is applicable to apply(). This method applies in its entirety the split-apply-combine scheme. In fact, this function divides the object into parts in order to be manipulated, invokes the passage of functions on each piece, and then tries to chain together the various parts. See the following program:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame({ 'color': ['white','black','white','white','black','black'],
                               'status':['up','up','down','down','down','up'],
                               'value1':[12.33,14.55,22.34,27.84,23.40,18.33],
                               'value2':[11.23,31.80,29.99,31.18,18.25,22.44]})


print('\nThe original dataframe\n')
print(mydataframe,'\n')

print(mydataframe.groupby(['color','status']).apply( lambda x: x.max()),'\n')
reindex = {0: 'first',1: 'second',2: 'third',3: 'fourth',4: 'fifth'}
recolumn = {'item':'object','price': 'value'}
print(mydataframe.rename(index=reindex, columns=recolumn),'\n')

temp = pd.date_range('1/1/2015', periods=10, freq= 'H')
print(temp,'\n')
print('time series')
timeseries = pd.Series(np.random.rand(10), index=temp)
print(timeseries,'\n')
print('time table')
timetable = pd.DataFrame( {'date': temp, 'value1' : np.random.rand(10),'value2' : np.random.rand(10)})
print(timetable,'\n')

timetable['cat'] = ['up','down','left','left','up','up','down','right','right','up']
print(timetable,'\n')

The output of the program is shown below:

The original dataframe

   color status value1 value2
0 white     up   12.33   11.23
1 black     up   14.55   31.80
2 white   down   22.34   29.99
3 white   down   27.84   31.18
4 black   down   23.40   18.25
5 black     up   18.33   22.44

                    color status value1 value2
color   status
black   down        black   down   23.40   18.25
         up           black     up   18.33   31.80
white down      white   down   27.84   31.18
         up            white     up   12.33   11.23

              color status value1 value2
first       white     up   12.33   11.23
second   black     up   14.55   31.80
third    white   down   22.34   29.99
fourth    white   down   27.84   31.18
fifth       black   down   23.40   18.25
5           black     up   18.33   22.44

DatetimeIndex(['2015-01-01 00:00:00', '2015-01-01 01:00:00',
               '2015-01-01 02:00:00', '2015-01-01 03:00:00',
               '2015-01-01 04:00:00', '2015-01-01 05:00:00',
               '2015-01-01 06:00:00', '2015-01-01 07:00:00',
               '2015-01-01 08:00:00', '2015-01-01 09:00:00'],
              dtype='datetime64[ns]', freq='H')

time series

2015-01-01 00:00:00    0.044260
2015-01-01 01:00:00    0.652472
2015-01-01 02:00:00    0.746970
2015-01-01 03:00:00    0.749644
2015-01-01 04:00:00    0.814090
2015-01-01 05:00:00    0.441604
2015-01-01 06:00:00    0.175240
2015-01-01 07:00:00    0.389164
2015-01-01 08:00:00    0.697356
2015-01-01 09:00:00    0.571612
Freq: H, dtype: float64

time table
                 date    value1    value2
0 2015-01-01 00:00:00 0.353713 0.914037
1 2015-01-01 01:00:00 0.333735 0.088877
2 2015-01-01 02:00:00 0.792908 0.552222
3 2015-01-01 03:00:00 0.150644 0.479201
4 2015-01-01 04:00:00 0.469318 0.786390
5 2015-01-01 05:00:00 0.854175 0.743490
6 2015-01-01 06:00:00 0.932783 0.667295
7 2015-01-01 07:00:00 0.304123 0.579275
8 2015-01-01 08:00:00 0.238325 0.117312
9 2015-01-01 09:00:00 0.017224 0.225015

                      date           value1      value2        cat
0 2015-01-01 00:00:00 0.353713 0.914037     up
1 2015-01-01 01:00:00 0.333735 0.088877   down
2 2015-01-01 02:00:00 0.792908 0.552222   left
3 2015-01-01 03:00:00 0.150644 0.479201   left
4 2015-01-01 04:00:00 0.469318 0.786390     up
5 2015-01-01 05:00:00 0.854175 0.743490     up
6 2015-01-01 06:00:00 0.932783 0.667295   down
7 2015-01-01 07:00:00 0.304123 0.579275 right
8 2015-01-01 08:00:00 0.238325 0.117312 right
9 2015-01-01 09:00:00 0.017224 0.225015     up

------------------
(program exited with code: 0)

Press any key to continue . . .

In the above program the through the line

timetable['cat'] = ['up','down','left','left','up','up','down','right','right','up']

we add to the dataframe preceding a column that represents a set of text values that you will use as key values.

Here I am ending today’s post. We learned about the three basic parts that divide the data manipulation: preparation, processing, and data aggregation and about a set of library functions that allow pandas to perform these operations.

Until we meet again keep practicing and learning Python, as Python is easy to learn!

Python is easy to learn

Monday, May 6, 2019

Pandas - 30 (Advanced Data Aggregation)

0 comments:

Post a Comment