Pandas - 28 (Data Aggregation) ~ Python is easy to learn

Data aggregation which involves a transformation that produces a single integer from an array is the last stage of data manipulation .

Some simple examples of data aggregation operations are the sum(), mean(), and count() functions. These functions operate on a set of data and perform a calculation with a consistent result consisting of a single value. A more formal manner and the one with more control in data aggregation is that which includes the categorization of a set.

The categorization of a set of data carried out for grouping is often a critical stage in the process of data analysis. It is a process of transformation since, after the division into different groups, we apply a function that converts or transforms the data in some way depending on the group they belong to. Very often the two phases of grouping and application of a function are performed in a single step.

For categorization pandas provides a tool that’s very flexible and high performance: GroupBy. Let's analyze in detail the process of GroupBy and how it works. Generally, it refers to its internal mechanism as a process called split-apply-combine. In its pattern of operation you may conceive this process as divided into three phases expressed by three operations:

• Splitting—Division into groups of datasets
• Applying—Application of a function on each group
• Combining—Combination of all the results obtained by different groups

In the first phase, that of splitting, the data contained within a data structure, such as a series or a dataframe, are divided into several groups, according to given criteria, which is often linked to indexes or to certain values in a column. In the jargon of SQL, values contained in this column are
reported as keys. Furthermore, if you are working with two-dimensional objects such as a dataframe, the grouping criterion may be applied both to the line (axis = 0) for that column (axis = 1).

The second phase, that of applying, consists of applying a function, or better a calculation expressed precisely by a function, which will produce a new and single value that’s specific to that group.

The last phase, that of combining, will collect all the results obtained from each group and combine them to form a new object. Thus we conclude that the process of data aggregation in pandas is divided into various phases called split-apply-combine. Let's understand this by means of an example:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],'object': ['pen','pencil','pencil','ashtray','pen'],
                                'price1' : [5.56,4.20,1.30,0.56,2.75],
                                'price2' : [4.75,4.12,1.60,0.75,3.15]})

print('\nThe original dataframe\n')
print(mydataframe)

group = mydataframe['price1'].groupby(mydataframe['color'])
print('\nThe created group\n')
print(group)
print('\nDetail how the dataframe was divided into groups of rows\n')
print(group.groups)
print('\nThe mean value\n')
print(group.mean())
print('\nThe sum\n')
print(group.sum())

In the above program we define a dataframe containing numeric and string values. Next we calculate the average of the price1 column using group labels listed in the color column. This we did by accessing the price1 column and calling the groupby() function with the color column.

The object that we got is a GroupBy object. In the operation of calculating the average of the price1 column that we just did, there was not really any calculation; there was just a collection of all the information needed to calculate the average. What we have done is group, in which all rows having the same value of color are grouped into a single item.

Next we analyze in detail how the dataframe was divided into groups of rows by calling the
attribute groups’ GroupBy object and saw how each group is listed and explicitly specifies the rows of the dataframe assigned to each of them. Finally we apply the operation on the group to
obtain the results for each individual group as shown in the output below:

The original dataframe

   color   object price1 price2
0 white      pen    5.56    4.75
1    red   pencil    4.20    4.12
2 green   pencil    1.30    1.60
3    red ashtray    0.56    0.75
4 green      pen    2.75    3.15

The created group

<pandas.core.groupby.generic.SeriesGroupBy object at 0x071DEBD0>

Detail how the dataframe was divided into groups of rows

{'green': Int64Index([2, 4], dtype='int64'), 'red': Int64Index([1, 3], dtype='in
t64'), 'white': Int64Index([0], dtype='int64')}

The mean value

color
green    2.025
red      2.380
white    5.560
Name: price1, dtype: float64

The sum

color
green    4.05
red      4.76
white    5.56
Name: price1, dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . .

The output shows how to group the data according to the values of a column as a key choice. The same thing can be extended to multiple columns, i.e., make a grouping of multiple keys hierarchical. See the following program:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],'object': ['pen','pencil','pencil','ashtray','pen'],
                                'price1' : [5.56,4.20,1.30,0.56,2.75],
                                'price2' : [4.75,4.12,1.60,0.75,3.15]})

print('\nThe original dataframe\n')
print(mydataframe)

ggroup = mydataframe['price1'].groupby([mydataframe['color'],mydataframe['object']])
print('\nThe created group\n')
print(ggroup)
print('\nDetail how the dataframe was divided into groups of rows\n')
print(ggroup.groups)
print('\nThe mean value\n')
print(ggroup.mean())
print('\nThe sum\n')
print(ggroup.sum())

The output of the program is shown below:

The original dataframe

   color   object price1 price2
0 white      pen    5.56    4.75
1    red   pencil    4.20    4.12
2 green   pencil    1.30    1.60
3    red ashtray    0.56    0.75
4 green      pen    2.75    3.15

The created group

<pandas.core.groupby.generic.SeriesGroupBy object at 0x002DFF90>

Detail how the dataframe was divided into groups of rows

{('green', 'pen'): Int64Index([4], dtype='int64'), ('green', 'pencil'): Int64Ind
ex([2], dtype='int64'), ('red', 'ashtray'): Int64Index([3], dtype='int64'), ('re
d', 'pencil'): Int64Index([1], dtype='int64'), ('white', 'pen'): Int64Index([0],
dtype='int64')}

The mean value

color object
green pen        2.75
       pencil     1.30
red    ashtray    0.56
       pencil     4.20
white pen        5.56
Name: price1, dtype: float64

The sum

color object
green pen        2.75
       pencil     1.30
red    ashtray    0.56
       pencil     4.20
white pen        5.56
Name: price1, dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . .

Now let's apply grouping to multiple columns. See the following program:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],'object': ['pen','pencil','pencil','ashtray','pen'],
                                'price1' : [5.56,4.20,1.30,0.56,2.75],
                                'price2' : [4.75,4.12,1.60,0.75,3.15]})

print('\nThe original dataframe\n')
print(mydataframe)

ggroup = mydataframe[['price1','price2']].groupby(mydataframe['color']).mean()
print('\nThe created group using multiple columns\n')
print(ggroup)

print('\nThe created group using entire dataframe\n')
print(mydataframe.groupby(mydataframe['color']).mean())

The output of the program is shown below:

The original dataframe

   color   object price1 price2
0 white      pen    5.56    4.75
1    red   pencil    4.20    4.12
2 green   pencil    1.30    1.60
3    red ashtray    0.56    0.75
4 green      pen    2.75    3.15

The created group using multiple columns

       price1 price2
color
green   2.025   2.375
red     2.380   2.435
white   5.560   4.750

The created group using entire dataframe

       price1 price2
color
green   2.025   2.375
red     2.380   2.435
white   5.560   4.750
------------------
(program exited with code: 0)

Press any key to continue . . .

Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!

Python is easy to learn

Thursday, May 2, 2019

Pandas - 28 (Data Aggregation)

0 comments:

Post a Comment