In this post we will focus on operations that can be performed between the two pandas data structures (series and dataframe).
1. Flexible Arithmetic Methods
In the previous post we saw how to use mathematical operators directly on the pandas data structures. The same operations can also be performed using appropriate methods, called flexible arithmetic methods. Some of these methods are:
• add()
• sub()
• div()
• mul()
Using these functions needs a different specification than what we're used to dealing with mathematical operators for example, if we add two series s1 and s2, then instead of writing s1+s2 we have to use s1.add(s2). The following program adds two frames:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
index=['blue','green','white','yellow'],
columns=['mug','pen','ball'])
print('dataframe 1\n')
print(frame1)
print('\ndataframe 2\n')
print(frame2)
print('\nAdding the dataframe using operator\n')
print(frame1+frame2)
print('\nAdding the dataframe using add() method\n')
print(frame1.add(frame2))
The output of the program is shown below:
dataframe 1
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
dataframe 2
mug pen ball
blue 0 1 2
green 3 4 5
white 6 7 8
yellow 9 10 11
Adding the dataframe using operator
ball mug paper pen pencil
blue 6.0 NaN NaN 6.0 NaN
green NaN NaN NaN NaN NaN
red NaN NaN NaN NaN NaN
white 20.0 NaN NaN 20.0 NaN
yellow 19.0 NaN NaN 19.0 NaN
Adding the dataframe using add() method
ball mug paper pen pencil
blue 6.0 NaN NaN 6.0 NaN
green NaN NaN NaN NaN NaN
red NaN NaN NaN NaN NaN
white 20.0 NaN NaN 20.0 NaN
yellow 19.0 NaN NaN 19.0 NaN
------------------
(program exited with code: 0)
Press any key to continue . . .
As you can see from the output, the results are the same as what you’d get using the addition operator +.
2. Operations Between DataFrame and Series
The following program shows transaction between a dataframe and a series:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
ser = pd.Series(np.arange(4), index=['ball','pen','pencil','paper'])
print('dataframe \n')
print(frame1)
print('\nseries\n')
print(ser)
print('\nSubstracting the series from dataframe using - operator\n')
print(frame1-ser)
print('\nSubstracting the series from dataframe using sub() method\n')
print(frame1.sub(ser))
In the above program two newly defined data structures have been created specifically so that the
indexes of series match the names of the columns of the dataframe. This way, we can apply a direct operation. The elements of the series are subtracted from the values of the dataframe corresponding to the same index on the column. The value is subtracted for all values of the column, regardless of their index.
The output of the program is shown below:
dataframe
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
series
ball 0
pen 1
pencil 2
paper 3
dtype: int32
Substracting the series from dataframe using - operator
ball pen pencil paper
red 0 0 0 0
blue 4 4 4 4
yellow 8 8 8 8
white 12 12 12 12
Substracting the series from dataframe using sub() method
ball pen pencil paper
red 0 0 0 0
blue 4 4 4 4
yellow 8 8 8 8
white 12 12 12 12
------------------
(program exited with code: 0)
Press any key to continue . . .
If an index is not present in one of the two data structures, the result will be a new column with that index only that all its elements will be NaN. See the following program:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
ser = pd.Series(np.arange(4), index=['ball','pen','pencil','paper'])
print('dataframe \n')
print(frame1)
print('\nseries\n')
print(ser)
ser['mug'] = 9
print('\nmodified series\n')
print(ser)
print('\nSubstracting the series from dataframe using - operator\n')
print(frame1-ser)
The output of the program is shown below:
dataframe
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
series
ball 0
pen 1
pencil 2
paper 3
dtype: int32
modified series
ball 0
pen 1
pencil 2
paper 3
mug 9
dtype: int64
Substracting the series from dataframe using - operator
ball mug paper pen pencil
red 0 NaN 0 0 0
blue 4 NaN 4 4 4
yellow 8 NaN 8 8 8
white 12 NaN 12 12 12
------------------
(program exited with code: 0)
Press any key to continue . . .
3. Functions by Element
We know that the pandas library is built on the foundations of NumPy and then extends many of its
features by adapting them to new data structures as series and dataframe. Among these are the universal functions, called ufunc. This class of functions operates by element in the data structure. In the following program we calculate the square root of each value in the dataframe using the NumPy np.sqrt():
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
ser = pd.Series(np.arange(4), index=['ball','pen','pencil','paper'])
print('dataframe \n')
print(frame1)
print('\nSquare root of each value in the dataframe\n')
print(np.sqrt(frame1))
The output of the program is shown below:
dataframe
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
Square root of each value in the dataframe
ball pen pencil paper
red 0.000000 1.000000 1.414214 1.732051
blue 2.000000 2.236068 2.449490 2.645751
yellow 2.828427 3.000000 3.162278 3.316625
white 3.464102 3.605551 3.741657 3.872983
------------------
(program exited with code: 0)
Press any key to continue . . .
4. Functions by Row or Column
The application of the functions is not limited to the ufunc functions, but also includes those defined by the user. The important point is that they operate on a one-dimensional array, giving a single number as a result. For example, you can define a lambda function that calculates the range covered by the elements in an array. See the following program:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
f = lambda x: x.max() - x.min()
def f(x):
return x.max() - x.min()
print('dataframe \n')
print(frame1)
print('\nUsing the apply()function on the dataframe\n')
print(frame1.apply(f))
print('\nUsing the apply()function on the dataframe column\n')
print(frame1.apply(f,axis=1))
The output of the program is shown below:
dataframe
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
Using the apply()function on the dataframe row
ball 12
pen 12
pencil 12
paper 12
dtype: int64
Using the apply()function on the dataframe column
red 3
blue 3
yellow 3
white 3
dtype: int64
------------------
(program exited with code: 0)
Press any key to continue . . .
So far we have seen that the method apply() return a scalar value. It can also return a series. A useful case would be to extend the application to many functions simultaneously. In this case, we will have two or more values for each feature applied. This can be done by defining a function as shown in the following program:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
f = lambda x: x.max() - x.min()
def f(x):
return pd.Series([x.min(), x.max()], index=['min','max'])
print('dataframe \n')
print(frame1)
print('\nUsing the apply()function on the dataframe row\n')
print(frame1.apply(f))
print('\nUsing the apply()function on the dataframe column\n')
print(frame1.apply(f,axis=1))
The output of the program is shown below:
dataframe
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
Using the apply()function on the dataframe row
ball pen pencil paper
min 0 1 2 3
max 12 13 14 15
Using the apply()function on the dataframe column
min max
red 0 3
blue 4 7
yellow 8 11
white 12 15
------------------
(program exited with code: 0)
Press any key to continue . . .
5. Statistics Functions
Most of the statistical functions for arrays are still valid for dataframe, so using the apply() function is no longer necessary. For example, functions such as sum() and mean() can calculate the sum and the average, respectively, of the elements contained within a dataframe. There is also a function called describe() that allows you to obtain summary statistics at once. See the following program:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
print('dataframe \n')
print(frame1)
print('\nUsing the sum()function on the dataframe \n')
print(frame1.sum())
print('\nUsing the mean()function on the dataframe \n')
print(frame1.mean())
print('\nUsing the describe()function on the dataframe \n')
print(frame1.describe())
The output of the program is shown below:
dataframe
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
Using the sum()function on the dataframe
ball 24
pen 28
pencil 32
paper 36
dtype: int64
Using the mean()function on the dataframe
ball 6.0
pen 7.0
pencil 8.0
paper 9.0
dtype: float64
Using the describe()function on the dataframe
ball pen pencil paper
count 4.000000 4.000000 4.000000 4.000000
mean 6.000000 7.000000 8.000000 9.000000
std 5.163978 5.163978 5.163978 5.163978
min 0.000000 1.000000 2.000000 3.000000
25% 3.000000 4.000000 5.000000 6.000000
50% 6.000000 7.000000 8.000000 9.000000
75% 9.000000 10.000000 11.000000 12.000000
max 12.000000 13.000000 14.000000 15.000000
------------------
(program exited with code: 0)
Press any key to continue . . .
6. Sorting and Ranking
Sorting the data is often a necessity and it is very important to be able to do it easily. pandas provides the sort_index() function, which returns a new object that’s identical to the start, but in which the elements are ordered. In the following program we'll see how you can sort items in a series. The operation is quite trivial since the list of indexes to be ordered is only one:
import pandas as pd
import numpy as np
ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
print('The series:\n')
print(ser)
print('\nThe sorted() series:\n')
print(ser.sort_index())
print('\nThe sorted() series in descending order:\n')
print(ser.sort_index(ascending=False))
The output of the program is shown below:
The series:
red 5
blue 0
yellow 3
white 8
green 4
dtype: int64
The sorted() series:
blue 0
green 4
red 5
white 8
yellow 3
dtype: int64
The sorted() series in descending order:
yellow 3
white 8
red 5
green 4
blue 0
dtype: int64
------------------
(program exited with code: 0)
Press any key to continue . . .
As we can see, the items were sorted in ascending alphabetical order based on their labels (from A to Z). This is the default behavior, but you can set the opposite order by setting the ascending option to False.
In the next program we'll see how you can sort items in a dataframe. With the dataframe, the sorting can be performed independently on each of its two axes. So if you want to order by row following the indexes, you just continue to use the sort_index() function without arguments as you’ve seen before, or if you prefer to order by columns, you need to set the axis options to 1. See the following program:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
print('The dataframe:\n')
print(frame1)
print('\nThe sorted() dataframe order by row:\n')
print(frame1.sort_index())
print('\nThe sorted() dataframe order by columns:\n')
print(frame1.sort_index(axis=1))
The output of the program is shown below:
The series:
red 5
blue 0
yellow 3
white 8
green 4
dtype: int64
The sorted() series:
blue 0
green 4
red 5
white 8
yellow 3
dtype: int64
The sorted() series in descending order:
yellow 3
white 8
red 5
green 4
blue 0
dtype: int64
------------------
(program exited with code: 0)
Press any key to continue . . .
We have seen how to sort the values according to the indexes. But very often we may need to sort the values contained in the data structure. In this case, we have to differentiate depending on whether we have to sort the values of a series or a dataframe.
If we want to order the series, you need to use the sort_values() function. If you need to order the values in a dataframe, use the sort_values() function seen previously but with the by option. Then you have to specify the name of the column on which to sort. If the sorting criteria will be based on two or more columns, you can assign an array containing the names of the columns to the by option.
See the following program:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
print('Sort the values of a series:\n')
print(ser.sort_values())
print('\nThe sorted dataframe:\n')
print(frame1.sort_values(by='pen'))
print('\nThe sorted dataframe based on two or more columns:\n')
print(frame1.sort_values(by=['pen','pencil']))
The output of the program is shown below:
Sort the values of a series:
blue 0
yellow 3
green 4
red 5
white 8
dtype: int64
The sorted dataframe:
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
The sorted dataframe based on two or more columns:
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
------------------
(program exited with code: 0)
Press any key to continue . . .
The ranking is an operation closely related to sorting. It mainly consists of assigning a rank (that is, a value that starts at 0 and then increase gradually) to each element of the series. The rank will be assigned starting from the lowest value to the highest.
The rank can also be assigned in the order in which the data are already in the data structure (without a sorting operation). In this case, you just add the method option with the first value assigned. By default, even the ranking follows an ascending sort. To reverse this criteria, set the ascending option to False.
See the following program:
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
print('assigning a rank to each element of a series:\n')
print(ser.rank())
print('\nThe rank assigned in the order of data:\n')
print(ser.rank(method='first'))
print('\nThe ranking assigned in descending order:\n')
print(ser.rank(ascending=False))
The output of the program is shown below:
assigning a rank to each element of a series:
red 4.0
blue 1.0
yellow 2.0
white 5.0
green 3.0
dtype: float64
The rank assigned in the order of data:
red 4.0
blue 1.0
yellow 2.0
white 5.0
green 3.0
dtype: float64
The ranking assigned in descending order:
red 2.0
blue 5.0
yellow 4.0
white 1.0
green 3.0
dtype: float64
------------------
(program exited with code: 0)
Press any key to continue . . .
7. Correlation and Covariance
The correlation and covariance calculations are expressed in pandas by the corr() and cov() functions. These kind of calculations normally involve two series. See the following program:
import pandas as pd
import numpy as np
seq2 = pd.Series([3,4,3,4,5,4,3,2],['2006','2007','2008','2009','2010','2011','2012','2013'])
seq = pd.Series([1,2,3,4,4,3,2,1],['2006','2007','2008','2009','2010','2011','2012','2013'])
print('Correlation\n')
print(seq.corr(seq2))
print('\nCovariance\n')
print(seq.cov(seq2))
The output of the program is shown below:
Correlation
0.7745966692414835
Covariance
0.8571428571428571
------------------
(program exited with code: 0)
Press any key to continue . . .
Covariance and correlation can also be applied to a single dataframe. In this case, they return their corresponding matrices in the form of two new dataframe objects. See the following program:
import pandas as pd
import numpy as np
frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],index=['red','blue','yellow','white'],columns=['ball','pen','pencil','paper'])
print('Correlation\n')
print(frame2.corr())
print('\nCovariance\n')
print(frame2.cov())
The output of the program is shown below:
Correlation
ball pen pencil paper
ball 1.000000 -0.276026 0.577350 -0.763763
pen -0.276026 1.000000 -0.079682 -0.361403
pencil 0.577350 -0.079682 1.000000 -0.692935
paper -0.763763 -0.361403 -0.692935 1.000000
Covariance
ball pen pencil paper
ball 2.000000 -0.666667 2.000000 -2.333333
pen -0.666667 2.916667 -0.333333 -1.333333
pencil 2.000000 -0.333333 6.000000 -3.666667
paper -2.333333 -1.333333 -3.666667 4.666667
------------------
(program exited with code: 0)
Press any key to continue . . .
We can calculate the pairwise correlations between the columns or rows of a dataframe with a series or another DataFrame using the the corrwith() method. See the following program:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],index=['red','blue','yellow','white'],columns=['ball','pen','pencil','paper'])
ser = pd.Series([0,1,2,3,9],index=['red','blue','yellow','white','green'])
print('Correlation with series\n')
print(frame2.corrwith(ser))
print('\nCorrelation with dataframe\n')
print(frame2.corrwith(frame1))
The output of the program is shown below:
Correlation with series
ball 0.730297
pen -0.831522
pencil 0.210819
paper -0.119523
dtype: float64
Correlation with dataframe
ball 0.730297
pen -0.831522
pencil 0.210819
paper -0.119523
dtype: float64
------------------
(program exited with code: 0)
Press any key to continue . . .
Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
1. Flexible Arithmetic Methods
In the previous post we saw how to use mathematical operators directly on the pandas data structures. The same operations can also be performed using appropriate methods, called flexible arithmetic methods. Some of these methods are:
• add()
• sub()
• div()
• mul()
Using these functions needs a different specification than what we're used to dealing with mathematical operators for example, if we add two series s1 and s2, then instead of writing s1+s2 we have to use s1.add(s2). The following program adds two frames:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
index=['blue','green','white','yellow'],
columns=['mug','pen','ball'])
print('dataframe 1\n')
print(frame1)
print('\ndataframe 2\n')
print(frame2)
print('\nAdding the dataframe using operator\n')
print(frame1+frame2)
print('\nAdding the dataframe using add() method\n')
print(frame1.add(frame2))
The output of the program is shown below:
dataframe 1
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
dataframe 2
mug pen ball
blue 0 1 2
green 3 4 5
white 6 7 8
yellow 9 10 11
Adding the dataframe using operator
ball mug paper pen pencil
blue 6.0 NaN NaN 6.0 NaN
green NaN NaN NaN NaN NaN
red NaN NaN NaN NaN NaN
white 20.0 NaN NaN 20.0 NaN
yellow 19.0 NaN NaN 19.0 NaN
Adding the dataframe using add() method
ball mug paper pen pencil
blue 6.0 NaN NaN 6.0 NaN
green NaN NaN NaN NaN NaN
red NaN NaN NaN NaN NaN
white 20.0 NaN NaN 20.0 NaN
yellow 19.0 NaN NaN 19.0 NaN
------------------
(program exited with code: 0)
Press any key to continue . . .
As you can see from the output, the results are the same as what you’d get using the addition operator +.
2. Operations Between DataFrame and Series
The following program shows transaction between a dataframe and a series:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
ser = pd.Series(np.arange(4), index=['ball','pen','pencil','paper'])
print('dataframe \n')
print(frame1)
print('\nseries\n')
print(ser)
print('\nSubstracting the series from dataframe using - operator\n')
print(frame1-ser)
print('\nSubstracting the series from dataframe using sub() method\n')
print(frame1.sub(ser))
In the above program two newly defined data structures have been created specifically so that the
indexes of series match the names of the columns of the dataframe. This way, we can apply a direct operation. The elements of the series are subtracted from the values of the dataframe corresponding to the same index on the column. The value is subtracted for all values of the column, regardless of their index.
The output of the program is shown below:
dataframe
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
series
ball 0
pen 1
pencil 2
paper 3
dtype: int32
Substracting the series from dataframe using - operator
ball pen pencil paper
red 0 0 0 0
blue 4 4 4 4
yellow 8 8 8 8
white 12 12 12 12
Substracting the series from dataframe using sub() method
ball pen pencil paper
red 0 0 0 0
blue 4 4 4 4
yellow 8 8 8 8
white 12 12 12 12
------------------
(program exited with code: 0)
Press any key to continue . . .
If an index is not present in one of the two data structures, the result will be a new column with that index only that all its elements will be NaN. See the following program:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
ser = pd.Series(np.arange(4), index=['ball','pen','pencil','paper'])
print('dataframe \n')
print(frame1)
print('\nseries\n')
print(ser)
ser['mug'] = 9
print('\nmodified series\n')
print(ser)
print('\nSubstracting the series from dataframe using - operator\n')
print(frame1-ser)
The output of the program is shown below:
dataframe
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
series
ball 0
pen 1
pencil 2
paper 3
dtype: int32
modified series
ball 0
pen 1
pencil 2
paper 3
mug 9
dtype: int64
Substracting the series from dataframe using - operator
ball mug paper pen pencil
red 0 NaN 0 0 0
blue 4 NaN 4 4 4
yellow 8 NaN 8 8 8
white 12 NaN 12 12 12
------------------
(program exited with code: 0)
Press any key to continue . . .
3. Functions by Element
We know that the pandas library is built on the foundations of NumPy and then extends many of its
features by adapting them to new data structures as series and dataframe. Among these are the universal functions, called ufunc. This class of functions operates by element in the data structure. In the following program we calculate the square root of each value in the dataframe using the NumPy np.sqrt():
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
ser = pd.Series(np.arange(4), index=['ball','pen','pencil','paper'])
print('dataframe \n')
print(frame1)
print('\nSquare root of each value in the dataframe\n')
print(np.sqrt(frame1))
The output of the program is shown below:
dataframe
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
Square root of each value in the dataframe
ball pen pencil paper
red 0.000000 1.000000 1.414214 1.732051
blue 2.000000 2.236068 2.449490 2.645751
yellow 2.828427 3.000000 3.162278 3.316625
white 3.464102 3.605551 3.741657 3.872983
------------------
(program exited with code: 0)
Press any key to continue . . .
4. Functions by Row or Column
The application of the functions is not limited to the ufunc functions, but also includes those defined by the user. The important point is that they operate on a one-dimensional array, giving a single number as a result. For example, you can define a lambda function that calculates the range covered by the elements in an array. See the following program:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
f = lambda x: x.max() - x.min()
def f(x):
return x.max() - x.min()
print('dataframe \n')
print(frame1)
print('\nUsing the apply()function on the dataframe\n')
print(frame1.apply(f))
print('\nUsing the apply()function on the dataframe column\n')
print(frame1.apply(f,axis=1))
The output of the program is shown below:
dataframe
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
Using the apply()function on the dataframe row
ball 12
pen 12
pencil 12
paper 12
dtype: int64
Using the apply()function on the dataframe column
red 3
blue 3
yellow 3
white 3
dtype: int64
------------------
(program exited with code: 0)
Press any key to continue . . .
So far we have seen that the method apply() return a scalar value. It can also return a series. A useful case would be to extend the application to many functions simultaneously. In this case, we will have two or more values for each feature applied. This can be done by defining a function as shown in the following program:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
f = lambda x: x.max() - x.min()
def f(x):
return pd.Series([x.min(), x.max()], index=['min','max'])
print('dataframe \n')
print(frame1)
print('\nUsing the apply()function on the dataframe row\n')
print(frame1.apply(f))
print('\nUsing the apply()function on the dataframe column\n')
print(frame1.apply(f,axis=1))
The output of the program is shown below:
dataframe
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
Using the apply()function on the dataframe row
ball pen pencil paper
min 0 1 2 3
max 12 13 14 15
Using the apply()function on the dataframe column
min max
red 0 3
blue 4 7
yellow 8 11
white 12 15
------------------
(program exited with code: 0)
Press any key to continue . . .
5. Statistics Functions
Most of the statistical functions for arrays are still valid for dataframe, so using the apply() function is no longer necessary. For example, functions such as sum() and mean() can calculate the sum and the average, respectively, of the elements contained within a dataframe. There is also a function called describe() that allows you to obtain summary statistics at once. See the following program:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
print('dataframe \n')
print(frame1)
print('\nUsing the sum()function on the dataframe \n')
print(frame1.sum())
print('\nUsing the mean()function on the dataframe \n')
print(frame1.mean())
print('\nUsing the describe()function on the dataframe \n')
print(frame1.describe())
The output of the program is shown below:
dataframe
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
Using the sum()function on the dataframe
ball 24
pen 28
pencil 32
paper 36
dtype: int64
Using the mean()function on the dataframe
ball 6.0
pen 7.0
pencil 8.0
paper 9.0
dtype: float64
Using the describe()function on the dataframe
ball pen pencil paper
count 4.000000 4.000000 4.000000 4.000000
mean 6.000000 7.000000 8.000000 9.000000
std 5.163978 5.163978 5.163978 5.163978
min 0.000000 1.000000 2.000000 3.000000
25% 3.000000 4.000000 5.000000 6.000000
50% 6.000000 7.000000 8.000000 9.000000
75% 9.000000 10.000000 11.000000 12.000000
max 12.000000 13.000000 14.000000 15.000000
------------------
(program exited with code: 0)
Press any key to continue . . .
6. Sorting and Ranking
Sorting the data is often a necessity and it is very important to be able to do it easily. pandas provides the sort_index() function, which returns a new object that’s identical to the start, but in which the elements are ordered. In the following program we'll see how you can sort items in a series. The operation is quite trivial since the list of indexes to be ordered is only one:
import pandas as pd
import numpy as np
ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
print('The series:\n')
print(ser)
print('\nThe sorted() series:\n')
print(ser.sort_index())
print('\nThe sorted() series in descending order:\n')
print(ser.sort_index(ascending=False))
The output of the program is shown below:
The series:
red 5
blue 0
yellow 3
white 8
green 4
dtype: int64
The sorted() series:
blue 0
green 4
red 5
white 8
yellow 3
dtype: int64
The sorted() series in descending order:
yellow 3
white 8
red 5
green 4
blue 0
dtype: int64
------------------
(program exited with code: 0)
Press any key to continue . . .
As we can see, the items were sorted in ascending alphabetical order based on their labels (from A to Z). This is the default behavior, but you can set the opposite order by setting the ascending option to False.
In the next program we'll see how you can sort items in a dataframe. With the dataframe, the sorting can be performed independently on each of its two axes. So if you want to order by row following the indexes, you just continue to use the sort_index() function without arguments as you’ve seen before, or if you prefer to order by columns, you need to set the axis options to 1. See the following program:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
print('The dataframe:\n')
print(frame1)
print('\nThe sorted() dataframe order by row:\n')
print(frame1.sort_index())
print('\nThe sorted() dataframe order by columns:\n')
print(frame1.sort_index(axis=1))
The output of the program is shown below:
The series:
red 5
blue 0
yellow 3
white 8
green 4
dtype: int64
The sorted() series:
blue 0
green 4
red 5
white 8
yellow 3
dtype: int64
The sorted() series in descending order:
yellow 3
white 8
red 5
green 4
blue 0
dtype: int64
------------------
(program exited with code: 0)
Press any key to continue . . .
We have seen how to sort the values according to the indexes. But very often we may need to sort the values contained in the data structure. In this case, we have to differentiate depending on whether we have to sort the values of a series or a dataframe.
If we want to order the series, you need to use the sort_values() function. If you need to order the values in a dataframe, use the sort_values() function seen previously but with the by option. Then you have to specify the name of the column on which to sort. If the sorting criteria will be based on two or more columns, you can assign an array containing the names of the columns to the by option.
See the following program:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
print('Sort the values of a series:\n')
print(ser.sort_values())
print('\nThe sorted dataframe:\n')
print(frame1.sort_values(by='pen'))
print('\nThe sorted dataframe based on two or more columns:\n')
print(frame1.sort_values(by=['pen','pencil']))
The output of the program is shown below:
Sort the values of a series:
blue 0
yellow 3
green 4
red 5
white 8
dtype: int64
The sorted dataframe:
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
The sorted dataframe based on two or more columns:
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
------------------
(program exited with code: 0)
Press any key to continue . . .
The ranking is an operation closely related to sorting. It mainly consists of assigning a rank (that is, a value that starts at 0 and then increase gradually) to each element of the series. The rank will be assigned starting from the lowest value to the highest.
The rank can also be assigned in the order in which the data are already in the data structure (without a sorting operation). In this case, you just add the method option with the first value assigned. By default, even the ranking follows an ascending sort. To reverse this criteria, set the ascending option to False.
See the following program:
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
print('assigning a rank to each element of a series:\n')
print(ser.rank())
print('\nThe rank assigned in the order of data:\n')
print(ser.rank(method='first'))
print('\nThe ranking assigned in descending order:\n')
print(ser.rank(ascending=False))
The output of the program is shown below:
assigning a rank to each element of a series:
red 4.0
blue 1.0
yellow 2.0
white 5.0
green 3.0
dtype: float64
The rank assigned in the order of data:
red 4.0
blue 1.0
yellow 2.0
white 5.0
green 3.0
dtype: float64
The ranking assigned in descending order:
red 2.0
blue 5.0
yellow 4.0
white 1.0
green 3.0
dtype: float64
------------------
(program exited with code: 0)
Press any key to continue . . .
7. Correlation and Covariance
The correlation and covariance calculations are expressed in pandas by the corr() and cov() functions. These kind of calculations normally involve two series. See the following program:
import pandas as pd
import numpy as np
seq2 = pd.Series([3,4,3,4,5,4,3,2],['2006','2007','2008','2009','2010','2011','2012','2013'])
seq = pd.Series([1,2,3,4,4,3,2,1],['2006','2007','2008','2009','2010','2011','2012','2013'])
print('Correlation\n')
print(seq.corr(seq2))
print('\nCovariance\n')
print(seq.cov(seq2))
The output of the program is shown below:
Correlation
0.7745966692414835
Covariance
0.8571428571428571
------------------
(program exited with code: 0)
Press any key to continue . . .
Covariance and correlation can also be applied to a single dataframe. In this case, they return their corresponding matrices in the form of two new dataframe objects. See the following program:
import pandas as pd
import numpy as np
frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],index=['red','blue','yellow','white'],columns=['ball','pen','pencil','paper'])
print('Correlation\n')
print(frame2.corr())
print('\nCovariance\n')
print(frame2.cov())
The output of the program is shown below:
Correlation
ball pen pencil paper
ball 1.000000 -0.276026 0.577350 -0.763763
pen -0.276026 1.000000 -0.079682 -0.361403
pencil 0.577350 -0.079682 1.000000 -0.692935
paper -0.763763 -0.361403 -0.692935 1.000000
Covariance
ball pen pencil paper
ball 2.000000 -0.666667 2.000000 -2.333333
pen -0.666667 2.916667 -0.333333 -1.333333
pencil 2.000000 -0.333333 6.000000 -3.666667
paper -2.333333 -1.333333 -3.666667 4.666667
------------------
(program exited with code: 0)
Press any key to continue . . .
We can calculate the pairwise correlations between the columns or rows of a dataframe with a series or another DataFrame using the the corrwith() method. See the following program:
import pandas as pd
import numpy as np
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],index=['red','blue','yellow','white'],columns=['ball','pen','pencil','paper'])
ser = pd.Series([0,1,2,3,9],index=['red','blue','yellow','white','green'])
print('Correlation with series\n')
print(frame2.corrwith(ser))
print('\nCorrelation with dataframe\n')
print(frame2.corrwith(frame1))
The output of the program is shown below:
Correlation with series
ball 0.730297
pen -0.831522
pencil 0.210819
paper -0.119523
dtype: float64
Correlation with dataframe
ball 0.730297
pen -0.831522
pencil 0.210819
paper -0.119523
dtype: float64
------------------
(program exited with code: 0)
Press any key to continue . . .
Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
0 comments:
Post a Comment