Tuesday, April 30, 2019

Pandas - 26 (Permutation and Random Sampling)

Permutation is random reordering of a series or the rows of a dataframe. These operations are easy to do using the numpy.random.permutation() function. Let's see an example of how permutation is performed. See the following program:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame(np.arange(25).reshape(5,5))

print('\nThe original dataframe\n')
print(mydataframe)

new_order = np.random.permutation(5)
print('\nThe new order in which to set the values of a row of the dataframe.\n')
print(new_order) 

print('\nApply new order on all lines of the dataframe\n')

print(mydataframe.take(new_order)) 

print('\nSubmitting a portion of the entire dataframe to a permutation\n')
new_order = [3,4,2]

print(mydataframe.take(new_order)) 


The output of the program is shown below:

The original dataframe

      0    1    2   3   4
0    0    1    2   3   4
1    5    6    7   8   9
2  10  11  12  13  14
3  15  16  17  18  19
4  20  21  22  23  24

The new order in which to set the values of a row of the dataframe.

[3 4 0 2 1]

Apply new order on all lines of the dataframe

      0     1   2    3   4
3  15  16  17  18  19
4  20  21  22  23  24
0   0   1   2   3   4
2  10  11  12  13  14
1   5   6   7   8   9

Submitting a portion of the entire dataframe to a permutation

       0   1     2   3   4
3  15  16  17  18  19
4  20  21  22  23  24
2  10  11  12  13  14
------------------
(program exited with code: 0)

Press any key to continue . . .


If we have huge dataframe, we might need to sample it randomly, and the quickest way to do this is by using the np.random.randint() function. The following program performs random sampling:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame(np.arange(25).reshape(5,5))

print('\nThe original dataframe\n')
print(mydataframe)

mysample = np.random.randint(0, len(mydataframe), size=3)

print('\nThe created sample \n')
print(mysample)


print('\nRandom sample\n')
print(mydataframe.take(mysample))


The output of the program is shown below:


The original dataframe

      0    1    2    3   4
0    0    1    2    3   4
1    5    6    7    8   9
2  10  11  12  13  14
3  15  16  17  18  19
4  20  21  22  23  24

The created sample

[2 2 1]

Random sample

      0    1    2    3    4
2  10  11  12  13  14
2  10  11  12  13  14
1   5   6   7   8   9
------------------
(program exited with code: 0)

Press any key to continue . . .


Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

Monday, April 29, 2019

Pandas - 25 (Discretization and Binning)

Discretization is a complex process of transformation which is used in experimental cases, to handle large quantities of data generated in sequence. To carry out an analysis of the data, it is necessary to transform this data into discrete categories, for example -

1. by dividing the range of values of such readings into smaller intervals and counting the occurrence or statistics in them.

2. Another case might be when we have a huge number of samples due to precise readings on a population. Even here, to facilitate analysis of the data, it is necessary to divide the range of values into categories and then analyze the occurrences and statistics related to each.

Our following program will explain how Discretization and Binning is performed:

import pandas as pd
import numpy as np

#List containing experimental values
results = [12,34,67,55,28,90,99,12,3,56,74,44,87,23,49,89,87]

#array containing the values of separation of bin
bins = [0,25,50,75,100]
#binning
cat = pd.cut(results, bins)

print('Bins\n')
print(cat)
print('\nCategories array indicating the names of the different internal categories\n')
print(cat.categories)
print('\nCodes array containing a list of numbers equal to the elements of results list\n')
print(cat.codes)
print('\nOccurrences for each bin\n')
print(pd.value_counts(cat))



Our program deals with data which is a reading of an experimental value between 0 and 100. These data are collected in a list named results. We divide the interval of data into four equal parts which we call bins. The first contains the values between 0 and 25, the second between 26 and 50, the third between 51 and 75, and the last between 76 and 100.

To do this binning with pandas, we define an array containing the values of separation of bin (0,25,50,75,100). Then we apply the cut() to the array of results also passing the bins. The object returned by the cut() function is a special object of Categorical type which we can consider as an array of strings indicating the name of the bin. Internally it contains a categories array indicating the names of the different internal categories and a codes array that contains a list of numbers equal to the elements of results (i.e., the array subjected to binning). The number corresponds to the bin to which the corresponding element of results is assigned.

Finally to know the occurrences for each bin, that is, how many results fall into each category, we use the value_counts() function.  The output of the program is shown below:

Bins

[(0, 25], (25, 50], (50, 75], (50, 75], (25, 50], ..., (75, 100], (0, 25], (25,
50], (75, 100], (75, 100]]
Length: 17
Categories (4, interval[int64]): [(0, 25] < (25, 50] < (50, 75] < (75, 100]]

Categories array indicating the names of the different internal categories

IntervalIndex([(0, 25], (25, 50], (50, 75], (75, 100]],
              closed='right',
              dtype='interval[int64]')

Codes array containing a list of numbers equal to the elements of results list

[0 1 2 2 1 3 3 0 0 2 2 1 3 0 1 3 3]

Occurrences for each bin

(75, 100]    5
(50, 75]     4
(25, 50]     4
(0, 25]      4
dtype: int64
------------------
(program exited with code: 0)

Press any key to continue . . .


From the output we can see that each class has the lower limit with a bracket and the upper limit with a parenthesis. This notation is consistent with mathematical notation that is used to indicate the intervals. If the bracket is square, the number belongs to the range (limit closed), and if it is round, the number does not belong to the interval (limit open).

We can give names to various bins by calling them first in an array of strings and then assigning to the labels options inside the cut() function that we have used to create the Categorical object. If the cut() function is passed as an argument to an integer instead of explicating the bin edges, this will divide the range of values of the array in many intervals as specified by the number.

The limits of the interval will be taken by the minimum and maximum of the sample data, which is, the array subjected to binning, in our case 'results'. See the following program:

import pandas as pd
import numpy as np

#List containing experimental values
results = [12,34,67,55,28,90,99,12,3,56,74,44,87,23,49,89,87]

#array containing the values of separation of bin
bins = [0,25,50,75,100]

bin_names = ['unlikely','less likely','likely','highly likely']
#binning
cat = pd.cut(results, bins,labels=bin_names)

print('Bins\n')
print(cat)

print('\nDividing the range of values of the array in 5 intervals\n')
print(pd.cut(results, 5))


The output of the program is shown below:

Bins

[unlikely, less likely, likely, likely, less likely, ..., highly likely, unlikel
y, less likely, highly likely, highly likely]
Length: 17
Categories (4, object): [unlikely < less likely < likely < highly likely]

Dividing the range of values of the array in 5 intervals

[(2.904, 22.2], (22.2, 41.4], (60.6, 79.8], (41.4, 60.6], (22.2, 41.4], ..., (79
.8, 99.0], (22.2, 41.4], (41.4, 60.6], (79.8, 99.0], (79.8, 99.0]]
Length: 17
Categories (5, interval[float64]): [(2.904, 22.2] < (22.2, 41.4] < (41.4, 60.6]
< (60.6, 79.8] <  (79.8, 99.0]]
------------------
(program exited with code: 0)

Press any key to continue . . .


Apart from cut(), pandas provides another method for binning qcut() which divides the sample directly into quintiles. By using cut(), we will have a different number of occurrences for each bin. Instead qcut() will ensure that the number of occurrences for each bin is equal, but the edges of each bin vary. Let's use this method in our previous program:

import pandas as pd
import numpy as np

#List containing experimental values
results = [12,34,67,55,28,90,99,12,3,56,74,44,87,23,49,89,87]

#array containing the values of separation of bin
bins = [0,25,50,75,100]

bin_names = ['unlikely','less likely','likely','highly likely']
#binning
cat = pd.cut(results, bins,labels=bin_names)

print('Bins\n')
print(cat)

print('\ndivide the range of values of the array in 5 intervals\n')
print(pd.cut(results, 5))

quintiles = pd.qcut(results, 5)

print('\nUsing qcut() dividing the range of values of the array in 5 intervals\n')
print(quintiles)

print('\nOccurrences for each bin\n')
print(pd.value_counts(quintiles))


The output of the program is shown below:

Bins

[unlikely, less likely, likely, likely, less likely, ..., highly likely, unlikel
y, less likely, highly likely, highly likely]
Length: 17
Categories (4, object): [unlikely < less likely < likely < highly likely]

divide the range of values of the array in 5 intervals

[(2.904, 22.2], (22.2, 41.4], (60.6, 79.8], (41.4, 60.6], (22.2, 41.4], ..., (79
.8, 99.0], (22.2, 41.4], (41.4, 60.6], (79.8, 99.0], (79.8, 99.0]]
Length: 17
Categories (5, interval[float64]): [(2.904, 22.2] < (22.2, 41.4] < (41.4, 60.6]
< (60.6, 79.8] <
                                    (79.8, 99.0]]

Using qcut() dividing the range of values of the array in 5 intervals

[(2.999, 24.0], (24.0, 46.0], (62.6, 87.0], (46.0, 62.6], (24.0, 46.0], ..., (62
.6, 87.0], (2.999, 24.0], (46.0, 62.6], (87.0, 99.0], (62.6, 87.0]]
Length: 17
Categories (5, interval[float64]): [(2.999, 24.0] < (24.0, 46.0] < (46.0, 62.6]
< (62.6, 87.0] <
                                    (87.0, 99.0]]

Occurrences for each bin

(62.6, 87.0]     4
(2.999, 24.0]    4
(87.0, 99.0]     3
(46.0, 62.6]     3
(24.0, 46.0]     3
dtype: int64
------------------
(program exited with code: 0)

Press any key to continue . . .


We can see from the output that in the case of quintiles, the intervals bounding the bin differ from
those generated by the cut() function. Moreover, qcut() tried to standardize the occurrences for each bin, but in the case of quintiles, the first two bins have an occurrence in more because the number of results is not divisible by five.

In the event of abnormal values in a data structure , often there is need to detect their presence during the data analysis process. See the following program in which we consider outliers those that have a value greater than three times the standard deviation.:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame(np.random.randn(1000,3))

print('\nThe original dataframe\n')
print(mydataframe)

print('\nThe statistics for each column\n')
print(mydataframe.describe()) 

print('\nThe standard deviation of each column of the dataframe\n')

print(mydataframe.std()) 

print('\nThe filtering of all the values of the dataframe\n')

print(mydataframe[(np.abs(mydataframe) > (3*mydataframe.std())).any(1)])  


First we created a dataframe with three columns from 1,000 completely random values. Then with the describe() function we view the statistics for each column and checked the standard deviation of each column of the dataframe using the std() function. Finally we applied the filtering of all the values of the dataframe, applying the corresponding standard deviation for each column and using the any() function, applied the filter on each column.

The output of the program is shown below:

The original dataframe

            0         1         2
0   -0.152871 -0.763588  0.550351
1   -0.329748  0.202107 -0.555331
2   -0.517345  1.790445 -1.158925
3   -0.972289 -0.312655  0.620838
4    1.450099 -0.507097 -2.944250
5    0.535946  0.007717 -0.390331
6   -0.487387 -0.471448  0.177973
7   -0.526594  0.467879  0.340540
8   -0.534051 -1.004680  0.544254
9    0.927932  1.201972 -0.804130
10   0.490467 -0.524667 -0.699485
11  -0.458283 -0.549288  0.299852
12   0.462777 -0.568852 -0.925806
13   0.426844 -1.712511 -0.780843
14  -0.663651  1.311056  0.979108
15   0.294022 -0.797623  0.730315
16   1.274876 -0.000637  0.286369
17   1.315956  0.067872  0.773538
18   0.106650 -0.511677 -0.437176
19  -0.627332  0.193505 -0.049096
20   0.181071  0.477801 -1.509857
21   0.468760 -1.005808  0.328267
22   1.568992  2.211600 -1.403844
23   0.177481  0.826748  0.310399
24   0.789889 -0.663966  0.556157
25   1.664440  0.468722  1.284905
26  -1.104418 -1.266112  2.053315
27   0.037905  1.034867 -0.992572
28  -2.607207  0.362349 -1.825882
29   0.390756  1.633788 -0.370098
..        ...       ...       ...
970 -0.431666  0.161989  1.098937
971 -0.020122  0.551296  1.081225
972 -0.505658 -0.298048  0.023238
973  0.138252 -1.028921 -0.124180
974 -1.064977 -0.000879  0.156231
975  1.347509 -0.021861  1.280861
976 -0.225524 -0.583704  0.005301
977 -0.263022 -2.116113 -1.257308
978 -1.019497 -0.244579 -1.429471
979 -1.283034  1.166787  0.713066
980  0.979168  0.057361  0.397983
981 -0.555054  0.496199 -0.658068
982  0.051657  0.196189 -0.083374
983 -1.578053 -0.229885 -0.413917
984  0.990382 -1.547720 -1.001030
985 -1.073932  0.470117 -1.726342
986 -0.013742 -0.784292 -0.686692
987  0.915873  1.920051  1.674205
988 -0.672419 -0.606013  0.129781
989  1.516790  0.578385 -0.540154
990  0.597486 -0.177357 -0.012550
991  0.216080 -1.731623  1.315886
992  0.776450 -2.359688 -1.205302
993 -0.094598 -0.211266 -0.752690
994 -0.335907 -0.634471 -1.062571
995 -0.004971 -1.916150  0.566218
996  0.585543 -0.212457  0.366224
997  0.167019 -1.194672  0.774392
998 -0.831502  0.307548 -2.015205
999  0.847513 -0.921022  0.425666

[1000 rows x 3 columns]

The statistics for each column

                 0            1            2
count  1000.000000  1000.000000  1000.000000
mean      0.038385    -0.038058     0.012744
std       1.002982     1.007076     1.015312
min      -2.854868    -3.133775    -3.002108
25%      -0.659852    -0.716565    -0.653306
50%       0.054970    -0.019524     0.021691
75%       0.740922     0.629396     0.661967
max       3.224009     3.179825     3.793527

The standard deviation of each column of the dataframe

0    1.002982
1    1.007076
2    1.015312
dtype: float64

The filtering of all the values of the dataframe

            0         1         2
160  3.224009 -1.695575 -0.720423
230 -0.590097  3.179825 -0.789051
334 -1.347662 -3.133775  0.727448
367 -0.522820 -3.059741 -1.905784
374 -0.905473 -0.220910  3.793527
------------------
(program exited with code: 0)

Press any key to continue . . .


Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

Sunday, April 28, 2019

Pandas - 24 (Data Transformation)

By now we have a decent background on how to prepare data for analysis and ready to begin the second stage of data manipulation: the data transformation. After we arrange the form of data and their disposal within the data structure, it is important to transform their values. We will see some common issues and the steps required to overcome them using functions of the pandas library.

Some of these operations involve the presence of duplicate or invalid values, with possible removal or replacement. Other operations relate instead by modifying the indexes. Other steps include handling and processing the numerical values of the data and strings. Let's start with:

1. Removing Duplicates

Duplicate rows might be present in a dataframe for various reasons. In dataframes of enormous size, the detection of these rows can be very problematic. In this case, pandas provides a series of tools to analyze the duplicate data present in large data structures. See the following program:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame({ 'color': ['white','white','red','red','white'],
                                    'value': [2,1,3,3,2]})
print('The dataframe \n')
print(mydataframe)


print('\nDetect the duplicate rows \n')
print(mydataframe.duplicated())


print('\nKnow which are the duplicate rows\n')
print(mydataframe[mydataframe.duplicated()]) 


In the above program we first create a simple dataframe with some duplicate rows. Then the duplicated() function is applied to the dataframe which can detect the rows that appear to be duplicated. It returns a series of Booleans where each element corresponds to a row, with True if the row is duplicated (i.e., only the other occurrences, not the first), and with False if there are no duplicates in the previous elements.

Finally we check which are the duplicate rows in our dataframe. The output of the program is shown below:

The dataframe

      color  value
0    white      2
1    white      1
2    red          3
3    red          3
4    white      2

Detect the duplicate rows

0    False
1    False
2    False
3     True
4     True
dtype: bool

Know which are the duplicate rows

      color      value
3    red          3
4    white      2

-----------------
(program exited with code: 0)

Press any key to continue . . .


pandas provides the drop_duplicates() function, which returns the dataframes without duplicate rows as shown in the following program:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame({ 'color': ['white','white','red','red','white'],
                                    'value': [2,1,3,3,2]})
print('The dataframe \n')
print(mydataframe)

print('\nThe duplicate rows\n')
print(mydataframe[mydataframe.duplicated()])

print('\nThe dataframe after deleting the duplicate rows \n')
print(mydataframe.drop_duplicates())


The drop_duplicates() deletes the other occurrences of the rows. The output of the program is shown below:

The dataframe

   color  value
0  white      2
1  white      1
2    red      3
3    red      3
4  white      2

The duplicate rows

     color  value
3    red      3
4   white      2

The dataframe after deleting the duplicate rows

   color  value
0  white      2
1  white      1
2    red       3
------------------
(program exited with code: 0)

Press any key to continue . . .


2. Mapping

Mapping is the creation of a list of matches between two different values, with the ability to bind a value to a particular label or string. The pandas library provides a set of functions which, exploit mapping to perform some operations. The dict objects are used to define mapping as shown below:

mymapping =  {
                         'label1' : 'value1,
                         'label2' : 'value2,
                          ...
                         }

The pandas library provides the following set of specific functions:

• replace()—Replaces values
• map()—Creates a new column
• rename()—Replaces the index values

Let's see how to use these functions:

Replacing Values via Mapping

Sometimes in our data structure there are values that do not meet our needs. For example, the text may be in a foreign language, or may be a synonym of another value, or may not be expressed in the desired shape. In such cases, a replace operation of various values is often a necessary process.

See the following program:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
                                    'color':['white','lal','hara','black','yellow'],
                                    'price':[5.56,4.20,1.30,0.56,2.75]})
print('The dataframe \n')
print(mydataframe)

newcolors = {'lal': 'red','hara': 'green'}

print('\nThe dataframe after replacing the values \n')
print(mydataframe.replace(newcolors))


In the above shown program we have created a dataframe containing various objects and colors, including two colors, lal and hara, that are not in English. To be able to replace the incorrect values with new values, we define a mapping of correspondences, containing as a key the new values which are English equivalents of the two colors. Finally we use the replace() function with the mapping as an argument and get the dataframe with the correct values. 

The output of the program is shown below:

The dataframe

      item   color  price
0     ball   white   5.56
1      mug     lal   4.20
2      pen    hara   1.30
3   pencil   black   0.56
4  ashtray  yellow   2.75

The dataframe after replacing the values

      item   color  price
0     ball   white   5.56
1      mug     red   4.20
2      pen   green   1.30
3   pencil   black   0.56
4  ashtray  yellow   2.75


------------------
(program exited with code: 0)

Press any key to continue . . .


A common use of the replace() is in the replacement of NaN values with another value, for example 0. See the following program:

import pandas as pd
import numpy as np

myseries = pd.Series([1,3,np.nan,4,6,np.nan,3])

print('\nOriginal Series\n')
print(myseries)

print('\nModified Series\n')
print(myseries.replace(np.nan,0))


In the above program we have replaced all the occurrences on NaN with 0. The output of the program is shown below:

Original Series

0    1.0
1    3.0
2    NaN
3    4.0
4    6.0
5    NaN
6    3.0
dtype: float64

Modified Series

0    1.0
1    3.0
2    0.0
3    4.0
4    6.0
5    0.0
6    3.0
dtype: float64


------------------
(program exited with code: 0)

Press any key to continue . . .

Adding Values via Mapping

We also sometimes need to add values in a column depending on the values contained in another. This is accomplished use the map() as shown in the following program: 

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
                             'color':['white','red','green','black','yellow']})

print('The dataframe \n')
print(mydataframe)

prices = { 'ball' : 5.56, 
           'mug' : 4.20,
           'bottle' : 1.30,
           'scissors' : 3.41,
           'pen' : 1.30,
           'pencil' : 0.56,
           'ashtray' : 2.75
           }

mydataframe['price'] = mydataframe['item'].map(prices)

print('\nThe dataframe with prices added\n')
print(mydataframe) 


In the above program we first create a dataframe with two colomns. Later we want to add a column to indicate the price of the item shown in the dataframe. The prices are defined for each item in the prices dictionary. We then applied the mapping of the prices on the column item using the map(), making sure to add a column to the price dataframe. The map() function applied to a series or to a column of a dataframe accepts a function or an object containing a dict with mapping.

The output of the program is shown below:

The dataframe

      item   color
0     ball   white
1      mug     red
2      pen   green
3   pencil   black
4  ashtray  yellow

The dataframe with prices added

      item   color  price
0     ball   white   5.56
1      mug     red   4.20
2      pen   green   1.30
3   pencil   black   0.56
4  ashtray  yellow   2.75
------------------
(program exited with code: 0)

Press any key to continue . . . .


Rename the Indexes of the Axes

Just like the values contained within the series and the dataframe, the axis label can be transformed in a very similar way using the mapping. To replace the label indexes, pandas provides the rename() function, which takes the mapping as an argument, that is, a dict object. Let's replace the label index in our previous program  so that instead of 01234 it displays first second third fourth. Add the following code to the previous program:

reindex = {0: 'first',1: 'second',2: 'third',3: 'fourth',4: 'fifth'}

print('\nThe dataframe with label renamed\n')
print(mydataframe.rename(reindex))  

The output of the program is shown below:

The dataframe with label renamed

              item       color     price
first        ball       white     5.56
second   mug      red         4.20
third       pen       green     1.30
fourth    pencil    black     0.56
fifth       ashtray  yellow   2.75
------------------
(program exited with code: 0)

Press any key to continue . . .

As we can see  the indexes are renamed. If we want to rename columns we must use the columns option. In this case we assign various mapping explicitly to the two index and columns options as shown below:

recolumn = {'item':'object','price': 'value'}

print('\nThe dataframe with column renamed\n')
print(mydataframe.rename(index=reindex, columns=recolumn))


Add this code to the previous program and see the output which should be:

The dataframe with column renamed

              object    color    value
first        ball       white    5.56
second   mug      red        4.20
third       pen       green    1.30
fourth    pencil    black     0.56
fifth       ashtray  yellow   2.75
------------------
(program exited with code: 0)

Press any key to continue . . .


It is also possible to rename single values as shown in  the following program:

import pandas as pd
import numpy as np

mydataframe = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
                             'color':['white','red','green','black','yellow']})
prices = { 'ball' : 5.56, 
           'mug' : 4.20,
           'bottle' : 1.30,
           'scissors' : 3.41,
           'pen' : 1.30,
           'pencil' : 0.56,
           'ashtray' : 2.75
           }

mydataframe['price'] = mydataframe['item'].map(prices)

print('\nThe original dataframe\n')
print(mydataframe) 

print('\nThe dataframe with label renamed\n')
print(mydataframe.rename(index={1:'first'}, columns={'item':'object'})) 


The output of the program is shown below:


The original dataframe

      item   color  price
0     ball   white   5.56
1      mug     red   4.20
2      pen   green   1.30
3   pencil   black   0.56
4  ashtray  yellow   2.75

The dataframe with label renamed

        object   color  price
0         ball   white   5.56
first      mug     red   4.20
2          pen   green   1.30
3       pencil   black   0.56
4      ashtray  yellow   2.75

The original dataframe

      item   color  price
0     ball   white   5.56
1      mug     red   4.20
2      pen   green   1.30
3   pencil   black   0.56
4  ashtray  yellow   2.75
------------------
(program exited with code: 0)

Press any key to continue . . .


As was desired, the index 1 is renames to first. As you have seen that the rename() function returns a dataframe with the changes, leaving unchanged the original dataframe. If we want the changes to take
effect on the object on which you call the function, you will set the inplace option to True. Add the following code to the previous program:

mydataframe.rename(columns={'item':'object'}, inplace=True)

print('\nThe dataframe with change in item object\n')
print(mydataframe)


The output of the program is shown below:

The original dataframe

      item   color  price
0     ball   white   5.56
1      mug     red   4.20
2      pen   green   1.30
3   pencil   black   0.56
4  ashtray  yellow   2.75

The dataframe with label renamed

        object   color  price
0         ball   white   5.56
first      mug     red   4.20
2          pen   green   1.30
3       pencil   black   0.56
4      ashtray  yellow   2.75

The original dataframe

      item   color  price
0     ball   white   5.56
1      mug     red   4.20
2      pen   green   1.30
3   pencil   black   0.56
4  ashtray  yellow   2.75

The dataframe with change in item object

     object   color  price
0     ball   white   5.56
1      mug     red   4.20
2      pen   green   1.30
3   pencil   black   0.56
4  ashtray  yellow   2.75
------------------
(program exited with code: 0)

Press any key to continue . . .


If you print the original data frame again what will be the output?

Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

Friday, April 26, 2019

Pandas - 23 (Pivoting)

Pivoting is a common operation  just like data assembling. In the context of pivoting, we have two basic operations:

• Stacking—Rotates or pivots the data structure converting columns to rows
• Unstacking—Converts rows into columns

Using the stack() function on the dataframe, we will get the pivoting of the columns in rows, thus producing a series. From this hierarchically indexed series, we can reassemble the dataframe into a
pivoted table by use of the unstack() function. We can also do the unstack on a different level, specifying the number of levels or its name as the argument of the function. The following program shows the stacking and unstacking:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(9).reshape(3,3),index=['white','black','red'],columns=['ball','pen','pencil'])

print('frame 1\n')
print(frame1)

ser5 = frame1.stack()
print('\nHierarchically indexed Series\n')
print(ser5)

print('\nReassemble the dataframe into a pivoted table\n')
print(ser5.unstack())

print('\nUnstack on a different level\n')
print(ser5.unstack(0))


The output of the program is shown below:

frame 1

       ball  pen  pencil
white     0    1       2
black     3    4       5
red       6    7       8

Hierarchically indexed Series

white  ball      0
       pen       1
       pencil    2
black  ball      3
       pen       4
       pencil    5
red    ball      6
       pen       7
       pencil    8
dtype: int32

Reassemble the dataframe into a pivoted table

       ball  pen  pencil
white     0    1       2
black     3    4       5
red       6    7       8

Unstack on a different level

        white  black  red
ball        0      3    6
pen         1      4    7
pencil      2      5    8
------------------
(program exited with code: 0)

Press any key to continue . . .


The most common way to store datasets is in CSV format or a table of a database. This happens especially when you have instrumental readings, calculation results iterated over time, or the simple manual input of a series of values. A similar case of these files is for example the logs file, which is filled line by line by accumulating data in it.

The peculiar characteristic of this type of dataset is to have entries on various columns, often duplicated in subsequent lines. Always remaining in tabular format of data, when you are in such cases you can refer them to as long or stacked format. To get a clearer idea about that, consider the following program:

import pandas as pd
import numpy as np

frame2 = pd.DataFrame({ 'color':['white','white','white','red','red','red','black','black','black'],
                                    'item':['ball','pen','mug','ball','pen','mug','ball','pen','mug'],
                                    'value': np.random.rand(9)})
print('Long frame \n')
print(frame2)


The output of the program is shown below:

Long frame

   color  item     value
0  white  ball  0.460043
1  white   pen  0.929175
2  white   mug  0.069400
3    red  ball  0.148175
4    red   pen  0.798596
5    red   mug  0.771690
6  black  ball  0.820520
7  black   pen  0.735380
8  black   mug  0.557580
------------------
(program exited with code: 0)

Press any key to continue . . .


The output shows that this mode of data recording has some disadvantages. One, for example, is the
multiplicity and repetition of some fields. Considering the columns as keys, the data in this format will be difficult to read, especially in fully understanding the relationships between the key values and the rest of the columns.

Instead of the long format, there is another way to arrange the data in a table that is called wide. This mode is easier to read, allowing easy connection with other tables, and it occupies much less space. So in general it is a more efficient way of storing the data, although less practical, especially if during the filling of the data.

As a criterion, select a column, or a set of them, as the primary key; then, the values contained in it must be unique.

In this regard, pandas gives you a function that allows you to make a transformation of a dataframe from the long type to the wide type. This function is pivot() and it accepts as arguments the column, or columns, which will assume the role of key.

Starting from the previous example, we choose to create a dataframe in wide format by choosing the color column as the key, and item as a second key, the values of which will form the new columns of the dataframe. See the following program:

import pandas as pd
import numpy as np

frame2 = pd.DataFrame({ 'color':['white','white','white','red','red','red','black','black','black'],
                                    'item':['ball','pen','mug','ball','pen','mug','ball','pen','mug'],
                                    'value': np.random.rand(9)})
print('Long frame \n')
print(frame2)

wideframe = frame2.pivot('color','item')
print('\nWide frame \n')
print(wideframe)


The output of the program is shown below:

Long frame

   color  item     value
0  white  ball  0.475684
1  white   pen  0.147188
2  white   mug  0.426831
3    red  ball  0.400304
4    red   pen  0.203074
5    red   mug  0.739819
6  black  ball  0.824657
7  black   pen  0.889884
8  black   mug  0.244726

Wide frame

          value
item       ball       mug       pen
color
black  0.824657  0.244726  0.889884
red    0.400304  0.739819  0.203074
white  0.475684  0.426831  0.147188
------------------
(program exited with code: 0)

Press any key to continue . . . 


In the output we can now see, in this format, the dataframe is much more compact and the data contained in it are much more readable.

The last stage of data preparation is the removal of columns and rows. Let's see this through our next program:

import pandas as pd
import numpy as np

frame3 = pd.DataFrame(np.arange(9).reshape(3,3),
                                    index=['white','black','red'],
                                    columns=['ball','pen','pencil'])
print('frame \n')
print(frame3)

del frame3['ball']

print('\nModified frame by removing column \n')
print(frame3)


print('\nModified frame by removing row \n')
print(frame3.drop('white'))


In order to remove a column, we simply use the del command applied to the dataframe with the column name specified. To remove an unwanted row, we have to use the drop() function with the label of the corresponding index as an argument. The output of the program is shown below:

frame

       ball  pen  pencil
white     0    1       2
black     3    4       5
red       6    7       8

Modified frame by removing column

       pen  pencil
white    1       2
black    4       5
red      7       8

Modified frame by removing row

       pen  pencil
black    4       5
red      7       8
------------------
(program exited with code: 0)

Press any key to continue . . .


Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

Thursday, April 25, 2019

Pandas - 22 (Concatenating)

Concatenation is another type of data combination and NumPy provides a concatenate() function to do this kind of operation with arrays. See the following program :

import pandas as pd
import numpy as np

array1 = np.arange(9).reshape((3,3))
print('Array 1\n')
print(array1)

array2 = np.arange(9).reshape((3,3))+6
print('\nArray 2\n')
print(array2)

print('\nConcatenated array axis=1\n')
print(np.concatenate([array1,array2],axis=1))

print('\nConcatenated array axis=0\n')
print(np.concatenate([array1,array2],axis=0))



The output of the program is shown below:

Array 1

[[0 1 2]
 [3 4 5]
 [6 7 8]]

Array 2

[[ 6  7  8]
 [ 9 10 11]
 [12 13 14]]

Concatenated array axis=1

[[ 0  1  2  6  7  8]
 [ 3  4  5  9 10 11]
 [ 6  7  8 12 13 14]]

Concatenated array axis=0

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 6  7  8]
 [ 9 10 11]
 [12 13 14]]
------------------
(program exited with code: 0)

Press any key to continue . . .


The pandas library and its data structures like series and dataframe, having labeled axes allows you to further generalize the concatenation of arrays. The concat() function is provided by pandas for this kind of operation. See the following program :

import pandas as pd
import numpy as np

ser1 = pd.Series(np.random.rand(4), index=[1,2,3,4])
print('Series 1\n')
print(ser1)

ser2 = pd.Series(np.random.rand(4), index=[5,6,7,8])
print('\nSeries 2\n')
print(ser2)

print('\nConcatenated series axis=1\n')
print(pd.concat([ser1,ser2], axis=1))

print('\nConcatenated series axis=0\n')
print(pd.concat([ser1,ser2]))


By default, the concat() function works on axis = 0, having as a returned object a series. If you set the axis = 1, then the result will be a dataframe. The output of the program is shown below:

Series 1

1    0.936029
2    0.194529
3    0.448288
4    0.952875
dtype: float64

Series 2

5    0.392544
6    0.978594
7    0.453258
8    0.661619
dtype: float64

Concatenated series axis=1

          0         1
1  0.936029       NaN
2  0.194529       NaN
3  0.448288       NaN
4  0.952875       NaN
5       NaN  0.392544
6       NaN  0.978594
7       NaN  0.453258
8       NaN  0.661619

Concatenated series axis=0

1    0.936029
2    0.194529
3    0.448288
4    0.952875
5    0.392544
6    0.978594
7    0.453258
8    0.661619
dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . .


When we concatenate the series with axis=1, in the output the concatenated parts are not identifiable in the result. Let's say we want to create a hierarchical index on the axis of concatenation. To do this, we have to use the keys option as shown in the following program :

import pandas as pd
import numpy as np

ser1 = pd.Series(np.random.rand(4), index=[1,2,3,4])
print('Series 1\n')
print(ser1)

ser2 = pd.Series(np.random.rand(4), index=[5,6,7,8])
print('\nSeries 2\n')
print(ser2)

print('\nConcatenated series using the keys option\n')
print(pd.concat([ser1,ser2], keys=[1,2]))

print('\nConcatenated series using the keys option along axis=1\n')
print(pd.concat([ser1,ser2], axis=1, keys=[1,2]))


The output of the program is shown below:

Series 1

1    0.034474
2    0.984395
3    0.912107
4    0.543064
dtype: float64

Series 2

5    0.864616
6    0.231658
7    0.875177
8    0.400951
dtype: float64

Concatenated series using the keys option

1  1    0.034474
   2    0.984395
   3    0.912107
   4    0.543064
2  5    0.864616
   6    0.231658
   7    0.875177
   8    0.400951
dtype: float64

Concatenated series using the keys option along axis=1

          1         2
1  0.034474       NaN
2  0.984395       NaN
3  0.912107       NaN
4  0.543064       NaN
5       NaN  0.864616
6       NaN  0.231658
7       NaN  0.875177
8       NaN  0.400951
------------------
(program exited with code: 0)

Press any key to continue . . .


As you may have noticed in the case of combinations between series along the axis = 1 the keys become the column headers of the dataframe.

Just like series, the concatenation applied to the dataframe. The following program shows the concatenation applied to the dataframe:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.random.rand(9).reshape(3,3), index=[1,2,3], columns=['A','B','C'])
print('Frame 1\n')
print(frame1)

frame2 = pd.DataFrame(np.random.rand(9).reshape(3,3), index=[4,5,6], columns=['A','B','C'])
print('\nFrame 2\n')
print(frame2)

print('\nConcatenated frames\n')
print(pd.concat([frame1, frame2]))

print('\nConcatenated frames along axis=1\n')
print(pd.concat([frame1, frame2], axis=1))



The output of the program is shown below:

Frame 1

          A         B         C
1  0.216094  0.206833  0.565031
2  0.278919  0.311937  0.410026
3  0.262882  0.487224  0.489479

Frame 2

          A         B         C
4  0.660482  0.491644  0.411970
5  0.511529  0.394583  0.475184
6  0.638702  0.849363  0.190679

Concatenated frames

          A         B         C
1  0.216094  0.206833  0.565031
2  0.278919  0.311937  0.410026
3  0.262882  0.487224  0.489479
4  0.660482  0.491644  0.411970
5  0.511529  0.394583  0.475184
6  0.638702  0.849363  0.190679

Concatenated frames along axis=1

          A         B         C         A         B         C
1  0.216094  0.206833  0.565031       NaN       NaN       NaN
2  0.278919  0.311937  0.410026       NaN       NaN       NaN
3  0.262882  0.487224  0.489479       NaN       NaN       NaN
4       NaN       NaN       NaN  0.660482  0.491644  0.411970
5       NaN       NaN       NaN  0.511529  0.394583  0.475184
6       NaN       NaN       NaN  0.638702  0.849363  0.190679
------------------
(program exited with code: 0)

Press any key to continue . . .


Let's consider a scenario in which we want the two datasets to have indexes that overlap in their entirety or at least partially. This combination of data cannot be obtained either with merging or with concatenation. One applicable function to series is combine_first(), which performs this kind of
operation along with data alignment. See the following program :

import pandas as pd
import numpy as np

ser1 = pd.Series(np.random.rand(5),index=[1,2,3,4,5])
print('Series 1\n')
print(ser1)

ser2 = pd.Series(np.random.rand(4),index=[2,4,5,6])
print('\nSeries 2\n')
print(ser2)

print('\nCombined series with ser2 as an arument\n')
print(ser1.combine_first(ser2))

print('\nCombined series with ser1 as an arument\n')
print(ser2.combine_first(ser1))

print('\nCombined series with partial overlap\n')
print(ser1[:3].combine_first(ser2[:3]))



The output of the program is shown below:

Series 1

1    0.546086
2    0.855131
3    0.975251
4    0.159282
5    0.778717
dtype: float64

Series 2

2    0.420990
4    0.883285
5    0.483201
6    0.848290
dtype: float64

Combined series with ser2 as an arument

1    0.546086
2    0.855131
3    0.975251
4    0.159282
5    0.778717
6    0.848290
dtype: float64

Combined series with ser1 as an arument

1    0.546086
2    0.420990
3    0.975251
4    0.883285
5    0.483201
6    0.848290
dtype: float64

Combined series with partial overlap

1    0.546086
2    0.855131
3    0.975251
4    0.883285
5    0.483201
dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . . 


Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

Wednesday, April 24, 2019

Pandas - 21 (Data Manipulation)

In the previous posts we learned how to acquire data from data sources such as databases and files. Once we have the data in the dataframe format, they are ready to be manipulated. It’s important to prepare the data so that they can be more easily subjected to analysis and manipulation. Especially in preparation for the next phase, the data must be ready for visualization.

In the coming posts we'll learn about the functionality that the pandas library offers for this stage of data analysis. The data manipulation has three phases which are:

• Data preparation
• Data transformation
• Data aggregation

The three phases of data manipulation will be treated individually, illustrating the various operations with a series of examples and on how best to use the functions of this library for carrying out such operations. Let's discuss about the first phase in this post which is -

Data Preparation

Data Preparation is necessary before we start manipulating the data. The data must be assembled in the form data structures such that they can be manipulated later with the tools made available by the pandas library. The different procedures for data preparation are:

• Loading
• Assembling
     • Merging
     • Concatenating
     • Combining
• Reshaping (pivoting)
• Removing

In our previous posts we have covered loading where we learned about the conversion from many different formats into a data structure such as a dataframe. But even after we have the data, probably from different sources and formats, and unified it into a dataframe, we need to perform further
operations of preparation. The data contained in the pandas objects can be assembled in different ways:

• Merging—The pandas.merge() function connects the rows in a dataframe based on one or more keys. This mode is very familiar to those who are confident with the SQL language, since it also
implements join operations.

• Concatenating—The pandas.concat() function concatenates the objects along an axis.

• Combining—The pandas.DataFrame.combine_first() function is a method that allows you to connect overlapped data in order to fill in missing values in a data structure by taking data from another structure.

In addition to these, a part of the preparation process is also pivoting, which consists of the
exchange between rows and columns.

Merging

The merging operation consists of a combination of data through the connection of rows using one or more keys, this corresponds to the JOIN operation in SQL.

Usually one uses the JOIN query with SQL to get data from different tables using some reference values (keys) shared between them. On the basis of these keys, it is possible to obtain new data in a tabular form as the result of the combination of other tables. This operation with the library pandas is called merging, and merge() is the function to perform this kind of operation.

Let's make a program which shows merging of two dataframes:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62]})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})

print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))


In our program we first define two dataframes and print them. Then we carry out the merging by applying the merge() function to the two dataframe objects. The output of the program is shown below:

First dataframe

        id         price
0     ball      12.33
1   pencil    11.44
2      pen     33.21
3      mug   13.23
4  ashtray  33.62

Second dataframe

       id         color
0  pencil     white
1  pencil     red
2    ball      red
3     pen     black

Merged dataframe

       id       price  color
0    ball    12.33    red
1  pencil  11.44  white
2  pencil  11.44    red
3     pen   33.21  black
------------------
(program exited with code: 0)

Press any key to continue . . .


As we can see from the output, the returned dataframe consists of all rows that have an ID in common. In addition to the common column, the columns from the first and the second dataframe are added. We used the merge() function without specifying any column explicitly. In fact, in most cases we need to decide which is the column on which to base the merging.

To do this, add the on option with the column name as the key for the merging. In the following program we'll implement decision based merging by specifying the column :

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62],
                        'color':['white','red','red','black','green']})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})

print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))

print('\nMerged dataframe on basis of specified column = id\n')
print(pd.merge(frame1,frame2,on='id'))

print('\nMerged dataframe on basis of specified column = color\n')
print(pd.merge(frame1,frame2,on='color'))



It is necessary to explicitly define the criteria for merging that pandas must follow,
specifying the name of the key column in the on option. The output of the program is shown below:

First dataframe

        id  price  color
0     ball  12.33  white
1   pencil  11.44    red
2      pen  33.21    red
3      mug  13.23  black
4  ashtray  33.62  green

Second dataframe

       id  color
0  pencil  white
1  pencil    red
2    ball    red
3     pen  black

Merged dataframe

       id  price color
0  pencil  11.44   red

Merged dataframe on basis of specified column = id

       id  price color_x color_y
0    ball  12.33   white     red
1  pencil  11.44     red   white
2  pencil  11.44     red     red
3     pen  33.21     red   black

Merged dataframe on basis of specified column = color

     id_x  price  color    id_y
0    ball  12.33  white  pencil
1  pencil  11.44    red  pencil
2  pencil  11.44    red    ball
3     pen  33.21    red  pencil
4     pen  33.21    red    ball
5     mug  13.23  black     pen
------------------
(program exited with code: 0)

Press any key to continue . . .


Sometimes we have two dataframes in which the key columns do not have the same name. In this situation, we have to use the left_on and right_on options, which specify the key column for the first and for the second dataframe. See the following program :

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62],
                        'color':['white','red','red','black','green']})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})
frame2.columns = ['sid','color']
print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))

print('\nMerged dataframe on basis of specified options\n')
print(pd.merge(frame1,frame2,left_on='id', right_on='sid'))


The output of the program is shown below:

First dataframe

        id  price  color
0     ball  12.33  white
1   pencil  11.44    red
2      pen  33.21    red
3      mug  13.23  black
4  ashtray  33.62  green

Second dataframe

      sid  color
0  pencil  white
1  pencil    red
2    ball    red
3     pen  black

Merged dataframe

       id  price  color     sid
0    ball  12.33  white  pencil
1  pencil  11.44    red  pencil
2  pencil  11.44    red    ball
3     pen  33.21    red  pencil
4     pen  33.21    red    ball
5     mug  13.23  black     pen

Merged dataframe on basis of specified options

       id  price color_x     sid color_y
0    ball  12.33   white    ball     red
1  pencil  11.44     red  pencil   white
2  pencil  11.44     red  pencil     red
3     pen  33.21     red     pen   black

------------------
(program exited with code: 0)

Press any key to continue . . . 


By default, the merge() function performs an inner join; the keys in the result are the result of an intersection. Other possible options are the left join, the right join, and the outer join. The outer
join produces the union of all keys, combining the effect of a left join with a right join. To select the type of join you have to use the how option. See the following program :

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62],
                        'color':['white','red','red','black','green']})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})
frame2.columns = ['id','color']
print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))

print('\nMerged dataframe on basis of outer join\n')
print(pd.merge(frame1,frame2,on='id',how='outer'))

print('\nMerged dataframe on basis of left join\n')
print(pd.merge(frame1,frame2,on='id',how='left'))

print('\nMerged dataframe on basis of right join\n')
print(pd.merge(frame1,frame2,on='id',how='right'))


The output of the program is shown below:

First dataframe

        id  price  color
0     ball  12.33  white
1   pencil  11.44    red
2      pen  33.21    red
3      mug  13.23  black
4  ashtray  33.62  green

Second dataframe

       id  color
0  pencil  white
1  pencil    red
2    ball    red
3     pen  black

Merged dataframe

       id  price color
0  pencil  11.44   red

Merged dataframe on basis of outer join

        id  price color_x color_y
0     ball  12.33   white     red
1   pencil  11.44     red   white
2   pencil  11.44     red     red
3      pen  33.21     red   black
4      mug  13.23   black     NaN
5  ashtray  33.62   green     NaN

Merged dataframe on basis of left join

        id  price color_x color_y
0     ball  12.33   white     red
1   pencil  11.44     red   white
2   pencil  11.44     red     red
3      pen  33.21     red   black
4      mug  13.23   black     NaN
5  ashtray  33.62   green     NaN

Merged dataframe on basis of right join

       id  price color_x color_y
0    ball  12.33   white     red
1  pencil  11.44     red   white
2  pencil  11.44     red     red
3     pen  33.21     red   black

------------------
(program exited with code: 0)

Press any key to continue . . . 


To merge multiple keys, we simply add a list to the on option as shown in the following program :

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62],
                        'color':['white','red','red','black','green']})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})

print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))

print('\nMerged dataframe on basis of outer join\n')
print(pd.merge(frame1,frame2,on=['id','color'],how='outer'))


The output of the program is shown below:

First dataframe

        id  price  color
0     ball  12.33  white
1   pencil  11.44    red
2      pen  33.21    red
3      mug  13.23  black
4  ashtray  33.62  green

Second dataframe

       id  color
0  pencil  white
1  pencil    red
2    ball    red
3     pen  black

Merged dataframe

       id  price color
0  pencil  11.44   red

Merged dataframe on basis of outer join

        id  price  color
0     ball  12.33  white
1   pencil  11.44    red
2      pen  33.21    red
3      mug  13.23  black
4  ashtray  33.62  green
5   pencil    NaN  white
6     ball    NaN    red
7      pen    NaN  black
------------------
(program exited with code: 0)

Press any key to continue . . . 


Merging on an Index

Sometimes instead of considering the columns of a dataframe as keys, indexes could be used as keys for merging. Then in order to decide which indexes to consider, we set the left_index or right_index options to True to activate them, with the ability to activate them both. Another way to do the merging by indexes is by using the join() function on the dataframe objects. It can also be used to combine many dataframe objects having the same or the same indexes but with no columns overlapping. See the following program which shows merging by indexes :

import pandas as pd
import numpy as np

frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62],
                        'color':['white','red','red','black','green']})

print('First dataframe\n')
print(frame1)

frame2 = pd.DataFrame( {'id': ['pencil','pencil','ball','pen'],
                        'color':['white','red','red','black']})
frame2.columns = ['id','color']
print('\nSecond dataframe\n')
print(frame2)
print('\nMerged dataframe\n')
print(pd.merge(frame1,frame2))

print('\nMerged dataframe on basis of indexes\n')
print(pd.merge(frame1,frame2,right_index=True, left_index=True))

frame2.columns = ['color2','id2']
print('\nMerged dataframe using join()\n')
print(frame1.join(frame2))


The output of the program is shown below:

First dataframe

        id  price  color
0     ball  12.33  white
1   pencil  11.44    red
2      pen  33.21    red
3      mug  13.23  black
4  ashtray  33.62  green

Second dataframe

       id  color
0  pencil  white
1  pencil    red
2    ball    red
3     pen  black

Merged dataframe

       id  price color
0  pencil  11.44   red

Merged dataframe on basis of indexes

     id_x  price color_x    id_y color_y
0    ball  12.33   white  pencil   white
1  pencil  11.44     red  pencil     red
2     pen  33.21     red    ball     red
3     mug  13.23   black     pen   black

Merged dataframe using join()

        id  price  color  color2    id2
0     ball  12.33  white  pencil  white
1   pencil  11.44    red  pencil    red
2      pen  33.21    red    ball    red
3      mug  13.23  black     pen  black
4  ashtray  33.62  green     NaN    NaN
------------------
(program exited with code: 0)

Press any key to continue . . .


In the above program we have used frame2.columns = ['color2','id2']  if we don't rename the columns then the program gives error as some columns in frame1 have the same name as frame2. Thus we rename the columns in frame2 before launching the join() function. 

Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

Tuesday, April 23, 2019

Pandas - 20 (Interacting with Databases)

Usually data are stored in databases (SQL-based relational database and NoSQL databases). For loading data from SQL in a dataframe pandas has some functions to simplify the process.

The pandas.io.sql module provides a unified interface independent of the DB, called sqlalchemy. This interface simplifies the connection mode, since regardless of the DB, the commands will always be the same.

To make a connection we use the create_engine() function (from sqlalchemy import create_engine). With this feature we can configure all the properties necessary to use the driver, as a user, password, port, and database instance. Following is a list of examples for the various types of databases:

For PostgreSQL:
>>> engine = create_engine('postgresql://scott:tiger@localhost:5432/mydatabase')
For MySQL
>>> engine = create_engine('mysql+mysqldb://scott:tiger@localhost/foo')
For Oracle
>>> engine = create_engine('oracle://scott:tiger@127.0.0.1:1521/sidname')
For MSSQL
>>> engine = create_engine('mssql+pyodbc://mydsn')
For SQLite
>>> engine = create_engine('sqlite:///foo.db')

Loading and Writing Data with SQLite3

Let's see how to interact withSQLite3 which is one of the frequently used databases for especially for development environments. We will use a SQLite database using the driver’s built-in Python sqlite3. SQLite3 is a tool that implements a DBMS SQL in a very simple and lightweight way, so it can be incorporated in any application implemented with the Python language. This tool is ideal for those want to have the functions of a database without having to install a real database. SQLite3 could be the right choice for those who wants to practice before going on to a real database, or for anyone who needs to use the functions of a database to collect data, but remaining within a single program, without having to interface with a database. See the following program :

import pandas as pd
import numpy as np
from sqlalchemy import create_engine

frame = pd.DataFrame(np.arange(20).reshape(4,5),
                    columns=['white','red','blue','black','green'])

print('The frame\n')
print(frame)
engine = create_engine('sqlite:///mydata.db')
frame.to_sql('colors',engine)

print('\nReading the database\n')
print(pd.read_sql('colors',engine))



Make sure you have installed sqlalchemy module. First we create a dataframe that we will use to create a new table on the SQLite3 database. Next we implement the connection to the SQLite3 database reate_engine('sqlite:///mydata.db'). Then we convert the dataframe in a table within the database using frame.to_sql('colors',engine). To read the database, we have to use the read_sql() function with the name of the table and the engine (pd.read_sql('colors',engine)).

The output of the program is shown below:

The frame



whiteredblueblackgreen
001234
156789
21011121314
31516171819

Reading the database


indexwhiteredblueblackgreen
0001234
1156789
221011121314
331516171819

Using the I/O APIs available in the pandas library the writing operation on the database became
very simple. If you don't agree then let's not use the I/O APIs available in the pandas library to perform the writing operation on the database. See the following program :

import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import sqlite3
query = """
         CREATE TABLE test
         (a VARCHAR(20), b VARCHAR(20),
          c REAL,        d INTEGER
         );"""
con = sqlite3.connect(':memory:')
con.execute(query)
con.commit()
data = [('white','up',1,3),
        ('black','down',2,8),
        ('green','up',4,4),
        ('red','down',5,5)]
stmt = "INSERT INTO test VALUES(?,?,?,?)"
con.executemany(stmt, data)
con.commit()
cursor = con.execute('select * from test')
rows = cursor.fetchall()
print('\nRecorded data\n')
print(rows)
print('\nFinding column names\n')
print(cursor.description)
print('\nThe table\n')
print(pd.DataFrame(rows, columns=list(zip(*cursor.description))[0]))


First we establish a connection to the DB and create a table by defining the corrected data types, so as to accommodate the data to be loaded. Next we enter data using the SQL INSERT statement. After loading the data on a table, we query the database to get the data we just recorded using an SQL SELECT statement. We the find the name of the columns within the description attribute of the cursor. Finally we pass the list of tuples to the constructor of the dataframe. The output of the program is shown below:

Recorded data

[('white', 'up', 1.0, 3), ('black', 'down', 2.0, 8), ('green', 'up', 4.0, 4), ('
red', 'down', 5.0, 5)]

Finding column names

(('a', None, None, None, None, None, None), ('b', None, None, None, None, None,
None), ('c', None, None, None, None, None, None), ('d', None, None, None, None,
None, None))

The table

       a         b         c      d
0  white    up        1.0   3
1  black    down   2.0   8
2  green    up       4.0    4
3    red      down  5.0    5

------------------
(program exited with code: 0)

Press any key to continue . . .


As seen from the program we can easy conclude that this approach is quite laborious and using the I/O APIs available in the pandas library the writing operation on the database is much easier. As I have only sqlite DB I am not discussing about interaction with other databases. Whenever I start using other databases, I'll create a post on how to interact with then using the I/O APIs available in the pandas library.


Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share: