Discretization is a complex process of transformation which is used in experimental cases, to handle large quantities of data generated in sequence. To carry out an analysis of the data, it is necessary to transform this data into discrete categories, for example -
1. by dividing the range of values of such readings into smaller intervals and counting the occurrence or statistics in them.
2. Another case might be when we have a huge number of samples due to precise readings on a population. Even here, to facilitate analysis of the data, it is necessary to divide the range of values into categories and then analyze the occurrences and statistics related to each.
Our following program will explain how Discretization and Binning is performed:
import pandas as pd
import numpy as np
#List containing experimental values
results = [12,34,67,55,28,90,99,12,3,56,74,44,87,23,49,89,87]
#array containing the values of separation of bin
bins = [0,25,50,75,100]
#binning
cat = pd.cut(results, bins)
print('Bins\n')
print(cat)
print('\nCategories array indicating the names of the different internal categories\n')
print(cat.categories)
print('\nCodes array containing a list of numbers equal to the elements of results list\n')
print(cat.codes)
print('\nOccurrences for each bin\n')
print(pd.value_counts(cat))
Our program deals with data which is a reading of an experimental value between 0 and 100. These data are collected in a list named results. We divide the interval of data into four equal parts which we call bins. The first contains the values between 0 and 25, the second between 26 and 50, the third between 51 and 75, and the last between 76 and 100.
To do this binning with pandas, we define an array containing the values of separation of bin (0,25,50,75,100). Then we apply the cut() to the array of results also passing the bins. The object returned by the cut() function is a special object of Categorical type which we can consider as an array of strings indicating the name of the bin. Internally it contains a categories array indicating the names of the different internal categories and a codes array that contains a list of numbers equal to the elements of results (i.e., the array subjected to binning). The number corresponds to the bin to which the corresponding element of results is assigned.
Finally to know the occurrences for each bin, that is, how many results fall into each category, we use the value_counts() function. The output of the program is shown below:
Bins
[(0, 25], (25, 50], (50, 75], (50, 75], (25, 50], ..., (75, 100], (0, 25], (25,
50], (75, 100], (75, 100]]
Length: 17
Categories (4, interval[int64]): [(0, 25] < (25, 50] < (50, 75] < (75, 100]]
Categories array indicating the names of the different internal categories
IntervalIndex([(0, 25], (25, 50], (50, 75], (75, 100]],
closed='right',
dtype='interval[int64]')
Codes array containing a list of numbers equal to the elements of results list
[0 1 2 2 1 3 3 0 0 2 2 1 3 0 1 3 3]
Occurrences for each bin
(75, 100] 5
(50, 75] 4
(25, 50] 4
(0, 25] 4
dtype: int64
------------------
(program exited with code: 0)
Press any key to continue . . .
From the output we can see that each class has the lower limit with a bracket and the upper limit with a parenthesis. This notation is consistent with mathematical notation that is used to indicate the intervals. If the bracket is square, the number belongs to the range (limit closed), and if it is round, the number does not belong to the interval (limit open).
We can give names to various bins by calling them first in an array of strings and then assigning to the labels options inside the cut() function that we have used to create the Categorical object. If the cut() function is passed as an argument to an integer instead of explicating the bin edges, this will divide the range of values of the array in many intervals as specified by the number.
The limits of the interval will be taken by the minimum and maximum of the sample data, which is, the array subjected to binning, in our case 'results'. See the following program:
import pandas as pd
import numpy as np
#List containing experimental values
results = [12,34,67,55,28,90,99,12,3,56,74,44,87,23,49,89,87]
#array containing the values of separation of bin
bins = [0,25,50,75,100]
bin_names = ['unlikely','less likely','likely','highly likely']
#binning
cat = pd.cut(results, bins,labels=bin_names)
print('Bins\n')
print(cat)
print('\nDividing the range of values of the array in 5 intervals\n')
print(pd.cut(results, 5))
The output of the program is shown below:
Bins
[unlikely, less likely, likely, likely, less likely, ..., highly likely, unlikel
y, less likely, highly likely, highly likely]
Length: 17
Categories (4, object): [unlikely < less likely < likely < highly likely]
Dividing the range of values of the array in 5 intervals
[(2.904, 22.2], (22.2, 41.4], (60.6, 79.8], (41.4, 60.6], (22.2, 41.4], ..., (79
.8, 99.0], (22.2, 41.4], (41.4, 60.6], (79.8, 99.0], (79.8, 99.0]]
Length: 17
Categories (5, interval[float64]): [(2.904, 22.2] < (22.2, 41.4] < (41.4, 60.6]
< (60.6, 79.8] < (79.8, 99.0]]
------------------
(program exited with code: 0)
Press any key to continue . . .
Apart from cut(), pandas provides another method for binning qcut() which divides the sample directly into quintiles. By using cut(), we will have a different number of occurrences for each bin. Instead qcut() will ensure that the number of occurrences for each bin is equal, but the edges of each bin vary. Let's use this method in our previous program:
import pandas as pd
import numpy as np
#List containing experimental values
results = [12,34,67,55,28,90,99,12,3,56,74,44,87,23,49,89,87]
#array containing the values of separation of bin
bins = [0,25,50,75,100]
bin_names = ['unlikely','less likely','likely','highly likely']
#binning
cat = pd.cut(results, bins,labels=bin_names)
print('Bins\n')
print(cat)
print('\ndivide the range of values of the array in 5 intervals\n')
print(pd.cut(results, 5))
quintiles = pd.qcut(results, 5)
print('\nUsing qcut() dividing the range of values of the array in 5 intervals\n')
print(quintiles)
print('\nOccurrences for each bin\n')
print(pd.value_counts(quintiles))
The output of the program is shown below:
Bins
[unlikely, less likely, likely, likely, less likely, ..., highly likely, unlikel
y, less likely, highly likely, highly likely]
Length: 17
Categories (4, object): [unlikely < less likely < likely < highly likely]
divide the range of values of the array in 5 intervals
[(2.904, 22.2], (22.2, 41.4], (60.6, 79.8], (41.4, 60.6], (22.2, 41.4], ..., (79
.8, 99.0], (22.2, 41.4], (41.4, 60.6], (79.8, 99.0], (79.8, 99.0]]
Length: 17
Categories (5, interval[float64]): [(2.904, 22.2] < (22.2, 41.4] < (41.4, 60.6]
< (60.6, 79.8] <
(79.8, 99.0]]
Using qcut() dividing the range of values of the array in 5 intervals
[(2.999, 24.0], (24.0, 46.0], (62.6, 87.0], (46.0, 62.6], (24.0, 46.0], ..., (62
.6, 87.0], (2.999, 24.0], (46.0, 62.6], (87.0, 99.0], (62.6, 87.0]]
Length: 17
Categories (5, interval[float64]): [(2.999, 24.0] < (24.0, 46.0] < (46.0, 62.6]
< (62.6, 87.0] <
(87.0, 99.0]]
Occurrences for each bin
(62.6, 87.0] 4
(2.999, 24.0] 4
(87.0, 99.0] 3
(46.0, 62.6] 3
(24.0, 46.0] 3
dtype: int64
------------------
(program exited with code: 0)
Press any key to continue . . .
We can see from the output that in the case of quintiles, the intervals bounding the bin differ from
those generated by the cut() function. Moreover, qcut() tried to standardize the occurrences for each bin, but in the case of quintiles, the first two bins have an occurrence in more because the number of results is not divisible by five.
In the event of abnormal values in a data structure , often there is need to detect their presence during the data analysis process. See the following program in which we consider outliers those that have a value greater than three times the standard deviation.:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame(np.random.randn(1000,3))
print('\nThe original dataframe\n')
print(mydataframe)
print('\nThe statistics for each column\n')
print(mydataframe.describe())
print('\nThe standard deviation of each column of the dataframe\n')
print(mydataframe.std())
print('\nThe filtering of all the values of the dataframe\n')
print(mydataframe[(np.abs(mydataframe) > (3*mydataframe.std())).any(1)])
First we created a dataframe with three columns from 1,000 completely random values. Then with the describe() function we view the statistics for each column and checked the standard deviation of each column of the dataframe using the std() function. Finally we applied the filtering of all the values of the dataframe, applying the corresponding standard deviation for each column and using the any() function, applied the filter on each column.
The output of the program is shown below:
The original dataframe
0 1 2
0 -0.152871 -0.763588 0.550351
1 -0.329748 0.202107 -0.555331
2 -0.517345 1.790445 -1.158925
3 -0.972289 -0.312655 0.620838
4 1.450099 -0.507097 -2.944250
5 0.535946 0.007717 -0.390331
6 -0.487387 -0.471448 0.177973
7 -0.526594 0.467879 0.340540
8 -0.534051 -1.004680 0.544254
9 0.927932 1.201972 -0.804130
10 0.490467 -0.524667 -0.699485
11 -0.458283 -0.549288 0.299852
12 0.462777 -0.568852 -0.925806
13 0.426844 -1.712511 -0.780843
14 -0.663651 1.311056 0.979108
15 0.294022 -0.797623 0.730315
16 1.274876 -0.000637 0.286369
17 1.315956 0.067872 0.773538
18 0.106650 -0.511677 -0.437176
19 -0.627332 0.193505 -0.049096
20 0.181071 0.477801 -1.509857
21 0.468760 -1.005808 0.328267
22 1.568992 2.211600 -1.403844
23 0.177481 0.826748 0.310399
24 0.789889 -0.663966 0.556157
25 1.664440 0.468722 1.284905
26 -1.104418 -1.266112 2.053315
27 0.037905 1.034867 -0.992572
28 -2.607207 0.362349 -1.825882
29 0.390756 1.633788 -0.370098
.. ... ... ...
970 -0.431666 0.161989 1.098937
971 -0.020122 0.551296 1.081225
972 -0.505658 -0.298048 0.023238
973 0.138252 -1.028921 -0.124180
974 -1.064977 -0.000879 0.156231
975 1.347509 -0.021861 1.280861
976 -0.225524 -0.583704 0.005301
977 -0.263022 -2.116113 -1.257308
978 -1.019497 -0.244579 -1.429471
979 -1.283034 1.166787 0.713066
980 0.979168 0.057361 0.397983
981 -0.555054 0.496199 -0.658068
982 0.051657 0.196189 -0.083374
983 -1.578053 -0.229885 -0.413917
984 0.990382 -1.547720 -1.001030
985 -1.073932 0.470117 -1.726342
986 -0.013742 -0.784292 -0.686692
987 0.915873 1.920051 1.674205
988 -0.672419 -0.606013 0.129781
989 1.516790 0.578385 -0.540154
990 0.597486 -0.177357 -0.012550
991 0.216080 -1.731623 1.315886
992 0.776450 -2.359688 -1.205302
993 -0.094598 -0.211266 -0.752690
994 -0.335907 -0.634471 -1.062571
995 -0.004971 -1.916150 0.566218
996 0.585543 -0.212457 0.366224
997 0.167019 -1.194672 0.774392
998 -0.831502 0.307548 -2.015205
999 0.847513 -0.921022 0.425666
[1000 rows x 3 columns]
The statistics for each column
0 1 2
count 1000.000000 1000.000000 1000.000000
mean 0.038385 -0.038058 0.012744
std 1.002982 1.007076 1.015312
min -2.854868 -3.133775 -3.002108
25% -0.659852 -0.716565 -0.653306
50% 0.054970 -0.019524 0.021691
75% 0.740922 0.629396 0.661967
max 3.224009 3.179825 3.793527
The standard deviation of each column of the dataframe
0 1.002982
1 1.007076
2 1.015312
dtype: float64
The filtering of all the values of the dataframe
0 1 2
160 3.224009 -1.695575 -0.720423
230 -0.590097 3.179825 -0.789051
334 -1.347662 -3.133775 0.727448
367 -0.522820 -3.059741 -1.905784
374 -0.905473 -0.220910 3.793527
------------------
(program exited with code: 0)
Press any key to continue . . .
Here I
am ending today’s post. Until we meet again keep practicing and learning
Python, as Python is easy to learn!