Wednesday, May 22, 2019

Pandas - 42 Supervised Learning with scikit-learn (Diabetes Dataset)

The diabetes dataset is one of the various datasets available within the scikit-learn library. To upload the data contained in this dataset, first we have to import the datasets module of the scikit-learn library and then you call the load_diabetes() function to load the dataset into a variable that will be called diabetes.

from sklearn import datasets
diabetes = datasets.load_diabetes()


This dataset contains physiological data collected on 442 patients and as a corresponding
target an indicator of the disease progression after a year. The physiological data occupy
the first 10 columns with values that indicate respectively the following:

• Age
• Sex
• Body mass index
• Blood pressure
• S1, S2, S3, S4, S5, and S6 (six blood serum measurements)

These measurements can be obtained by calling the data attribute. But when we check the values in the dataset, we find values very different from what we expected. In the following program we look at the 10 values for the first patient:

from sklearn import datasets

diabetes = datasets.load_diabetes()
print(diabetes.data[0])


The output of the program is shown below:

[ 0.03807591  0.05068012  0.06169621  0.02187235 -0.0442235  -0.03482076
 -0.04340085 -0.00259226  0.01990842 -0.01764613]
------------------
(program exited with code: 0)

Press any key to continue . . . 


In the above output, each of the 10 values was mean centered and subsequently scaled by the standard deviation times the number of samples. Checking will reveal that the sum of squares of each column is equal to 1. Let's try doing this calculation with the age measurements; we should obtain a value very close to 1. See the following program :

diabetes = datasets.load_diabetes()
print(np.sum(diabetes.data[:,0]**2))


The output of the program is shown below:

1.0000000000000746
------------------
(program exited with code: 0)

Press any key to continue . . . 


Even though these values are normalized and therefore difficult to read, they continue to express the 10 physiological characteristics and therefore have not lost their value or statistical information.

As for the indicators of the progress of the disease, that is, the values that must correspond to the results of our predictions, these are obtainable by means of the target attribute as shown in the following program :

diabetes = datasets.load_diabetes()
print(diabetes.target)


The output of the program is shown below:

[151.  75. 141. 206. 135.  97. 138.  63. 110. 310. 101.  69. 179. 185.
 118. 171. 166. 144.  97. 168.  68.  49.  68. 245. 184. 202. 137.  85.
 131. 283. 129.  59. 341.  87.  65. 102. 265. 276. 252.  90. 100.  55.
.......

....... 
72. 140. 189. 181. 209. 136. 261. 113. 131. 174. 257.  55.  84.  42.
 146. 212. 233.  91. 111. 152. 120.  67. 310.  94. 183.  66. 173.  72.
  49.  64.  48. 178. 104. 132. 220.  57.]
------------------
(program exited with code: 0)

Press any key to continue . . .


Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

0 comments:

Post a Comment