Thursday, May 23, 2019

Pandas - 43 Supervised Learning with scikit-learn (Linear Regression)


Linear Regression uses data contained in the training set to build a linear model. The most simple is based on the equation of a rect with the two parameters a and b to characterize it. These parameters will be calculated so as to make the sum of squared residuals as small as possible.

y = a*x + c

In this expression, x is the training set, y is the target, b is the slope, and c is the intercept of the rect represented by the model. In scikit-learn, to use the predictive model for the linear regression, we  must import the linear_model module and then use the manufacturer LinearRegression() constructor to create the predictive model, which we call linreg.

In the following program we'll use the diabetes dataset described in the previous post:

linreg = linear_model.LinearRegression()

diabetes = datasets.load_diabetes()
# break the 442 patients into a training set
x_train = diabetes.data[:-20]
y_train = diabetes.target[:-20]
# break the 442 patients into a test set
x_test = diabetes.data[-20:]
y_test = diabetes.target[-20:]

linreg.fit(x_train,y_train)#apply the training set to the predictive model
print('\n10 b coefficients\n')
print(linreg.coef_)#get the 10 b coefficients
print('\nA series of targets to be compared with the values actually observed\n')
print(linreg.predict(x_test))#apply the test set to the linreg prediction model
print(y_test)
print('\nVariance\n')
print(linreg.score(x_test, y_test))#Calculating the variance



In our program we first break the 442 patients into a training set (composed of the first 422 patients) and a test set (the last 20 patients). Next we apply the training set to the predictive model through the use of the fit() function. Once the model is trained we can get the 10 b coefficients calculated for each physiological variable, using the coef_ attribute of the predictive model.

Later we apply the test set to the linreg prediction model to get a series of targets to be compared with the values actually observed. Finally we calculate the variance which is a good indicator of what prediction should be perfect. Ideally, the closer the variance is to 1, the more perfect the prediction.

The output of the program is shown below:

10 b coefficients

[ 3.03499549e-01 -2.37639315e+02  5.10530605e+02  3.27736980e+02
 -8.14131709e+02  4.92814588e+02  1.02848452e+02  1.84606489e+02
  7.43519617e+02  7.60951722e+01]

A series of targets to be compared with the values actually observed

[197.61846908 155.43979328 172.88665147 111.53537279 164.80054784
 131.06954875 259.12237761 100.47935157 117.0601052  124.30503555
 218.36632793  61.19831284 132.25046751 120.3332925   52.54458691
 194.03798088 102.57139702 123.56604987 211.0346317   52.60335674]
[233.  91. 111. 152. 120.  67. 310.  94. 183.  66. 173.  72.  49.  64.
  48. 178. 104. 132. 220.  57.]

Variance

0.5850753022690571
------------------
(program exited with code: 0)

Press any key to continue . . .


Now let's start with the linear regression, taking into account a single physiological factor, the age. See the following program :

diabetes = datasets.load_diabetes()
x_train = diabetes.data[:-20]
y_train = diabetes.target[:-20]
x_test = diabetes.data[-20:]
y_test = diabetes.target[-20:]
x0_test = x_test[:,0]
x0_train = x_train[:,0]
x0_test = x0_test[:,np.newaxis]
x0_train = x0_train[:,np.newaxis]
linreg = linear_model.LinearRegression()
linreg.fit(x0_train,y_train)
y = linreg.predict(x0_test)
plt.scatter(x0_test,y_test,color='k')
plt.plot(x0_test,y,color='b',linewidth=3)
plt.show()


The output of the program is shown below where the blue line representing the linear correlation between the ages of patients and the disease progression:
We have 10 physiological factors within the diabetes dataset. Therefore, to have a more complete picture of all the training set, we can make a linear regression for every physiological feature, creating 10 models and seeing the result for each of them through a linear chart.

See the following program :

diabetes = datasets.load_diabetes()
x_train = diabetes.data[:-20]
y_train = diabetes.target[:-20]
x_test = diabetes.data[-20:]
y_test = diabetes.target[-20:]
plt.figure(figsize=(8,12))

linreg = linear_model.LinearRegression()

for f in range(0,10):
    xi_test = x_test[:,f]
    xi_train = x_train[:,f]
    xi_test = xi_test[:,np.newaxis]
   
    xi_train = xi_train[:,np.newaxis]
    linreg.fit(xi_train,y_train)
    y = linreg.predict(xi_test)
    plt.subplot(5,2,f+1)
    plt.scatter(xi_test,y_test,color='k')
    plt.plot(xi_test,y,color='b',linewidth=3)
plt.show() 
   

The output of the program shows 10 linear charts, each of which represents the correlation between a
physiological factor and the progression of diabetes:
 

Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

0 comments:

Post a Comment