Projects

Which is the best model to be able to predict car price based on the dataset?

DaTALK 2021. 6. 5.

(1) import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

(2) get the data

# path of data 
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/automobileEDA.csv'
df = pd.read_csv(path)
df.head()

(3) load the modules & create the linear regression object

from sklearn.linear_model import LinearRegression
lm = LinearRegression()

<1> Simple Linear Regression

 

- Simple Linear Regression = a method to help us understand the relationship btw two variables: the predictor/independent variable(X) + the response/dependent variable(Y, that we want to predict)

- the result of linear regression is a linear function = predicts the response(dependent) variable as a function of the predictor(independent) variable

a: the intercept / b: the slope

Q) How could highway-mpg help us predict car price?

- we will create a linear function with "highway-mpg" = the predictor variable / the "price" = the resonse variable

 

(4) create two types of variables & fit the linear model

X = df[['highway-mpg']]
Y = df['price']

lm.fit(X,Y)

(5) the value of the intercept & the slope

lm.intercept_
#prints 38423.305858157386

lm.coef_
#prints array([-821.73337832])

 * the FINAL estimated linear model

<2> Multiple Linear Regression

 

- very similar to Simple Linear Regression

- this method is used to explain the relationship btw one continuous response (dependent) variable & two or more predictor(independent) variables

- most of the real-world regrssion models involve multiple predictors

the equation is given by:

(4) ~ (5)

Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]
lm.fit(Z, df['price'])
lm.intercept_
#prints -15806.624626329198
lm.coef_
#prints array([53.49574423,  4.70770099, 81.53026382, 36.05748882])

 * the FINAL estimated linear model

<3> Model evaluation using visualization

# import the visualization package: seaborn
import seaborn as sns
%matplotlib inline 

(6) Regression plot

- will give a reasonable estimate of the relationship between the two variables, the strength of the correlation, as well as the direction (positive or negative correlation)

 

* visualize highway-mpg as potential predictor variable of price

width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)

- negatively correlated to highway-mpg (the regression slope is negative)

- how scattered the data points are around the regression line

= a good indication of the variance of the data, and whether a linear model would be the best fit or not

- if the data is too far off from the line, this linear model might not be the best model for this data

 

(7) Residual Plot

 

- If the points in a residual plot are randomly spread out around the x-axis, then a linear model is appropriate for the data. Why is that? Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.

width = 12
height = 10
plt.figure(figsize=(width, height))
sns.residplot(df['highway-mpg'], df['price'])
plt.show()

-> the residuals are not randomly spread around the x-axis, which leads us to believe that maybe a non-linear model is more appropriate for this data

 

(8) Distribution Plot

Q) Visualizing a model for Multiple Linear Regression

A) Looking at the distribution plot

Y_hat = lm.predict(Z)

plt.figure(figsize=(width, height))


ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Y_hat, hist=False, color="b", label="Fitted Values" , ax=ax1)


plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')

plt.show()
plt.close()

<4> Polynomial Regression

- We saw earlier that a linear model did not provide the best fit while using highway-mpg as the predictor variable. Let's see if we can try fitting a polynomial model to the data instead

def PlotPolly(model, independent_variable, dependent_variabble, Name):
    x_new = np.linspace(15, 55, 100)
    y_new = model(x_new)

    plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
    plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
    ax = plt.gca()
    ax.set_facecolor((0.898, 0.898, 0.898))
    fig = plt.gcf()
    plt.xlabel(Name)
    plt.ylabel('Price of Cars')

    plt.show()
    plt.close()

- we use the cubic order(3rd order) 

x = df['highway-mpg']
y = df['price']

# Here we use a polynomial of the 3rd order (cubic) 
f = np.polyfit(x, y, 3)
p = np.poly1d(f)
print(p)

PlotPolly(p, x, y, 'highway-mpg')

-> this polynomial model performs better than the linear model. This is because the generated polynomial function "hits" more of the data points.

 

<5> Pipelines

= data pipelines simplify the steps of processing data

- we use the module Pipeline to create a pipeline & we also use StandardScaler as a step in the pipeline

 

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

- We create the pipeline, by creating a list of tuples including the name of the model or estimator and its corresponding constructor.

Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]

- we input the list as an argument to the pipeline constructor

pipe=Pipeline(Input)

- We can normalize the data, perform a transform and fit the model simultaneously

pipe.fit(Z,y)

<6> Measures for In-Sample Evaluation

 

(9) R-squared & MSE

 

1) Simple Linear Regression

#highway_mpg_fit
lm.fit(X, Y)
# Find the R^2
print('The R-square is: ', lm.score(X, Y))

-> The R-square is:  0.4965911884339175

(We can say that ~ 49.659% of the variation of the price is explained by this simple linear model "horsepower_fit")

Yhat=lm.predict(X)
print('The output of the first four predicted value is: ', Yhat[0:4])
#Yhat=lm.predict(X)
print('The output of the first four predicted value is: ', Yhat[0:4])
Yhat=lm.predict(X)
print('The output of the first four predicted value is: ', Yhat[0:4])

-> The output of the first four predicted value is:  [16236.50464347 16236.50464347 17058.23802179 13771.3045085]

from sklearn.metrics import mean_squared_error

- we compare the predicted results with the actual results, and get MSE

mse = mean_squared_error(df['price'], Yhat)
print('The mean square error of price and predicted value is: ', mse)

-> The mean square error of price and predicted value is:  31635042.944639895

 

2) Multiple Linear Regression

# fit the model 
lm.fit(Z, df['price'])
# Find the R^2
print('The R-square is: ', lm.score(Z, df['price']))

Y_predict_multifit = lm.predict(Z)

print('The mean square error of price and predicted value using multifit is: ', \
      mean_squared_error(df['price'], Y_predict_multifit))

 

-> The R-square is: 0.8093562806577457

(We can say that ~ 80.896 % of the variation of price is explained by this multiple linear regression "multi_fit".)

-> The mean square error of price and predicted value using multifit is: 11980366.87072649

 

3) Polynomial Fit

from sklearn.metrics import r2_score

r_squared = r2_score(y, p(x))
print('The R-square value is: ', r_squared)

mean_squared_error(df['price'], p(x))

-> The R-square value is: 0.6741946663906513

-> 20474146.42636125

 

<7> Prediction & Decision Making

 

(10) Prediction

import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline 

new_input=np.arange(1, 100, 1).reshape(-1, 1)

lm.fit(X, Y)

yhat=lm.predict(new_input)

plt.plot(new_input, yhat)
plt.show()

(11) Decision Making - Determining a Good Model Fit

 

* When comparing models, the model with the higher R-squared value is a better fit for the data.

* When comparing models, the model with the smallest MSE value is a better fit for the data.

 

<Simple Linear Regression Model (SLR) vs. Multiple Linear Regression Model (MLR)>

- Usually, the more variables you have, the better your model is at predicting, but this is not always true. Sometimes you may not have enough data, you may run into numerical problems, or many of the variables may not be useful and or even act as noise. As a result, you should always check the MSE and R^2

 

** MLR seems like a better model than SLR (in this case)

 

- and when MLR is compared with the SLQ in this case, we can conclude MLR is a better model

 

A) Comparing these three models, we conclude that the MLR model is the best model to be able to predict price from our dataset. This result makes sense, since we have 27 variables in total, and we know that more than one of those variables are potential predictors of the final car price.

댓글