Projects

Analyzing US Economic Data - House Sales in King County, USA

DaTALK 2021. 6. 28.

1. import libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
%matplotlib inline

2. import the data and show the first five rows

# importing the data
file_name='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/kc_house_data_NaN.csv'
df=pd.read_csv(file_name)
df.head()

* Question 1) Display the data types of each column using the attribute dtypes

df.dtypes

* Question 2) Drop the columns "id" and "Unnamed: 0" from axis 1 using the method drop(), then use the method describe() to obtain a statistical summary of the data.

df.drop(columns = ['Unnamed: 0','id'],inplace = True)
df.describe()

* Question 3) use the method value_counts to count the number of houses with unique floor values, use the method .to_frame() to convert it to a dataframe

df['floors'].value_counts().to_frame()

3. check if there is null and convert it into 0
(you can convert it into the avg of the residual data or just omit it alone - up to your choice)

df.isnull().sum()

df['bedrooms'] = df['bedrooms'].fillna(0)
df['bathrooms'] = df['bathrooms'].fillna(0)

* Question 4) use the function boxplot in the seaborn library to produce a plot that can be used to determine whether houses with a waterfront view or without a waterfront view have more price outliers.

sns.boxplot(x='waterfront',y='price',data=df)

-> houses without a waterfront view have more price outliers

 

* Question 5) Use the function regplot in the seaborn library to determine if the feature sqft_above is negatively or positively correlated with price.

sns.regplot(x='sqft_above',y='price',data=df)

-> the feature sqft_above is positively correlated with price

 

* Question 6) Fit a linear regression model to predict the price using the feature 'sqft_living' then calculate the R^2.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()

X = df[['sqft_living']]
Y = df['price']
lm.fit(X,Y)
lm.score(X,Y)

ans) 0.4928532179037931

 

* Question 7) Fit a linear regression model to predict the 'price' using the list of features: floors, waterfront, lat, bedrooms, sqft_basement, view, bathrooms, sqft_living15, sqft_above, grade, sqft_living

features = ["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]

#after replacing zero with NaN values in the 'bathroom' & 'bedroom' columns..
X2 = df[features]
Y2 = df['price']
lm.fit(X2,Y2)
lm.score(X2,Y2)

ans) 0.6577163489341582

 

* Question 8) Create a pipeline object that scales the data performs a polynomial transform and fits a linear regression model. Fit the object using the features in the question above, then fit the model and calculate the R^2.

Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures()),('model',LinearRegression())]
pipe=Pipeline(Input)
pipe.fit(X2,Y2)
pipe.score(X2,Y2)

ans) 0.7513456626638001

 

* Question 9) Create and fit a Ridge regression object using the training data, setting the regularization parameter to 0.1 and calculate the R^2 using the test data.

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

# set the test size to 30% of the entire samples
x_train, x_test, y_train, y_test = train_test_split(X2, Y2, test_size=0.3, random_state=1)

    
print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])

from sklearn.linear_model import Ridge

RidgeModel = Ridge(alpha=0.1) 
RidgeModel.fit(x_train, y_train)
RidgeModel.score(x_test, y_test)

number of test samples : 6484 number of training samples: 15129

ans) 0.6505480504085062

 

* Question 10) Perform a second order polynomial transform on both the training data and testing data. Create and fit a Ridge regression object using the training data, setting the regularisation parameter to 0.1. Calculate the R^2 utilising the test data provided.

pr=PolynomialFeatures(degree=2)
#fit_transfrom only on x data
x_train_pr=pr.fit_transform(x_train[features])
x_test_pr=pr.fit_transform(x_test[features])

RidgeModel = Ridge(alpha=0.1) 
RidgeModel.fit(x_train_pr, y_train)
RidgeModel.score(x_test_pr, y_test)

= performed a second order polynomial transform on only x_train & x_test data
(not on y_train & y_test data - since polynomial transfrom is to transform only on x data)

ans) 0.7475192573736549

 

 

 

 

댓글