CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
DanielBarnes18

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: DanielBarnes18/IBM-Data-Science-Professional-Certificate
Path: blob/main/07. Data Analysis with Python/Final Assignment - House Price Predictions.ipynb
Views: 4586
Kernel: Python 3.7
cognitiveclass.ai logo

Data Analysis with Python

House Sales in King County, USA

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

VariableDescription
idA notation for a house
dateDate house was sold
pricePrice is prediction target
bedroomsNumber of bedrooms
bathroomsNumber of bathrooms
sqft_livingSquare footage of the home
sqft_lotSquare footage of the lot
floorsTotal floors (levels) in house
waterfrontHouse which has a view to a waterfront
viewHas been viewed
conditionHow good the condition is overall
gradeoverall grade given to the housing unit, based on King County grading system
sqft_aboveSquare footage of house apart from basement
sqft_basementSquare footage of the basement
yr_builtBuilt Year
yr_renovatedYear when house was renovated
zipcodeZip code
latLatitude coordinate
longLongitude coordinate
sqft_living15Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area
sqft_lot15LotSize area in 2015(implies-- some renovations)

You will require the following libraries:

import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler,PolynomialFeatures from sklearn.linear_model import LinearRegression %matplotlib inline

Module 1: Importing Data Sets

Load the csv:

file_name='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/FinalModule_Coursera/data/kc_house_data_NaN.csv' df=pd.read_csv(file_name)

We use the method head to display the first 5 columns of the dataframe.

df.head()

Question 1

Display the data types of each column using the function dtypes, then take a screenshot and submit it, include your code in the image.

df.dtypes
Unnamed: 0 int64 id int64 date object price float64 bedrooms float64 bathrooms float64 sqft_living int64 sqft_lot int64 floors float64 waterfront int64 view int64 condition int64 grade int64 sqft_above int64 sqft_basement int64 yr_built int64 yr_renovated int64 zipcode int64 lat float64 long float64 sqft_living15 int64 sqft_lot15 int64 dtype: object

We use the method describe to obtain a statistical summary of the dataframe.

df.describe()

Module 2: Data Wrangling

Question 2

Drop the columns "id" and "Unnamed: 0" from axis 1 using the method drop(), then use the method describe() to obtain a statistical summary of the data. Take a screenshot and submit it, make sure the inplace parameter is set to True

df.drop(['id', 'Unnamed: 0'], axis = 1, inplace = True) df.describe()

We can see we have missing values for the columns bedrooms and bathrooms

print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum()) print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())
number of NaN values for the column bedrooms : 13 number of NaN values for the column bathrooms : 10

We can replace the missing values of the column 'bedrooms' with the mean of the column 'bedrooms' using the method replace(). Don't forget to set the inplace parameter to True

mean=df['bedrooms'].mean() df['bedrooms'].replace(np.nan,mean, inplace=True)

We also replace the missing values of the column 'bathrooms' with the mean of the column 'bathrooms' using the method replace(). Don't forget to set the inplace parameter top True

mean=df['bathrooms'].mean() df['bathrooms'].replace(np.nan,mean, inplace=True)
print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum()) print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())
number of NaN values for the column bedrooms : 0 number of NaN values for the column bathrooms : 0

Module 3: Exploratory Data Analysis

Question 3

Use the method value_counts to count the number of houses with unique floor values, use the method .to_frame() to convert it to a dataframe.

df['floors'].value_counts().to_frame()

Question 4

Use the function boxplot in the seaborn library to determine whether houses with a waterfront view or without a waterfront view have more price outliers.

sns.boxplot(df['waterfront'], df['price'])
<matplotlib.axes._subplots.AxesSubplot at 0x7feb201d0a10>
Image in a Jupyter notebook

Question 5

Use the function regplot in the seaborn library to determine if the feature sqft_above is negatively or positively correlated with price.

sns.regplot(df['sqft_above'], df['price'])
<matplotlib.axes._subplots.AxesSubplot at 0x7feb1b964950>
Image in a Jupyter notebook

We can use the Pandas method corr() to find the feature other than price that is most correlated with price.

df.corr()['price'].sort_values()
zipcode -0.053203 long 0.021626 condition 0.036362 yr_built 0.054012 sqft_lot15 0.082447 sqft_lot 0.089661 yr_renovated 0.126434 floors 0.256794 waterfront 0.266369 lat 0.307003 bedrooms 0.308797 sqft_basement 0.323816 view 0.397293 bathrooms 0.525738 sqft_living15 0.585379 sqft_above 0.605567 grade 0.667434 sqft_living 0.702035 price 1.000000 Name: price, dtype: float64

Module 4: Model Development

We can Fit a linear regression model using the longitude feature 'long' and caculate the R^2.

X = df[['long']] Y = df['price'] lm = LinearRegression() lm.fit(X,Y) lm.score(X, Y)
0.00046769430149007363

Question 6

Fit a linear regression model to predict the 'price' using the feature 'sqft_living' then calculate the R^2. Take a screenshot of your code and the value of the R^2.

X1 = df[['sqft_living']] lm.fit(X1,Y) lm.score(X1, Y)
0.4928532179037931

Question 7

Fit a linear regression model to predict the 'price' using the list of features:

features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]

Then calculate the R^2. Take a screenshot of your code.

lm.fit(df[features], Y) lm.score(df[features], Y)
0.657679183672129

This will help with Question 8

Create a list of tuples, the first element in the tuple contains the name of the estimator:

'scale'

'polynomial'

'model'

The second element in the tuple contains the model constructor

StandardScaler()

PolynomialFeatures(include_bias=False)

LinearRegression()

Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures(include_bias=False)),('model',LinearRegression())]

Question 8

Use the list to create a pipeline object to predict the 'price', fit the object using the features in the list features, and calculate the R^2.

Pipe = Pipeline(Input) Pipe.fit(df[features],Y) Pipe.score(df[features],Y)
0.7513408553309376

Module 5: Model Evaluation and Refinement

Import the necessary modules:

from sklearn.model_selection import cross_val_score from sklearn.model_selection import train_test_split print("done")
done

We will split the data into training and testing sets:

features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"] X = df[features] Y = df['price'] x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=1) print("number of test samples:", x_test.shape[0]) print("number of training samples:",x_train.shape[0])
number of test samples: 3242 number of training samples: 18371

Question 9

Create and fit a Ridge regression object using the training data, set the regularization parameter to 0.1, and calculate the R^2 using the test data.

from sklearn.linear_model import Ridge
RM = Ridge(alpha = 0.1) RM.fit(x_train,y_train) RM.score(x_test,y_test)
0.6478759163939122

Question 10

Perform a second order polynomial transform on both the training data and testing data. Create and fit a Ridge regression object using the training data, set the regularisation parameter to 0.1, and calculate the R^2 utilising the test data provided. Take a screenshot of your code and the R^2.

pr = PolynomialFeatures(degree = 2) x_train_pr = pr.fit_transform(x_train[features]) x_test_pr = pr.fit_transform(x_test[features]) RM = Ridge(alpha = 0.1) RM.fit(x_train_pr, y_train) RM.score(x_test_pr, y_test)
0.7002744279896707