Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: DanielBarnes18/IBM-Data-Science-Professional-Certificate
Path: blob/main/07. Data Analysis with Python/Final Assignment - House Price Predictions.ipynb
Views: ⁴⁵⁸⁶

Kernel: Python 3.7

Data Analysis with Python

House Sales in King County, USA

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

Variable	Description
id	A notation for a house
date	Date house was sold
price	Price is prediction target
bedrooms	Number of bedrooms
bathrooms	Number of bathrooms
sqft_living	Square footage of the home
sqft_lot	Square footage of the lot
floors	Total floors (levels) in house
waterfront	House which has a view to a waterfront
view	Has been viewed
condition	How good the condition is overall
grade	overall grade given to the housing unit, based on King County grading system
sqft_above	Square footage of house apart from basement
sqft_basement	Square footage of the basement
yr_built	Built Year
yr_renovated	Year when house was renovated
zipcode	Zip code
lat	Latitude coordinate
long	Longitude coordinate
sqft_living15	Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area
sqft_lot15	LotSize area in 2015(implies-- some renovations)

You will require the following libraries:

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.linear_model import LinearRegression
%matplotlib inline

Module 1: Importing Data Sets

Load the csv:

In [2]:

file_name='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/FinalModule_Coursera/data/kc_house_data_NaN.csv'
df=pd.read_csv(file_name)

We use the method head to display the first 5 columns of the dataframe.

In [3]:

df.head()

Question 1

Display the data types of each column using the function dtypes, then take a screenshot and submit it, include your code in the image.

In [4]:

df.dtypes

Unnamed: 0         int64
id                 int64
date              object
price            float64
bedrooms         float64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

We use the method describe to obtain a statistical summary of the dataframe.

In [5]:

df.describe()

Module 2: Data Wrangling

Question 2

Drop the columns "id" and "Unnamed: 0" from axis 1 using the method drop(), then use the method describe() to obtain a statistical summary of the data. Take a screenshot and submit it, make sure the inplace parameter is set to True

In [6]:

df.drop(['id', 'Unnamed: 0'], axis = 1, inplace = True)
df.describe()

We can see we have missing values for the columns bedrooms and bathrooms

In [7]:

print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())

number of NaN values for the column bedrooms : 13
number of NaN values for the column bathrooms : 10

We can replace the missing values of the column 'bedrooms' with the mean of the column 'bedrooms' using the method replace(). Don't forget to set the inplace parameter to True

In [8]:

mean=df['bedrooms'].mean()
df['bedrooms'].replace(np.nan,mean, inplace=True)

We also replace the missing values of the column 'bathrooms' with the mean of the column 'bathrooms' using the method replace(). Don't forget to set the inplace parameter top True

In [9]:

mean=df['bathrooms'].mean()
df['bathrooms'].replace(np.nan,mean, inplace=True)

In [10]:

print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())

number of NaN values for the column bedrooms : 0
number of NaN values for the column bathrooms : 0

Module 3: Exploratory Data Analysis

Question 3

Use the method value_counts to count the number of houses with unique floor values, use the method .to_frame() to convert it to a dataframe.

In [11]:

df['floors'].value_counts().to_frame()

Question 4

Use the function boxplot in the seaborn library to determine whether houses with a waterfront view or without a waterfront view have more price outliers.

In [12]:

sns.boxplot(df['waterfront'], df['price'])

<matplotlib.axes._subplots.AxesSubplot at 0x7feb201d0a10>

Question 5

Use the function regplot in the seaborn library to determine if the feature sqft_above is negatively or positively correlated with price.

In [13]:

sns.regplot(df['sqft_above'], df['price'])

<matplotlib.axes._subplots.AxesSubplot at 0x7feb1b964950>

We can use the Pandas method corr() to find the feature other than price that is most correlated with price.

In [14]:

df.corr()['price'].sort_values()

zipcode         -0.053203
long             0.021626
condition        0.036362
yr_built         0.054012
sqft_lot15       0.082447
sqft_lot         0.089661
yr_renovated     0.126434
floors           0.256794
waterfront       0.266369
lat              0.307003
bedrooms         0.308797
sqft_basement    0.323816
view             0.397293
bathrooms        0.525738
sqft_living15    0.585379
sqft_above       0.605567
grade            0.667434
sqft_living      0.702035
price            1.000000
Name: price, dtype: float64

Module 4: Model Development

We can Fit a linear regression model using the longitude feature 'long' and caculate the R^2.

In [15]:

X = df[['long']]
Y = df['price']
lm = LinearRegression()
lm.fit(X,Y)
lm.score(X, Y)

0.00046769430149007363

Question 6

Fit a linear regression model to predict the 'price' using the feature 'sqft_living' then calculate the R^2. Take a screenshot of your code and the value of the R^2.

In [16]:

X1 = df[['sqft_living']]
lm.fit(X1,Y)
lm.score(X1, Y)

0.4928532179037931

Question 7

Fit a linear regression model to predict the 'price' using the list of features:

In [17]:

features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]

Then calculate the R^2. Take a screenshot of your code.

In [18]:

lm.fit(df[features], Y)
lm.score(df[features], Y)

0.657679183672129

This will help with Question 8

Create a list of tuples, the first element in the tuple contains the name of the estimator:

'scale'

'polynomial'

'model'

The second element in the tuple contains the model constructor

StandardScaler()

PolynomialFeatures(include_bias=False)

LinearRegression()

In [19]:

Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures(include_bias=False)),('model',LinearRegression())]

Question 8

Use the list to create a pipeline object to predict the 'price', fit the object using the features in the list features, and calculate the R^2.

In [20]:

Pipe = Pipeline(Input)
Pipe.fit(df[features],Y)
Pipe.score(df[features],Y)

0.7513408553309376

Import the necessary modules:

In [21]:

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
print("done")

done

We will split the data into training and testing sets:

In [22]:

features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]    
X = df[features]
Y = df['price']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=1)


print("number of test samples:", x_test.shape[0])
print("number of training samples:",x_train.shape[0])

number of test samples: 3242
number of training samples: 18371

Question 9

Create and fit a Ridge regression object using the training data, set the regularization parameter to 0.1, and calculate the R^2 using the test data.

In [23]:

from sklearn.linear_model import Ridge

In [27]:

RM = Ridge(alpha = 0.1)
RM.fit(x_train,y_train)
RM.score(x_test,y_test)

0.6478759163939122

Question 10

Perform a second order polynomial transform on both the training data and testing data. Create and fit a Ridge regression object using the training data, set the regularisation parameter to 0.1, and calculate the R^2 utilising the test data provided. Take a screenshot of your code and the R^2.

In [29]:

pr = PolynomialFeatures(degree = 2)
x_train_pr = pr.fit_transform(x_train[features])
x_test_pr = pr.fit_transform(x_test[features])

RM = Ridge(alpha = 0.1)
RM.fit(x_train_pr, y_train)
RM.score(x_test_pr, y_test)

0.7002744279896707

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Data Analysis with Python

House Sales in King County, USA

Module 1: Importing Data Sets

Question 1

Module 2: Data Wrangling

Question 2

Module 3: Exploratory Data Analysis

Question 3

Question 4

Question 5

Module 4: Model Development

Question 6

Question 7

This will help with Question 8

Question 8

Module 5: Model Evaluation and Refinement

Question 9

Question 10

Product

Resources

Company

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more, all in one place. Commercial Alternative to JupyterHub.

Data Analysis with Python

House Sales in King County, USA

Module 1: Importing Data Sets

Question 1

Module 2: Data Wrangling

Question 2

Module 3: Exploratory Data Analysis

Question 3

Question 4

Question 5

Module 4: Model Development

Question 6

Question 7

This will help with Question 8

Question 8

Module 5: Model Evaluation and Refinement

Question 9

Question 10

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.