Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/07. Data Analysis with Python/Final Assignment - House Price Predictions.ipynb
Views: 4586
Data Analysis with Python
House Sales in King County, USA
This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.
Variable | Description |
---|---|
id | A notation for a house |
date | Date house was sold |
price | Price is prediction target |
bedrooms | Number of bedrooms |
bathrooms | Number of bathrooms |
sqft_living | Square footage of the home |
sqft_lot | Square footage of the lot |
floors | Total floors (levels) in house |
waterfront | House which has a view to a waterfront |
view | Has been viewed |
condition | How good the condition is overall |
grade | overall grade given to the housing unit, based on King County grading system |
sqft_above | Square footage of house apart from basement |
sqft_basement | Square footage of the basement |
yr_built | Built Year |
yr_renovated | Year when house was renovated |
zipcode | Zip code |
lat | Latitude coordinate |
long | Longitude coordinate |
sqft_living15 | Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area |
sqft_lot15 | LotSize area in 2015(implies-- some renovations) |
You will require the following libraries:
Module 1: Importing Data Sets
Load the csv:
We use the method head
to display the first 5 columns of the dataframe.
Question 1
Display the data types of each column using the function dtypes, then take a screenshot and submit it, include your code in the image.
We use the method describe to obtain a statistical summary of the dataframe.
Module 2: Data Wrangling
Question 2
Drop the columns "id"
and "Unnamed: 0"
from axis 1 using the method drop()
, then use the method describe()
to obtain a statistical summary of the data. Take a screenshot and submit it, make sure the inplace
parameter is set to True
We can see we have missing values for the columns bedrooms
and bathrooms
We can replace the missing values of the column 'bedrooms'
with the mean of the column 'bedrooms'
using the method replace()
. Don't forget to set the inplace
parameter to True
We also replace the missing values of the column 'bathrooms'
with the mean of the column 'bathrooms'
using the method replace()
. Don't forget to set the inplace
parameter top True
Module 3: Exploratory Data Analysis
Question 3
Use the method value_counts
to count the number of houses with unique floor values, use the method .to_frame()
to convert it to a dataframe.
Question 4
Use the function boxplot
in the seaborn library to determine whether houses with a waterfront view or without a waterfront view have more price outliers.
Question 5
Use the function regplot
in the seaborn library to determine if the feature sqft_above
is negatively or positively correlated with price.
We can use the Pandas method corr()
to find the feature other than price that is most correlated with price.
Module 4: Model Development
We can Fit a linear regression model using the longitude feature 'long'
and caculate the R^2.
Question 6
Fit a linear regression model to predict the 'price'
using the feature 'sqft_living'
then calculate the R^2. Take a screenshot of your code and the value of the R^2.
Question 7
Fit a linear regression model to predict the 'price'
using the list of features:
Then calculate the R^2. Take a screenshot of your code.
This will help with Question 8
Create a list of tuples, the first element in the tuple contains the name of the estimator:
'scale'
'polynomial'
'model'
The second element in the tuple contains the model constructor
StandardScaler()
PolynomialFeatures(include_bias=False)
LinearRegression()
Question 8
Use the list to create a pipeline object to predict the 'price', fit the object using the features in the list features
, and calculate the R^2.
Module 5: Model Evaluation and Refinement
Import the necessary modules:
We will split the data into training and testing sets:
Question 9
Create and fit a Ridge regression object using the training data, set the regularization parameter to 0.1, and calculate the R^2 using the test data.
Question 10
Perform a second order polynomial transform on both the training data and testing data. Create and fit a Ridge regression object using the training data, set the regularisation parameter to 0.1, and calculate the R^2 utilising the test data provided. Take a screenshot of your code and the R^2.