Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/ML/Notebook/Random Forest Continuous data.ipynb
3087 views
Kernel: Python 3

Problem :The problem we will tackle is predicting the max temperature for next day in a city using one year of past weather data.

# Pandas is used for data manipulation import pandas as pd # Read in data and display first 5 rows features = pd.read_csv('https://raw.githubusercontent.com/suyashi29/python-su/master/temps.csv') features.head(5)
print('The shape of our features is:', features.shape)
The shape of our features is: (348, 12)
# Descriptive statistics for each column features.describe()

There are not any data points that immediately appear as anomalous and no zeros in any of the measurement columns.

# One-hot encode the data using pandas get_dummies features = pd.get_dummies(features) # Display the first 5 rows of the last 12 columns features.iloc[:,5:].head(5)
# Use numpy to convert to arrays import numpy as np # Labels are the values we want to predict labels = np.array(features['actual']) # Remove the labels from the features # axis 1 refers to the columns features= features.drop('actual', axis = 1) # Saving feature names for later use feature_list = list(features.columns) # Convert to numpy array features = np.array(features)
# Using Skicit-learn to split data into training and testing sets from sklearn.model_selection import train_test_split # Split the data into training and testing sets train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, random_state = 42)
print('Training Features Shape:', train_features.shape) print('Training Labels Shape:', train_labels.shape) print('Testing Features Shape:', test_features.shape) print('Testing Labels Shape:', test_labels.shape)
Training Features Shape: (261, 17) Training Labels Shape: (261,) Testing Features Shape: (87, 17) Testing Labels Shape: (87,)
# The baseline predictions are the historical averages baseline_preds = test_features[:, feature_list.index('average')] # Baseline errors, and display average baseline error baseline_errors = abs(baseline_preds - test_labels) print('Average baseline error: ', round(np.mean(baseline_errors), 2))
Average baseline error: 5.06
# Import the model we are using from sklearn.ensemble import RandomForestRegressor # Instantiate model with 1000 decision trees rf = RandomForestRegressor(n_estimators = 1000, random_state = 42) # Train the model on training data rf.fit(train_features, train_labels);
# Use the forest's predict method on the test data predictions = rf.predict(test_features) # Calculate the absolute errors errors = abs(predictions - test_labels) # Print out the mean absolute error (mae) print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees')
Mean Absolute Error: 3.87 degrees
# Calculate mean absolute percentage error (MAPE) mape = 100 * (errors / test_labels) # Calculate and display accuracy accuracy = 100 - np.mean(mape) print('Accuracy:', round(accuracy, 2), '%.')
Accuracy: 93.93 %.
# Get numerical feature importances importances = list(rf.feature_importances_) # List of tuples with variable and importance feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)] # Sort the feature importances by most important first feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True) # Print out the feature and importances [print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
Variable: temp_1 Importance: 0.66 Variable: average Importance: 0.15 Variable: forecast_noaa Importance: 0.05 Variable: forecast_acc Importance: 0.03 Variable: day Importance: 0.02 Variable: temp_2 Importance: 0.02 Variable: forecast_under Importance: 0.02 Variable: friend Importance: 0.02 Variable: month Importance: 0.01 Variable: year Importance: 0.0 Variable: week_Fri Importance: 0.0 Variable: week_Mon Importance: 0.0 Variable: week_Sat Importance: 0.0 Variable: week_Sun Importance: 0.0 Variable: week_Thurs Importance: 0.0 Variable: week_Tues Importance: 0.0 Variable: week_Wed Importance: 0.0
# Import matplotlib for plotting and use magic command for Jupyter Notebooks import matplotlib.pyplot as plt %matplotlib inline # Set the style plt.style.use('fivethirtyeight') # list of x locations for plotting x_values = list(range(len(importances))) # Make a bar chart plt.bar(x_values, importances, orientation = 'vertical') # Tick labels for x axis plt.xticks(x_values, feature_list, rotation='vertical') # Axis labels and title plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');
Image in a Jupyter notebook