Path: blob/master/Data Science using Python/Day 4 Data Modelling on Crop Growth Prediction.ipynb
3074 views
Data Modelling using Python
Statistical Description using Pandas
Let's make the data ready for machine learning model
Correlation visualization between features. We can see how Phosphorous levels and Potassium levels are highly correlated.
FEATURE SCALING
Feature scaling is required before creating training data and feeding it to the model.
Data Modelling
KNN Classifier for Crop prediction.
KNN Introduction
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).
KNN algorithm
One of the simplest of all the supervised machine learning algorithms. It simply calculates the distance of a new data point to all other training data points.
K can be any integer. K=3 mean ( find the 3 nearest points)
The KNN algorithm starts by calculating the distance of point(Euclidean or Manhattan ) X from all the points.
Finally it assigns the data point to the class to which the majority of the K data points belong.
Note
The model for KNN is the entire training dataset. When a prediction is required for a unseen data instance, the kNN algorithm will search through the training dataset for the k-most similar instances. The prediction attribute of the most similar instances is summarized and returned as the prediction for the unseen instance.
number of cLasess, sqr(total data points)
Confusion Matrix
Let's try different values of n_neighbors to fine tune and get better results
Classification using Support Vector Classifer (SVC)
SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong.
The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:
Each of these functions has its characteristics, its pros and cons, and its equation, but as there's no easy way of knowing which function performs best with any given dataset, we usually choose different functions in turn and compare the results.
Let's try to increase SVC Linear model accuracy by parameter tuning.
GridSearchCV can help us find the best parameters.
Insights
KNN and SVC both algorithms are giving 75% accuracy
We can prefer SVC for this prection
Make sure we are not taking K and P together as they are highly correlated