GitHub Repository: suyashi29/python-su
Path: blob/master/Data Science using Python/Day 4 Data Modelling on Crop Growth Prediction.ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

Data Modelling using Python

In [1]:

import pandas as pd 
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:

df=pd.read_csv("crop_recommendation.csv")
df.head()

Out[2]:

In [3]:

df.tail()

Out[3]:

Statistical Description using Pandas

In [4]:

df.describe()

Out[4]:

In [5]:

df.describe(include="object")

Out[5]:

Let's make the data ready for machine learning model

In [6]:

c=df.label.astype('category')
targets = dict(enumerate(c.cat.categories))
df['target']=c.cat.codes

y=df.target
X=df[['N','P','K','temperature','humidity','ph','rainfall']]

In [8]:

df['target'].tail()

Out[8]:

  5
  5
  5
  5
  5
Name: target, dtype: int8

Correlation visualization between features. We can see how Phosphorous levels and Potassium levels are highly correlated.

In [9]:


plt.figure(figsize=(20, 5))
sns.heatmap(X.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Out[9]:

FEATURE SCALING

Feature scaling is required before creating training data and feeding it to the model.

In [12]:


10/1000,100/1000,1000/1000

Out[12]:

(0.01, 0.1, 1.0)

In [13]:

df

Out[13]:

## Modelling Process 

1.x-Inputs/Features(K,P,Temp)
2.y- out/tar(Crops)

x-train 
x-test
y-train
y-test

- Crop=Model or train(x_train adn y-train) 
- Crop(x_test) = y-pred
- Error = ytest - yprec

In [15]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=1,test_size=.2)

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

# we must apply the scaling to the test set as well that we are computing for the training set
X_test_scaled = scaler.transform(X_test)

Data Modelling

KNN Classifier for Crop prediction.

KNN Introduction

K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).

KNN algorithm

One of the simplest of all the supervised machine learning algorithms. It simply calculates the distance of a new data point to all other training data points.
K can be any integer. K=3 mean ( find the 3 nearest points)
The KNN algorithm starts by calculating the distance of point(Euclidean or Manhattan ) X from all the points.
Finally it assigns the data point to the class to which the majority of the K data points belong.

Note

The model for KNN is the entire training dataset. When a prediction is required for a unseen data instance, the kNN algorithm will search through the training dataset for the k-most similar instances. The prediction attribute of the most similar instances is summarized and returned as the prediction for the unseen instance.

number of cLasess, sqr(total data points)

In [26]:

a= [[0], [1],[2] ,[3],[4],[5],[6],[7],[8],[10],[12],[13],[16],[15]]
b = [0, 0, 0, 0,0,0,0,1,1,1,1,1,1,1]
from sklearn.neighbors import KNeighborsClassifier
dummy = KNeighborsClassifier(n_neighbors=4)
dummy.fit(a, b)

Out[26]:

In [27]:

print("Prediction=",dummy.predict([[4.5],[9]]))

Out[27]:

Prediction= [0 1]

In [28]:

print("Prediction Probability = ",dummy.predict_proba([[4.5],[9]]))

Out[28]:

Prediction Probability =  [[1.   0.  ]
 [0.25 0.75]]

In [29]:

print("Closed Neighbours ", dummy.kneighbors([[9]]))

Out[29]:

Closed Neighbours  (array([[1., 1., 2., 3.]]), array([[8, 9, 7, 6]], dtype=int64))

In [30]:

df

Out[30]:

In [31]:

y=df.target
X=df[['N','P','temperature','humidity','ph','rainfall']]

In [32]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.30)

In [33]:

from sklearn.neighbors import KNeighborsClassifier
crop_p = KNeighborsClassifier(n_neighbors=7)
crop_p.fit(X_train, y_train)
crop_p.score(X_test, y_test)

Out[33]:

0.9424242424242424

In [34]:

y_pred = crop_p.predict(X_test)

Confusion Matrix

In [35]:

from sklearn.metrics import multilabel_confusion_matrix
cnf_matrix=multilabel_confusion_matrix(y_test, y_pred,
                            
                          labels=[0 ,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,22])

In [36]:

cnf_matrix

Out[36]:

array([[[622,   0],
        [  0,  38]],

       [[638,   0],
        [  0,  22]],

       [[628,   0],
        [  0,  32]],

       [[624,   7],
        [  0,  29]],

       [[633,   0],
        [  0,  27]],

       [[634,   0],
        [  1,  25]],

       [[625,   3],
        [  0,  32]],

       [[630,   0],
        [  0,  30]],

       [[623,   5],
        [  1,  31]],

       [[624,   2],
        [  7,  27]],

       [[633,   1],
        [  0,  26]],

       [[625,   0],
        [  3,  32]],

       [[627,   0],
        [  0,  33]],

       [[636,   0],
        [  1,  23]],

       [[629,   1],
        [  0,  30]],

       [[641,   0],
        [  0,  19]],

       [[625,   3],
        [ 15,  17]],

       [[632,   0],
        [  1,  27]],

       [[623,   0],
        [  2,  35]],

       [[619,  15],
        [  3,  23]],

       [[628,   1],
        [  4,  27]],

       [[660,   0],
        [  0,   0]]], dtype=int64)

Let's try different values of n_neighbors to fine tune and get better results

In [37]:


from sklearn.metrics import accuracy_score
for K in range(0,7):
  K_value = K+1
  neigh = KNeighborsClassifier(n_neighbors = K_value)
  neigh.fit(X_train, y_train)
  y_pred = neigh.predict(X_test)
  print("Accuracy is ", accuracy_score(y_test,y_pred)*100,"% for K-Value:",K_value)

Out[37]:

Accuracy is  94.39393939393939 % for K-Value: 1
Accuracy is  93.63636363636364 % for K-Value: 2
Accuracy is  94.84848484848484 % for K-Value: 3
Accuracy is  94.39393939393939 % for K-Value: 4
Accuracy is  93.78787878787878 % for K-Value: 5
Accuracy is  93.78787878787878 % for K-Value: 6
Accuracy is  94.24242424242424 % for K-Value: 7

In [40]:

import numpy as np
X_new=[80.000000,45.000000,10.598693,120.473146,12.425045,120.867624]

X_new=pd.Series(X_new)

In [41]:

Prediction= crop_p.predict([X_new])

In [42]:


print("Prediction: {}".format(Prediction))
print("Predicted target name: {}".format(df['label'][Prediction]))

Out[42]:

Prediction: [17]
Predicted target name: 17    rice
Name: label, dtype: object

Classification using Support Vector Classifer (SVC)

SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong.

The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:

1.Linear
2.Polynomial
3.Radial basis function (RBF)
4.Sigmoid

Each of these functions has its characteristics, its pros and cons, and its equation, but as there's no easy way of knowing which function performs best with any given dataset, we usually choose different functions in turn and compare the results.

In [ ]:

from sklearn.svm import SVC

svc_linear = SVC(kernel = 'linear').fit(X_train, y_train)
print("Linear Kernel Accuracy: ",svc_linear.score(X_test,y_test))

svc_poly = SVC(kernel = 'rbf').fit(X_train, y_train)
print("Rbf Kernel Accuracy: ", svc_poly.score(X_test,y_test))

svc_poly = SVC(kernel = 'poly').fit(X_train, y_train)
print("Poly Kernel Accuracy: ", svc_poly.score(X_test,y_test))

Let's try to increase SVC Linear model accuracy by parameter tuning.

GridSearchCV can help us find the best parameters.

In [ ]:

from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

parameters = {'C': np.logspace(-3, 2, 6).tolist(), 'gamma': np.logspace(-3, 2, 6).tolist()}
# 'degree': np.arange(0,5,1).tolist(), 'kernel':['linear','rbf','poly']

model = GridSearchCV(estimator = SVC(kernel="linear"), param_grid=parameters, n_jobs=-1, cv=4)
model.fit(X_train, y_train)

In [ ]:

print(model.best_score_ )
print(model.best_params_ )

Insights

KNN and SVC both algorithms are giving 75% accuracy
We can prefer SVC for this prection
Make sure we are not taking K and P together as they are highly correlated