Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Data Science using Python/Lab 4 EDA on Crop Growth Prediction.ipynb
3074 views
Kernel: Python 3 (ipykernel)

EDA on Crop RECOMMENDATION USING WEATHER AND SOIL CONTENT

image.png

What is EDA?

EDA stands for Exploratory Data Analysis. It is an approach used in data science and statistics to summarize, visualize, and understand the main characteristics, patterns, and relationships present in a dataset. The primary goal of EDA is to gain insights into the data and to uncover potential patterns or anomalies that might not be immediately apparent.

EDA involves a variety of techniques and methods, including:

  • Summary Statistics: Calculating basic statistics such as mean, median, mode, standard deviation, and percentiles to understand the central tendency and variability of the data.

  • Data Visualization: Creating graphs, charts, and plots to visually represent the data. Common types of visualizations include histograms, scatter plots, box plots, bar charts, and heatmaps.

  • Distribution Analysis: Examining the distribution of variables to understand their shape, skewness, and spread. This helps identify potential outliers and anomalies.

  • Correlation Analysis: Investigating the relationships between different variables to understand how they are related and whether there are any strong correlations or dependencies.

  • Data Cleaning and Preprocessing: Identifying missing values, outliers, and inconsistencies in the data and deciding how to handle them before further analysis.

  • Feature Engineering: Exploring ways to create new features or transformations of existing features that might better capture the underlying patterns in the data.

  • Hypothesis Generation: Formulating initial hypotheses about the relationships between variables and then using EDA to validate or disprove these hypotheses.

Why EDA

  • An approach to summarize, visualize, and become intimately familiar with the important characteristics of a data set.

  • Defines and Refines the selection of feature variables that will be used for machine learning.

  • Helps to find hidden Insights

  • It provides the context needed to develop an appropriate model with minimum errors

import pandas as pd import numpy as np import warnings warnings.simplefilter(action='ignore', category=FutureWarning) warnings.simplefilter(action='ignore', category=UserWarning) import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline
df=pd.read_csv("crop_recommendation.csv") df.head()
df.shape
(2200, 8)
df["label"].unique()
array(['rice', 'maize', 'chickpea', 'kidneybeans', 'pigeonpeas', 'mothbeans', 'mungbean', 'blackgram', 'lentil', 'pomegranate', 'banana', 'mango', 'grapes', 'watermelon', 'muskmelon', 'apple', 'orange', 'papaya', 'coconut', 'cotton', 'jute', 'coffee'], dtype=object)

Insights:

  • We have total observations for 22 crops

Let us check for null value

df.isnull().sum()
N 0 P 0 K 0 temperature 0 humidity 0 ph 0 rainfall 0 label 0 dtype: int64

Insight:

  • There are no null values in my data

Statistical Description using Pandas

## To Check statistical properties of Numerical data df.describe()

Insights:

  • Mean of Nitrogen used by given sets of crops is 51 approx

  • Mean of Phosphorus used by given sets of crops is 53 approx

  • Mean of Potassiun used by given sets of crops is 48 approx

  • Mean temperature is 25.6 that means a moderate temperature is facvorable for given set of crops

#To Check statistical properties of Categorical data df.describe(include="object")

We have total 22 Crops in out data and all crops in 100 data points

c = pd.value_counts(df.label).to_frame().reset_index() c.head(22)

Visualization Part

  • Total number of crops and their average suitable temp

Bar plot

A bar plot or bar chart is a graph that represents the category of data with rectangular bars with lengths and heights that is proportional to the values which they represent. The bar plots can be plotted horizontally or vertically. A bar chart describes the comparisons between the discrete categories.

fig = plt.figure(figsize =(28, 12)) # Horizontal Bar Plot plt.bar(df['label'], df['temperature'],color ="pink") # Show Plot plt.show()
Image in a Jupyter notebook
plt.figure(figsize=(25, 8)) # Set the figure size # Create the bar plot bars = plt.bar(df['label'], df['temperature'], color='skyblue') # Add annotations above each bar for bar in bars: yval = bar.get_height() plt.text(bar.get_x() + bar.get_width()/2, yval + 1, round(yval, 1), ha='center', va='bottom') # Adding labels and title plt.xlabel('Crop_Category') plt.ylabel('Temperature') plt.title('Temperature vs Crops') # Show the plot plt.tight_layout() plt.show()
Image in a Jupyter notebook
# Calculate mean temperature for each label mean_temperatures = df.groupby('label')['temperature'].mean().reset_index() plt.figure(figsize=(30, 15)) # Set the figure size # Create the bar plot bars = plt.bar(mean_temperatures['label'], mean_temperatures['temperature'], color='lightgreen') # Add annotations above each bar for bar in bars: yval = bar.get_height() plt.text(bar.get_x() + bar.get_width()/2, yval+1, f'{yval:.1f}', ha='center', va='bottom') # Adding labels and title plt.xlabel('Crops') plt.ylabel('Mean Temperature') plt.title('Annotated Bar Plot of Mean Temperatures') # Show the plot plt.tight_layout() plt.show()
Image in a Jupyter notebook

Distribution of temperature, humidity and ph.

It is symmetrical and bell shaped, showing that trials will usually give a result near the average, but will occasionally deviate by large amounts. It's also fascinating how these two really resemble each other!

plt.figure(figsize=(12,6)) plt.subplot(1, 3, 1) sns.distplot(df['temperature'],color="darkred",bins=15,hist_kws={'alpha':0.4}) plt.subplot(1, 3, 2) sns.distplot(df['ph'],color="yellow",bins=15,hist_kws={'alpha':0.4}) plt.subplot(1, 3, 3) sns.distplot(df['humidity'],color="darkgreen",bins=15,hist_kws={'alpha':0.4})
<AxesSubplot:xlabel='humidity', ylabel='Density'>
Image in a Jupyter notebook

Countplot

  • Show value counts for a single categorical variable. If we use only one data

  • Show value counts for two categorical variables and using hue parameter

plt.figure(figsize=(22,15)) sns.countplot(y='label',data=df,palette = "Set2") sns.set(rc = {'figure.figsize':(20, 8)})
Image in a Jupyter notebook

Pairplot

Pairplot can be used to visualize the relationship between continuous or categorical variables. It is a useful tool for exploratory data analysis and statistical graphics.

plt.figure(figsize=(22,15)) sns.pairplot(df, hue = 'label')
<seaborn.axisgrid.PairGrid at 0x1e15a84a250>
<Figure size 1584x1080 with 0 Axes>
Image in a Jupyter notebook

Insights:

During rainy season, average rainfall is high (average 120 mm) and temperature is mildly chill (less than 30'C).

Rain affects soil moisture which affects ph of the soil. Here are the crops which are likely to be planted during this season.

  • Rice needs heavy rainfall (>200 mm) and a humidity above 80%. No wonder major rice production in India comes from East Coasts which has average of 220 mm rainfall every year!

  • Coconut is a tropical crop and needs high humidity therefore explaining massive exports from coastal areas around the country.

Joint plots

Allow you to plot a relationship between two variables (also known as a bivariate relationship), while simultaneously exploring the distribution of each underlying variable.

Check crops for which favorable temperature is greater than 35 and rainfall 130

sns.jointplot(x="rainfall",y="humidity",data=df[(df['temperature']>25) & (df['rainfall']>130)],hue="label")
<seaborn.axisgrid.JointGrid at 0x1e15af27d00>
Image in a Jupyter notebook

Insights

  • Papaya and PIGEONPEANS grow at higher temperature and high rainfall

Check crops for which favorable temperature is less than 30 and humdity less than 50

sns.jointplot(x="rainfall",y="humidity",data=df[(df['temperature']<30) & (df['humidity']<50)],hue="label")

Inisghts

  • Checkpea, kidney beans, Mango etc are the crops favorable at moderate soil temperature and humdity

sns.jointplot(x="K",y="N",data=df[(df['N']>40)&(df['K']>40)],hue="label")

Let's try to plot a specfic case of pairplot between `humidity` and `K` (potassium levels in the soil.)

sns.jointplot() can be used for bivariate analysis to plot between humidity and K levels based on Label type. It further generates frequency distribution of classes with respect to features

sns.jointplot(x="K",y="humidity",data=df,hue='label',size=8,s=30,alpha=0.7)

We can see ph values are critical when it comes to soil. A stability between 6 and 7 is preffered

sns.boxplot(y='label',x='ph',data=df)

Box Plot

It is a type of chart that depicts a group of numerical data through their quartiles. It is a simple way to visualize the shape of our data. image.png

sns.boxplot(y='label',x='P',data=df[df['rainfall']>150])
<AxesSubplot:xlabel='P', ylabel='label'>
Image in a Jupyter notebook

Further analyzing phosphorous levels.

When humidity is less than 65, almost same phosphor levels(approx 14 to 25) are required for 6 crops which could be grown just based on the amount of rain expected over the next few weeks.

LinePlot

A line plot is a graphical representation of data on a number line using dots, crosses, or any other symbols

sns.lineplot(data = df[(df['humidity']<65)], x = "K", y = "rainfall",hue="label")
<AxesSubplot:xlabel='K', ylabel='rainfall'>
Image in a Jupyter notebook

Insights

  • Maize, chickpea, coffee etc need high soil humidity

  • Soil with Phosphorus content in range 12-25 is most suitable for most of the crops. provided we have good amount of moisture in soil

DATA PRE-PROCESSING

Let's make the data ready for machine learning model

c=df.label.astype('category') targets = dict(enumerate(c.cat.categories)) df['target']=c.cat.codes df['target']
## Defining targets and features y=df.target X=df[['N','P','K','temperature','humidity','ph','rainfall']]

Correlation visualization between features. We can see how Phosphorous levels and Potassium levels are highly correlated

image.png

plt.figure(figsize=(20, 5)) sns.heatmap(X.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") plt.show()
  • We will not consider Phosphorous and Pottasium together as they are correlated

Test: Xtest, Ytest Ytest=Actual Values Xtest=Precicted Train: Xtrain , Ytrain (Data Training ): 80per Actual - Predicted Xnew(), Correct Y values

FEATURE SCALING

Feature scaling is required before creating training data and feeding it to the model.

As we saw earlier, two of our features (temperature and ph) are gaussian distributed, therefore scaling them between 0 and 1 with MinMaxScaler.

from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=1) scaler = MinMaxScaler() X_train_scaled = scaler.fit_transform(X_train) # we must apply the scaling to the test set as well that we are computing for the training set X_test_scaled = scaler.transform(X_test)