Path: blob/master/Data Science using Python/Lab 4 EDA on Crop Growth Prediction.ipynb
3074 views
EDA on Crop RECOMMENDATION USING WEATHER AND SOIL CONTENT
What is EDA?
EDA stands for Exploratory Data Analysis. It is an approach used in data science and statistics to summarize, visualize, and understand the main characteristics, patterns, and relationships present in a dataset. The primary goal of EDA is to gain insights into the data and to uncover potential patterns or anomalies that might not be immediately apparent.
EDA involves a variety of techniques and methods, including:
Summary Statistics: Calculating basic statistics such as mean, median, mode, standard deviation, and percentiles to understand the central tendency and variability of the data.
Data Visualization: Creating graphs, charts, and plots to visually represent the data. Common types of visualizations include histograms, scatter plots, box plots, bar charts, and heatmaps.
Distribution Analysis: Examining the distribution of variables to understand their shape, skewness, and spread. This helps identify potential outliers and anomalies.
Correlation Analysis: Investigating the relationships between different variables to understand how they are related and whether there are any strong correlations or dependencies.
Data Cleaning and Preprocessing: Identifying missing values, outliers, and inconsistencies in the data and deciding how to handle them before further analysis.
Feature Engineering: Exploring ways to create new features or transformations of existing features that might better capture the underlying patterns in the data.
Hypothesis Generation: Formulating initial hypotheses about the relationships between variables and then using EDA to validate or disprove these hypotheses.
Why EDA
An approach to summarize, visualize, and become intimately familiar with the important characteristics of a data set.
Defines and Refines the selection of feature variables that will be used for machine learning.
Helps to find hidden Insights
It provides the context needed to develop an appropriate model with minimum errors
Insights:
We have total observations for 22 crops
Let us check for null value
Insight:
There are no null values in my data
Statistical Description using Pandas
Insights:
Mean of Nitrogen used by given sets of crops is 51 approx
Mean of Phosphorus used by given sets of crops is 53 approx
Mean of Potassiun used by given sets of crops is 48 approx
Mean temperature is 25.6 that means a moderate temperature is facvorable for given set of crops
We have total 22 Crops in out data and all crops in 100 data points
Visualization Part
Total number of crops and their average suitable temp
Bar plot
A bar plot or bar chart is a graph that represents the category of data with rectangular bars with lengths and heights that is proportional to the values which they represent. The bar plots can be plotted horizontally or vertically. A bar chart describes the comparisons between the discrete categories.
Distribution of temperature, humidity and ph.
It is symmetrical and bell shaped, showing that trials will usually give a result near the average, but will occasionally deviate by large amounts. It's also fascinating how these two really resemble each other!
Countplot
Show value counts for a single categorical variable. If we use only one data
Show value counts for two categorical variables and using hue parameter
Pairplot
Pairplot can be used to visualize the relationship between continuous or categorical variables. It is a useful tool for exploratory data analysis and statistical graphics.
Insights:
During rainy season, average rainfall is high (average 120 mm) and temperature is mildly chill (less than 30'C).
Rain affects soil moisture which affects ph of the soil. Here are the crops which are likely to be planted during this season.
Rice needs heavy rainfall (>200 mm) and a humidity above 80%. No wonder major rice production in India comes from East Coasts which has average of 220 mm rainfall every year!
Coconut is a tropical crop and needs high humidity therefore explaining massive exports from coastal areas around the country.
Joint plots
Allow you to plot a relationship between two variables (also known as a bivariate relationship), while simultaneously exploring the distribution of each underlying variable.
Check crops for which favorable temperature is greater than 35 and rainfall 130
Insights
Papaya and PIGEONPEANS grow at higher temperature and high rainfall
Check crops for which favorable temperature is less than 30 and humdity less than 50
Inisghts
Checkpea, kidney beans, Mango etc are the crops favorable at moderate soil temperature and humdity
Let's try to plot a specfic case of pairplot between `humidity` and `K` (potassium levels in the soil.)
sns.jointplot()
can be used for bivariate analysis to plot between humidity and K levels based on Label type. It further generates frequency distribution of classes with respect to features
We can see ph values are critical when it comes to soil. A stability between 6 and 7 is preffered
Box Plot
It is a type of chart that depicts a group of numerical data through their quartiles. It is a simple way to visualize the shape of our data.
Further analyzing phosphorous levels.
When humidity is less than 65, almost same phosphor levels(approx 14 to 25) are required for 6 crops which could be grown just based on the amount of rain expected over the next few weeks.
LinePlot
A line plot is a graphical representation of data on a number line using dots, crosses, or any other symbols
Insights
Maize, chickpea, coffee etc need high soil humidity
Soil with Phosphorus content in range 12-25 is most suitable for most of the crops. provided we have good amount of moisture in soil
DATA PRE-PROCESSING
Let's make the data ready for machine learning model
Correlation visualization between features. We can see how Phosphorous levels and Potassium levels are highly correlated
We will not consider Phosphorous and Pottasium together as they are correlated
FEATURE SCALING
Feature scaling is required before creating training data and feeding it to the model.
As we saw earlier, two of our features (temperature and ph) are gaussian distributed, therefore scaling them between 0 and 1 with MinMaxScaler.