Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Lab on Key Python Lib.ipynb
3064 views
Kernel: Python 3

Analyze given data set

  • Check data shape

  • Describe data

  • Check for null values and drop null values or replace it

  • Drop colums that you find unnecessary

  • What is revenue distribution for Bekins in 2005

  • What is the growth% for Data/Technology in 2006

  • What is the Revenue distribution for Education in 2005

  • What is number of companies in Each industry

  • Draw a correlation plot to check on feature realtionship.

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline sns.set() from subprocess import check_output
xls = pd.ExcelFile('CompanyDetails.xlsx') overview = pd.read_excel(xls, 'Overview') comp = pd.read_excel(xls, 'Financials')
data[data["2005 Revenue"]="3 Round"]]
#xls.shape overview.head()
comp.head()
df = overview.join(comp, lsuffix='', rsuffix='_right') df.shape
(529, 18)
df.describe(include="all")
df = df.drop(['ID_right' , 'Name_right'], axis=1)
df.columns
Index(['ID', 'Name', 'Industry', 'Year Founded', 'Employees', 'State', 'City', 'Zip Code', '2005 Revenue', '2005 Expenses', '2005 Growth%', '2004 Revenue', '2004 Expenses', '2004 Growth%', '2003 Revenue', '2003 Expenses'], dtype='object')
df.head()
df.isnull().sum()
ID 0 Name 0 Industry 0 Year Founded 1 Employees 0 State 0 City 33 Zip Code 37 2005 Revenue 0 2005 Expenses 0 2005 Growth% 0 2004 Revenue 0 2004 Expenses 0 2004 Growth% 41 2003 Revenue 41 2003 Expenses 394 dtype: int64
df = df.drop(['City' , 'Zip Code' , '2004 Growth%','2003 Revenue','2003 Expenses'], axis=1)
df["Year Founded"]= df["Year Founded"].fillna(df["Year Founded"].median)
df = df.replace("Data/Technology,", "Data/Technology")
df = df.replace("Housing/Real Estate,", "Housing/Real Estate")
df.to_csv('file_name.csv')
df_copy = df

What is the growth% for Data/Technology in 2005

df_copy.groupby(['Industry']).mean()["2005 Growth%"]["Data/Technology"]
2.952745423823052

What is the Revenue distribution for Education in 2005

x = df_copy.loc[df_copy["Industry"] == "Education", "2005 Revenue"] ax = sns.boxplot(x)
Image in a Jupyter notebook

What is number of companies in Each industry

df_copy.groupby(['Industry']).count()["ID"]
Industry Aerospace and Defense 1 Business & Legal Services 45 Data/Technology 98 Education 19 Energy 28 Environment & Weather 12 Finance & Investment 75 Food & Agriculture 6 Geospatial/Mapping 30 Governance 43 Healthcare 40 Housing/Real Estate 21 Insurance 11 Lifestyle & Consumer 25 Media 1 Research & Consulting 28 Scientific Research 17 Software 1 Transportation 28 Name: ID, dtype: int64

Draw a correlation plot to check on feature realtionship

ch = df_copy.groupby(['Industry']).mean() ch
ch.columns
Index(['ID', 'Employees', '2005 Revenue', '2005 Expenses', '2005 Growth%', '2004 Revenue', '2004 Expenses'], dtype='object')
#df = df.drop(['City' , 'Zip Code' , '2004 Growth%','2003 Revenue','2003 Expenses'], axis=1) coordata = ch.drop(['ID'],axis=1)
correlations = coordata.corr()
plt.matshow(coordata.corr()) #fig = plt.figure(figsize=(11,9)) plt.xticks(range(len(coordata.columns)), coordata.columns) plt.yticks(range(len(coordata.columns)), coordata.columns) plt.colorbar() plt.show()
Image in a Jupyter notebook
# Compute the correlation matrix corr = coordata.corr() # Generate a mask for the upper triangle #mask = np.triu(np.ones_like(corr, dtype=np.bool)) # Set up the matplotlib figure f, ax = plt.subplots(figsize=(11, 9)) # Generate a custom diverging colormap cmap = sns.diverging_palette(220, 10, as_cmap=True) # Draw the heatmap with the mask and correct aspect ratio sns.heatmap(corr, cmap=cmap, vmax=.3, center=0,annot=True, square=True, linewidths=.5, cbar_kws={"shrink": .5})
<matplotlib.axes._subplots.AxesSubplot at 0x21dd9cec288>
Image in a Jupyter notebook