CoCalc -- Lab on Key Python Lib.ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/Lab on Key Python Lib.ipynb
³⁰⁶⁴ views

Kernel: Python 3

Analyze given data set

Check data shape
Describe data
Check for null values and drop null values or replace it
Drop colums that you find unnecessary
What is revenue distribution for Bekins in 2005
What is the growth% for Data/Technology in 2006
What is the Revenue distribution for Education in 2005
What is number of companies in Each industry
Draw a correlation plot to check on feature realtionship.

In [1]:

import numpy as  np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
from subprocess import check_output

In [2]:

xls = pd.ExcelFile('CompanyDetails.xlsx')
overview = pd.read_excel(xls, 'Overview')
comp = pd.read_excel(xls, 'Financials')

In [ ]:

data[data["2005 Revenue"]="3 Round"]]

In [3]:

#xls.shape
overview.head()

Out[3]:

In [4]:

comp.head()

Out[4]:

In [5]:

df = overview.join(comp, lsuffix='', rsuffix='_right')
df.shape

Out[5]:

(529, 18)

In [6]:

df.describe(include="all")

Out[6]:

In [7]:

df = df.drop(['ID_right' , 'Name_right'], axis=1)

In [8]:

df.columns

Out[8]:

Index(['ID', 'Name', 'Industry', 'Year Founded', 'Employees', 'State', 'City',
       'Zip Code', '2005 Revenue', '2005 Expenses', '2005 Growth%',
       '2004 Revenue', '2004 Expenses', '2004 Growth%', '2003 Revenue',
       '2003 Expenses'],
      dtype='object')

In [9]:

df.head()

Out[9]:

In [10]:

df.isnull().sum()

Out[10]:

ID                 0
Name               0
Industry           0
Year Founded       1
Employees          0
State              0
City              33
Zip Code          37
2005 Revenue       0
2005 Expenses      0
2005 Growth%       0
2004 Revenue       0
2004 Expenses      0
2004 Growth%      41
2003 Revenue      41
2003 Expenses    394
dtype: int64

In [11]:

df = df.drop(['City' , 'Zip Code' , '2004 Growth%','2003 Revenue','2003 Expenses'], axis=1)

In [12]:

df["Year Founded"]= df["Year Founded"].fillna(df["Year Founded"].median)

In [13]:

df = df.replace("Data/Technology,", "Data/Technology")

In [14]:



df = df.replace("Housing/Real Estate,", "Housing/Real Estate")

In [15]:

df.to_csv('file_name.csv')

In [16]:

df_copy = df

What is the growth% for Data/Technology in 2005

In [17]:

df_copy.groupby(['Industry']).mean()["2005 Growth%"]["Data/Technology"]

Out[17]:

2.952745423823052

What is the Revenue distribution for Education in 2005

In [36]:

x = df_copy.loc[df_copy["Industry"] == "Education", "2005 Revenue"]
ax = sns.boxplot(x)

Out[36]:

What is number of companies in Each industry

In [19]:

df_copy.groupby(['Industry']).count()["ID"]

Out[19]:

Industry
Aerospace and Defense         1
Business & Legal Services    45
Data/Technology              98
Education                    19
Energy                       28
Environment & Weather        12
Finance & Investment         75
Food & Agriculture            6
Geospatial/Mapping           30
Governance                   43
Healthcare                   40
Housing/Real Estate          21
Insurance                    11
Lifestyle & Consumer         25
Media                         1
Research & Consulting        28
Scientific Research          17
Software                      1
Transportation               28
Name: ID, dtype: int64

Draw a correlation plot to check on feature realtionship

In [20]:

ch = df_copy.groupby(['Industry']).mean()
ch

Out[20]:

In [21]:

ch.columns

Out[21]:

Index(['ID', 'Employees', '2005 Revenue', '2005 Expenses', '2005 Growth%',
       '2004 Revenue', '2004 Expenses'],
      dtype='object')

In [22]:

#df = df.drop(['City' , 'Zip Code' , '2004 Growth%','2003 Revenue','2003 Expenses'], axis=1)

coordata = ch.drop(['ID'],axis=1)

In [23]:

correlations = coordata.corr()

In [24]:

plt.matshow(coordata.corr())
#fig = plt.figure(figsize=(11,9))
plt.xticks(range(len(coordata.columns)), coordata.columns)
plt.yticks(range(len(coordata.columns)), coordata.columns)
plt.colorbar()
plt.show()

Out[24]:

In [29]:

# Compute the correlation matrix
corr = coordata.corr()

# Generate a mask for the upper triangle
#mask = np.triu(np.ones_like(corr, dtype=np.bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap=cmap, vmax=.3, center=0,annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

Out[29]:

<matplotlib.axes._subplots.AxesSubplot at 0x21dd9cec288>

In [ ]:

In [ ]:

In [ ]:

Analyze given data set

What is the growth% for Data/Technology in 2005

What is the Revenue distribution for Education in 2005

What is number of companies in Each industry

Draw a correlation plot to check on feature realtionship

Product

Resources

Company