GitHub Repository: YStrano/DataScience_GA
Path: blob/master/projects/project_1/Project-1 - Yair Strano.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Project 1

In this first project you will implement a few python functions and generally familiarize yourself with the pandas library.

Please refer to numpy-and-pandas.ipynb noteook in lesson2 and the for loop section of the python-controlflow.ipynb notebook in the python_foundations folder. and the pandas documentation here for assitance.

I have written the numerical answers you are looking for below - please show me the code you used to generate those answers.

Note! You will need to look within that documentation/ use other search results on the internet to complete this assignment!

Question 1: Multiples of Three and Five

If we list all of the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6, and 9. The sum of these multiples is 23. Find the sum of all the multiples of 3 and 5 below 1,000.

In [89]:

mylist = []
for x in (range(1, 1000)):
    if x %3 == 0:
        mylist.append(x)
    elif x %5 == 0:
        mylist.append(x)
print(mylist[:5])

Out[89]:

[3, 5, 6, 9, 10]

In [90]:

sum(mylist)

Out[90]:

233168

Answer: 233,168

Note: you may find yourself with the answer 266,333! Think carefully what is going on with this question and what may be driving the difference between your answer and the correct value! A hint can be found in the control flow notebook.

elif solves the issue of counting a number that is a multiple of both 3 and 5 twice.

Question 2: Pandas Intro

In [91]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

2.1 Load the Citibike-Feb-24 dataset into memory and assign it to the variable "df"

The data are in /data/citibike_feb2014.csv

Use pd.read_csv function. Please refer to the documentation here if you are having trouble.

In [92]:

df = pd.read_csv('data/citibike_feb2014.csv')

2.2 How many rows and how many columns are there in the dataset?

In [93]:

print(df.shape)

Out[93]:

(224736, 15)

A: 224,736 rows, 15 columns

2.3 Please print out the first five rows of the dataset

In [94]:

df.head(5)

Out[94]:

2.4 What is the average trip duration? (In seconds)

In [95]:

df['tripduration'].mean()

Out[95]:

874.5198099102947

A: 874.5198 seconds

2.5 What is the total trip duration in this entire dataset in hours?

In [96]:

total_secs = df['tripduration'].sum()
hours = total_secs / 60 / 60
print(hours)

Out[96]:

54593.35666666667

A: 54593.3567 hours

2.6 What is the most popular start station? And how many rides started at that station in Feb 2014?

Note, the pandas cookbook may come in handy for this (look at chapter 1 & 2): https://pandas.pydata.org/pandas-docs/stable/tutorials.html

In [97]:

list(df.columns)

Out[97]:

['tripduration',
 'starttime',
 'stoptime',
 'start station id',
 'start station name',
 'start station latitude',
 'start station longitude',
 'end station id',
 'end station name',
 'end station latitude',
 'end station longitude',
 'bikeid',
 'usertype',
 'birth year',
 'gender']

In [98]:

df['start station name'].value_counts()

Out[98]:

Lafayette St & E 8 St              2920
Pershing Square N                  2719
E 17 St & Broadway                 2493
W 21 St & 6 Ave                    2403
8 Ave & W 31 St                    2171
8 Ave & W 33 St                    1944
W 41 St & 8 Ave                    1916
University Pl & E 14 St            1808
Cleveland Pl & Spring St           1796
Broadway & E 14 St                 1795
Broadway & E 22 St                 1694
E 43 St & Vanderbilt Ave           1688
W 33 St & 7 Ave                    1570
Broadway & W 24 St                 1562
W 31 St & 7 Ave                    1538
Greenwich Ave & 8 Ave              1500
Christopher St & Greenwich St      1500
Great Jones St                     1471
W 13 St & 6 Ave                    1392
W 27 St & 7 Ave                    1391
2 Ave & E 31 St                    1371
Lexington Ave & E 24 St            1363
W 17 St & 8 Ave                    1361
W 18 St & 6 Ave                    1359
E 32 St & Park Ave                 1352
W 38 St & 8 Ave                    1337
West St & Chambers St              1324
1 Ave & E 15 St                    1319
Lawrence St & Willoughby St        1298
6 Ave & W 33 St                    1290
                                   ... 
Sands St & Gold St                  136
Macon St & Nostrand Ave             134
Clinton Ave & Flushing Ave          134
Fulton St & Clermont Ave            130
S Portland Ave & Hanson Pl          128
Broadway & Berry St                 126
Clermont Ave & Park Ave             121
Cadman Plaza E & Tillary St         110
Myrtle Ave & St Edwards St          100
Monroe St & Classon Ave              94
Carlton Ave & Park Ave               93
Lexington Ave & Classon Ave          92
Hancock St & Bedford Ave             91
Fulton St & Rockwell Pl              89
Nassau St & Navy St                  89
3 Ave & Schermerhorn St              89
Avenue D & E 12 St                   82
Flushing Ave & Carlton Ave           76
Gallatin Pl & Livingston St          76
7 Ave & Farragut St                  75
Columbia Heights & Cranberry St      72
W 13 St & 5 Ave                      69
Franklin Ave & Myrtle Ave            57
Park Ave & St Edwards St             57
Front St & Gold St                   56
Hanover Pl & Livingston St           54
Concord St & Bridge St               45
Bedford Ave & S 9th St               41
Railroad Ave & Kay Ave               36
Church St & Leonard St                4
Name: start station name, Length: 329, dtype: int64

In [99]:

df['start station id'].value_counts()

Out[99]:

   2920
   2719
   2493
   2403
   2171
   1944
   1916
   1808
   1796
   1795
   1694
   1688
   1570
   1562
   1538
   1500
   1500
   1471
   1392
   1391
   1371
   1363
   1361
   1359
   1352
   1337
   1324
   1319
   1298
   1290
        ... 
    136
    134
    134
    130
    128
    126
    121
    110
    100
     94
     93
     92
     91
     89
     89
     89
     82
     76
     76
    75
     72
     69
     57
     57
     56
     54
     45
     41
    36
      4
Name: start station id, Length: 329, dtype: int64

In [100]:

df['start station id'].value_counts().loc[293]

Out[100]:

2920

A: Station id: 293, Number of rides; 2920

2.7 What percentage of the total riders are of usertype "Subscriber"?

In [101]:

df['usertype'].value_counts()

Out[101]:

Subscriber    218019
Customer        6717
Name: usertype, dtype: int64

In [102]:

subs = sum(df['usertype'] == 'Subscriber') / len(df)
print(subs)

Out[102]:

0.9701115976078599

A: 97.0112

What is the average age (in 2014) of the riders in this dataset?

Note, this requires creating a new column and then taking the difference between 2014 and the rider's birth year, then taking the average!

In [104]:

2014 - df['birth year'][0]

Out[104]:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-104-1b2a498ce611> in <module>()
----> 1 2014 - df['birth year'][0]

TypeError: unsupported operand type(s) for -: 'int' and 'str'

In [105]:

df['birth year'].head()

Out[105]:

  1991
  1979
  1948
  1981
  1990
Name: birth year, dtype: object

In [106]:

df['birth year'].astype(int)

Out[106]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-106-ff839553694e> in <module>()
----> 1 df['birth year'].astype(int)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
   4002         # else, only a single dtype is given
   4003         new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
-> 4004                                      **kwargs)
   4005         return self._constructor(new_data).__finalize__(self)
   4006 
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs)
   3460 
   3461     def astype(self, dtype, **kwargs):
-> 3462         return self.apply('astype', dtype=dtype, **kwargs)
   3463 
   3464     def convert(self, **kwargs):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3327 
   3328             kwargs['mgr'] = self
-> 3329             applied = getattr(b, f)(**kwargs)
   3330             result_blocks = _extend_blocks(applied, result_blocks)
   3331 
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, errors, values, **kwargs)
    542     def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
    543         return self._astype(dtype, copy=copy, errors=errors, values=values,
--> 544                             **kwargs)
    545 
    546     def _astype(self, dtype, copy=False, errors='raise', values=None,
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, **kwargs)
    623 
    624                 # _astype_nansafe works fine with 1-d only
--> 625                 values = astype_nansafe(values.ravel(), dtype, copy=True)
    626                 values = values.reshape(self.shape)
    627 
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy)
    690     elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer):
    691         # work around NumPy brokenness, #1987
--> 692         return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    693 
    694     if dtype.name in ("datetime64", "timedelta64"):
pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()
pandas/_libs/src/util.pxd in util.set_value_at_unsafe()
ValueError: invalid literal for int() with base 10: '\\N'

In [107]:

df['birth year'].astype(str).astype(int)

Out[107]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-107-009a67367ef8> in <module>()
----> 1 df['birth year'].astype(str).astype(int)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
   4002         # else, only a single dtype is given
   4003         new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
-> 4004                                      **kwargs)
   4005         return self._constructor(new_data).__finalize__(self)
   4006 
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs)
   3460 
   3461     def astype(self, dtype, **kwargs):
-> 3462         return self.apply('astype', dtype=dtype, **kwargs)
   3463 
   3464     def convert(self, **kwargs):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3327 
   3328             kwargs['mgr'] = self
-> 3329             applied = getattr(b, f)(**kwargs)
   3330             result_blocks = _extend_blocks(applied, result_blocks)
   3331 
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, errors, values, **kwargs)
    542     def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
    543         return self._astype(dtype, copy=copy, errors=errors, values=values,
--> 544                             **kwargs)
    545 
    546     def _astype(self, dtype, copy=False, errors='raise', values=None,
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, **kwargs)
    623 
    624                 # _astype_nansafe works fine with 1-d only
--> 625                 values = astype_nansafe(values.ravel(), dtype, copy=True)
    626                 values = values.reshape(self.shape)
    627 
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy)
    690     elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer):
    691         # work around NumPy brokenness, #1987
--> 692         return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    693 
    694     if dtype.name in ("datetime64", "timedelta64"):
pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()
pandas/_libs/src/util.pxd in util.set_value_at_unsafe()
ValueError: invalid literal for int() with base 10: '\\N'

In [108]:

df['birth_year'] = pd.to_numeric(df['birth year'])

Out[108]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "\N"

During handling of the above exception, another exception occurred:
ValueError                                Traceback (most recent call last)
<ipython-input-108-cd37727500f1> in <module>()
----> 1 df['birth_year'] = pd.to_numeric(df['birth year'])

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\tools\numeric.py in to_numeric(arg, errors, downcast)
    131             coerce_numeric = False if errors in ('ignore', 'raise') else True
    132             values = lib.maybe_convert_numeric(values, set(),
--> 133                                                coerce_numeric=coerce_numeric)
    134 
    135     except Exception:
pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "\N" at position 31

In [109]:

df['birth year'].loc[31]

Out[109]:

'\\N'

In [110]:

df['birth year'].value_counts()

Out[110]:

  9305
  9139
  8779
  8208
  8109
  8048
  7968
  7771
  7661
  7472
  6876
  6843
\N      6717
  5957
  5848
  5736
  5697
  5579
  5455
  5028
  4974
  4923
  4884
  4650
  4476
  4249
  4229
  4005
  3644
  3641
        ... 
   725
   454
   451
   334
   311
   251
   214
   182
   164
   130
    84
    75
    68
    43
    32
    31
    24
    21
    14
    11
    10
     9
     8
     5
     5
     4
     3
     2
     1
     1
Name: birth year, Length: 78, dtype: int64

In [111]:

df.iloc[31]

Out[111]:

tripduration                                   664
starttime                      2014-02-01 00:08:47
stoptime                       2014-02-01 00:19:51
start station id                               237
start station name                 E 11 St & 2 Ave
start station latitude                     40.7305
start station longitude                   -73.9867
end station id                                 349
end station name           Rivington St & Ridge St
end station latitude                       40.7185
end station longitude                     -73.9833
bikeid                                       17540
usertype                                  Customer
birth year                                      \N
gender                                           0
Name: 31, dtype: object

In [112]:

new_df = df[df['birth year']!='\\N']

In [113]:

new_df['birth year'] = new_df['birth year'].astype(int)

Out[113]:

C:\Users\ystrano\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

In [114]:

df['int_birthyear'] = 2014 - new_df['birth year']

In [115]:

df['int_birthyear'].mean()

Out[115]:

38.50249290199478