Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/projects/project_1/Project-1 - Yair Strano.ipynb
1904 views
Kernel: Python 3

Project 1

In this first project you will implement a few python functions and generally familiarize yourself with the pandas library.

Please refer to numpy-and-pandas.ipynb noteook in lesson2 and the for loop section of the python-controlflow.ipynb notebook in the python_foundations folder. and the pandas documentation here for assitance.

I have written the numerical answers you are looking for below - please show me the code you used to generate those answers.

Note! You will need to look within that documentation/ use other search results on the internet to complete this assignment!

Question 1: Multiples of Three and Five

If we list all of the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6, and 9. The sum of these multiples is 23. Find the sum of all the multiples of 3 and 5 below 1,000.

mylist = [] for x in (range(1, 1000)): if x %3 == 0: mylist.append(x) elif x %5 == 0: mylist.append(x) print(mylist[:5])
[3, 5, 6, 9, 10]
sum(mylist)
233168

Answer: 233,168

Note: you may find yourself with the answer 266,333! Think carefully what is going on with this question and what may be driving the difference between your answer and the correct value! A hint can be found in the control flow notebook.

elif solves the issue of counting a number that is a multiple of both 3 and 5 twice.

Question 2: Pandas Intro

import pandas as pd import numpy as np import matplotlib.pyplot as plt

2.1 Load the Citibike-Feb-24 dataset into memory and assign it to the variable "df"

The data are in /data/citibike_feb2014.csv

Use pd.read_csv function. Please refer to the documentation here if you are having trouble.

df = pd.read_csv('data/citibike_feb2014.csv')

2.2 How many rows and how many columns are there in the dataset?

print(df.shape)
(224736, 15)

A: 224,736 rows, 15 columns

2.3 Please print out the first five rows of the dataset

df.head(5)

2.4 What is the average trip duration? (In seconds)

df['tripduration'].mean()
874.5198099102947

A: 874.5198 seconds

2.5 What is the total trip duration in this entire dataset in hours?

total_secs = df['tripduration'].sum() hours = total_secs / 60 / 60 print(hours)
54593.35666666667

A: 54593.3567 hours

Note, the pandas cookbook may come in handy for this (look at chapter 1 & 2): https://pandas.pydata.org/pandas-docs/stable/tutorials.html

list(df.columns)
['tripduration', 'starttime', 'stoptime', 'start station id', 'start station name', 'start station latitude', 'start station longitude', 'end station id', 'end station name', 'end station latitude', 'end station longitude', 'bikeid', 'usertype', 'birth year', 'gender']
df['start station name'].value_counts()
Lafayette St & E 8 St 2920 Pershing Square N 2719 E 17 St & Broadway 2493 W 21 St & 6 Ave 2403 8 Ave & W 31 St 2171 8 Ave & W 33 St 1944 W 41 St & 8 Ave 1916 University Pl & E 14 St 1808 Cleveland Pl & Spring St 1796 Broadway & E 14 St 1795 Broadway & E 22 St 1694 E 43 St & Vanderbilt Ave 1688 W 33 St & 7 Ave 1570 Broadway & W 24 St 1562 W 31 St & 7 Ave 1538 Greenwich Ave & 8 Ave 1500 Christopher St & Greenwich St 1500 Great Jones St 1471 W 13 St & 6 Ave 1392 W 27 St & 7 Ave 1391 2 Ave & E 31 St 1371 Lexington Ave & E 24 St 1363 W 17 St & 8 Ave 1361 W 18 St & 6 Ave 1359 E 32 St & Park Ave 1352 W 38 St & 8 Ave 1337 West St & Chambers St 1324 1 Ave & E 15 St 1319 Lawrence St & Willoughby St 1298 6 Ave & W 33 St 1290 ... Sands St & Gold St 136 Macon St & Nostrand Ave 134 Clinton Ave & Flushing Ave 134 Fulton St & Clermont Ave 130 S Portland Ave & Hanson Pl 128 Broadway & Berry St 126 Clermont Ave & Park Ave 121 Cadman Plaza E & Tillary St 110 Myrtle Ave & St Edwards St 100 Monroe St & Classon Ave 94 Carlton Ave & Park Ave 93 Lexington Ave & Classon Ave 92 Hancock St & Bedford Ave 91 Fulton St & Rockwell Pl 89 Nassau St & Navy St 89 3 Ave & Schermerhorn St 89 Avenue D & E 12 St 82 Flushing Ave & Carlton Ave 76 Gallatin Pl & Livingston St 76 7 Ave & Farragut St 75 Columbia Heights & Cranberry St 72 W 13 St & 5 Ave 69 Franklin Ave & Myrtle Ave 57 Park Ave & St Edwards St 57 Front St & Gold St 56 Hanover Pl & Livingston St 54 Concord St & Bridge St 45 Bedford Ave & S 9th St 41 Railroad Ave & Kay Ave 36 Church St & Leonard St 4 Name: start station name, Length: 329, dtype: int64
df['start station id'].value_counts()
293 2920 519 2719 497 2493 435 2403 521 2171 490 1944 477 1916 382 1808 151 1796 285 1795 402 1694 318 1688 492 1570 444 1562 379 1538 284 1500 358 1500 229 1471 345 1392 442 1391 528 1371 537 1363 116 1361 168 1359 472 1352 523 1337 426 1324 504 1319 323 1298 505 1290 ... 282 136 437 134 343 134 397 130 353 128 389 126 421 121 232 110 245 100 289 94 419 93 120 92 436 91 243 89 144 89 298 89 339 82 242 76 218 76 2001 75 216 72 253 69 372 57 119 57 418 56 431 54 278 45 443 41 2005 36 320 4 Name: start station id, Length: 329, dtype: int64
df['start station id'].value_counts().loc[293]
2920

A: Station id: 293, Number of rides; 2920

2.7 What percentage of the total riders are of usertype "Subscriber"?

df['usertype'].value_counts()
Subscriber 218019 Customer 6717 Name: usertype, dtype: int64
subs = sum(df['usertype'] == 'Subscriber') / len(df) print(subs)
0.9701115976078599

A: 97.0112

What is the average age (in 2014) of the riders in this dataset?

Note, this requires creating a new column and then taking the difference between 2014 and the rider's birth year, then taking the average!

2014 - df['birth year'][0]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-104-1b2a498ce611> in <module>() ----> 1 2014 - df['birth year'][0] TypeError: unsupported operand type(s) for -: 'int' and 'str'
df['birth year'].head()
0 1991 1 1979 2 1948 3 1981 4 1990 Name: birth year, dtype: object
df['birth year'].astype(int)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-106-ff839553694e> in <module>() ----> 1 df['birth year'].astype(int) ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs) 116 else: 117 kwargs[new_arg_name] = new_arg_value --> 118 return func(*args, **kwargs) 119 return wrapper 120 return _deprecate_kwarg ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs) 4002 # else, only a single dtype is given 4003 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors, -> 4004 **kwargs) 4005 return self._constructor(new_data).__finalize__(self) 4006 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs) 3460 3461 def astype(self, dtype, **kwargs): -> 3462 return self.apply('astype', dtype=dtype, **kwargs) 3463 3464 def convert(self, **kwargs): ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs) 3327 3328 kwargs['mgr'] = self -> 3329 applied = getattr(b, f)(**kwargs) 3330 result_blocks = _extend_blocks(applied, result_blocks) 3331 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, errors, values, **kwargs) 542 def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs): 543 return self._astype(dtype, copy=copy, errors=errors, values=values, --> 544 **kwargs) 545 546 def _astype(self, dtype, copy=False, errors='raise', values=None, ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, **kwargs) 623 624 # _astype_nansafe works fine with 1-d only --> 625 values = astype_nansafe(values.ravel(), dtype, copy=True) 626 values = values.reshape(self.shape) 627 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy) 690 elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer): 691 # work around NumPy brokenness, #1987 --> 692 return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape) 693 694 if dtype.name in ("datetime64", "timedelta64"): pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe() pandas/_libs/src/util.pxd in util.set_value_at_unsafe() ValueError: invalid literal for int() with base 10: '\\N'
df['birth year'].astype(str).astype(int)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-107-009a67367ef8> in <module>() ----> 1 df['birth year'].astype(str).astype(int) ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs) 116 else: 117 kwargs[new_arg_name] = new_arg_value --> 118 return func(*args, **kwargs) 119 return wrapper 120 return _deprecate_kwarg ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs) 4002 # else, only a single dtype is given 4003 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors, -> 4004 **kwargs) 4005 return self._constructor(new_data).__finalize__(self) 4006 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs) 3460 3461 def astype(self, dtype, **kwargs): -> 3462 return self.apply('astype', dtype=dtype, **kwargs) 3463 3464 def convert(self, **kwargs): ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs) 3327 3328 kwargs['mgr'] = self -> 3329 applied = getattr(b, f)(**kwargs) 3330 result_blocks = _extend_blocks(applied, result_blocks) 3331 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, errors, values, **kwargs) 542 def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs): 543 return self._astype(dtype, copy=copy, errors=errors, values=values, --> 544 **kwargs) 545 546 def _astype(self, dtype, copy=False, errors='raise', values=None, ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, **kwargs) 623 624 # _astype_nansafe works fine with 1-d only --> 625 values = astype_nansafe(values.ravel(), dtype, copy=True) 626 values = values.reshape(self.shape) 627 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy) 690 elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer): 691 # work around NumPy brokenness, #1987 --> 692 return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape) 693 694 if dtype.name in ("datetime64", "timedelta64"): pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe() pandas/_libs/src/util.pxd in util.set_value_at_unsafe() ValueError: invalid literal for int() with base 10: '\\N'
df['birth_year'] = pd.to_numeric(df['birth year'])
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric() ValueError: Unable to parse string "\N" During handling of the above exception, another exception occurred: ValueError Traceback (most recent call last) <ipython-input-108-cd37727500f1> in <module>() ----> 1 df['birth_year'] = pd.to_numeric(df['birth year']) ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\tools\numeric.py in to_numeric(arg, errors, downcast) 131 coerce_numeric = False if errors in ('ignore', 'raise') else True 132 values = lib.maybe_convert_numeric(values, set(), --> 133 coerce_numeric=coerce_numeric) 134 135 except Exception: pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric() ValueError: Unable to parse string "\N" at position 31
df['birth year'].loc[31]
'\\N'
df['birth year'].value_counts()
1985 9305 1984 9139 1983 8779 1981 8208 1986 8109 1988 8048 1982 7968 1979 7771 1980 7661 1987 7472 1978 6876 1989 6843 \N 6717 1977 5957 1974 5848 1970 5736 1990 5697 1976 5579 1969 5455 1972 5028 1973 4974 1971 4923 1975 4884 1967 4650 1968 4476 1964 4249 1963 4229 1966 4005 1965 3644 1962 3641 ... 1949 725 1946 454 1947 451 1996 334 1944 311 1997 251 1945 214 1942 182 1941 164 1943 130 1940 84 1938 75 1900 68 1939 43 1922 32 1936 31 1937 24 1934 21 1935 14 1901 11 1933 10 1899 9 1932 8 1907 5 1926 5 1910 4 1917 3 1927 2 1921 1 1913 1 Name: birth year, Length: 78, dtype: int64
df.iloc[31]
tripduration 664 starttime 2014-02-01 00:08:47 stoptime 2014-02-01 00:19:51 start station id 237 start station name E 11 St & 2 Ave start station latitude 40.7305 start station longitude -73.9867 end station id 349 end station name Rivington St & Ridge St end station latitude 40.7185 end station longitude -73.9833 bikeid 17540 usertype Customer birth year \N gender 0 Name: 31, dtype: object
new_df = df[df['birth year']!='\\N']
new_df['birth year'] = new_df['birth year'].astype(int)
C:\Users\ystrano\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy """Entry point for launching an IPython kernel.
df['int_birthyear'] = 2014 - new_df['birth year']
df['int_birthyear'].mean()
38.50249290199478