GitHub Repository: YStrano/DataScience_GA
Path: blob/master/projects/project_1/Project-1 - my notes w' different method process.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Project 1

In this first project you will implement a few python functions and generally familiarize yourself with the pandas library.

Please refer to numpy-and-pandas.ipynb noteook in lesson2 and the for loop section of the python-controlflow.ipynb notebook in the python_foundations folder. and the pandas documentation here for assitance.

I have written the numerical answers you are looking for below - please show me the code you used to generate those answers.

Note! You will need to look within that documentation/ use other search results on the internet to complete this assignment!

Question 1: Multiples of Three and Five

If we list all of the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6, and 9. The sum of these multiples is 23. Find the sum of all the multiples of 3 and 5 below 1,000.

In [10]:

mylist = []
for x in (range(1, 1000)): #range is a function, so you can set start and end by passing different parameters and separating them with a comma.
    if x %3 == 0:
        mylist.append(x)
    elif x %5 == 0:
        mylist.append(x)
print(mylist[:5])

Out[10]:

[3, 5, 6, 9, 10]

In [11]:

sum(mylist)

Out[11]:

233168

Answer: 233,168

Note: you may find yourself with the answer 266,333! Think carefully what is going on with this question and what may be driving the difference between your answer and the correct value! A hint can be found in the control flow notebook.

elif solves the issue of counting a number that is a multiple of both 3 and 5 twice.

Question 2: Pandas Intro

In [12]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

2.1 Load the Citibike-Feb-24 dataset into memory and assign it to the variable "df"

The data are in /data/citibike_feb2014.csv

Use pd.read_csv function. Please refer to the documentation here if you are having trouble.

In [13]:

df = pd.read_csv('data/citibike_feb2014.csv')

2.2 How many rows and how many columns are there in the dataset?

In [16]:

x = len(df)
y = len(list(df.columns))
z = len(df.columns)

print(x)
print(y)
print(z)

print(df.shape)

Out[16]:

224736
15
15
(224736, 15)

A: 224,736 rows, 15 columns

2.3 Please print out the first five rows of the dataset

In [17]:

df.head(5)

Out[17]:

2.4 What is the average trip duration? (In seconds)

In [18]:

df['tripduration'].mean()

Out[18]:

874.5198099102947

In [ ]:

In [19]:

df['tripduration'].sum() / len(df)

Out[19]:

874.5198099102947

In [ ]:

In [20]:

df['starttime_dt'] = df['starttime'].apply(pd.to_datetime)
df['stoptime_dt'] = df['stoptime'].apply(pd.to_datetime)

In [21]:

df['triptime'] = df['stoptime_dt'] - df['starttime_dt']

In [23]:

df['triptime'].head()

Out[23]:

 00:06:22
 00:06:12
 00:09:51
 00:09:43
 00:03:43
Name: triptime, dtype: timedelta64[ns]

In [24]:

df['triptime_seconds'] = df['triptime'].apply(lambda x: x.total_seconds())

In [25]:

df['triptime_seconds'].mean()

Out[25]:

874.5198099102947

A: 874.5198 seconds

2.5 What is the total trip duration in this entire dataset in hours?

In [26]:

382 / (6 + 22/60)

Out[26]:

60.00000000000001

In [27]:

total_secs = df['tripduration'].sum()
hours = total_secs / 60 / 60
print(hours)

Out[27]:

54593.35666666667

In [ ]:

In [28]:

hours = df['triptime_seconds'].sum() / 3600
print(hours)

Out[28]:

54593.35666666667

In [ ]:

In [20]:

df['triptime_hours'] = df['triptime_seconds']/3600
df['triptime_hours'].sum()

Out[20]:

54593.356666666674

A: 54593.3567 hours

2.6 What is the most popular start station? And how many rides started at that station in Feb 2014?

Note, the pandas cookbook may come in handy for this (look at chapter 1 & 2): https://pandas.pydata.org/pandas-docs/stable/tutorials.html

In [26]:

list(df.columns)

Out[26]:

['tripduration',
 'starttime',
 'stoptime',
 'start station id',
 'start station name',
 'start station latitude',
 'start station longitude',
 'end station id',
 'end station name',
 'end station latitude',
 'end station longitude',
 'bikeid',
 'usertype',
 'birth year',
 'gender',
 'newstarttime',
 'newstoptime',
 'triptime',
 'triptime_seconds',
 'triptime_hours']

In [100]:

df['start station name'].value_counts()

Out[100]:

Lafayette St & E 8 St              2920
Pershing Square N                  2719
E 17 St & Broadway                 2493
W 21 St & 6 Ave                    2403
8 Ave & W 31 St                    2171
8 Ave & W 33 St                    1944
W 41 St & 8 Ave                    1916
University Pl & E 14 St            1808
Cleveland Pl & Spring St           1796
Broadway & E 14 St                 1795
Broadway & E 22 St                 1694
E 43 St & Vanderbilt Ave           1688
W 33 St & 7 Ave                    1570
Broadway & W 24 St                 1562
W 31 St & 7 Ave                    1538
Christopher St & Greenwich St      1500
Greenwich Ave & 8 Ave              1500
Great Jones St                     1471
W 13 St & 6 Ave                    1392
W 27 St & 7 Ave                    1391
2 Ave & E 31 St                    1371
Lexington Ave & E 24 St            1363
W 17 St & 8 Ave                    1361
W 18 St & 6 Ave                    1359
E 32 St & Park Ave                 1352
W 38 St & 8 Ave                    1337
West St & Chambers St              1324
1 Ave & E 15 St                    1319
Lawrence St & Willoughby St        1298
6 Ave & W 33 St                    1290
                                   ... 
Grand St & Havemeyer St             136
Clinton Ave & Flushing Ave          134
Macon St & Nostrand Ave             134
Fulton St & Clermont Ave            130
S Portland Ave & Hanson Pl          128
Broadway & Berry St                 126
Clermont Ave & Park Ave             121
Cadman Plaza E & Tillary St         110
Myrtle Ave & St Edwards St          100
Monroe St & Classon Ave              94
Carlton Ave & Park Ave               93
Lexington Ave & Classon Ave          92
Hancock St & Bedford Ave             91
Nassau St & Navy St                  89
Fulton St & Rockwell Pl              89
3 Ave & Schermerhorn St              89
Avenue D & E 12 St                   82
Flushing Ave & Carlton Ave           76
Gallatin Pl & Livingston St          76
7 Ave & Farragut St                  75
Columbia Heights & Cranberry St      72
W 13 St & 5 Ave                      69
Franklin Ave & Myrtle Ave            57
Park Ave & St Edwards St             57
Front St & Gold St                   56
Hanover Pl & Livingston St           54
Concord St & Bridge St               45
Bedford Ave & S 9th St               41
Railroad Ave & Kay Ave               36
Church St & Leonard St                4
Name: start station name, Length: 329, dtype: int64

In [103]:

df['start station id'].value_counts()

Out[103]:

   2920
   2719
   2493
   2403
   2171
   1944
   1916
   1808
   1796
   1795
   1694
   1688
   1570
   1562
   1538
   1500
   1500
   1471
   1392
   1391
   1371
   1363
   1361
   1359
   1352
   1337
   1324
   1319
   1298
   1290
        ... 
    136
    134
    134
    130
    128
    126
    121
    110
    100
     94
     93
     92
     91
     89
     89
     89
     82
     76
     76
    75
     72
     69
     57
     57
     56
     54
     45
     41
    36
      4
Name: start station id, Length: 329, dtype: int64

In [102]:

df['start station id'].value_counts().loc[293]

Out[102]:

2920

In [34]:

import datetime

In [39]:

d_ex = datetime.datetime.strptime('2014-02-01 00:00:00', "%Y-%m-%d %H:%M:%S")

In [40]:

d_ex

Out[40]:

datetime.datetime(2014, 2, 1, 0, 0)

In [42]:

d_ex.month

Out[42]:

2

In [43]:

df['starttime_dt'][0].month

Out[43]:

2

In [45]:

df['month_start'] = df['starttime_dt'].apply(lambda x: x.month)

In [46]:

df['month_start'].value_counts()

Out[46]:

2    224736
Name: month_start, dtype: int64

all the months are feb

In [47]:

df2 = df[['start station id','start station name']]

In [56]:

time1 = time.time()
dict2 = {}
for i in range(len(df2)): #to NOT iterate through the df itself, i.e. to NOT iterate through the columns.
    row = df2.iloc[i] #grabs each row of the df
    k = row['start station id']
    if k in dict2.keys(): #this turns the keys of a dict into a list, same can be done for values
        dict2[k] += 1 #if key exists adds by 1
    else:
        dict2[k] = 1 #if key is new, adds key and sets value to 1 (no key can ever have no value)
time2 = time.time()
total_time = time2 - time1
print ("time in seconds: %s" % str(total_time))

Out[56]:

time in seconds: 35.311187982559204

In [97]:

time3 = time.time()
dict3 = {}
for u_id in df2['start station id'].unique():
    dict3[u_id] = len(df2[df2['start station id']==u_id])
time4 = time.time()
tot_time = time4 - time3
print('Time in seconds: %s' % str(tot_time))

Out[97]:

Time in seconds: 0.304030179977417

In [55]:

time5 = time.time()
print(df2['start station id'].value_counts().index[0])
time6 = time.time()
tot_time = time6 - time5
print('Time in seconds: %s' % str(tot_time))

Out[55]:

293
Time in seconds: 0.0047607421875

In [52]:

import time

In [95]:

time.time()

Out[95]:

1524245702.8950233

In [59]:

time5 = time.time()
dict4 = df2['start station id'].value_counts().to_dict()
time6 = time.time()
tot_time = time6 - time5
print('Time in seconds: %s' % str(tot_time))

Out[59]:

Time in seconds: 0.00567173957824707

In [60]:

import pandas as pd

In [61]:

pd.Series(dict2)

Out[61]:

     676
     632
     296
     301
   1361
     57
     92
   1019
   1163
    508
    184
     89
    530
    935
    549
   1796
    388
   1051
    316
    633
   1201
    650
    958
   1359
    963
    795
    713
   1010
     72
    151
        ... 
   1020
   1371
   1007
    676
    410
    354
    388
    961
   1363
    569
    352
   1288
   1037
    960
   339
    75
   222
  1093
   580
    36
   429
   407
   439
   560
   706
   359
   720
   427
   330
   781
Length: 329, dtype: int64

In [62]:

my_ser = pd.Series(dict2)

In [63]:

my_ser.sort_values(ascending = False, inplace = True)

In [65]:

my_ser.iloc[0]

Out[65]:

2920

In [87]:

my_ser.index

Out[87]:

Int64Index([ 293,  519,  497,  435,  521,  490,  477,  382,  151,  285,
            ...
             216,  253,  119,  372,  418,  431,  278,  443, 2005,  320],
           dtype='int64', length=329)

In [88]:

my_ser.index[0]

Out[88]:

293

In [69]:

row = df2.iloc[0]

In [70]:

row

Out[70]:

start station id                      294
start station name    Washington Square E
Name: 0, dtype: object

In [68]:

row['start station id']

Out[68]:

294

In [59]:

df['start station id'].to_dict() # a series always has the function `to_dict()`

Out[59]:

{0: 294,
285,
247,
357,
401,
152,
325,
354,
375,
285,
518,
501,
388,
518,
257,
477,
317,
527,
504,
316,
490,
518,
450,
300,
474,
490,
540,
347,
499,
285,
403,
237,
146,
497,
470,
355,
540,
494,
496,
237,
143,
345,
368,
444,
237,
497,
493,
435,
229,
435,
237,
526,
372,
477,
349,
488,
2012,
476,
3002,
2021,
443,
146,
237,
351,
405,
280,
490,
473,
497,
345,
319,
312,
404,
384,
483,
527,
476,
446,
453,
476,
319,
116,
480,
157,
432,
410,
312,
503,
405,
161,
326,
284,
336,
497,
325,
482,
432,
441,
293,
394,
2021,
279,
401,
417,
380,
438,
336,
161,
254,
336,
342,
505,
523,
252,
368,
405,
522,
257,
463,
521,
2021,
528,
366,
446,
293,
466,
485,
483,
284,
500,
152,
334,
488,
389,
434,
285,
251,
325,
285,
519,
497,
2002,
147,
494,
504,
523,
449,
402,
470,
394,
461,
308,
168,
293,
476,
257,
418,
412,
242,
528,
460,
146,
380,
460,
364,
127,
236,
380,
540,
319,
508,
433,
434,
127,
325,
405,
263,
528,
537,
285,
312,
352,
293,
462,
531,
147,
466,
432,
508,
280,
428,
291,
293,
350,
317,
312,
495,
492,
334,
528,
528,
432,
375,
293,
319,
327,
285,
433,
294,
545,
486,
307,
312,
432,
368,
499,
531,
212,
223,
546,
263,
404,
469,
284,
284,
236,
403,
521,
411,
236,
317,
503,
312,
401,
394,
515,
423,
387,
263,
297,
466,
383,
478,
477,
326,
415,
396,
268,
83,
284,
284,
508,
415,
428,
307,
268,
312,
432,
268,
390,
449,
477,
297,
404,
324,
466,
294,
310,
251,
489,
358,
347,
2008,
403,
238,
432,
446,
262,
262,
294,
476,
489,
395,
343,
492,
379,
334,
546,
505,
296,
546,
268,
517,
395,
168,
346,
236,
463,
349,
247,
403,
72,
395,
157,
448,
368,
349,
382,
495,
531,
312,
357,
489,
164,
486,
251,
326,
389,
83,
127,
147,
307,
473,
307,
307,
357,
312,
143,
397,
403,
470,
463,
312,
428,
503,
345,
448,
388,
307,
312,
312,
520,
345,
358,
498,
334,
466,
236,
297,
319,
412,
259,
284,
497,
329,
383,
439,
291,
404,
385,
128,
375,
457,
236,
335,
470,
487,
403,
151,
301,
301,
540,
285,
515,
512,
433,
236,
345,
372,
495,
243,
439,
296,
247,
252,
252,
312,
274,
526,
285,
254,
417,
383,
82,
412,
483,
417,
404,
382,
457,
336,
326,
404,
361,
449,
472,
479,
401,
247,
479,
497,
377,
280,
304,
526,
497,
489,
485,
361,
355,
323,
540,
327,
415,
228,
404,
500,
474,
2002,
2002,
358,
458,
357,
479,
354,
405,
512,
2022,
532,
296,
463,
334,
526,
447,
504,
505,
432,
271,
323,
312,
312,
147,
284,
317,
428,
477,
368,
405,
404,
404,
474,
435,
297,
380,
521,
358,
463,
2010,
477,
521,
478,
500,
312,
2012,
512,
508,
296,
470,
212,
477,
448,
489,
504,
379,
312,
2003,
2003,
492,
463,
401,
508,
480,
312,
504,
523,
412,
412,
382,
458,
349,
483,
401,
348,
238,
2022,
285,
151,
455,
515,
415,
448,
290,
462,
501,
410,
490,
428,
416,
492,
521,
128,
478,
325,
428,
394,
265,
496,
526,
489,
515,
465,
519,
302,
444,
436,
290,
160,
289,
517,
349,
481,
432,
492,
251,
521,
504,
482,
127,
515,
524,
468,
490,
472,
492,
529,
487,
435,
492,
494,
434,
116,
116,
523,
335,
490,
449,
536,
353,
488,
482,
509,
453,
301,
488,
490,
266,
468,
521,
386,
120,
242,
476,
492,
2000,
482,
258,
476,
523,
478,
248,
281,
291,
406,
317,
521,
237,
487,
406,
462,
270,
477,
82,
477,
391,
311,
293,
422,
400,
291,
254,
279,
511,
466,
369,
395,
168,
369,
410,
433,
174,
369,
502,
534,
519,
300,
480,
447,
398,
512,
500,
495,
483,
394,
502,
536,
391,
357,
236,
500,
472,
167,
212,
525,
341,
454,
290,
537,
379,
379,
147,
508,
446,
128,
477,
498,
476,
434,
503,
236,
2012,
488,
519,
3002,
303,
521,
520,
490,
151,
224,
521,
457,
531,
2006,
373,
301,
517,
369,
329,
329,
167,
453,
264,
490,
403,
488,
490,
2012,
479,
524,
251,
428,
72,
494,
310,
445,
3002,
244,
449,
426,
285,
488,
441,
470,
446,
2022,
310,
479,
127,
445,
476,
127,
432,
251,
302,
2003,
382,
2003,
517,
174,
403,
229,
352,
306,
482,
369,
309,
168,
504,
468,
494,
428,
373,
490,
475,
515,
495,
345,
152,
502,
476,
241,
488,
446,
435,
280,
330,
527,
516,
502,
490,
439,
345,
161,
439,
303,
417,
470,
284,
545,
461,
509,
536,
127,
545,
507,
167,
450,
361,
279,
490,
400,
258,
489,
434,
285,
528,
223,
361,
426,
254,
529,
494,
252,
505,
463,
350,
385,
128,
281,
301,
418,
476,
483,
334,
319,
537,
545,
435,
503,
518,
435,
363,
479,
340,
355,
529,
275,
453,
521,
263,
284,
488,
2009,
405,
364,
266,
363,
423,
258,
279,
306,
3002,
540,
296,
473,
410,
482,
237,
318,
486,
505,
518,
546,
537,
505,
527,
297,
293,
470,
518,
79,
2008,
415,
326,
403,
511,
500,
341,
507,
526,
468,
466,
368,
284,
326,
2012,
518,
334,
444,
247,
157,
423,
462,
345,
345,
428,
306,
507,
258,
428,
3002,
428,
499,
402,
459,
527,
468,
524,
480,
411,
224,
279,
520,
72,
336,
318,
385,
174,
487,
537,
334,
538,
290,
360,
349,
518,
307,
308,
116,
540,
473,
435,
435,
478,
319,
365,
302,
488,
150,
482,
3002,
363,
504,
293,
282,
387,
519,
72,
317,
478,
527,
346,
513,
473,
476,
291,
223,
334,
262,
368,
195,
385,
345,
345,
501,
315,
545,
477,
434,
72,
433,
511,
507,
494,
150,
326,
470,
254,
545,
329,
502,
435,
147,
403,
146,
237,
301,
238,
297,
352,
517,
168,
334,
224,
513,
484,
405,
228,
382,
237,
537,
326,
271,
528,
448,
432,
502,
462,
346,
72,
291,
254,
236,
518,
428,
251,
537,
515,
82,
498,
285,
391,
317,
526,
2022,
494,
529,
363,
435,
455,
 ...}

In [ ]:

A: Station id: 293, Number of rides; 2920

2.7 What percentage of the total riders are of usertype "Subscriber"?

In [77]:

df['usertype'].value_counts()

Out[77]:

Subscriber    218019
Customer        6717
Name: usertype, dtype: int64

In [78]:

sum(df['usertype'] == 'Subscriber')

Out[78]:

218019

In [79]:

subs = sum(df['usertype'] == 'Subscriber') / len(df)
print(subs)

Out[79]:

0.9701115976078599

In [85]:

subtype_df = df['usertype'].value_counts().rename('count').to_frame() #this takes a value_counts series, renames it and turns it into a dataframe

In [86]:

subtype_df

Out[86]:

In [87]:

subtype_df['ratio'] = subtype_df / len(df2)

In [88]:

subtype_df['ratio'] = subtype_df / subtype_df['count'].sum()

In [89]:

subtype_df

Out[89]:

A: 97.0112

What is the average age (in 2014) of the riders in this dataset?

Note, this requires creating a new column and then taking the difference between 2014 and the rider's birth year, then taking the average!

In [137]:

2014 - df['birth year'][0]

Out[137]:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-137-1b2a498ce611> in <module>()
----> 1 2014 - df['birth year'][0]

TypeError: unsupported operand type(s) for -: 'int' and 'str'

In [146]:

df['birth year'].head()

Out[146]:

  1991
  1979
  1948
  1981
  1990
Name: birth year, dtype: object

In [144]:

df['birth year'].astype(int)

Out[144]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-144-ff839553694e> in <module>()
----> 1 df['birth year'].astype(int)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
   4002         # else, only a single dtype is given
   4003         new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
-> 4004                                      **kwargs)
   4005         return self._constructor(new_data).__finalize__(self)
   4006 
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs)
   3460 
   3461     def astype(self, dtype, **kwargs):
-> 3462         return self.apply('astype', dtype=dtype, **kwargs)
   3463 
   3464     def convert(self, **kwargs):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3327 
   3328             kwargs['mgr'] = self
-> 3329             applied = getattr(b, f)(**kwargs)
   3330             result_blocks = _extend_blocks(applied, result_blocks)
   3331 
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, errors, values, **kwargs)
    542     def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
    543         return self._astype(dtype, copy=copy, errors=errors, values=values,
--> 544                             **kwargs)
    545 
    546     def _astype(self, dtype, copy=False, errors='raise', values=None,
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, **kwargs)
    623 
    624                 # _astype_nansafe works fine with 1-d only
--> 625                 values = astype_nansafe(values.ravel(), dtype, copy=True)
    626                 values = values.reshape(self.shape)
    627 
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy)
    690     elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer):
    691         # work around NumPy brokenness, #1987
--> 692         return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    693 
    694     if dtype.name in ("datetime64", "timedelta64"):
pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()
pandas/_libs/src/util.pxd in util.set_value_at_unsafe()
ValueError: invalid literal for int() with base 10: '\\N'

In [126]:

df['birth year'].astype(str).astype(int)

Out[126]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-126-009a67367ef8> in <module>()
----> 1 df['birth year'].astype(str).astype(int)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
   4002         # else, only a single dtype is given
   4003         new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
-> 4004                                      **kwargs)
   4005         return self._constructor(new_data).__finalize__(self)
   4006 
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs)
   3460 
   3461     def astype(self, dtype, **kwargs):
-> 3462         return self.apply('astype', dtype=dtype, **kwargs)
   3463 
   3464     def convert(self, **kwargs):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3327 
   3328             kwargs['mgr'] = self
-> 3329             applied = getattr(b, f)(**kwargs)
   3330             result_blocks = _extend_blocks(applied, result_blocks)
   3331 
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, errors, values, **kwargs)
    542     def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
    543         return self._astype(dtype, copy=copy, errors=errors, values=values,
--> 544                             **kwargs)
    545 
    546     def _astype(self, dtype, copy=False, errors='raise', values=None,
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, **kwargs)
    623 
    624                 # _astype_nansafe works fine with 1-d only
--> 625                 values = astype_nansafe(values.ravel(), dtype, copy=True)
    626                 values = values.reshape(self.shape)
    627 
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy)
    690     elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer):
    691         # work around NumPy brokenness, #1987
--> 692         return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    693 
    694     if dtype.name in ("datetime64", "timedelta64"):
pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()
pandas/_libs/src/util.pxd in util.set_value_at_unsafe()
ValueError: invalid literal for int() with base 10: '\\N'

In [100]:

df['birth_year'] = pd.to_numeric(df['birth year'])

Out[100]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "\N"

During handling of the above exception, another exception occurred:
ValueError                                Traceback (most recent call last)
<ipython-input-100-cd37727500f1> in <module>()
----> 1 df['birth_year'] = pd.to_numeric(df['birth year'])

/anaconda3/lib/python3.6/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
    131             coerce_numeric = False if errors in ('ignore', 'raise') else True
    132             values = lib.maybe_convert_numeric(values, set(),
--> 133                                                coerce_numeric=coerce_numeric)
    134 
    135     except Exception:
pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "\N" at position 31

In [131]:

df['birth year'].loc[31]

Out[131]:

'\\N'

In [142]:

df['birth year'].value_counts()

Out[142]:

  9305
  9139
  8779
  8208
  8109
  8048
  7968
  7771
  7661
  7472
  6876
  6843
\N      6717
  5957
  5848
  5736
  5697
  5579
  5455
  5028
  4974
  4923
  4884
  4650
  4476
  4249
  4229
  4005
  3644
  3641
        ... 
   725
   454
   451
   334
   311
   251
   214
   182
   164
   130
    84
    75
    68
    43
    32
    31
    24
    21
    14
    11
    10
     9
     8
     5
     5
     4
     3
     2
     1
     1
Name: birth year, Length: 78, dtype: int64

In [97]:

df[df['birth year'] == '\\N']

Out[97]:

In [99]:

x = 2014
df['diff_year'] = x - df['birth_year']
df['diff_year'].mean()

Out[99]:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
    675         try:
--> 676             result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
    677         except TypeError:
/anaconda3/lib/python3.6/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
    203     if use_numexpr:
--> 204         return _evaluate(op, op_str, a, b, **eval_kwargs)
    205     return _evaluate_standard(op, op_str, a, b)
/anaconda3/lib/python3.6/site-packages/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs)
    118     if result is None:
--> 119         result = _evaluate_standard(op, op_str, a, b)
    120 
/anaconda3/lib/python3.6/site-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
     63     with np.errstate(all='ignore'):
---> 64         return op(a, b)
     65 
/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in <lambda>(x, y)
     98                           default_axis=default_axis, reversed=True),
---> 99         rsub=arith_method(lambda x, y: y - x, names('rsub'), op('-'),
    100                           default_axis=default_axis, reversed=True),
TypeError: unsupported operand type(s) for -: 'int' and 'str'

During handling of the above exception, another exception occurred:
TypeError                                 Traceback (most recent call last)
/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues)
    699             with np.errstate(all='ignore'):
--> 700                 return na_op(lvalues, rvalues)
    701         except Exception:
/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
    685                 mask = notna(x)
--> 686                 result[mask] = op(x[mask], y)
    687             else:
/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in <lambda>(x, y)
     98                           default_axis=default_axis, reversed=True),
---> 99         rsub=arith_method(lambda x, y: y - x, names('rsub'), op('-'),
    100                           default_axis=default_axis, reversed=True),
TypeError: unsupported operand type(s) for -: 'int' and 'str'

During handling of the above exception, another exception occurred:
TypeError                                 Traceback (most recent call last)
<ipython-input-99-008ed3355f8f> in <module>()
      1 x = 2014
----> 2 df['diff_year'] = x - df['birth year']
      3 df['diff_year'].mean()
/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(left, right, name, na_op)
    737                 lvalues = lvalues.values
    738 
--> 739         result = wrap_results(safe_na_op(lvalues, rvalues))
    740         return construct_result(
    741             left,
/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues)
    708                 if is_object_dtype(lvalues):
    709                     return libalgos.arrmap_object(lvalues,
--> 710                                                   lambda x: op(x, rvalues))
    711             raise
    712 
pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.arrmap_object()
/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in <lambda>(x)
    708                 if is_object_dtype(lvalues):
    709                     return libalgos.arrmap_object(lvalues,
--> 710                                                   lambda x: op(x, rvalues))
    711             raise
    712 
/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in <lambda>(x, y)
     97         rmul=arith_method(operator.mul, names('rmul'), op('*'),
     98                           default_axis=default_axis, reversed=True),
---> 99         rsub=arith_method(lambda x, y: y - x, names('rsub'), op('-'),
    100                           default_axis=default_axis, reversed=True),
    101         rtruediv=arith_method(lambda x, y: operator.truediv(y, x),
TypeError: unsupported operand type(s) for -: 'int' and 'str'

In [ ]:

In [151]:

df.iloc[31]

Out[151]:

tripduration                                   664
starttime                      2014-02-01 00:08:47
stoptime                       2014-02-01 00:19:51
start station id                               237
start station name                 E 11 St & 2 Ave
start station latitude                     40.7305
start station longitude                   -73.9867
end station id                                 349
end station name           Rivington St & Ridge St
end station latitude                       40.7185
end station longitude                     -73.9833
bikeid                                       17540
usertype                                  Customer
birth year                                      \N
gender                                           0
newstarttime                   2014-02-01 00:08:47
newstoptime                    2014-02-01 00:19:51
triptime                           0 days 00:11:04
triptime_seconds                               664
triptime_hours                            0.184444
Name: 31, dtype: object

In [158]:

df[df['birth year']=='\\N'] #need to add an extra \ bc \n is a new line character so \\n = \n in a non new line character form

Out[158]:

In [157]:

new_df = df[df['birth year']!='\\N']

In [163]:

#new_df['birth year'] = new_df['birth year'].astype(int)
new_df['birth year'] = new_df['birth year'].apply(lambda x: int(x))

Out[163]:

C:\Users\ystrano\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  

In [167]:

df['int_birthyear'] = 2014 - new_df['birth year'] #in numpy you can broadcast a calc and it will apply across every row. this runs quickly bc it uses cython (c code) and multi threading

In [168]:

df['int_birthyear'].mean()

Out[168]:

38.50249290199478

In [ ]:

In [103]:

df["birth_year"] = df['birth year'].replace('\\N', np.nan)

adding a \ helps escape the special character issue this string has. the problem with this is that the 0 is going to deflate the average result - this analysis is correct, and that's why we set this to "not a number" and not 0

In [104]:

df.loc[~df['birth_year'].isnull(), 'birth_year'] = df.loc[~df['birth_year'].isnull(), 'birth_year'].astype(int)

In [105]:

df['age_approx'] = 2014 - df['birth_year']

In [107]:

df['age_approx'].mean()

Out[107]:

38.50249290199478

In [ ]: