Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/projects/project_1/Project-1 - my notes w' different method process.ipynb
1904 views
Kernel: Python 3

Project 1

In this first project you will implement a few python functions and generally familiarize yourself with the pandas library.

Please refer to numpy-and-pandas.ipynb noteook in lesson2 and the for loop section of the python-controlflow.ipynb notebook in the python_foundations folder. and the pandas documentation here for assitance.

I have written the numerical answers you are looking for below - please show me the code you used to generate those answers.

Note! You will need to look within that documentation/ use other search results on the internet to complete this assignment!

Question 1: Multiples of Three and Five

If we list all of the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6, and 9. The sum of these multiples is 23. Find the sum of all the multiples of 3 and 5 below 1,000.

mylist = [] for x in (range(1, 1000)): #range is a function, so you can set start and end by passing different parameters and separating them with a comma. if x %3 == 0: mylist.append(x) elif x %5 == 0: mylist.append(x) print(mylist[:5])
[3, 5, 6, 9, 10]
sum(mylist)
233168

Answer: 233,168

Note: you may find yourself with the answer 266,333! Think carefully what is going on with this question and what may be driving the difference between your answer and the correct value! A hint can be found in the control flow notebook.

elif solves the issue of counting a number that is a multiple of both 3 and 5 twice.

Question 2: Pandas Intro

import pandas as pd import numpy as np import matplotlib.pyplot as plt

2.1 Load the Citibike-Feb-24 dataset into memory and assign it to the variable "df"

The data are in /data/citibike_feb2014.csv

Use pd.read_csv function. Please refer to the documentation here if you are having trouble.

df = pd.read_csv('data/citibike_feb2014.csv')

2.2 How many rows and how many columns are there in the dataset?

x = len(df) y = len(list(df.columns)) z = len(df.columns) print(x) print(y) print(z) print(df.shape)
224736 15 15 (224736, 15)

A: 224,736 rows, 15 columns

2.3 Please print out the first five rows of the dataset

df.head(5)

2.4 What is the average trip duration? (In seconds)

df['tripduration'].mean()
874.5198099102947
df['tripduration'].sum() / len(df)
874.5198099102947
df['starttime_dt'] = df['starttime'].apply(pd.to_datetime) df['stoptime_dt'] = df['stoptime'].apply(pd.to_datetime)
df['triptime'] = df['stoptime_dt'] - df['starttime_dt']
df['triptime'].head()
0 00:06:22 1 00:06:12 2 00:09:51 3 00:09:43 4 00:03:43 Name: triptime, dtype: timedelta64[ns]
df['triptime_seconds'] = df['triptime'].apply(lambda x: x.total_seconds())
df['triptime_seconds'].mean()
874.5198099102947

A: 874.5198 seconds

2.5 What is the total trip duration in this entire dataset in hours?

382 / (6 + 22/60)
60.00000000000001
total_secs = df['tripduration'].sum() hours = total_secs / 60 / 60 print(hours)
54593.35666666667
hours = df['triptime_seconds'].sum() / 3600 print(hours)
54593.35666666667
df['triptime_hours'] = df['triptime_seconds']/3600 df['triptime_hours'].sum()
54593.356666666674

A: 54593.3567 hours

Note, the pandas cookbook may come in handy for this (look at chapter 1 & 2): https://pandas.pydata.org/pandas-docs/stable/tutorials.html

list(df.columns)
['tripduration', 'starttime', 'stoptime', 'start station id', 'start station name', 'start station latitude', 'start station longitude', 'end station id', 'end station name', 'end station latitude', 'end station longitude', 'bikeid', 'usertype', 'birth year', 'gender', 'newstarttime', 'newstoptime', 'triptime', 'triptime_seconds', 'triptime_hours']
df['start station name'].value_counts()
Lafayette St & E 8 St 2920 Pershing Square N 2719 E 17 St & Broadway 2493 W 21 St & 6 Ave 2403 8 Ave & W 31 St 2171 8 Ave & W 33 St 1944 W 41 St & 8 Ave 1916 University Pl & E 14 St 1808 Cleveland Pl & Spring St 1796 Broadway & E 14 St 1795 Broadway & E 22 St 1694 E 43 St & Vanderbilt Ave 1688 W 33 St & 7 Ave 1570 Broadway & W 24 St 1562 W 31 St & 7 Ave 1538 Christopher St & Greenwich St 1500 Greenwich Ave & 8 Ave 1500 Great Jones St 1471 W 13 St & 6 Ave 1392 W 27 St & 7 Ave 1391 2 Ave & E 31 St 1371 Lexington Ave & E 24 St 1363 W 17 St & 8 Ave 1361 W 18 St & 6 Ave 1359 E 32 St & Park Ave 1352 W 38 St & 8 Ave 1337 West St & Chambers St 1324 1 Ave & E 15 St 1319 Lawrence St & Willoughby St 1298 6 Ave & W 33 St 1290 ... Grand St & Havemeyer St 136 Clinton Ave & Flushing Ave 134 Macon St & Nostrand Ave 134 Fulton St & Clermont Ave 130 S Portland Ave & Hanson Pl 128 Broadway & Berry St 126 Clermont Ave & Park Ave 121 Cadman Plaza E & Tillary St 110 Myrtle Ave & St Edwards St 100 Monroe St & Classon Ave 94 Carlton Ave & Park Ave 93 Lexington Ave & Classon Ave 92 Hancock St & Bedford Ave 91 Nassau St & Navy St 89 Fulton St & Rockwell Pl 89 3 Ave & Schermerhorn St 89 Avenue D & E 12 St 82 Flushing Ave & Carlton Ave 76 Gallatin Pl & Livingston St 76 7 Ave & Farragut St 75 Columbia Heights & Cranberry St 72 W 13 St & 5 Ave 69 Franklin Ave & Myrtle Ave 57 Park Ave & St Edwards St 57 Front St & Gold St 56 Hanover Pl & Livingston St 54 Concord St & Bridge St 45 Bedford Ave & S 9th St 41 Railroad Ave & Kay Ave 36 Church St & Leonard St 4 Name: start station name, Length: 329, dtype: int64
df['start station id'].value_counts()
293 2920 519 2719 497 2493 435 2403 521 2171 490 1944 477 1916 382 1808 151 1796 285 1795 402 1694 318 1688 492 1570 444 1562 379 1538 284 1500 358 1500 229 1471 345 1392 442 1391 528 1371 537 1363 116 1361 168 1359 472 1352 523 1337 426 1324 504 1319 323 1298 505 1290 ... 282 136 437 134 343 134 397 130 353 128 389 126 421 121 232 110 245 100 289 94 419 93 120 92 436 91 243 89 144 89 298 89 339 82 242 76 218 76 2001 75 216 72 253 69 372 57 119 57 418 56 431 54 278 45 443 41 2005 36 320 4 Name: start station id, Length: 329, dtype: int64
df['start station id'].value_counts().loc[293]
2920
import datetime
d_ex = datetime.datetime.strptime('2014-02-01 00:00:00', "%Y-%m-%d %H:%M:%S")
d_ex
datetime.datetime(2014, 2, 1, 0, 0)
d_ex.month
2
df['starttime_dt'][0].month
2
df['month_start'] = df['starttime_dt'].apply(lambda x: x.month)
df['month_start'].value_counts()
2 224736 Name: month_start, dtype: int64

all the months are feb

df2 = df[['start station id','start station name']]
time1 = time.time() dict2 = {} for i in range(len(df2)): #to NOT iterate through the df itself, i.e. to NOT iterate through the columns. row = df2.iloc[i] #grabs each row of the df k = row['start station id'] if k in dict2.keys(): #this turns the keys of a dict into a list, same can be done for values dict2[k] += 1 #if key exists adds by 1 else: dict2[k] = 1 #if key is new, adds key and sets value to 1 (no key can ever have no value) time2 = time.time() total_time = time2 - time1 print ("time in seconds: %s" % str(total_time))
time in seconds: 35.311187982559204
time3 = time.time() dict3 = {} for u_id in df2['start station id'].unique(): dict3[u_id] = len(df2[df2['start station id']==u_id]) time4 = time.time() tot_time = time4 - time3 print('Time in seconds: %s' % str(tot_time))
Time in seconds: 0.304030179977417
time5 = time.time() print(df2['start station id'].value_counts().index[0]) time6 = time.time() tot_time = time6 - time5 print('Time in seconds: %s' % str(tot_time))
293 Time in seconds: 0.0047607421875
import time
time.time()
1524245702.8950233
time5 = time.time() dict4 = df2['start station id'].value_counts().to_dict() time6 = time.time() tot_time = time6 - time5 print('Time in seconds: %s' % str(tot_time))
Time in seconds: 0.00567173957824707
import pandas as pd
pd.Series(dict2)
72 676 79 632 82 296 83 301 116 1361 119 57 120 92 127 1019 128 1163 137 508 143 184 144 89 146 530 147 935 150 549 151 1796 152 388 153 1051 157 316 160 633 161 1201 164 650 167 958 168 1359 173 963 174 795 195 713 212 1010 216 72 217 151 ... 527 1020 528 1371 529 1007 531 676 532 410 533 354 534 388 536 961 537 1363 538 569 539 352 540 1288 545 1037 546 960 2000 339 2001 75 2002 222 2003 1093 2004 580 2005 36 2006 429 2008 407 2009 439 2010 560 2012 706 2017 359 2021 720 2022 427 2023 330 3002 781 Length: 329, dtype: int64
my_ser = pd.Series(dict2)
my_ser.sort_values(ascending = False, inplace = True)
my_ser.iloc[0]
2920
my_ser.index
Int64Index([ 293, 519, 497, 435, 521, 490, 477, 382, 151, 285, ... 216, 253, 119, 372, 418, 431, 278, 443, 2005, 320], dtype='int64', length=329)
my_ser.index[0]
293
row = df2.iloc[0]
row
start station id 294 start station name Washington Square E Name: 0, dtype: object
row['start station id']
294
df['start station id'].to_dict() # a series always has the function `to_dict()`
{0: 294, 1: 285, 2: 247, 3: 357, 4: 401, 5: 152, 6: 325, 7: 354, 8: 375, 9: 285, 10: 518, 11: 501, 12: 388, 13: 518, 14: 257, 15: 477, 16: 317, 17: 527, 18: 504, 19: 316, 20: 490, 21: 518, 22: 450, 23: 300, 24: 474, 25: 490, 26: 540, 27: 347, 28: 499, 29: 285, 30: 403, 31: 237, 32: 146, 33: 497, 34: 470, 35: 355, 36: 540, 37: 494, 38: 496, 39: 237, 40: 143, 41: 345, 42: 368, 43: 444, 44: 237, 45: 497, 46: 493, 47: 435, 48: 229, 49: 435, 50: 237, 51: 526, 52: 372, 53: 477, 54: 349, 55: 488, 56: 2012, 57: 476, 58: 3002, 59: 2021, 60: 443, 61: 146, 62: 237, 63: 351, 64: 405, 65: 280, 66: 490, 67: 473, 68: 497, 69: 345, 70: 319, 71: 312, 72: 404, 73: 384, 74: 483, 75: 527, 76: 476, 77: 446, 78: 453, 79: 476, 80: 319, 81: 116, 82: 480, 83: 157, 84: 432, 85: 410, 86: 312, 87: 503, 88: 405, 89: 161, 90: 326, 91: 284, 92: 336, 93: 497, 94: 325, 95: 482, 96: 432, 97: 441, 98: 293, 99: 394, 100: 2021, 101: 279, 102: 401, 103: 417, 104: 380, 105: 438, 106: 336, 107: 161, 108: 254, 109: 336, 110: 342, 111: 505, 112: 523, 113: 252, 114: 368, 115: 405, 116: 522, 117: 257, 118: 463, 119: 521, 120: 2021, 121: 528, 122: 366, 123: 446, 124: 293, 125: 466, 126: 485, 127: 483, 128: 284, 129: 500, 130: 152, 131: 334, 132: 488, 133: 389, 134: 434, 135: 285, 136: 251, 137: 325, 138: 285, 139: 519, 140: 497, 141: 2002, 142: 147, 143: 494, 144: 504, 145: 523, 146: 449, 147: 402, 148: 470, 149: 394, 150: 461, 151: 308, 152: 168, 153: 293, 154: 476, 155: 257, 156: 418, 157: 412, 158: 242, 159: 528, 160: 460, 161: 146, 162: 380, 163: 460, 164: 364, 165: 127, 166: 236, 167: 380, 168: 540, 169: 319, 170: 508, 171: 433, 172: 434, 173: 127, 174: 325, 175: 405, 176: 263, 177: 528, 178: 537, 179: 285, 180: 312, 181: 352, 182: 293, 183: 462, 184: 531, 185: 147, 186: 466, 187: 432, 188: 508, 189: 280, 190: 428, 191: 291, 192: 293, 193: 350, 194: 317, 195: 312, 196: 495, 197: 492, 198: 334, 199: 528, 200: 528, 201: 432, 202: 375, 203: 293, 204: 319, 205: 327, 206: 285, 207: 433, 208: 294, 209: 545, 210: 486, 211: 307, 212: 312, 213: 432, 214: 368, 215: 499, 216: 531, 217: 212, 218: 223, 219: 546, 220: 263, 221: 404, 222: 469, 223: 284, 224: 284, 225: 236, 226: 403, 227: 521, 228: 411, 229: 236, 230: 317, 231: 503, 232: 312, 233: 401, 234: 394, 235: 515, 236: 423, 237: 387, 238: 263, 239: 297, 240: 466, 241: 383, 242: 478, 243: 477, 244: 326, 245: 415, 246: 396, 247: 268, 248: 83, 249: 284, 250: 284, 251: 508, 252: 415, 253: 428, 254: 307, 255: 268, 256: 312, 257: 432, 258: 268, 259: 390, 260: 449, 261: 477, 262: 297, 263: 404, 264: 324, 265: 466, 266: 294, 267: 310, 268: 251, 269: 489, 270: 358, 271: 347, 272: 2008, 273: 403, 274: 238, 275: 432, 276: 446, 277: 262, 278: 262, 279: 294, 280: 476, 281: 489, 282: 395, 283: 343, 284: 492, 285: 379, 286: 334, 287: 546, 288: 505, 289: 296, 290: 546, 291: 268, 292: 517, 293: 395, 294: 168, 295: 346, 296: 236, 297: 463, 298: 349, 299: 247, 300: 403, 301: 72, 302: 395, 303: 157, 304: 448, 305: 368, 306: 349, 307: 382, 308: 495, 309: 531, 310: 312, 311: 357, 312: 489, 313: 164, 314: 486, 315: 251, 316: 326, 317: 389, 318: 83, 319: 127, 320: 147, 321: 307, 322: 473, 323: 307, 324: 307, 325: 357, 326: 312, 327: 143, 328: 397, 329: 403, 330: 470, 331: 463, 332: 312, 333: 428, 334: 503, 335: 345, 336: 448, 337: 388, 338: 307, 339: 312, 340: 312, 341: 520, 342: 345, 343: 358, 344: 498, 345: 334, 346: 466, 347: 236, 348: 297, 349: 319, 350: 412, 351: 259, 352: 284, 353: 497, 354: 329, 355: 383, 356: 439, 357: 291, 358: 404, 359: 385, 360: 128, 361: 375, 362: 457, 363: 236, 364: 335, 365: 470, 366: 487, 367: 403, 368: 151, 369: 301, 370: 301, 371: 540, 372: 285, 373: 515, 374: 512, 375: 433, 376: 236, 377: 345, 378: 372, 379: 495, 380: 243, 381: 439, 382: 296, 383: 247, 384: 252, 385: 252, 386: 312, 387: 274, 388: 526, 389: 285, 390: 254, 391: 417, 392: 383, 393: 82, 394: 412, 395: 483, 396: 417, 397: 404, 398: 382, 399: 457, 400: 336, 401: 326, 402: 404, 403: 361, 404: 449, 405: 472, 406: 479, 407: 401, 408: 247, 409: 479, 410: 497, 411: 377, 412: 280, 413: 304, 414: 526, 415: 497, 416: 489, 417: 485, 418: 361, 419: 355, 420: 323, 421: 540, 422: 327, 423: 415, 424: 228, 425: 404, 426: 500, 427: 474, 428: 2002, 429: 2002, 430: 358, 431: 458, 432: 357, 433: 479, 434: 354, 435: 405, 436: 512, 437: 2022, 438: 532, 439: 296, 440: 463, 441: 334, 442: 526, 443: 447, 444: 504, 445: 505, 446: 432, 447: 271, 448: 323, 449: 312, 450: 312, 451: 147, 452: 284, 453: 317, 454: 428, 455: 477, 456: 368, 457: 405, 458: 404, 459: 404, 460: 474, 461: 435, 462: 297, 463: 380, 464: 521, 465: 358, 466: 463, 467: 2010, 468: 477, 469: 521, 470: 478, 471: 500, 472: 312, 473: 2012, 474: 512, 475: 508, 476: 296, 477: 470, 478: 212, 479: 477, 480: 448, 481: 489, 482: 504, 483: 379, 484: 312, 485: 2003, 486: 2003, 487: 492, 488: 463, 489: 401, 490: 508, 491: 480, 492: 312, 493: 504, 494: 523, 495: 412, 496: 412, 497: 382, 498: 458, 499: 349, 500: 483, 501: 401, 502: 348, 503: 238, 504: 2022, 505: 285, 506: 151, 507: 455, 508: 515, 509: 415, 510: 448, 511: 290, 512: 462, 513: 501, 514: 410, 515: 490, 516: 428, 517: 416, 518: 492, 519: 521, 520: 128, 521: 478, 522: 325, 523: 428, 524: 394, 525: 265, 526: 496, 527: 526, 528: 489, 529: 515, 530: 465, 531: 519, 532: 302, 533: 444, 534: 436, 535: 290, 536: 160, 537: 289, 538: 517, 539: 349, 540: 481, 541: 432, 542: 492, 543: 251, 544: 521, 545: 504, 546: 482, 547: 127, 548: 515, 549: 524, 550: 468, 551: 490, 552: 472, 553: 492, 554: 529, 555: 487, 556: 435, 557: 492, 558: 494, 559: 434, 560: 116, 561: 116, 562: 523, 563: 335, 564: 490, 565: 449, 566: 536, 567: 353, 568: 488, 569: 482, 570: 509, 571: 453, 572: 301, 573: 488, 574: 490, 575: 266, 576: 468, 577: 521, 578: 386, 579: 120, 580: 242, 581: 476, 582: 492, 583: 2000, 584: 482, 585: 258, 586: 476, 587: 523, 588: 478, 589: 248, 590: 281, 591: 291, 592: 406, 593: 317, 594: 521, 595: 237, 596: 487, 597: 406, 598: 462, 599: 270, 600: 477, 601: 82, 602: 477, 603: 391, 604: 311, 605: 293, 606: 422, 607: 400, 608: 291, 609: 254, 610: 279, 611: 511, 612: 466, 613: 369, 614: 395, 615: 168, 616: 369, 617: 410, 618: 433, 619: 174, 620: 369, 621: 502, 622: 534, 623: 519, 624: 300, 625: 480, 626: 447, 627: 398, 628: 512, 629: 500, 630: 495, 631: 483, 632: 394, 633: 502, 634: 536, 635: 391, 636: 357, 637: 236, 638: 500, 639: 472, 640: 167, 641: 212, 642: 525, 643: 341, 644: 454, 645: 290, 646: 537, 647: 379, 648: 379, 649: 147, 650: 508, 651: 446, 652: 128, 653: 477, 654: 498, 655: 476, 656: 434, 657: 503, 658: 236, 659: 2012, 660: 488, 661: 519, 662: 3002, 663: 303, 664: 521, 665: 520, 666: 490, 667: 151, 668: 224, 669: 521, 670: 457, 671: 531, 672: 2006, 673: 373, 674: 301, 675: 517, 676: 369, 677: 329, 678: 329, 679: 167, 680: 453, 681: 264, 682: 490, 683: 403, 684: 488, 685: 490, 686: 2012, 687: 479, 688: 524, 689: 251, 690: 428, 691: 72, 692: 494, 693: 310, 694: 445, 695: 3002, 696: 244, 697: 449, 698: 426, 699: 285, 700: 488, 701: 441, 702: 470, 703: 446, 704: 2022, 705: 310, 706: 479, 707: 127, 708: 445, 709: 476, 710: 127, 711: 432, 712: 251, 713: 302, 714: 2003, 715: 382, 716: 2003, 717: 517, 718: 174, 719: 403, 720: 229, 721: 352, 722: 306, 723: 482, 724: 369, 725: 309, 726: 168, 727: 504, 728: 468, 729: 494, 730: 428, 731: 373, 732: 490, 733: 475, 734: 515, 735: 495, 736: 345, 737: 152, 738: 502, 739: 476, 740: 241, 741: 488, 742: 446, 743: 435, 744: 280, 745: 330, 746: 527, 747: 516, 748: 502, 749: 490, 750: 439, 751: 345, 752: 161, 753: 439, 754: 303, 755: 417, 756: 470, 757: 284, 758: 545, 759: 461, 760: 509, 761: 536, 762: 127, 763: 545, 764: 507, 765: 167, 766: 450, 767: 361, 768: 279, 769: 490, 770: 400, 771: 258, 772: 489, 773: 434, 774: 285, 775: 528, 776: 223, 777: 361, 778: 426, 779: 254, 780: 529, 781: 494, 782: 252, 783: 505, 784: 463, 785: 350, 786: 385, 787: 128, 788: 281, 789: 301, 790: 418, 791: 476, 792: 483, 793: 334, 794: 319, 795: 537, 796: 545, 797: 435, 798: 503, 799: 518, 800: 435, 801: 363, 802: 479, 803: 340, 804: 355, 805: 529, 806: 275, 807: 453, 808: 521, 809: 263, 810: 284, 811: 488, 812: 2009, 813: 405, 814: 364, 815: 266, 816: 363, 817: 423, 818: 258, 819: 279, 820: 306, 821: 3002, 822: 540, 823: 296, 824: 473, 825: 410, 826: 482, 827: 237, 828: 318, 829: 486, 830: 505, 831: 518, 832: 546, 833: 537, 834: 505, 835: 527, 836: 297, 837: 293, 838: 470, 839: 518, 840: 79, 841: 2008, 842: 415, 843: 326, 844: 403, 845: 511, 846: 500, 847: 341, 848: 507, 849: 526, 850: 468, 851: 466, 852: 368, 853: 284, 854: 326, 855: 2012, 856: 518, 857: 334, 858: 444, 859: 247, 860: 157, 861: 423, 862: 462, 863: 345, 864: 345, 865: 428, 866: 306, 867: 507, 868: 258, 869: 428, 870: 3002, 871: 428, 872: 499, 873: 402, 874: 459, 875: 527, 876: 468, 877: 524, 878: 480, 879: 411, 880: 224, 881: 279, 882: 520, 883: 72, 884: 336, 885: 318, 886: 385, 887: 174, 888: 487, 889: 537, 890: 334, 891: 538, 892: 290, 893: 360, 894: 349, 895: 518, 896: 307, 897: 308, 898: 116, 899: 540, 900: 473, 901: 435, 902: 435, 903: 478, 904: 319, 905: 365, 906: 302, 907: 488, 908: 150, 909: 482, 910: 3002, 911: 363, 912: 504, 913: 293, 914: 282, 915: 387, 916: 519, 917: 72, 918: 317, 919: 478, 920: 527, 921: 346, 922: 513, 923: 473, 924: 476, 925: 291, 926: 223, 927: 334, 928: 262, 929: 368, 930: 195, 931: 385, 932: 345, 933: 345, 934: 501, 935: 315, 936: 545, 937: 477, 938: 434, 939: 72, 940: 433, 941: 511, 942: 507, 943: 494, 944: 150, 945: 326, 946: 470, 947: 254, 948: 545, 949: 329, 950: 502, 951: 435, 952: 147, 953: 403, 954: 146, 955: 237, 956: 301, 957: 238, 958: 297, 959: 352, 960: 517, 961: 168, 962: 334, 963: 224, 964: 513, 965: 484, 966: 405, 967: 228, 968: 382, 969: 237, 970: 537, 971: 326, 972: 271, 973: 528, 974: 448, 975: 432, 976: 502, 977: 462, 978: 346, 979: 72, 980: 291, 981: 254, 982: 236, 983: 518, 984: 428, 985: 251, 986: 537, 987: 515, 988: 82, 989: 498, 990: 285, 991: 391, 992: 317, 993: 526, 994: 2022, 995: 494, 996: 529, 997: 363, 998: 435, 999: 455, ...}

A: Station id: 293, Number of rides; 2920

2.7 What percentage of the total riders are of usertype "Subscriber"?

df['usertype'].value_counts()
Subscriber 218019 Customer 6717 Name: usertype, dtype: int64
sum(df['usertype'] == 'Subscriber')
218019
subs = sum(df['usertype'] == 'Subscriber') / len(df) print(subs)
0.9701115976078599
subtype_df = df['usertype'].value_counts().rename('count').to_frame() #this takes a value_counts series, renames it and turns it into a dataframe
subtype_df
subtype_df['ratio'] = subtype_df / len(df2)
subtype_df['ratio'] = subtype_df / subtype_df['count'].sum()
subtype_df

A: 97.0112

What is the average age (in 2014) of the riders in this dataset?

Note, this requires creating a new column and then taking the difference between 2014 and the rider's birth year, then taking the average!

2014 - df['birth year'][0]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-137-1b2a498ce611> in <module>() ----> 1 2014 - df['birth year'][0] TypeError: unsupported operand type(s) for -: 'int' and 'str'
df['birth year'].head()
0 1991 1 1979 2 1948 3 1981 4 1990 Name: birth year, dtype: object
df['birth year'].astype(int)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-144-ff839553694e> in <module>() ----> 1 df['birth year'].astype(int) ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs) 116 else: 117 kwargs[new_arg_name] = new_arg_value --> 118 return func(*args, **kwargs) 119 return wrapper 120 return _deprecate_kwarg ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs) 4002 # else, only a single dtype is given 4003 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors, -> 4004 **kwargs) 4005 return self._constructor(new_data).__finalize__(self) 4006 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs) 3460 3461 def astype(self, dtype, **kwargs): -> 3462 return self.apply('astype', dtype=dtype, **kwargs) 3463 3464 def convert(self, **kwargs): ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs) 3327 3328 kwargs['mgr'] = self -> 3329 applied = getattr(b, f)(**kwargs) 3330 result_blocks = _extend_blocks(applied, result_blocks) 3331 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, errors, values, **kwargs) 542 def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs): 543 return self._astype(dtype, copy=copy, errors=errors, values=values, --> 544 **kwargs) 545 546 def _astype(self, dtype, copy=False, errors='raise', values=None, ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, **kwargs) 623 624 # _astype_nansafe works fine with 1-d only --> 625 values = astype_nansafe(values.ravel(), dtype, copy=True) 626 values = values.reshape(self.shape) 627 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy) 690 elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer): 691 # work around NumPy brokenness, #1987 --> 692 return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape) 693 694 if dtype.name in ("datetime64", "timedelta64"): pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe() pandas/_libs/src/util.pxd in util.set_value_at_unsafe() ValueError: invalid literal for int() with base 10: '\\N'
df['birth year'].astype(str).astype(int)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-126-009a67367ef8> in <module>() ----> 1 df['birth year'].astype(str).astype(int) ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs) 116 else: 117 kwargs[new_arg_name] = new_arg_value --> 118 return func(*args, **kwargs) 119 return wrapper 120 return _deprecate_kwarg ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs) 4002 # else, only a single dtype is given 4003 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors, -> 4004 **kwargs) 4005 return self._constructor(new_data).__finalize__(self) 4006 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs) 3460 3461 def astype(self, dtype, **kwargs): -> 3462 return self.apply('astype', dtype=dtype, **kwargs) 3463 3464 def convert(self, **kwargs): ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs) 3327 3328 kwargs['mgr'] = self -> 3329 applied = getattr(b, f)(**kwargs) 3330 result_blocks = _extend_blocks(applied, result_blocks) 3331 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, errors, values, **kwargs) 542 def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs): 543 return self._astype(dtype, copy=copy, errors=errors, values=values, --> 544 **kwargs) 545 546 def _astype(self, dtype, copy=False, errors='raise', values=None, ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, **kwargs) 623 624 # _astype_nansafe works fine with 1-d only --> 625 values = astype_nansafe(values.ravel(), dtype, copy=True) 626 values = values.reshape(self.shape) 627 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy) 690 elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer): 691 # work around NumPy brokenness, #1987 --> 692 return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape) 693 694 if dtype.name in ("datetime64", "timedelta64"): pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe() pandas/_libs/src/util.pxd in util.set_value_at_unsafe() ValueError: invalid literal for int() with base 10: '\\N'
df['birth_year'] = pd.to_numeric(df['birth year'])
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric() ValueError: Unable to parse string "\N" During handling of the above exception, another exception occurred: ValueError Traceback (most recent call last) <ipython-input-100-cd37727500f1> in <module>() ----> 1 df['birth_year'] = pd.to_numeric(df['birth year']) /anaconda3/lib/python3.6/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast) 131 coerce_numeric = False if errors in ('ignore', 'raise') else True 132 values = lib.maybe_convert_numeric(values, set(), --> 133 coerce_numeric=coerce_numeric) 134 135 except Exception: pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric() ValueError: Unable to parse string "\N" at position 31
df['birth year'].loc[31]
'\\N'
df['birth year'].value_counts()
1985 9305 1984 9139 1983 8779 1981 8208 1986 8109 1988 8048 1982 7968 1979 7771 1980 7661 1987 7472 1978 6876 1989 6843 \N 6717 1977 5957 1974 5848 1970 5736 1990 5697 1976 5579 1969 5455 1972 5028 1973 4974 1971 4923 1975 4884 1967 4650 1968 4476 1964 4249 1963 4229 1966 4005 1965 3644 1962 3641 ... 1949 725 1946 454 1947 451 1996 334 1944 311 1997 251 1945 214 1942 182 1941 164 1943 130 1940 84 1938 75 1900 68 1939 43 1922 32 1936 31 1937 24 1934 21 1935 14 1901 11 1933 10 1899 9 1932 8 1907 5 1926 5 1910 4 1917 3 1927 2 1921 1 1913 1 Name: birth year, Length: 78, dtype: int64
df[df['birth year'] == '\\N']
x = 2014 df['diff_year'] = x - df['birth_year'] df['diff_year'].mean()
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) /anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y) 675 try: --> 676 result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs) 677 except TypeError: /anaconda3/lib/python3.6/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs) 203 if use_numexpr: --> 204 return _evaluate(op, op_str, a, b, **eval_kwargs) 205 return _evaluate_standard(op, op_str, a, b) /anaconda3/lib/python3.6/site-packages/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs) 118 if result is None: --> 119 result = _evaluate_standard(op, op_str, a, b) 120 /anaconda3/lib/python3.6/site-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs) 63 with np.errstate(all='ignore'): ---> 64 return op(a, b) 65 /anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in <lambda>(x, y) 98 default_axis=default_axis, reversed=True), ---> 99 rsub=arith_method(lambda x, y: y - x, names('rsub'), op('-'), 100 default_axis=default_axis, reversed=True), TypeError: unsupported operand type(s) for -: 'int' and 'str' During handling of the above exception, another exception occurred: TypeError Traceback (most recent call last) /anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues) 699 with np.errstate(all='ignore'): --> 700 return na_op(lvalues, rvalues) 701 except Exception: /anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y) 685 mask = notna(x) --> 686 result[mask] = op(x[mask], y) 687 else: /anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in <lambda>(x, y) 98 default_axis=default_axis, reversed=True), ---> 99 rsub=arith_method(lambda x, y: y - x, names('rsub'), op('-'), 100 default_axis=default_axis, reversed=True), TypeError: unsupported operand type(s) for -: 'int' and 'str' During handling of the above exception, another exception occurred: TypeError Traceback (most recent call last) <ipython-input-99-008ed3355f8f> in <module>() 1 x = 2014 ----> 2 df['diff_year'] = x - df['birth year'] 3 df['diff_year'].mean() /anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(left, right, name, na_op) 737 lvalues = lvalues.values 738 --> 739 result = wrap_results(safe_na_op(lvalues, rvalues)) 740 return construct_result( 741 left, /anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues) 708 if is_object_dtype(lvalues): 709 return libalgos.arrmap_object(lvalues, --> 710 lambda x: op(x, rvalues)) 711 raise 712 pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.arrmap_object() /anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in <lambda>(x) 708 if is_object_dtype(lvalues): 709 return libalgos.arrmap_object(lvalues, --> 710 lambda x: op(x, rvalues)) 711 raise 712 /anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in <lambda>(x, y) 97 rmul=arith_method(operator.mul, names('rmul'), op('*'), 98 default_axis=default_axis, reversed=True), ---> 99 rsub=arith_method(lambda x, y: y - x, names('rsub'), op('-'), 100 default_axis=default_axis, reversed=True), 101 rtruediv=arith_method(lambda x, y: operator.truediv(y, x), TypeError: unsupported operand type(s) for -: 'int' and 'str'
df.iloc[31]
tripduration 664 starttime 2014-02-01 00:08:47 stoptime 2014-02-01 00:19:51 start station id 237 start station name E 11 St & 2 Ave start station latitude 40.7305 start station longitude -73.9867 end station id 349 end station name Rivington St & Ridge St end station latitude 40.7185 end station longitude -73.9833 bikeid 17540 usertype Customer birth year \N gender 0 newstarttime 2014-02-01 00:08:47 newstoptime 2014-02-01 00:19:51 triptime 0 days 00:11:04 triptime_seconds 664 triptime_hours 0.184444 Name: 31, dtype: object
df[df['birth year']=='\\N'] #need to add an extra \ bc \n is a new line character so \\n = \n in a non new line character form
new_df = df[df['birth year']!='\\N']
#new_df['birth year'] = new_df['birth year'].astype(int) new_df['birth year'] = new_df['birth year'].apply(lambda x: int(x))
C:\Users\ystrano\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df['int_birthyear'] = 2014 - new_df['birth year'] #in numpy you can broadcast a calc and it will apply across every row. this runs quickly bc it uses cython (c code) and multi threading
df['int_birthyear'].mean()
38.50249290199478
df["birth_year"] = df['birth year'].replace('\\N', np.nan)

adding a \ helps escape the special character issue this string has. the problem with this is that the 0 is going to deflate the average result - this analysis is correct, and that's why we set this to "not a number" and not 0

df.loc[~df['birth_year'].isnull(), 'birth_year'] = df.loc[~df['birth_year'].isnull(), 'birth_year'].astype(int)
df['age_approx'] = 2014 - df['birth_year']
df['age_approx'].mean()
38.50249290199478