CoCalc -- exam2-solution.ipynb

Project: CSCI 195

Path: Class Samples / datascience / exam2-solution.ipynb

Views: ⁵⁹³⁰
Image: ubuntu2004

Kernel: Unknown Kernel

Execute the cell below to import the Pandas and NumPy modules using their familiar aliases.

In [17]:

import numpy as np
import pandas as pd

Write a statement in the cell below that creates and displays a DataFrame named stats with the structure shown below.

	points	assists
Player
Schoonveld	5	2
Voskuil	11	3
Muller	18	5

Player is not a data row, but is the index's name.

In [16]:

stats = pd.DataFrame(data={"points": [5, 11, 18], "assists": [2, 3, 5]}, index=['Schoonveld', 'Voskuil', 'Muller'])
stats.index.name="Player"
stats

Execute the cell below to load several data sets from the seaborn package.

In [3]:

from exam2_data import taxis, vehicles

Write a statement in the cell below that lists the names and types of the columns in the DataFrame named taxis.

In [5]:

taxis.info()

Write a statement in the cell below that displays the rows in taxis with index values 1000 through 1005.

In [9]:

taxis.loc[1000:1005]

Write a statement in the cell below that displays the pickup_zone, dropoff_zone, and fare for the 10th through 20th rows in taxis.

In [10]:

taxis.iloc[10:20][['pickup_zone', 'dropoff_zone', 'fare']]

Write a statement that displays the possible values in the payment column in taxis.

In [15]:

taxis['payment'].unique()

What percentage of trips were taken by one passenger? Write one or more statements in the cell below that display the answer to this question in the format below

Out of #,### trips, ##.##% were taken by a single passenger.

In [5]:

total_trips = len(taxis)
number_single_passenger_trips = np.sum(taxis['passengers'] == 1)
percent_single_passenger = number_single_passenger_trips / total_trips * 100
print(f"Out of {total_trips:,} trips, {percent_single_passenger:.2f}% were taken by a single passenger.")

Write a single statement to display the 2 most frequently occurring values in the pickup_borough column along with the number of trips originating in those boroughs.

In [6]:

taxis['pickup_borough'].value_counts()[0:2]

Determine if more trips were made whose total was $20 or less, or whose total was $50 or more.

Print either the string $20 or less or $50 or more.

In [25]:

under_20 = np.sum(taxis['total'] <= 20)
over_50 = np.sum(taxis['total'] >= 50)

print ("$20 or less" if under_20 > over_50 else "$50 or more")

Write a statement that loads the contents of a file named mpg.txt into a DataFrame named mpg. The | character is used to separate column within the file. The columns should be named:

mpg
cylinders
displacement
horsepower
weight
acceleration
model_year
origin
name

The displacement and acceleration columns should not be imported into mpg. Display the first 10 rows in mpg to verify the import worked correctly.

In [6]:

columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', 'origin', 'name']
keep = ['mpg', 'cylinders', 'horsepower', 'weight', 'model_year', 'origin', 'name']
mpg = pd.read_csv('mpg.txt', names=columns, usecols=keep, delimiter="|")
mpg.head(n=10)

Write code to add a new column named weight_tons to the DataFrame named vehicles. The values in the weight_tons column should be the values in the weight column divided by 2000.

Display the first 5 values in the weight_tons column after adding the column.

In [10]:

vehicles['weight_tons'] = vehicles['weight'].map(lambda w: w/2000)
vehicles['weight_tons'][:5]

This question uses the DataFrame named taxis. Write a statement in the cell below that creates and displays a DataFrame named mean_by_borough containing the average values for the fare and distance traveled for each value of the pickup_borough column.

In [23]:

grouped_by_pickup = taxis.groupby('pickup_borough')
mean_by_borough = grouped_by_pickup[['fare', 'distance']].mean()
mean_by_borough

Determine the most common combination of pickup_borough and dropoff_borough for which pickup_borough and dropoff_borough are not the same.

For that combination only, determine the number of fares and the amount of revenue generated. Revenue is defined as the sum of the values in the total column.

The output should be:

The most common trip was from Queens to Manhattan, with 224 trips generating $11,436.66 of revenue.

In [18]:

different_dropoff = taxis[taxis['pickup_borough'] != taxis['dropoff_borough']]
grouped_by_pickup_dropoff = different_dropoff.groupby(['pickup_borough', 'dropoff_borough'])['fare'].count()
(pickup, dropoff) = grouped_by_pickup_dropoff.sort_values().index[-1]
mask = (taxis['pickup_borough'] == pickup) & (taxis['dropoff_borough'] == dropoff)
trips = taxis[mask]
revenue = taxis[mask]['total'].sum()
num_trips = trips.shape[0]
print(f"The most common trip was from {pickup} to {dropoff}, with {num_trips:,} trips generating ${revenue:,.2f} of revenue.")