Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/master/Analyze Your Runkeeper Fitness Data/notebook.ipynb
Views: 1229
1. Obtain and review raw data
One day, my old running friend and I were chatting about our running styles, training habits, and achievements, when I suddenly realized that I could take an in-depth analytical look at my training. I have been using a popular GPS fitness tracker called Runkeeper for years and decided it was time to analyze my running data to see how I was doing.
Since 2012, I've been using the Runkeeper app, and it's great. One key feature: its excellent data export. Anyone who has a smartphone can download the app and analyze their data like we will in this notebook.
After logging your run, the first step is to export the data from Runkeeper (which I've done already). Then import the data and start exploring to find potential problems. After that, create data cleaning strategies to fix the issues. Finally, analyze and visualize the clean time-series data.
I exported seven years worth of my training data, from 2012 through 2018. The data is a CSV file where each row is a single training activity. Let's load and inspect it.
2. Data preprocessing
Lucky for us, the column names Runkeeper provides are informative, and we don't need to rename any columns.
But, we do notice missing values using the info()
method. What are the reasons for these missing values? It depends. Some heart rate information is missing because I didn't always use a cardio sensor. In the case of the Notes
column, it is an optional field that I sometimes left blank. Also, I only used the Route Name
column once, and never used the Friend's Tagged
column.
We'll fill in missing values in the heart rate column to avoid misleading results later, but right now, our first data preprocessing steps will be to:
- Remove columns not useful for our analysis.
- Replace the "Other" activity type to "Unicycling" because that was always the "Other" activity.
- Count missing values.
3. Dealing with missing values
As we can see from the last output, there are 214 missing entries for my average heart rate.
We can't go back in time to get those data, but we can fill in the missing values with an average value. This process is called mean imputation. When imputing the mean to fill in missing data, we need to consider that the average heart rate varies for different activities (e.g., walking vs. running). We'll filter the DataFrames by activity type (Type
) and calculate each activity's mean heart rate, then fill in the missing values with those means.
4. Plot running data
Now we can create our first plot! As we found earlier, most of the activities in my data were running (459 of them to be exact). There are only 29, 18, and two instances for cycling, walking, and unicycling, respectively. So for now, let's focus on plotting the different running metrics.
An excellent first visualization is a figure with four subplots, one for each running metric (each numerical column). Each subplot will have a different y-axis, which is explained in each legend. The x-axis, Date
, is shared among all subplots.
5. Running statistics
No doubt, running helps people stay mentally and physically healthy and productive at any age. And it is great fun! When runners talk to each other about their hobby, we not only discuss our results, but we also discuss different training strategies.
You'll know you're with a group of runners if you commonly hear questions like:
- What is your average distance?
- How fast do you run?
- Do you measure your heart rate?
- How often do you train?
Let's find the answers to these questions in my data. If you look back at plots in Task 4, you can see the answer to, Do you measure your heart rate? Before 2015: no. To look at the averages, let's only use the data from 2015 through 2018.
In pandas, the resample()
method is similar to the groupby()
method - with resample()
you group by a specific time span. We'll use resample()
to group the time series data by a sampling period and apply several methods to each sampling period. In our case, we'll resample annually and weekly.
6. Visualization with averages
Let's plot the long term averages of my distance run and my heart rate with their raw data to visually compare the averages to each training session. Again, we'll use the data from 2015 through 2018.
In this task, we will use matplotlib
functionality for plot creation and customization.
7. Did I reach my goals?
To motivate myself to run regularly, I set a target goal of running 1000 km per year. Let's visualize my annual running distance (km) from 2013 through 2018 to see if I reached my goal each year. Only stars in the green region indicate success.
8. Am I progressing?
Let's dive a little deeper into the data to answer a tricky question: am I progressing in terms of my running skills?
To answer this question, we'll decompose my weekly distance run and visually compare it to the raw data. A red trend line will represent the weekly distance run.
We are going to use statsmodels
library to decompose the weekly trend.
9. Training intensity
Heart rate is a popular metric used to measure training intensity. Depending on age and fitness level, heart rates are grouped into different zones that people can target depending on training goals. A target heart rate during moderate-intensity activities is about 50-70% of maximum heart rate, while during vigorous physical activity it’s about 70-85% of maximum.
We'll create a distribution plot of my heart rate data by training intensity. It will be a visual presentation for the number of activities from predefined training zones.
10. Detailed summary report
With all this data cleaning, analysis, and visualization, let's create detailed summary tables of my training.
To do this, we'll create two tables. The first table will be a summary of the distance (km) and climb (m) variables for each training activity. The second table will list the summary statistics for the average speed (km/hr), climb (m), and distance (km) variables for each training activity.
11. Fun facts
To wrap up, let’s pick some fun facts out of the summary tables and solve the last exercise.
These data (my running history) represent 6 years, 2 months and 21 days. And I remember how many running shoes I went through–7.
FUN FACTS
- Average distance: 11.38 km
- Longest distance: 38.32 km
- Highest climb: 982 m
- Total climb: 57,278 m
- Total number of km run: 5,224 km
- Total runs: 459
- Number of running shoes gone through: 7 pairs
The story of Forrest Gump is well known–the man, who for no particular reason decided to go for a "little run." His epic run duration was 3 years, 2 months and 14 days (1169 days). In the picture you can see Forrest’s route of 24,700 km.
FORREST RUN FACTS
- Average distance: 21.13 km
- Total number of km run: 24,700 km
- Total runs: 1169
- Number of running shoes gone through: ...
Assuming Forest and I go through running shoes at the same rate, figure out how many pairs of shoes Forrest needed for his run.