Path: blob/master/2-Exploratory-Data-Analysis.ipynb
164 views
Exploratory Data Analysis
Introduction
After the data cleaning step where we put our data into a few standard formats, the next step is to take a look at the data and see if what we're looking at makes sense. Before applying any fancy algorithms, it's always important to explore the data first.
When working with numerical data, some of the exploratory data analysis (EDA) techniques we can use include finding the average of the data set, the distribution of the data, the most common values, etc. The idea is the same when working with text data. We are going to find some more obvious patterns with EDA before identifying the hidden patterns with machines learning (ML) techniques. We are going to look at the following for each comedian:
Most common words - find these and create word clouds
Size of vocabulary - look number of unique words and also how quickly someone speaks
Amount of profanity - most common terms
Most Common Words
Analysis
NOTE: At this point, we could go on and create word clouds. However, by looking at these top words, you can see that some of them have very little meaning and could be added to a stop words list, so let's do just that.
Findings
Ali Wong says the s-word a lot and talks about her husband. I guess that's funny to me.
A lot of people use the F-word. Let's dig into that later.
Number of Words
Analysis
Findings
Vocabulary
Ricky Gervais (British comedy) and Bill Burr (podcast host) use a lot of words in their comedy
Louis C.K. (self-depricating comedy) and Anthony Jeselnik (dark humor) have a smaller vocabulary
Talking Speed
Joe Rogan (blue comedy) and Bill Burr (podcast host) talk fast
Bo Burnham (musical comedy) and Anthony Jeselnik (dark humor) talk slow
Ali Wong is somewhere in the middle in both cases. Nothing too interesting here.
Amount of Profanity
Analysis
Findings
Averaging 2 F-Bombs Per Minute! - I don't like too much swearing, especially the f-word, which is probably why I've never heard of Bill Bur, Joe Rogan and Jim Jefferies.
Clean Humor - It looks like profanity might be a good predictor of the type of comedy I like. Besides Ali Wong, my two other favorite comedians in this group are John Mulaney and Mike Birbiglia.
Side Note
What was our goal for the EDA portion of our journey? To be able to take an initial look at our data and see if the results of some basic analysis made sense.
My conclusion - yes, it does, for a first pass. There are definitely some things that could be better cleaned up, such as adding more stop words or including bi-grams. But we can save that for another day. The results, especially the profanity findings, are interesting and make general sense, so we're going to move on.
As a reminder, the data science process is an interative one. It's better to see some non-perfect but acceptable results to help you quickly decide whether your project is a dud or not, instead of having analysis paralysis and never delivering anything.
Alice's data science (and life) motto: Let go of perfectionism!
Additional Exercises
What other word counts do you think would be interesting to compare instead of the f-word and s-word? Create a scatter plot comparing them.