Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_15/2016_primary_speeches.ipynb
1904 views
Kernel: Python [conda env:ga]
import pandas as pd import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer from sklearn.decomposition import LatentDirichletAllocation

Explore the Capital Words DataSet

This dataset comprises 11,000 speeches made in Congress by Congressmen and Senators who threw their hat into the ring in the 2016 primary. Note the dataset goes pretty far back (as early as 1996 for Bernie Sanders.)

df = pd.read_csv('capitol_words.csv', encoding = 'utf-8', sep='|', parse_dates=['date']) df.head()
df.shape
(11376, 8)
df.dtypes
Unnamed: 0 int64 chamber object congress int64 date datetime64[ns] speaker_name object speaker_party object text object title object dtype: object

Who were the speakers and how many speeches did they make in the dataset?

Notice that one speech is incorrectly coded as 'Joe Biden' rather than 'Joseph Biden'

df['speaker_name'].value_counts()
Bernie Sanders 2241 Joseph Biden 1854 Rick Santorum 1613 Mike Pence 1238 Lindsey Graham 1158 Hillary Clinton 830 Rand Paul 455 Barack Obama 411 Jim Webb 381 Ted Cruz 365 Marco Rubio 359 John Kasich 316 Lincoln Chafee 154 Joe Biden 1 Name: speaker_name, dtype: int64

Lets look at one speech - First in the dataset

df.loc[0, 'text']
u'Mr. Speaker, 480,000 Federal employees are working without pay, a form of involuntary servitude; 280,000 Federal employees are not working, and they will be paid. Virtually all of these workers have mortgages to pay, children to feed, and financial obligations to meet.\r\nMr. Speaker, what is happening to these workers is immoral, is wrong, and must be rectified immediately. Newt Gingrich and the Republican leadership must not continue to hold the House and the American people hostage while they push their disastrous 7-year balanced budget plan. The gentleman from Georgia, Mr. Gingrich, and the Republican leadership must join Senator Dole and the entire Senate and pass a continuing resolution now, now to reopen Government.\r\nMr. Speaker, that is what the American people want, that is what they need, and that is what this body must do.'

Your Task

Choose some way to analyze this dataset using LDA.

Options include:

  • Looking at how topics (or key words in topics) change over time for one speaker,

  • How topics compare across speakers

  • How topics compare for House vs Senate or Republican vs Democrat, etc,

  • Whether the topics that arrise from just the titles are interesting,

  • or any other interesting idea you have.