CoCalc -- ExampleManipulatingDataFrames.ipynb

GitHub Repository: ycchen00/Introduction-to-Data-Science-in-Python
Path: blob/main/resources/week-2/ExampleManipulatingDataFrames.ipynb
³²²³ views

Kernel: Python 3

In this lecture I'm going to walk through a basic data cleaning process with you and introduce you to a few more pandas API functions.

In [1]:

# Let's start by bringing in pandas
import pandas as pd
# And load our dataset. We're going to be cleaning the list of presidents in the US from wikipedia
df=pd.read_csv("datasets/presidents.csv")
# And lets just take a look at some of the data
df.head()

Out[1]:

In [2]:

# Ok, we have some presidents, some dates, I see a bunch of footnotes in the "Born" column which might cause
# issues. Let's start with cleaning up that name into firstname and lastname. I'm going to tackle this with
# a regex. So I want to create two new columns and apply a regex to the projection of the "President" column.

# Here's one solution, we could make a copy of the President column
df["First"]=df['President']
# Then we can call replace() and just have a pattern that matches the last name and set it to an empty string
df["First"]=df["First"].replace("[ ].*", "", regex=True)
# Now let's take a look
df.head()

Out[2]:

In [3]:

# That works, but it's kind of gross. And it's slow, since we had to make a full copy of a column then go
# through and update strings. There are a few other ways we can deal with this. Let me show you the most 
# general one first, and that's called the apply() function. Let's drop the column we made first
del(df["First"])

# The apply() function on a dataframe will take some arbitrary function you have written and apply it to
# either a Series (a single column) or DataFrame across all rows or columns. Lets write a function which
# just splits a string into two pieces using a single row of data
def splitname(row):
    # The row is a single Series object which is a single row indexed by column values
    # Let's extract the firstname and create a new entry in the series
    row['First']=row['President'].split(" ")[0]
    # Let's do the same with the last word in the string
    row['Last']=row['President'].split(" ")[-1]
    # Now we just return the row and the pandas .apply() will take of merging them back into a DataFrame
    return row

# Now if we apply this to the dataframe indicating we want to apply it across columns
df=df.apply(splitname, axis='columns')
df.head()

Out[3]:

In [4]:

# Pretty questionable as to whether that is less gross, but it achieves the result and I find that I use the
# apply() function regularly in my work. The pandas series has a couple of other nice convenience functions
# though, and the next I would like to touch on is called .extract(). Lets drop our firstname and lastname.
del(df['First'])
del(df['Last'])

# Extract takes a regular expression as input and specifically requires you to set capture groups that
# correspond to the output columns you are interested in. And, this is a great place for you to pause the
# video and reflect - if you were going to write a regular expression that returned groups and just had the
# firstname and lastname in it, what would that look like?

# Here's my solution, where we match three groups but only return two, the first and the last name
pattern="(^[\w]*)(?:.* )([\w]*$)"

# Now the extract function is built into the str attribute of the Series object, so we can call it
# using Series.str.extract(pattern)
df["President"].str.extract(pattern).head()

Out[4]:

In [5]:

# So that looks pretty nice, other than the column names. But if we name the groups we get named columns out
pattern="(?P<First>^[\w]*)(?:.* )(?P<Last>[\w]*$)"

# Now call extract
names=df["President"].str.extract(pattern).head()
names

Out[5]:

In [6]:

# And we can just copy these into our main dataframe if we want to
df["First"]=names["First"]
df["Last"]=names["Last"]
df.head()

Out[6]:

In [7]:

# It's worth looking at the pandas str module for other functions which have been written specifically
# to clean up strings in DataFrames, and you can find that in the docs in the Working with Text
# section: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

In [8]:

# Now lets move on to clean up that Born column. First, let's get rid of anything that isn't in the
# pattern of Month Day and Year.
df["Born"]=df["Born"].str.extract("([\w]{3} [\w]{1,2}, [\w]{4})")
df["Born"].head()

Out[8]:

  Feb 22, 1732
  Oct 30, 1735
  Apr 13, 1743
  Mar 16, 1751
  Apr 28, 1758
Name: Born, dtype: object

In [9]:

# So, that cleans up the date format. But I'm going to foreshadow something else here - the type of this
# column is object, and we know that's what pandas uses when it is dealing with string. But pandas actually
# has really interesting date/time features - in fact, that's one of the reasons Wes McKinney put his efforts
# into the library, to deal with financial transactions. So if I were building this out, I would actually
# update this column to the write data type as well
df["Born"]=pd.to_datetime(df["Born"])
df["Born"].head()

Out[9]:

 1732-02-22
 1735-10-30
 1743-04-13
 1751-03-16
 1758-04-28
Name: Born, dtype: datetime64[ns]

In [10]:

# This would make subsequent processing on the dataframe around dates, such as getting every President who
# was born in a given time span, much easier.

Now, most of the other columns in this dataset I would clean in a similar fashion. And this would be a good practice activity for you, so I would recommend that you pause the video, open up the notebook for the lecture if you don't already have it opened, and then finish cleaning up this dataframe. In this lecture I introduced you to the str module which has a number of important functions for cleaning pandas dataframes. You don't have to use these - I actually use apply() quite a bit myself, especially if I don't need high performance data cleaning because my dataset is small. But the str functions are incredibly useful and build on your existing knowledge of regular expressions, and because they are vectorized they are efficient to use as well.

Product

Resources

Company