Project: CSCI 195

Path: Class Samples / datascience / track_census_data_pandas.ipynb

Views: ⁵⁹³⁰
Image: ubuntu2004

Kernel: Python 3 (system-wide)

As with NumPy, there's a fairly standard import statement that gets used for Pandas. Our textbook author also mentions importing DataFrame and Series specifically since these are used so frequently.

In [1]:

import pandas as pd
import numpy as np

from pandas import Series, DataFrame

Getting the FIPS data

In the NumPy version of this exercise, we manually downloaded and parsed a file that contained our FIPS code database. Pandas provides some pre-built parsing capabilities, and is able to parse information from

CSV (Comma Separated Values)
tab-delimited files
JSON
HTML tables
Data stored in an SQL database

The USDA has a page that contains FIPS codes. First let's try just plain pd.read_html to see what it does.

In [2]:

url = 'https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696'
tables = pd.read_html(url)
print(len(tables))

15

Yikes, that's a lot of tables (for those of you who have some HTML experience, this is another reason why table-based page layout is not a very good idea!)

Let's refine our search to look for one that contains the text "FIPS".

In [3]:

tables = pd.read_html(url, match="FIPS")
print(len(tables))

10

That's a little better, but not great. Here's a version where you can identify the table by an attribute it has in the HTML - this requires a bit of HTML knowledge, but it works out the best. It would be even better if our table had an id attribute, but it doesn't.

In [4]:

tables = pd.read_html(url, attrs={'class':'data'})
print(len(tables))

1

OK, now that we've got it down to 1 table, we can pick it out, and see what it looks like.

In [5]:

fips = tables[0]
fips.head()

	Name	Postal Code	FIPS
0	Alabama	AL	1
1	Alaska	AK	2
2	Arizona	AZ	4
3	Arkansas	AR	5
4	California	CA	6

And look at the data at the end (usually if there's going to need to be adjustments made, they will occur at the beginning of the table or at the end)

In [6]:

fips.tail(n=10)

	Name	Postal Code	FIPS
45	Virginia	VA	51
46	Washington	WA	53
47	West Virginia	WV	54
48	Wisconsin	WI	55
49	Wyoming	WY	56
50	American Samoa	AS	60
51	Guam	GU	66
52	Northern Mariana Islands	MP	69
53	Puerto Rico	PR	72
54	Virgin Islands	VI	78

Let's get rid of the last 5 rows, since they correspond to territories and not states. We can apply the same type of array slicing syntax that worked in NumPy.

In [7]:

fips = fips[0:-5]
fips.tail()

	Name	Postal Code	FIPS
45	Virginia	VA	51
46	Washington	WA	53
47	West Virginia	WV	54
48	Wisconsin	WI	55
49	Wyoming	WY	56

Now let's since the Census site is going to use strings with leading zeros to help us identify a state, let's transform the FIPS column. First, pull out the desired column of the DataFrame and assign it to a variable and print out its type.

In [8]:

fips_codes = fips['FIPS']
print(type(fips_codes))
print(fips_codes.head())

<class 'pandas.core.series.Series'>
  1
  2
  4
  5
  6
Name: FIPS, dtype: int64

Now use a list comprehension to generate a list containing the string versions of the FIPS codes

In [9]:

fips_codes_as_strings = [ str(code) for code in fips["FIPS"] ]
fips_codes_as_strings

['1',
 '2',
 '4',
 '5',
 '6',
 '8',
 '9',
 '10',
 '12',
 '13',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21',
 '22',
 '23',
 '24',
 '25',
 '26',
 '27',
 '28',
 '29',
 '30',
 '31',
 '32',
 '33',
 '34',
 '35',
 '36',
 '37',
 '38',
 '39',
 '40',
 '41',
 '42',
 '44',
 '45',
 '46',
 '47',
 '48',
 '49',
 '50',
 '51',
 '53',
 '54',
 '55',
 '56']

Pandas gives us a simpler way to convert data types, however, using the astype method on a Series object.

In [10]:

fips_codes_as_strings = fips_codes.astype('string')
fips_codes_as_strings

    1
    2
    4
    5
    6
    8
    9
   10
   12
   13
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  44
  45
  46
  47
  48
  49
  50
  51
  53
  54
  55
  56
Name: FIPS, dtype: string

But, we want these to have a leading zero for values under 10. Let's write a function to do this for us.

In [11]:

def convert(value):
    if value >= 10:
        return str(value)
    else:
        return '0' + str(value)

Try it with the first and last values in fips.

In [12]:

print(convert(fips["FIPS"][0]))
print(convert(fips["FIPS"][9]))
print(convert(fips["FIPS"][49]))

01
13
56

Note that the following call doesn't work. Why not?

In [13]:

fips_series = fips["FIPS"]

#fips_series[-1]

The Series and DataFrame objects provide a separate method named iloc to work positionally, rather than with index values.

In [14]:

fips["FIPS"].iloc[0:10]

   1
   2
   4
   5
   6
   8
   9
  10
  12
  13
Name: FIPS, dtype: int64

Now we can use the map method along with convert to get new values for each of the values in the "FIPS" column. map is a method of a Series that returns a new Series based on calling a function on each value of the original Series.

In [15]:

fips["FIPS"] = fips["FIPS"].map(convert)
fips.head()

/tmp/ipykernel_428/1832877244.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fips["FIPS"] = fips["FIPS"].map(convert)

	Name	Postal Code	FIPS
0	Alabama	AL	01
1	Alaska	AK	02
2	Arizona	AZ	04
3	Arkansas	AR	05
4	California	CA	06

And now we can replace the index with the FIPS values, and get rid of the FIPS column. First remind ourselves of the structure of our original DataFrame.

In [16]:

fips.head(n=2)

	Name	Postal Code	FIPS
0	Alabama	AL	01
1	Alaska	AK	02

Now create a new DataFrame named fips_new with the converted Series serving as the index.

In [17]:

fips_new = DataFrame(data={"Name": fips["Name"], "Postal Code": fips["Postal Code"]}, index=fips["FIPS"])

In [18]:

fips_new.head(n=5)

	Name	Postal Code
FIPS
01	NaN	NaN
02	NaN	NaN
04	NaN	NaN
05	NaN	NaN
06	NaN	NaN

Ughh, that doesn't seem to have worked very well! When combining Series objects together like this, Pandas always wants to combine things together that have the same index. Since our new DataFame uses a completely different index than the original Series objects, this approach doesn't work.

In [19]:

print(fips.index)
print(fips_new.index)

RangeIndex(start=0, stop=50, step=1)
Index(['01', '02', '04', '05', '06', '08', '09', '10', '12', '13', '15', '16',
       '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28',
       '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40',
       '41', '42', '44', '45', '46', '47', '48', '49', '50', '51', '53', '54',
       '55', '56'],
      dtype='object', name='FIPS')

Here's one approach we can try: using the values property of a Series object to get a more "raw" form of the data.

In [20]:

print(fips["Name"].values)

['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'Florida' 'Georgia' 'Hawaii' 'Idaho' 'Illinois'
 'Indiana' 'Iowa' 'Kansas' 'Kentucky' 'Louisiana' 'Maine' 'Maryland'
 'Massachusetts' 'Michigan' 'Minnesota' 'Mississippi' 'Missouri' 'Montana'
 'Nebraska' 'Nevada' 'New Hampshire' 'New Jersey' 'New Mexico' 'New York'
 'North Carolina' 'North Dakota' 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania'
 'Rhode Island' 'South Carolina' 'South Dakota' 'Tennessee' 'Texas' 'Utah'
 'Vermont' 'Virginia' 'Washington' 'West Virginia' 'Wisconsin' 'Wyoming']

What might the type of fips["Name"].values be?

In [21]:

print(type(fips["Name"].values))

<class 'numpy.ndarray'>

Let's try to create our new data frame using the values from the Series, so that we're not trying to match indexes.

In [22]:

fips_new = DataFrame(data={"Name": fips["Name"].values, "Post Code": fips["Postal Code"].values}, index=fips["FIPS"])
fips_new.head()

	Name	Post Code
FIPS
01	Alabama	AL
02	Alaska	AK
04	Arizona	AZ
05	Arkansas	AR
06	California	CA

Use the info method to see some details about the fips_new DataFrame.

In [23]:

fips_new.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50 entries, 01 to 56
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Name       50 non-null     object
 1   Post Code  50 non-null     object
dtypes: object(2)
memory usage: 1.2+ KB

Take a look at the index associated with fips_new.

In [24]:

print(fips_new.index)

Index(['01', '02', '04', '05', '06', '08', '09', '10', '12', '13', '15', '16',
       '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28',
       '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40',
       '41', '42', '44', '45', '46', '47', '48', '49', '50', '51', '53', '54',
       '55', '56'],
      dtype='object', name='FIPS')

Now we can get out information about Michigan like this

In [25]:

#  Get row by index key value
print(fips_new.loc['26'])
#  Get row by position
print(fips_new.iloc[0])

Name         Michigan
Post Code          MI
Name: 26, dtype: object
Name         Alabama
Post Code         AL
Name: 01, dtype: object

Other ways to create the new `DataFrame`

First, we can use the set_index method, which uses an existing column as the index for the DataFrame.

In [26]:

fips.head(n=2)

	Name	Postal Code	FIPS
0	Alabama	AL	01
1	Alaska	AK	02

In [27]:

fips_new = fips.set_index(keys="FIPS", inplace=False, drop=False)
fips_new.head(n=5)

	Name	Postal Code	FIPS
FIPS
01	Alabama	AL	01
02	Alaska	AK	02
04	Arizona	AZ	04
05	Arkansas	AR	05
06	California	CA	06

It kind of looks like there's an index value FIPS. Let's see if there is.

In [28]:

#  We can't just use the [ ] notation to try to get an entry out by index, since for a DataFrame 
#  this will look for a column with that name.

In [29]:

fips_new.loc['FIPS']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3620             try:
-> 3621                 return self._engine.get_loc(casted_key)
   3622             except KeyError as err:
/usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
/usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'FIPS'

The above exception was the direct cause of the following exception:
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_428/3242763737.py in <cell line: 1>()
----> 1 fips_new.loc['FIPS']

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in __getitem__(self, key)
    965 
    966             maybe_callable = com.apply_if_callable(key, self.obj)
--> 967             return self._getitem_axis(maybe_callable, axis=axis)
    968 
    969     def _is_scalar_access(self, key: tuple):
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1200         # fall thru to straight lookup
   1201         self._validate_key(key, axis)
-> 1202         return self._get_label(key, axis=axis)
   1203 
   1204     def _get_slice_axis(self, slice_obj: slice, axis: int):
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in _get_label(self, label, axis)
   1151     def _get_label(self, label, axis: int):
   1152         # GH#5667 this will fail if the label is not present in the axis.
-> 1153         return self.obj.xs(label, axis=axis)
   1154 
   1155     def _handle_lowerdim_multi_index_axis0(self, tup: tuple):
/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level)
   3874                     new_index = index[loc]
   3875         else:
-> 3876             loc = index.get_loc(key)
   3877 
   3878             if isinstance(loc, np.ndarray):
/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3621                 return self._engine.get_loc(casted_key)
   3622             except KeyError as err:
-> 3623                 raise KeyError(key) from err
   3624             except TypeError:
   3625                 # If we have a listlike key, _check_indexing_error will raise
KeyError: 'FIPS'

So the display of FIPS in the view of the table view is just Pandas telling us the name of the index, not an actual value.

In [31]:

fips_new.index

Index(['01', '02', '04', '05', '06', '08', '09', '10', '12', '13', '15', '16',
       '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28',
       '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40',
       '41', '42', '44', '45', '46', '47', '48', '49', '50', '51', '53', '54',
       '55', '56'],
      dtype='object', name='FIPS')

In [32]:

fips_new.head(n=2)

	Name	Postal Code	FIPS
FIPS
01	Alabama	AL	01
02	Alaska	AK	02

Now let's get rid of the FIPS column (which could have been done with we called set_index)

In [33]:

fips_new.drop("FIPS", axis=1, inplace=True)
fips_new.head(2)

	Name	Postal Code
FIPS
01	Alabama	AL
02	Alaska	AK

Adding data to the DataFrame object

Now we'll set the region for each of the Midwest states. First, set the region to 0 for all the states.

In [34]:

fips_new['Region'] = 0
fips_new.tail(2)

	Name	Postal Code	Region
FIPS
55	Wisconsin	WI	0
56	Wyoming	WY	0

Now set the region to 1 for Michigan and Ohio, on the assumption that we already know the FIPS codes are '26' and '39'

In [71]:

fips_new.loc[['26', '39'], 'Region'] = 1
fips_new.loc[['26', '39']]

	Name	Postal Code	Region
FIPS
26	Michigan	MI	1
39	Ohio	OH	1

What if we don't know what the codes are? We can first get the index values for the rows using boolean indexing.

In [75]:

indiana = fips_new['Postal Code'] == 'IN'
illinois = fips_new['Postal Code'] == 'IL'
print(type(illinois.values))
import numpy as np
#either = np.logical_or(indiana,illinois)
either = indiana | illinois
print(either)

<class 'numpy.ndarray'>
FIPS
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
   True
   True
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
Name: Postal Code, dtype: bool

Let's see the rows in fips_new matched by this boolean array.

In [76]:

fips_new.loc[either]

	Name	Postal Code	Region
FIPS
17	Illinois	IL	0
18	Indiana	IN	0

And now we can set the Region column.

In [77]:

fips_new.loc[either, 'Region'] = 1

Here's an alternative way.

Get the index associated with those rows in either

In [0]:

in_ill_index = fips_new[either].index
in_ill_index

And now we are basically able to do what we did for Michigan and Ohio when we knew those index values in advance.

In [78]:

#fips_new.loc[in_ill_index, 'Region'] = 1
fips_new[fips_new['Region'] == 1]

	Name	Postal Code	Region
FIPS
17	Illinois	IL	1
18	Indiana	IN	1
26	Michigan	MI	1
39	Ohio	OH	1

Here's a way to do it all in one step. Not necessarily recommended, because it's kind of hard to figure out exactly what it's doing, but it works!

In [84]:

fips_new.loc[(fips_new['Postal Code'] == 'WI') | (fips_new['Postal Code'] == 'MN'), 'Region'] = 1

FIPS
27    1
55    1
Name: Region, dtype: int64

In [82]:

fips_new.loc[fips_new['Region'] == 1]

	Name	Postal Code	Region
FIPS
17	Illinois	IL	1
18	Indiana	IN	1
26	Michigan	MI	1
27	Minnesota	MN	1
39	Ohio	OH	1
55	Wisconsin	WI	1

In [83]:

fips_new.loc[['26', '39'], ['Name', 'Postal Code']]

	Name	Postal Code
FIPS
26	Michigan	MI
39	Ohio	OH

Getting the population data

The cell below contains the function written previously to get us the census population data from api.census.gov

In [0]:

import requests
import json
import numpy as np

def download_pop_data(year, states):
    states_list = ",".join(states)

    base_url = 'http://api.census.gov/data'
    application = "pep/charagegroups"
    query = f"get=POP&for=state:{states_list}"


    url = f"{base_url}/{year}/{application}?{query}"

    response = requests.get(url)

    if not (200 <= response.status_code <= 299):
        return (False, None)

    state_data = json.loads(response.content)
    state_data = np.array(state_data[1:])

    population = state_data[:,0].astype('int')
    state = state_data[:, 1].astype('int')


    data = np.column_stack((state, population))
    years = np.full(data.shape[0],year, dtype='int')

    year_data = np.column_stack((years, data))
    return (True, year_data)

In [0]:

download_pop_data(2015, fips_new.index)

Let's modify this function to return a Pandas DataFrame instead of a NumPy ndarray, using the FIPS code as the index.

In [0]:

def download_pop_data_pd(year, states):
    states_list = ",".join(states)

    base_url = 'http://api.census.gov/data'
    application = "pep/charagegroups"
    query = f"get=POP&for=state:{states_list}"


    url = f"{base_url}/{year}/{application}?{query}"

    response = requests.get(url)

    if not (200 <= response.status_code <= 299):
        return (False, None)

    state_data = json.loads(response.content)
    state_data = np.array(state_data[1:])

    data = DataFrame(data{'Population': })


    return (True, data)

In [0]:

result, df = download_pop_data_pd(2015, fips_new.index)

if result:
    print(df.head())
    print(df.dtypes)
else:
    print("Failed to download population data for 2015 using the Pandas method")

Here's one way for us to generate all of the years of data. Pandas recommends against doing it this way, though, because of the costs of memory reallocation on each call.

In [0]:

In [0]:

Here's a more efficient way: create individual data frames for each year, then call concat to combine them.

In [0]:

Now let's get the total population in the Midwest for 2015.

In [0]:

In [0]:

Getting the FIPS data

Other ways to create the new DataFrame

Adding data to the DataFrame object

Getting the population data

Other ways to create the new `DataFrame`