As with NumPy
, there's a fairly standard import
statement that gets used for Pandas. Our textbook author also mentions importing DataFrame
and Series
specifically since these are used so frequently.
Getting the FIPS data
In the NumPy version of this exercise, we manually downloaded and parsed a file that contained our FIPS code database. Pandas provides some pre-built parsing capabilities, and is able to parse information from
CSV (Comma Separated Values)
tab-delimited files
JSON
HTML tables
Data stored in an SQL database
The USDA has a page that contains FIPS codes. First let's try just plain pd.read_html
to see what it does.
Yikes, that's a lot of tables (for those of you who have some HTML experience, this is another reason why table-based page layout is not a very good idea!)
Let's refine our search to look for one that contains the text "FIPS".
That's a little better, but not great. Here's a version where you can identify the table by an attribute it has in the HTML - this requires a bit of HTML knowledge, but it works out the best. It would be even better if our table had an id attribute, but it doesn't.
OK, now that we've got it down to 1 table, we can pick it out, and see what it looks like.
Name | Postal Code | FIPS | |
---|---|---|---|
0 | Alabama | AL | 1 |
1 | Alaska | AK | 2 |
2 | Arizona | AZ | 4 |
3 | Arkansas | AR | 5 |
4 | California | CA | 6 |
And look at the data at the end (usually if there's going to need to be adjustments made, they will occur at the beginning of the table or at the end)
Name | Postal Code | FIPS | |
---|---|---|---|
45 | Virginia | VA | 51 |
46 | Washington | WA | 53 |
47 | West Virginia | WV | 54 |
48 | Wisconsin | WI | 55 |
49 | Wyoming | WY | 56 |
50 | American Samoa | AS | 60 |
51 | Guam | GU | 66 |
52 | Northern Mariana Islands | MP | 69 |
53 | Puerto Rico | PR | 72 |
54 | Virgin Islands | VI | 78 |
Let's get rid of the last 5 rows, since they correspond to territories and not states. We can apply the same type of array slicing syntax that worked in NumPy.
Name | Postal Code | FIPS | |
---|---|---|---|
45 | Virginia | VA | 51 |
46 | Washington | WA | 53 |
47 | West Virginia | WV | 54 |
48 | Wisconsin | WI | 55 |
49 | Wyoming | WY | 56 |
Now let's since the Census site is going to use strings with leading zeros to help us identify a state, let's transform the FIPS column. First, pull out the desired column of the DataFrame
and assign it to a variable and print out its type.
Now use a list comprehension to generate a list containing the string versions of the FIPS codes
Pandas gives us a simpler way to convert data types, however, using the astype
method on a Series
object.
But, we want these to have a leading zero for values under 10. Let's write a function to do this for us.
Try it with the first and last values in fips
.
Note that the following call doesn't work. Why not?
The Series
and DataFrame
objects provide a separate method named iloc
to work positionally, rather than with index values.
Now we can use the map
method along with convert
to get new values for each of the values in the "FIPS" column. map
is a method of a Series
that returns a new Series
based on calling a function on each value of the original Series
.
Name | Postal Code | FIPS | |
---|---|---|---|
0 | Alabama | AL | 01 |
1 | Alaska | AK | 02 |
2 | Arizona | AZ | 04 |
3 | Arkansas | AR | 05 |
4 | California | CA | 06 |
And now we can replace the index with the FIPS values, and get rid of the FIPS column. First remind ourselves of the structure of our original DataFrame
.
Name | Postal Code | FIPS | |
---|---|---|---|
0 | Alabama | AL | 01 |
1 | Alaska | AK | 02 |
Now create a new DataFrame
named fips_new
with the converted
Series
serving as the index.
Name | Postal Code | |
---|---|---|
FIPS | ||
01 | NaN | NaN |
02 | NaN | NaN |
04 | NaN | NaN |
05 | NaN | NaN |
06 | NaN | NaN |
Ughh, that doesn't seem to have worked very well! When combining Series
objects together like this, Pandas always wants to combine things together that have the same index. Since our new DataFame
uses a completely different index than the original Series
objects, this approach doesn't work.
Here's one approach we can try: using the values
property of a Series
object to get a more "raw" form of the data.
What might the type of fips["Name"].values
be?
Let's try to create our new data frame using the values from the Series, so that we're not trying to match indexes.
Name | Post Code | |
---|---|---|
FIPS | ||
01 | Alabama | AL |
02 | Alaska | AK |
04 | Arizona | AZ |
05 | Arkansas | AR |
06 | California | CA |
Use the info
method to see some details about the fips_new
DataFrame
.
Take a look at the index associated with fips_new
.
Now we can get out information about Michigan like this
Other ways to create the new DataFrame
First, we can use the set_index
method, which uses an existing column as the index for the DataFrame
.
Name | Postal Code | FIPS | |
---|---|---|---|
0 | Alabama | AL | 01 |
1 | Alaska | AK | 02 |
Name | Postal Code | FIPS | |
---|---|---|---|
FIPS | |||
01 | Alabama | AL | 01 |
02 | Alaska | AK | 02 |
04 | Arizona | AZ | 04 |
05 | Arkansas | AR | 05 |
06 | California | CA | 06 |
It kind of looks like there's an index value FIPS. Let's see if there is.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3620 try:
-> 3621 return self._engine.get_loc(casted_key)
3622 except KeyError as err:
/usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
/usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'FIPS'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
/tmp/ipykernel_428/3242763737.py in <cell line: 1>()
----> 1 fips_new.loc['FIPS']
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in __getitem__(self, key)
965
966 maybe_callable = com.apply_if_callable(key, self.obj)
--> 967 return self._getitem_axis(maybe_callable, axis=axis)
968
969 def _is_scalar_access(self, key: tuple):
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1200 # fall thru to straight lookup
1201 self._validate_key(key, axis)
-> 1202 return self._get_label(key, axis=axis)
1203
1204 def _get_slice_axis(self, slice_obj: slice, axis: int):
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in _get_label(self, label, axis)
1151 def _get_label(self, label, axis: int):
1152 # GH#5667 this will fail if the label is not present in the axis.
-> 1153 return self.obj.xs(label, axis=axis)
1154
1155 def _handle_lowerdim_multi_index_axis0(self, tup: tuple):
/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level)
3874 new_index = index[loc]
3875 else:
-> 3876 loc = index.get_loc(key)
3877
3878 if isinstance(loc, np.ndarray):
/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3621 return self._engine.get_loc(casted_key)
3622 except KeyError as err:
-> 3623 raise KeyError(key) from err
3624 except TypeError:
3625 # If we have a listlike key, _check_indexing_error will raise
KeyError: 'FIPS'
So the display of FIPS in the view of the table view is just Pandas telling us the name of the index, not an actual value.
Name | Postal Code | FIPS | |
---|---|---|---|
FIPS | |||
01 | Alabama | AL | 01 |
02 | Alaska | AK | 02 |
Now let's get rid of the FIPS column (which could have been done with we called set_index
)
Name | Postal Code | |
---|---|---|
FIPS | ||
01 | Alabama | AL |
02 | Alaska | AK |
Adding data to the DataFrame object
Now we'll set the region for each of the Midwest states. First, set the region to 0 for all the states.
Name | Postal Code | Region | |
---|---|---|---|
FIPS | |||
55 | Wisconsin | WI | 0 |
56 | Wyoming | WY | 0 |
Now set the region to 1 for Michigan and Ohio, on the assumption that we already know the FIPS codes are '26' and '39'
Name | Postal Code | Region | |
---|---|---|---|
FIPS | |||
26 | Michigan | MI | 1 |
39 | Ohio | OH | 1 |
What if we don't know what the codes are? We can first get the index values for the rows using boolean indexing.
Let's see the rows in fips_new
matched by this boolean array.
Name | Postal Code | Region | |
---|---|---|---|
FIPS | |||
17 | Illinois | IL | 0 |
18 | Indiana | IN | 0 |
And now we can set the Region
column.
Here's an alternative way.
Get the index associated with those rows in either
And now we are basically able to do what we did for Michigan and Ohio when we knew those index values in advance.
Name | Postal Code | Region | |
---|---|---|---|
FIPS | |||
17 | Illinois | IL | 1 |
18 | Indiana | IN | 1 |
26 | Michigan | MI | 1 |
39 | Ohio | OH | 1 |
Here's a way to do it all in one step. Not necessarily recommended, because it's kind of hard to figure out exactly what it's doing, but it works!
Name | Postal Code | Region | |
---|---|---|---|
FIPS | |||
17 | Illinois | IL | 1 |
18 | Indiana | IN | 1 |
26 | Michigan | MI | 1 |
27 | Minnesota | MN | 1 |
39 | Ohio | OH | 1 |
55 | Wisconsin | WI | 1 |
Name | Postal Code | |
---|---|---|
FIPS | ||
26 | Michigan | MI |
39 | Ohio | OH |
Getting the population data
The cell below contains the function written previously to get us the census population data from api.census.gov
Let's modify this function to return a Pandas DataFrame
instead of a NumPy ndarray
, using the FIPS code as the index.
Here's one way for us to generate all of the years of data. Pandas recommends against doing it this way, though, because of the costs of memory reallocation on each call.
Here's a more efficient way: create individual data frames for each year, then call concat
to combine them.
Now let's get the total population in the Midwest for 2015.