Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Project: CSCI 195
Views: 5930
Image: ubuntu2004
Kernel: Python 3 (system-wide)

As with NumPy, there's a fairly standard import statement that gets used for Pandas. Our textbook author also mentions importing DataFrame and Series specifically since these are used so frequently.

import pandas as pd import numpy as np from pandas import Series, DataFrame

Getting the FIPS data

In the NumPy version of this exercise, we manually downloaded and parsed a file that contained our FIPS code database. Pandas provides some pre-built parsing capabilities, and is able to parse information from

  • CSV (Comma Separated Values)

  • tab-delimited files

  • JSON

  • HTML tables

  • Data stored in an SQL database

The USDA has a page that contains FIPS codes. First let's try just plain pd.read_html to see what it does.

url = 'https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696' tables = pd.read_html(url) print(len(tables))
15

Yikes, that's a lot of tables (for those of you who have some HTML experience, this is another reason why table-based page layout is not a very good idea!)

Let's refine our search to look for one that contains the text "FIPS".

tables = pd.read_html(url, match="FIPS") print(len(tables))
10

That's a little better, but not great. Here's a version where you can identify the table by an attribute it has in the HTML - this requires a bit of HTML knowledge, but it works out the best. It would be even better if our table had an id attribute, but it doesn't.

tables = pd.read_html(url, attrs={'class':'data'}) print(len(tables))
1

OK, now that we've got it down to 1 table, we can pick it out, and see what it looks like.

fips = tables[0] fips.head()
Name Postal Code FIPS
0 Alabama AL 1
1 Alaska AK 2
2 Arizona AZ 4
3 Arkansas AR 5
4 California CA 6

And look at the data at the end (usually if there's going to need to be adjustments made, they will occur at the beginning of the table or at the end)

fips.tail(n=10)
Name Postal Code FIPS
45 Virginia VA 51
46 Washington WA 53
47 West Virginia WV 54
48 Wisconsin WI 55
49 Wyoming WY 56
50 American Samoa AS 60
51 Guam GU 66
52 Northern Mariana Islands MP 69
53 Puerto Rico PR 72
54 Virgin Islands VI 78

Let's get rid of the last 5 rows, since they correspond to territories and not states. We can apply the same type of array slicing syntax that worked in NumPy.

fips = fips[0:-5] fips.tail()
Name Postal Code FIPS
45 Virginia VA 51
46 Washington WA 53
47 West Virginia WV 54
48 Wisconsin WI 55
49 Wyoming WY 56

Now let's since the Census site is going to use strings with leading zeros to help us identify a state, let's transform the FIPS column. First, pull out the desired column of the DataFrame and assign it to a variable and print out its type.

fips_codes = fips['FIPS'] print(type(fips_codes)) print(fips_codes.head())
<class 'pandas.core.series.Series'> 0 1 1 2 2 4 3 5 4 6 Name: FIPS, dtype: int64

Now use a list comprehension to generate a list containing the string versions of the FIPS codes

fips_codes_as_strings = [ str(code) for code in fips["FIPS"] ] fips_codes_as_strings
['1', '2', '4', '5', '6', '8', '9', '10', '12', '13', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '44', '45', '46', '47', '48', '49', '50', '51', '53', '54', '55', '56']

Pandas gives us a simpler way to convert data types, however, using the astype method on a Series object.

fips_codes_as_strings = fips_codes.astype('string') fips_codes_as_strings
0 1 1 2 2 4 3 5 4 6 5 8 6 9 7 10 8 12 9 13 10 15 11 16 12 17 13 18 14 19 15 20 16 21 17 22 18 23 19 24 20 25 21 26 22 27 23 28 24 29 25 30 26 31 27 32 28 33 29 34 30 35 31 36 32 37 33 38 34 39 35 40 36 41 37 42 38 44 39 45 40 46 41 47 42 48 43 49 44 50 45 51 46 53 47 54 48 55 49 56 Name: FIPS, dtype: string

But, we want these to have a leading zero for values under 10. Let's write a function to do this for us.

def convert(value): if value >= 10: return str(value) else: return '0' + str(value)

Try it with the first and last values in fips.

print(convert(fips["FIPS"][0])) print(convert(fips["FIPS"][9])) print(convert(fips["FIPS"][49]))
01 13 56

Note that the following call doesn't work. Why not?

fips_series = fips["FIPS"] #fips_series[-1]

The Series and DataFrame objects provide a separate method named iloc to work positionally, rather than with index values.

fips["FIPS"].iloc[0:10]
0 1 1 2 2 4 3 5 4 6 5 8 6 9 7 10 8 12 9 13 Name: FIPS, dtype: int64

Now we can use the map method along with convert to get new values for each of the values in the "FIPS" column. map is a method of a Series that returns a new Series based on calling a function on each value of the original Series.

fips["FIPS"] = fips["FIPS"].map(convert) fips.head()
/tmp/ipykernel_428/1832877244.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy fips["FIPS"] = fips["FIPS"].map(convert)
Name Postal Code FIPS
0 Alabama AL 01
1 Alaska AK 02
2 Arizona AZ 04
3 Arkansas AR 05
4 California CA 06

And now we can replace the index with the FIPS values, and get rid of the FIPS column. First remind ourselves of the structure of our original DataFrame.

fips.head(n=2)
Name Postal Code FIPS
0 Alabama AL 01
1 Alaska AK 02

Now create a new DataFrame named fips_new with the converted Series serving as the index.

fips_new = DataFrame(data={"Name": fips["Name"], "Postal Code": fips["Postal Code"]}, index=fips["FIPS"])
fips_new.head(n=5)
Name Postal Code
FIPS
01 NaN NaN
02 NaN NaN
04 NaN NaN
05 NaN NaN
06 NaN NaN

Ughh, that doesn't seem to have worked very well! When combining Series objects together like this, Pandas always wants to combine things together that have the same index. Since our new DataFame uses a completely different index than the original Series objects, this approach doesn't work.

print(fips.index) print(fips_new.index)
RangeIndex(start=0, stop=50, step=1) Index(['01', '02', '04', '05', '06', '08', '09', '10', '12', '13', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '44', '45', '46', '47', '48', '49', '50', '51', '53', '54', '55', '56'], dtype='object', name='FIPS')

Here's one approach we can try: using the values property of a Series object to get a more "raw" form of the data.

print(fips["Name"].values)
['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado' 'Connecticut' 'Delaware' 'Florida' 'Georgia' 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky' 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota' 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire' 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota' 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'Rhode Island' 'South Carolina' 'South Dakota' 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia' 'Washington' 'West Virginia' 'Wisconsin' 'Wyoming']

What might the type of fips["Name"].values be?

print(type(fips["Name"].values))
<class 'numpy.ndarray'>

Let's try to create our new data frame using the values from the Series, so that we're not trying to match indexes.

fips_new = DataFrame(data={"Name": fips["Name"].values, "Post Code": fips["Postal Code"].values}, index=fips["FIPS"]) fips_new.head()
Name Post Code
FIPS
01 Alabama AL
02 Alaska AK
04 Arizona AZ
05 Arkansas AR
06 California CA

Use the info method to see some details about the fips_new DataFrame.

fips_new.info()
<class 'pandas.core.frame.DataFrame'> Index: 50 entries, 01 to 56 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 50 non-null object 1 Post Code 50 non-null object dtypes: object(2) memory usage: 1.2+ KB

Take a look at the index associated with fips_new.

print(fips_new.index)
Index(['01', '02', '04', '05', '06', '08', '09', '10', '12', '13', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '44', '45', '46', '47', '48', '49', '50', '51', '53', '54', '55', '56'], dtype='object', name='FIPS')

Now we can get out information about Michigan like this

# Get row by index key value print(fips_new.loc['26']) # Get row by position print(fips_new.iloc[0])
Name Michigan Post Code MI Name: 26, dtype: object Name Alabama Post Code AL Name: 01, dtype: object

Other ways to create the new DataFrame

First, we can use the set_index method, which uses an existing column as the index for the DataFrame.

fips.head(n=2)
Name Postal Code FIPS
0 Alabama AL 01
1 Alaska AK 02
fips_new = fips.set_index(keys="FIPS", inplace=False, drop=False) fips_new.head(n=5)
Name Postal Code FIPS
FIPS
01 Alabama AL 01
02 Alaska AK 02
04 Arizona AZ 04
05 Arkansas AR 05
06 California CA 06

It kind of looks like there's an index value FIPS. Let's see if there is.

# We can't just use the [ ] notation to try to get an entry out by index, since for a DataFrame # this will look for a column with that name.
fips_new.loc['FIPS']
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) /usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3620 try: -> 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err: /usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() /usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'FIPS' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) /tmp/ipykernel_428/3242763737.py in <cell line: 1>() ----> 1 fips_new.loc['FIPS'] /usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in __getitem__(self, key) 965 966 maybe_callable = com.apply_if_callable(key, self.obj) --> 967 return self._getitem_axis(maybe_callable, axis=axis) 968 969 def _is_scalar_access(self, key: tuple): /usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis) 1200 # fall thru to straight lookup 1201 self._validate_key(key, axis) -> 1202 return self._get_label(key, axis=axis) 1203 1204 def _get_slice_axis(self, slice_obj: slice, axis: int): /usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in _get_label(self, label, axis) 1151 def _get_label(self, label, axis: int): 1152 # GH#5667 this will fail if the label is not present in the axis. -> 1153 return self.obj.xs(label, axis=axis) 1154 1155 def _handle_lowerdim_multi_index_axis0(self, tup: tuple): /usr/local/lib/python3.8/dist-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level) 3874 new_index = index[loc] 3875 else: -> 3876 loc = index.get_loc(key) 3877 3878 if isinstance(loc, np.ndarray): /usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err: -> 3623 raise KeyError(key) from err 3624 except TypeError: 3625 # If we have a listlike key, _check_indexing_error will raise KeyError: 'FIPS'

So the display of FIPS in the view of the table view is just Pandas telling us the name of the index, not an actual value.

fips_new.index
Index(['01', '02', '04', '05', '06', '08', '09', '10', '12', '13', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '44', '45', '46', '47', '48', '49', '50', '51', '53', '54', '55', '56'], dtype='object', name='FIPS')
fips_new.head(n=2)
Name Postal Code FIPS
FIPS
01 Alabama AL 01
02 Alaska AK 02

Now let's get rid of the FIPS column (which could have been done with we called set_index)

fips_new.drop("FIPS", axis=1, inplace=True) fips_new.head(2)
Name Postal Code
FIPS
01 Alabama AL
02 Alaska AK

Adding data to the DataFrame object

Now we'll set the region for each of the Midwest states. First, set the region to 0 for all the states.

fips_new['Region'] = 0 fips_new.tail(2)
Name Postal Code Region
FIPS
55 Wisconsin WI 0
56 Wyoming WY 0

Now set the region to 1 for Michigan and Ohio, on the assumption that we already know the FIPS codes are '26' and '39'

fips_new.loc[['26', '39'], 'Region'] = 1 fips_new.loc[['26', '39']]
Name Postal Code Region
FIPS
26 Michigan MI 1
39 Ohio OH 1

What if we don't know what the codes are? We can first get the index values for the rows using boolean indexing.

indiana = fips_new['Postal Code'] == 'IN' illinois = fips_new['Postal Code'] == 'IL' print(type(illinois.values)) import numpy as np #either = np.logical_or(indiana,illinois) either = indiana | illinois print(either)
<class 'numpy.ndarray'> FIPS 01 False 02 False 04 False 05 False 06 False 08 False 09 False 10 False 12 False 13 False 15 False 16 False 17 True 18 True 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False 30 False 31 False 32 False 33 False 34 False 35 False 36 False 37 False 38 False 39 False 40 False 41 False 42 False 44 False 45 False 46 False 47 False 48 False 49 False 50 False 51 False 53 False 54 False 55 False 56 False Name: Postal Code, dtype: bool

Let's see the rows in fips_new matched by this boolean array.

fips_new.loc[either]
Name Postal Code Region
FIPS
17 Illinois IL 0
18 Indiana IN 0

And now we can set the Region column.

fips_new.loc[either, 'Region'] = 1

Here's an alternative way.

Get the index associated with those rows in either

in_ill_index = fips_new[either].index in_ill_index

And now we are basically able to do what we did for Michigan and Ohio when we knew those index values in advance.

#fips_new.loc[in_ill_index, 'Region'] = 1 fips_new[fips_new['Region'] == 1]
Name Postal Code Region
FIPS
17 Illinois IL 1
18 Indiana IN 1
26 Michigan MI 1
39 Ohio OH 1

Here's a way to do it all in one step. Not necessarily recommended, because it's kind of hard to figure out exactly what it's doing, but it works!

fips_new.loc[(fips_new['Postal Code'] == 'WI') | (fips_new['Postal Code'] == 'MN'), 'Region'] = 1
FIPS 27 1 55 1 Name: Region, dtype: int64
fips_new.loc[fips_new['Region'] == 1]
Name Postal Code Region
FIPS
17 Illinois IL 1
18 Indiana IN 1
26 Michigan MI 1
27 Minnesota MN 1
39 Ohio OH 1
55 Wisconsin WI 1
fips_new.loc[['26', '39'], ['Name', 'Postal Code']]
Name Postal Code
FIPS
26 Michigan MI
39 Ohio OH

Getting the population data

The cell below contains the function written previously to get us the census population data from api.census.gov

import requests import json import numpy as np def download_pop_data(year, states): states_list = ",".join(states) base_url = 'http://api.census.gov/data' application = "pep/charagegroups" query = f"get=POP&for=state:{states_list}" url = f"{base_url}/{year}/{application}?{query}" response = requests.get(url) if not (200 <= response.status_code <= 299): return (False, None) state_data = json.loads(response.content) state_data = np.array(state_data[1:]) population = state_data[:,0].astype('int') state = state_data[:, 1].astype('int') data = np.column_stack((state, population)) years = np.full(data.shape[0],year, dtype='int') year_data = np.column_stack((years, data)) return (True, year_data)
download_pop_data(2015, fips_new.index)

Let's modify this function to return a Pandas DataFrame instead of a NumPy ndarray, using the FIPS code as the index.

def download_pop_data_pd(year, states): states_list = ",".join(states) base_url = 'http://api.census.gov/data' application = "pep/charagegroups" query = f"get=POP&for=state:{states_list}" url = f"{base_url}/{year}/{application}?{query}" response = requests.get(url) if not (200 <= response.status_code <= 299): return (False, None) state_data = json.loads(response.content) state_data = np.array(state_data[1:]) data = DataFrame(data{'Population': }) return (True, data)
result, df = download_pop_data_pd(2015, fips_new.index) if result: print(df.head()) print(df.dtypes) else: print("Failed to download population data for 2015 using the Pandas method")

Here's one way for us to generate all of the years of data. Pandas recommends against doing it this way, though, because of the costs of memory reallocation on each call.

Here's a more efficient way: create individual data frames for each year, then call concat to combine them.

Now let's get the total population in the Midwest for 2015.