This notebook...
# Converted list of NPIs to array for ease of use (could also import from CSV)
given_npis_array = ["1003892316","1013028315","1093806754","1104985845","1134108103","1134326697","1174629950","1184787020","1205852985","1376542811","1417912270","1477612752","1508894361","1518054915","1548217680","1598700627","1639174816","1639277130","1689627051","1699777334","1720006992","1720199763","1770563116","1861695678","1871562900","1932111424","1932135266","1932188521","1952460248","1992999031","1114980703","1215918800","1255334769","1891765574","1679568901","1083790117","1851401822","1902093693","1790975886","1598775538","1730289463","1417911314","1407972490","1811951734","1275604480","1346347028","1184679664","1285723304","1134123474","1215918743","1912909037","1245334309","1275511032","1093812133","1396799185","1447204243","1932205721","1679516314","1124089685","1902069958","1992999031","1619933132","1932198256","1184624785","1477533511","1205931722","1053391185","1831110923","1225020886","1053534610","1821158775","1649488461","1528133949","1962405142","1750359071","1225104318","1497930903","1619933132","1447274972","1144439316","1730150830","1134214091","1902822760","1336105238","1649469628","1699737932","1669685111","1306902739","1811180144","1437112604","1144315839","1053417808","1114189354","1932147626","1699737932","1437178274","1831104694","1073740825","1326080060","1760481808"]
import pandas as pd
import numpy as np
given_npis = pd.DataFrame(given_npis_array)
given_npis.columns = ['npi']
given_npis_count = len(given_npis)
# set pandas option to show the full dataset
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
print "We received {0} NPIs".format(given_npis_count)
given_npis.drop_duplicates(inplace=True)
given_npis_count_dupes_dropped = len(given_npis)
print "After removing {0} duplicates, we are working with {1} NPIs".format(given_npis_count-given_npis_count_dupes_dropped, given_npis_count_dupes_dropped)
# First order of business is to limit to only cardiologists
# to do this, we need to know the specialty of the cardiologists
# which we can get from the data.medicare.gov National Provider File
# the dataset is big-ish at 673mb so I downloaded it in full and stored it as a flat file
# alternatively, we could have grabbed data via API for just cardiologists
# but I want the possibility of doing other analysis later
# import urllib
# all_card_providers_url = "https://data.medicare.gov/resource/aeay-dfax.json?$where={0}&$limit={1}".format(urllib.quote("pri_spec LIKE '%CARD%'"),999999999)
# all_card_providers = pd.read_json(all_providers_url)
all_providers = pd.read_csv("data/data.medicare.gov_s63f-csi6_National-Downloadable_File.csv", low_memory=False)
all_providers.npi = all_providers.npi.astype(np.str)
print "{0} providers brought in from data.medicare.gov/d/s63f-csi6 file".format(len(all_providers))
# do a left inner join on the two dataframes
# so we are left with a dataframe of only the 97 given NPIs and their demographic info
given_npis_plus_ndf = pd.merge(given_npis, all_providers, how='left', on=['npi'])
given_npis_plus_ndf.drop_duplicates(subset='npi', inplace=True) # see assumption about duplicates
print "Do we have as many NPIs in this joined dataset as we were given? {0}".format(len(given_npis_plus_ndf) == given_npis_count_dupes_dropped)
# TODO: prefix
# OK, let's store the given_npis_plus dataset for later use so we dont have to
given_npis_plus_ndf.to_csv("data/given_npis_plus_national_downloadable_file.csv", index = False)
print "Saved given NPIs plus national downloadable file demographics to CSV for save keeping"
Basic demographic information from source #1 merged in with the list of unique NPIs provided; a sample of the data is below
# Read in given_npis_plus_ndf from file
given_npis_plus_ndf = pd.read_csv("data/given_npis_plus_national_downloadable_file.csv")
given_npis_plus_ndf[0:2]
If the ACO is looking for only cardiologists, we should probably filter non-cardiologists out.
First let's see what specialties are represented in our given list of providers -- perhaps multiple specialities fall within cardiology and/or cardiology shows up in secondary specialties
specialty_fields = ['Primary specialty', 'Secondary specialty 1', 'Secondary specialty 2', 'Secondary specialty 3', 'Secondary specialty 4']
for field in specialty_fields:
given_npis_plus_ndf[field].replace(np.nan, "NULL", inplace=True)
print given_npis_plus_ndf.groupby([field]).size().order()
print ""
CARD* shows up in primary and secondary specialty #1 -- let's see if the instance of secondary specialty is assigned to a provider with the primary specialty.
Note: there were 5 providers with no specialty which we will rule out per a documented assumption
==> there are 9 providers with a specialty related to the heart
print given_npis_plus_ndf.groupby(['Primary specialty','Secondary specialty 1']).size()[:5]
The provider with a secondary specialty 1 of cardiac surgery has a primary specialty of cardiology.
Per the previously documented assumption, we are going to discard the cardiac surgeon bringing us to 8 potential providers to potentially invite to the ACO
given_cardioligsts = given_npis_plus_ndf[given_npis_plus_ndf['Primary specialty'] == 'CARDIOVASCULAR DISEASE (CARDIOLOGY)']
print "We have narrowed the list of {0} unique NPIs to {1} with a primary specialty of cardiology".format(len(given_npis_plus_ndf), len(given_cardioligsts))
given_cardioligsts
excluded_entities = pd.read_csv("data/oig.hhs.gov_exclusion-list.csv", low_memory=False)
excluded_given_cardiologist_npis = []
for index, cardiologist in given_cardioligsts.iterrows():
exclusion_matches = excluded_entities[excluded_entities['NPI'] == cardiologist['npi']]
if len(exclusion_matches) > 0:
excluded_given_cardiologist_npis.append(cardiologist['npi'])
# print "/!\ DANGER /!\ {0} is in the exclusion list".format(cardiologist['npi'])
# else:
# print "It's all good, {0} is not in the exclusion list".format(cardiologist['npi'])
print "{0} cardiologists in our prospect list must be excluded".format(len(excluded_given_cardiologist_npis))
So.. now we have 8 cardiologists but how do we prioritize them? Even if we could invite all eight, we should focus our energy on top performers.
given_cardioligsts.to_csv("data/given_cardiologists_plus_national_downloadable_file.csv", index=False)