Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download

Jupyter notebook 1-Filtering for only legal cardiologists.ipynb

64 views
Kernel: Python 2 (Ubuntu, plain)

This notebook...

  • deduplicates the given 100 NPIs (down to 97)

  • join in demographic information from data source #1 (data.medicare.gov physician compare)

  • filter out non-cardiologists

  • check to make sure none of the cardiologists are on the HHS OIG Exclusions list

# Converted list of NPIs to array for ease of use (could also import from CSV) given_npis_array = ["1003892316","1013028315","1093806754","1104985845","1134108103","1134326697","1174629950","1184787020","1205852985","1376542811","1417912270","1477612752","1508894361","1518054915","1548217680","1598700627","1639174816","1639277130","1689627051","1699777334","1720006992","1720199763","1770563116","1861695678","1871562900","1932111424","1932135266","1932188521","1952460248","1992999031","1114980703","1215918800","1255334769","1891765574","1679568901","1083790117","1851401822","1902093693","1790975886","1598775538","1730289463","1417911314","1407972490","1811951734","1275604480","1346347028","1184679664","1285723304","1134123474","1215918743","1912909037","1245334309","1275511032","1093812133","1396799185","1447204243","1932205721","1679516314","1124089685","1902069958","1992999031","1619933132","1932198256","1184624785","1477533511","1205931722","1053391185","1831110923","1225020886","1053534610","1821158775","1649488461","1528133949","1962405142","1750359071","1225104318","1497930903","1619933132","1447274972","1144439316","1730150830","1134214091","1902822760","1336105238","1649469628","1699737932","1669685111","1306902739","1811180144","1437112604","1144315839","1053417808","1114189354","1932147626","1699737932","1437178274","1831104694","1073740825","1326080060","1760481808"] import pandas as pd import numpy as np given_npis = pd.DataFrame(given_npis_array) given_npis.columns = ['npi'] given_npis_count = len(given_npis) # set pandas option to show the full dataset pd.set_option('display.max_rows', 100) pd.set_option('display.max_columns', None) print "We received {0} NPIs".format(given_npis_count) given_npis.drop_duplicates(inplace=True) given_npis_count_dupes_dropped = len(given_npis) print "After removing {0} duplicates, we are working with {1} NPIs".format(given_npis_count-given_npis_count_dupes_dropped, given_npis_count_dupes_dropped)
We received 100 NPIs After removing 3 duplicates, we are working with 97 NPIs
# First order of business is to limit to only cardiologists # to do this, we need to know the specialty of the cardiologists # which we can get from the data.medicare.gov National Provider File # the dataset is big-ish at 673mb so I downloaded it in full and stored it as a flat file # alternatively, we could have grabbed data via API for just cardiologists # but I want the possibility of doing other analysis later # import urllib # all_card_providers_url = "https://data.medicare.gov/resource/aeay-dfax.json?$where={0}&$limit={1}".format(urllib.quote("pri_spec LIKE '%CARD%'"),999999999) # all_card_providers = pd.read_json(all_providers_url) all_providers = pd.read_csv("data/data.medicare.gov_s63f-csi6_National-Downloadable_File.csv", low_memory=False) all_providers.npi = all_providers.npi.astype(np.str) print "{0} providers brought in from data.medicare.gov/d/s63f-csi6 file".format(len(all_providers))
2279287 providers brought in from data.medicare.gov/d/s63f-csi6 file
# do a left inner join on the two dataframes # so we are left with a dataframe of only the 97 given NPIs and their demographic info given_npis_plus_ndf = pd.merge(given_npis, all_providers, how='left', on=['npi']) given_npis_plus_ndf.drop_duplicates(subset='npi', inplace=True) # see assumption about duplicates print "Do we have as many NPIs in this joined dataset as we were given? {0}".format(len(given_npis_plus_ndf) == given_npis_count_dupes_dropped) # TODO: prefix
Do we have as many NPIs in this joined dataset as we were given? True
# OK, let's store the given_npis_plus dataset for later use so we dont have to given_npis_plus_ndf.to_csv("data/given_npis_plus_national_downloadable_file.csv", index = False) print "Saved given NPIs plus national downloadable file demographics to CSV for save keeping"
Saved given NPIs plus national downloadable file demographics to CSV for save keeping

Checkpoint (⚑) 1: Dataset of 97 NPIs with basic demographic info

Basic demographic information from source #1 merged in with the list of unique NPIs provided; a sample of the data is below

# Read in given_npis_plus_ndf from file given_npis_plus_ndf = pd.read_csv("data/given_npis_plus_national_downloadable_file.csv") given_npis_plus_ndf[0:2]

If the ACO is looking for only cardiologists, we should probably filter non-cardiologists out.

First let's see what specialties are represented in our given list of providers -- perhaps multiple specialities fall within cardiology and/or cardiology shows up in secondary specialties

specialty_fields = ['Primary specialty', 'Secondary specialty 1', 'Secondary specialty 2', 'Secondary specialty 3', 'Secondary specialty 4'] for field in specialty_fields: given_npis_plus_ndf[field].replace(np.nan, "NULL", inplace=True) print given_npis_plus_ndf.groupby([field]).size().order() print ""
Primary specialty CARDIAC SURGERY 1 NEUROSURGERY 1 SPORTS MEDICINE 1 VASCULAR SURGERY 1 NULL 5 HEMATOLOGY/ONCOLOGY 6 GENERAL SURGERY 7 NEPHROLOGY 7 PULMONARY DISEASE 7 CARDIOVASCULAR DISEASE (CARDIOLOGY) 8 ENDOCRINOLOGY 8 GASTROENTEROLOGY 9 INTERNAL MEDICINE 9 NEUROLOGY 9 ORTHOPEDIC SURGERY 9 RHEUMATOLOGY 9 dtype: int64 Secondary specialty 1 CARDIAC SURGERY 1 DIAGNOSTIC RADIOLOGY 1 EMERGENCY MEDICINE 1 HEMATOLOGY 1 INFECTIOUS DISEASE 1 SLEEP LABORATORY/MEDICINE 1 NEPHROLOGY 3 CRITICAL CARE (INTENSIVISTS) 9 INTERNAL MEDICINE 27 NULL 52 dtype: int64 Secondary specialty 2 MEDICAL ONCOLOGY 1 PERIPHERAL VASCULAR DISEASE 1 PULMONARY DISEASE 3 INTERNAL MEDICINE 5 NULL 87 dtype: int64 Secondary specialty 3 SLEEP LABORATORY/MEDICINE 1 NULL 96 dtype: int64 Secondary specialty 4 NULL 97 dtype: int64

CARD* shows up in primary and secondary specialty #1 -- let's see if the instance of secondary specialty is assigned to a provider with the primary specialty.

Note: there were 5 providers with no specialty which we will rule out per a documented assumption

==> there are 9 providers with a specialty related to the heart

  • 8x 'CARDIOVASCULAR DISEASE (CARDIOLOGY)'

  • 1x 'CARDIAC SURGERY'

print given_npis_plus_ndf.groupby(['Primary specialty','Secondary specialty 1']).size()[:5]
Primary specialty Secondary specialty 1 CARDIAC SURGERY INTERNAL MEDICINE 1 CARDIOVASCULAR DISEASE (CARDIOLOGY) CARDIAC SURGERY 1 CRITICAL CARE (INTENSIVISTS) 1 INTERNAL MEDICINE 2 NULL 4 dtype: int64

The provider with a secondary specialty 1 of cardiac surgery has a primary specialty of cardiology.

Per the previously documented assumption, we are going to discard the cardiac surgeon bringing us to 8 potential providers to potentially invite to the ACO

given_cardioligsts = given_npis_plus_ndf[given_npis_plus_ndf['Primary specialty'] == 'CARDIOVASCULAR DISEASE (CARDIOLOGY)'] print "We have narrowed the list of {0} unique NPIs to {1} with a primary specialty of cardiology".format(len(given_npis_plus_ndf), len(given_cardioligsts))
We have narrowed the list of 97 unique NPIs to 8 with a primary specialty of cardiology

⚑ 2: Narrowed given NPIs to 8 Cardiologists:

given_cardioligsts

⚑ 8 cardiologists in the running (none are in the HHS OIG exclusion list)

excluded_entities = pd.read_csv("data/oig.hhs.gov_exclusion-list.csv", low_memory=False) excluded_given_cardiologist_npis = [] for index, cardiologist in given_cardioligsts.iterrows(): exclusion_matches = excluded_entities[excluded_entities['NPI'] == cardiologist['npi']] if len(exclusion_matches) > 0: excluded_given_cardiologist_npis.append(cardiologist['npi']) # print "/!\ DANGER /!\ {0} is in the exclusion list".format(cardiologist['npi']) # else: # print "It's all good, {0} is not in the exclusion list".format(cardiologist['npi']) print "{0} cardiologists in our prospect list must be excluded".format(len(excluded_given_cardiologist_npis))
0 cardiologists in our prospect list must be excluded

So.. now we have 8 cardiologists but how do we prioritize them? Even if we could invite all eight, we should focus our energy on top performers.

given_cardioligsts.to_csv("data/given_cardiologists_plus_national_downloadable_file.csv", index=False)