CoCalc -- 1-Filtering for only legal cardiologists.ipynb

Download

Jupyter notebook 1-Filtering for only legal cardiologists.ipynb

ACO Analysis

1-Filtering for only legal cardiologists.ipynb

⁷⁸ views

Kernel: Python 2 (Ubuntu, plain)

Notebook #1: Filtering for only legal cardiologists

This notebook...

deduplicates the given 100 NPIs (down to 97)
join in demographic information from data source #1 (data.medicare.gov physician compare)
filter out non-cardiologists
check to make sure none of the cardiologists are on the HHS OIG Exclusions list

In [1]:

# Converted list of NPIs to array for ease of use (could also import from CSV)
given_npis_array = ["1003892316","1013028315","1093806754","1104985845","1134108103","1134326697","1174629950","1184787020","1205852985","1376542811","1417912270","1477612752","1508894361","1518054915","1548217680","1598700627","1639174816","1639277130","1689627051","1699777334","1720006992","1720199763","1770563116","1861695678","1871562900","1932111424","1932135266","1932188521","1952460248","1992999031","1114980703","1215918800","1255334769","1891765574","1679568901","1083790117","1851401822","1902093693","1790975886","1598775538","1730289463","1417911314","1407972490","1811951734","1275604480","1346347028","1184679664","1285723304","1134123474","1215918743","1912909037","1245334309","1275511032","1093812133","1396799185","1447204243","1932205721","1679516314","1124089685","1902069958","1992999031","1619933132","1932198256","1184624785","1477533511","1205931722","1053391185","1831110923","1225020886","1053534610","1821158775","1649488461","1528133949","1962405142","1750359071","1225104318","1497930903","1619933132","1447274972","1144439316","1730150830","1134214091","1902822760","1336105238","1649469628","1699737932","1669685111","1306902739","1811180144","1437112604","1144315839","1053417808","1114189354","1932147626","1699737932","1437178274","1831104694","1073740825","1326080060","1760481808"]

import pandas as pd
import numpy as np

given_npis = pd.DataFrame(given_npis_array)
given_npis.columns = ['npi']
given_npis_count = len(given_npis)

# set pandas option to show the full dataset
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)

print "We received {0} NPIs".format(given_npis_count)
given_npis.drop_duplicates(inplace=True)
given_npis_count_dupes_dropped = len(given_npis)

print "After removing {0} duplicates, we are working with {1} NPIs".format(given_npis_count-given_npis_count_dupes_dropped, given_npis_count_dupes_dropped)

Out[1]:

We received 100 NPIs
After removing 3 duplicates, we are working with 97 NPIs

In [3]:

# First order of business is to limit to only cardiologists
#   to do this, we need to know the specialty of the cardiologists 
#   which we can get from the data.medicare.gov National Provider File

# the dataset is big-ish at 673mb so I downloaded it in full and stored it as a flat file
#   alternatively, we could have grabbed data via API for just cardiologists
#   but I want the possibility of doing other analysis later
#      import urllib
#      all_card_providers_url = "https://data.medicare.gov/resource/aeay-dfax.json?$where={0}&$limit={1}".format(urllib.quote("pri_spec LIKE '%CARD%'"),999999999)
#      all_card_providers = pd.read_json(all_providers_url)
all_providers = pd.read_csv("data/data.medicare.gov_s63f-csi6_National-Downloadable_File.csv", low_memory=False)
all_providers.npi = all_providers.npi.astype(np.str)

print "{0} providers brought in from data.medicare.gov/d/s63f-csi6 file".format(len(all_providers))

Out[3]:

2279287 providers brought in from data.medicare.gov/d/s63f-csi6 file

In [14]:

# do a left inner join on the two dataframes
# so we are left with a dataframe of only the 97 given NPIs and their demographic info
given_npis_plus_ndf = pd.merge(given_npis, all_providers, how='left', on=['npi'])
given_npis_plus_ndf.drop_duplicates(subset='npi', inplace=True) # see assumption about duplicates
print "Do we have as many NPIs in this joined dataset as we were given? {0}".format(len(given_npis_plus_ndf) == given_npis_count_dupes_dropped)

# TODO: prefix

Out[14]:

Do we have as many NPIs in this joined dataset as we were given? True

In [9]:

# OK, let's store the given_npis_plus dataset for later use so we dont have to 
given_npis_plus_ndf.to_csv("data/given_npis_plus_national_downloadable_file.csv", index = False)
print "Saved given NPIs plus national downloadable file demographics to CSV for save keeping"

Out[9]:

Saved given NPIs plus national downloadable file demographics to CSV for save keeping

Checkpoint (⚑) 1: Dataset of 97 NPIs with basic demographic info

Basic demographic information from source #1 merged in with the list of unique NPIs provided; a sample of the data is below

In [17]:

# Read in given_npis_plus_ndf from file
given_npis_plus_ndf = pd.read_csv("data/given_npis_plus_national_downloadable_file.csv")
given_npis_plus_ndf[0:2]

Out[17]:

If the ACO is looking for only cardiologists, we should probably filter non-cardiologists out.

First let's see what specialties are represented in our given list of providers -- perhaps multiple specialities fall within cardiology and/or cardiology shows up in secondary specialties

In [48]:

specialty_fields = ['Primary specialty', 'Secondary specialty 1', 'Secondary specialty 2', 'Secondary specialty 3', 'Secondary specialty 4']

for field in specialty_fields:
    given_npis_plus_ndf[field].replace(np.nan, "NULL", inplace=True)
    print given_npis_plus_ndf.groupby([field]).size().order()
    print ""

Out[48]:

Primary specialty
CARDIAC SURGERY                        1
NEUROSURGERY                           1
SPORTS MEDICINE                        1
VASCULAR SURGERY                       1
NULL                                   5
HEMATOLOGY/ONCOLOGY                    6
GENERAL SURGERY                        7
NEPHROLOGY                             7
PULMONARY DISEASE                      7
CARDIOVASCULAR DISEASE (CARDIOLOGY)    8
ENDOCRINOLOGY                          8
GASTROENTEROLOGY                       9
INTERNAL MEDICINE                      9
NEUROLOGY                              9
ORTHOPEDIC SURGERY                     9
RHEUMATOLOGY                           9
dtype: int64

Secondary specialty 1
CARDIAC SURGERY                  1
DIAGNOSTIC RADIOLOGY             1
EMERGENCY MEDICINE               1
HEMATOLOGY                       1
INFECTIOUS DISEASE               1
SLEEP LABORATORY/MEDICINE        1
NEPHROLOGY                       3
CRITICAL CARE (INTENSIVISTS)     9
INTERNAL MEDICINE               27
NULL                            52
dtype: int64

Secondary specialty 2
MEDICAL ONCOLOGY                1
PERIPHERAL VASCULAR DISEASE     1
PULMONARY DISEASE               3
INTERNAL MEDICINE               5
NULL                           87
dtype: int64

Secondary specialty 3
SLEEP LABORATORY/MEDICINE     1
NULL                         96
dtype: int64

Secondary specialty 4
NULL                     97
dtype: int64

CARD* shows up in primary and secondary specialty #1 -- let's see if the instance of secondary specialty is assigned to a provider with the primary specialty.

Note: there were 5 providers with no specialty which we will rule out per a documented assumption

==> there are 9 providers with a specialty related to the heart

8x 'CARDIOVASCULAR DISEASE (CARDIOLOGY)'
1x 'CARDIAC SURGERY'

In [54]:

print given_npis_plus_ndf.groupby(['Primary specialty','Secondary specialty 1']).size()[:5]

Out[54]:

Primary specialty                    Secondary specialty 1       
CARDIAC SURGERY                      INTERNAL MEDICINE               1
CARDIOVASCULAR DISEASE (CARDIOLOGY)  CARDIAC SURGERY                 1
                                     CRITICAL CARE (INTENSIVISTS)    1
                                     INTERNAL MEDICINE               2
                                     NULL                            4
dtype: int64

The provider with a secondary specialty 1 of cardiac surgery has a primary specialty of cardiology.

Per the previously documented assumption, we are going to discard the cardiac surgeon bringing us to 8 potential providers to potentially invite to the ACO

In [62]:

given_cardioligsts = given_npis_plus_ndf[given_npis_plus_ndf['Primary specialty'] == 'CARDIOVASCULAR DISEASE (CARDIOLOGY)']
print "We have narrowed the list of {0} unique NPIs to {1} with a primary specialty of cardiology".format(len(given_npis_plus_ndf), len(given_cardioligsts))

Out[62]:

We have narrowed the list of 97 unique NPIs to 8 with a primary specialty of cardiology

⚑ 2: Narrowed given NPIs to 8 Cardiologists:

In [71]:

given_cardioligsts

Out[71]:

⚑ 8 cardiologists in the running (none are in the HHS OIG exclusion list)

In [88]:

excluded_entities = pd.read_csv("data/oig.hhs.gov_exclusion-list.csv", low_memory=False)
excluded_given_cardiologist_npis = []
for index, cardiologist in given_cardioligsts.iterrows():
    exclusion_matches = excluded_entities[excluded_entities['NPI'] == cardiologist['npi']]
    if len(exclusion_matches) > 0:
        excluded_given_cardiologist_npis.append(cardiologist['npi'])
#         print "/!\ DANGER /!\ {0} is in the exclusion list".format(cardiologist['npi'])
#     else:
#         print "It's all good, {0} is not in the exclusion list".format(cardiologist['npi'])

print "{0} cardiologists in our prospect list must be excluded".format(len(excluded_given_cardiologist_npis))

Out[88]:

0 cardiologists in our prospect list must be excluded

So.. now we have 8 cardiologists but how do we prioritize them? Even if we could invite all eight, we should focus our energy on top performers.

In [89]:

given_cardioligsts.to_csv("data/given_cardiologists_plus_national_downloadable_file.csv", index=False)

In [ ]:

Notebook #1: Filtering for only legal cardiologists

Checkpoint (⚑) 1: Dataset of 97 NPIs with basic demographic info

⚑ 2: Narrowed given NPIs to 8 Cardiologists:

⚑ 8 cardiologists in the running (none are in the HHS OIG exclusion list)

Product

Resources

Company