This notebook...

  • deduplicates the given 100 NPIs (down to 97)
  • join in demographic information from data source #1 (data.medicare.gov physician compare)
  • filter out non-cardiologists
  • check to make sure none of the cardiologists are on the HHS OIG Exclusions list
In [1]:
# Converted list of NPIs to array for ease of use (could also import from CSV)
given_npis_array = ["1003892316","1013028315","1093806754","1104985845","1134108103","1134326697","1174629950","1184787020","1205852985","1376542811","1417912270","1477612752","1508894361","1518054915","1548217680","1598700627","1639174816","1639277130","1689627051","1699777334","1720006992","1720199763","1770563116","1861695678","1871562900","1932111424","1932135266","1932188521","1952460248","1992999031","1114980703","1215918800","1255334769","1891765574","1679568901","1083790117","1851401822","1902093693","1790975886","1598775538","1730289463","1417911314","1407972490","1811951734","1275604480","1346347028","1184679664","1285723304","1134123474","1215918743","1912909037","1245334309","1275511032","1093812133","1396799185","1447204243","1932205721","1679516314","1124089685","1902069958","1992999031","1619933132","1932198256","1184624785","1477533511","1205931722","1053391185","1831110923","1225020886","1053534610","1821158775","1649488461","1528133949","1962405142","1750359071","1225104318","1497930903","1619933132","1447274972","1144439316","1730150830","1134214091","1902822760","1336105238","1649469628","1699737932","1669685111","1306902739","1811180144","1437112604","1144315839","1053417808","1114189354","1932147626","1699737932","1437178274","1831104694","1073740825","1326080060","1760481808"]

import pandas as pd
import numpy as np

given_npis = pd.DataFrame(given_npis_array)
given_npis.columns = ['npi']
given_npis_count = len(given_npis)

# set pandas option to show the full dataset
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)

print "We received {0} NPIs".format(given_npis_count)
given_npis.drop_duplicates(inplace=True)
given_npis_count_dupes_dropped = len(given_npis)

print "After removing {0} duplicates, we are working with {1} NPIs".format(given_npis_count-given_npis_count_dupes_dropped, given_npis_count_dupes_dropped)
We received 100 NPIs
After removing 3 duplicates, we are working with 97 NPIs
In [3]:
# First order of business is to limit to only cardiologists
#   to do this, we need to know the specialty of the cardiologists 
#   which we can get from the data.medicare.gov National Provider File

# the dataset is big-ish at 673mb so I downloaded it in full and stored it as a flat file
#   alternatively, we could have grabbed data via API for just cardiologists
#   but I want the possibility of doing other analysis later
#      import urllib
#      all_card_providers_url = "https://data.medicare.gov/resource/aeay-dfax.json?$where={0}&$limit={1}".format(urllib.quote("pri_spec LIKE '%CARD%'"),999999999)
#      all_card_providers = pd.read_json(all_providers_url)
all_providers = pd.read_csv("data/data.medicare.gov_s63f-csi6_National-Downloadable_File.csv", low_memory=False)
all_providers.npi = all_providers.npi.astype(np.str)

print "{0} providers brought in from data.medicare.gov/d/s63f-csi6 file".format(len(all_providers))
2279287 providers brought in from data.medicare.gov/d/s63f-csi6 file
In [14]:
# do a left inner join on the two dataframes
# so we are left with a dataframe of only the 97 given NPIs and their demographic info
given_npis_plus_ndf = pd.merge(given_npis, all_providers, how='left', on=['npi'])
given_npis_plus_ndf.drop_duplicates(subset='npi', inplace=True) # see assumption about duplicates
print "Do we have as many NPIs in this joined dataset as we were given? {0}".format(len(given_npis_plus_ndf) == given_npis_count_dupes_dropped)

# TODO: prefix 
Do we have as many NPIs in this joined dataset as we were given? True
In [9]:
# OK, let's store the given_npis_plus dataset for later use so we dont have to 
given_npis_plus_ndf.to_csv("data/given_npis_plus_national_downloadable_file.csv", index = False)
print "Saved given NPIs plus national downloadable file demographics to CSV for save keeping"
Saved given NPIs plus national downloadable file demographics to CSV for save keeping

Checkpoint (⚑) 1: Dataset of 97 NPIs with basic demographic info

Basic demographic information from source #1 merged in with the list of unique NPIs provided; a sample of the data is below

In [17]:
# Read in given_npis_plus_ndf from file
given_npis_plus_ndf = pd.read_csv("data/given_npis_plus_national_downloadable_file.csv")
given_npis_plus_ndf[0:2]
Out[17]:
npi PAC ID Professional Enrollment ID Last Name First Name Middle Name Suffix Gender Credential Medical school name Graduation year Primary specialty Secondary specialty 1 Secondary specialty 2 Secondary specialty 3 Secondary specialty 4 All secondary specialties Organization legal name Organization DBA name Group Practice PAC ID Number of Group Practice members Line 1 Street Address Line 2 Street Address Marker of address line 2 suppression City State Zip Code Claims based hospital affiliation CCN 1 Claims based hospital affiliation LBN 1 Claims based hospital affiliation CCN 2 Claims based hospital affiliation LBN 2 Claims based hospital affiliation CCN 3 Claims based hospital affiliation LBN 3 Claims based hospital affiliation CCN 4 Claims based hospital affiliation LBN 4 Claims based hospital affiliation CCN 5 Claims based hospital affiliation LBN 5 Professional accepts Medicare Assignment Participating in eRx Participating in PQRS Participating in EHR Received PQRS Maintenance of Certification Program Incentive Participated in Million Hearts
0 1003892316 7214992767 I20090817000324 JANUMPALLY LINGAIAH NaN NaN M NaN OTHER 1974 NEUROLOGY NaN NaN NaN NaN NaN ANTELOPE VALLEY NEUROSCIENCE NaN 345205944 6 913 W ALENE AVE B NaN N RIDGECREST CA 935552399 50056 ANTELOPE VALLEY HOSPITAL DISTRICT 50204 LANCASTER HOSPITAL CORPORATION 51333 RIDGECREST REGIONAL HOSPITAL NaN NaN NaN NaN Y NaN Y Y NaN NaN
1 1013028315 7517020688 I20090115000476 BARCELONA EDGARDO S NaN M NaN OTHER 1972 VASCULAR SURGERY NaN NaN NaN NaN NaN HIGH DESERT MEDICAL CORPORATION NaN 6103730569 49 43839 N 15TH ST W NaN N LANCASTER CA 935344756 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Y NaN NaN NaN NaN NaN

If the ACO is looking for only cardiologists, we should probably filter non-cardiologists out.

First let's see what specialties are represented in our given list of providers -- perhaps multiple specialities fall within cardiology and/or cardiology shows up in secondary specialties

In [48]:
specialty_fields = ['Primary specialty', 'Secondary specialty 1', 'Secondary specialty 2', 'Secondary specialty 3', 'Secondary specialty 4']

for field in specialty_fields:
    given_npis_plus_ndf[field].replace(np.nan, "NULL", inplace=True)
    print given_npis_plus_ndf.groupby([field]).size().order()
    print ""
Primary specialty
CARDIAC SURGERY                        1
NEUROSURGERY                           1
SPORTS MEDICINE                        1
VASCULAR SURGERY                       1
NULL                                   5
HEMATOLOGY/ONCOLOGY                    6
GENERAL SURGERY                        7
NEPHROLOGY                             7
PULMONARY DISEASE                      7
CARDIOVASCULAR DISEASE (CARDIOLOGY)    8
ENDOCRINOLOGY                          8
GASTROENTEROLOGY                       9
INTERNAL MEDICINE                      9
NEUROLOGY                              9
ORTHOPEDIC SURGERY                     9
RHEUMATOLOGY                           9
dtype: int64

Secondary specialty 1
CARDIAC SURGERY                  1
DIAGNOSTIC RADIOLOGY             1
EMERGENCY MEDICINE               1
HEMATOLOGY                       1
INFECTIOUS DISEASE               1
SLEEP LABORATORY/MEDICINE        1
NEPHROLOGY                       3
CRITICAL CARE (INTENSIVISTS)     9
INTERNAL MEDICINE               27
NULL                            52
dtype: int64

Secondary specialty 2
MEDICAL ONCOLOGY                1
PERIPHERAL VASCULAR DISEASE     1
PULMONARY DISEASE               3
INTERNAL MEDICINE               5
NULL                           87
dtype: int64

Secondary specialty 3
SLEEP LABORATORY/MEDICINE     1
NULL                         96
dtype: int64

Secondary specialty 4
NULL                     97
dtype: int64

CARD* shows up in primary and secondary specialty #1 -- let's see if the instance of secondary specialty is assigned to a provider with the primary specialty.

Note: there were 5 providers with no specialty which we will rule out per a documented assumption

==> there are 9 providers with a specialty related to the heart

  • 8x 'CARDIOVASCULAR DISEASE (CARDIOLOGY)'
  • 1x 'CARDIAC SURGERY'
In [54]:
print given_npis_plus_ndf.groupby(['Primary specialty','Secondary specialty 1']).size()[:5]
Primary specialty                    Secondary specialty 1       
CARDIAC SURGERY                      INTERNAL MEDICINE               1
CARDIOVASCULAR DISEASE (CARDIOLOGY)  CARDIAC SURGERY                 1
                                     CRITICAL CARE (INTENSIVISTS)    1
                                     INTERNAL MEDICINE               2
                                     NULL                            4
dtype: int64

The provider with a secondary specialty 1 of cardiac surgery has a primary specialty of cardiology.

Per the previously documented assumption, we are going to discard the cardiac surgeon bringing us to 8 potential providers to potentially invite to the ACO

In [62]:
given_cardioligsts = given_npis_plus_ndf[given_npis_plus_ndf['Primary specialty'] == 'CARDIOVASCULAR DISEASE (CARDIOLOGY)']
print "We have narrowed the list of {0} unique NPIs to {1} with a primary specialty of cardiology".format(len(given_npis_plus_ndf), len(given_cardioligsts))
We have narrowed the list of 97 unique NPIs to 8 with a primary specialty of cardiology

⚑ 2: Narrowed given NPIs to 8 Cardiologists:

In [71]:
given_cardioligsts
Out[71]:
npi PAC ID Professional Enrollment ID Last Name First Name Middle Name Suffix Gender Credential Medical school name Graduation year Primary specialty Secondary specialty 1 Secondary specialty 2 Secondary specialty 3 Secondary specialty 4 All secondary specialties Organization legal name Organization DBA name Group Practice PAC ID Number of Group Practice members Line 1 Street Address Line 2 Street Address Marker of address line 2 suppression City State Zip Code Claims based hospital affiliation CCN 1 Claims based hospital affiliation LBN 1 Claims based hospital affiliation CCN 2 Claims based hospital affiliation LBN 2 Claims based hospital affiliation CCN 3 Claims based hospital affiliation LBN 3 Claims based hospital affiliation CCN 4 Claims based hospital affiliation LBN 4 Claims based hospital affiliation CCN 5 Claims based hospital affiliation LBN 5 Professional accepts Medicare Assignment Participating in eRx Participating in PQRS Participating in EHR Received PQRS Maintenance of Certification Program Incentive Participated in Million Hearts
7 1184787020 6002875069 I20041006000639 KUMAR ANIL NaN NaN M MD OTHER 1977 CARDIOVASCULAR DISEASE (CARDIOLOGY) NULL NULL NULL NULL NaN KUMAR MEDICAL CORPORATION NaN 1456389030 2 44215 15TH W ST SUITE 215 N LANCASTER CA 93534 50056 ANTELOPE VALLEY HOSPITAL DISTRICT 50204 LANCASTER HOSPITAL CORPORATION 51333 RIDGECREST REGIONAL HOSPITAL NaN NaN NaN NaN Y NaN Y Y NaN NaN
11 1477612752 5193784155 I20041006000364 GILL KANWALJIT S NaN M MD OTHER 1990 CARDIOVASCULAR DISEASE (CARDIOLOGY) INTERNAL MEDICINE PERIPHERAL VASCULAR DISEASE NULL NULL INTERNAL MEDICINE, PERIPHERAL VASCULAR DISEASE KANWALJIT S. GILL MD INC NaN 8729246806 1 38656 MEDICAL CTR DR A NaN N PALMDALE CA 935514483 50204 LANCASTER HOSPITAL CORPORATION 50056 ANTELOPE VALLEY HOSPITAL DISTRICT 51333 RIDGECREST REGIONAL HOSPITAL NaN NaN NaN NaN Y NaN Y Y NaN NaN
28 1952460248 648340653 I20080603000419 KHANAL SANJAYA NaN NaN M NaN OTHER 1991 CARDIOVASCULAR DISEASE (CARDIOLOGY) CARDIAC SURGERY NULL NULL NULL CARDIAC SURGERY ANTELOPE VALLEY CARDIOLOGY ASSOCIATES NaN 7416054614 4 43723 20TH ST W NaN N LANCASTER CA 935344763 50056 ANTELOPE VALLEY HOSPITAL DISTRICT 50204 LANCASTER HOSPITAL CORPORATION 51333 RIDGECREST REGIONAL HOSPITAL NaN NaN NaN NaN Y NaN Y Y NaN NaN
30 1114980703 3779470182 I20040301001181 ALTURJUMAN AHMAD MOUTAZ NaN M MD OTHER 1989 CARDIOVASCULAR DISEASE (CARDIOLOGY) CRITICAL CARE (INTENSIVISTS) INTERNAL MEDICINE NULL NULL CRITICAL CARE (INTENSIVISTS), INTERNAL MEDICINE BLACKSTONE MEDICAL CORP. NaN 5991725640 2 4500 BROCKTON AVE SUITE 305 N RIVERSIDE CA 925014027 50022 RIVERSIDE HEALTHCARE SYSTEM, L.P. 50770 LOMA LINDA UNIVERSITY MEDICAL CENTER MURRIETA 50775 TEMECULA VALLEY HOSPITAL, INC. 50102 PARKVIEW COMMUNITY HOSPITAL MEDICAL CENTER 50327 SEVENTH DAY ADVENTISTS LOMA LINDA UNIVERSITY M... Y NaN Y Y NaN NaN
31 1215918800 9234125527 I20070823000730 SHENKMAN HEATHER J NaN F MD ALBANY MEDICAL COLLEGE OF UNION UNIVERSITY 1999 CARDIOVASCULAR DISEASE (CARDIOLOGY) INTERNAL MEDICINE NULL NULL NULL INTERNAL MEDICINE LAKESIDE MEDICAL ASSOCIATES, A MEDICAL GROUP, ... NaN 1951202183 2 7345 MEDICAL CTR DR SUITE 500 N WEST HILLS CA 913071964 50235 PROVIDENCE HEALTH SYSTEM-SOUTHERN CALIFORNIA 50761 PROVIDENCE HEALTH SYSTEM-SOUTHERN CALIFORNIA 50278 PROVIDENCE HEALTH SYSTEM - SOUTHERN CALIFORNIA NaN NaN NaN NaN M NaN NaN NaN NaN NaN
32 1255334769 3678470341 I20080307000417 ON ROGER C NaN M NaN TUFTS UNIVERSITY SCHOOL OF MEDICINE 1978 CARDIOVASCULAR DISEASE (CARDIOLOGY) NULL NULL NULL NULL NaN LAKESIDE MEDICAL ORGANIZATION, A MEDICAL GROUP... NaN 7618005166 100 191 S BUENA VISTA SUITE 440 N BURBANK CA 915054554 50235 PROVIDENCE HEALTH SYSTEM-SOUTHERN CALIFORNIA 50278 PROVIDENCE HEALTH SYSTEM - SOUTHERN CALIFORNIA 50761 PROVIDENCE HEALTH SYSTEM-SOUTHERN CALIFORNIA NaN NaN NaN NaN Y NaN Y Y NaN NaN
33 1891765574 3274585765 I20050211000931 GUPTA VINOD K NaN M MD UNIVERSITY OF CALIFORNIA, SAN FRANCISCO SCHOOL... 1966 CARDIOVASCULAR DISEASE (CARDIOLOGY) NULL NULL NULL NULL NaN HEART CENTER OF SOUTHERN CALIFORNIA INC NaN 8820040314 1 2876 N SYCAMORE DR NaN N SIMI VALLEY CA 930651550 50236 SIMI VALLEY HOSPITAL & HEALTH CARE SERVICES NaN NaN NaN NaN NaN NaN NaN NaN M NaN Y Y NaN NaN
36 1851401822 6800077249 I20110218000225 SILVA AUGUSTO NaN NaN M NaN OTHER 1967 CARDIOVASCULAR DISEASE (CARDIOLOGY) NULL NULL NULL NULL NaN LAKESIDE MEDICAL ORGANIZATION, A MEDICAL GROUP... NaN 7618005166 100 191 S BUENA VISTA ST SUITE 400 N BURBANK CA 915054543 50235 PROVIDENCE HEALTH SYSTEM-SOUTHERN CALIFORNIA NaN NaN NaN NaN NaN NaN NaN NaN Y NaN NaN NaN NaN NaN

⚑ 8 cardiologists in the running (none are in the HHS OIG exclusion list)

In [88]:
excluded_entities = pd.read_csv("data/oig.hhs.gov_exclusion-list.csv", low_memory=False)
excluded_given_cardiologist_npis = []
for index, cardiologist in given_cardioligsts.iterrows():
    exclusion_matches = excluded_entities[excluded_entities['NPI'] == cardiologist['npi']]
    if len(exclusion_matches) > 0:
        excluded_given_cardiologist_npis.append(cardiologist['npi'])
#         print "/!\ DANGER /!\ {0} is in the exclusion list".format(cardiologist['npi'])
#     else:
#         print "It's all good, {0} is not in the exclusion list".format(cardiologist['npi'])

print "{0} cardiologists in our prospect list must be excluded".format(len(excluded_given_cardiologist_npis))
0 cardiologists in our prospect list must be excluded

So.. now we have 8 cardiologists but how do we prioritize them? Even if we could invite all eight, we should focus our energy on top performers.

In [89]:
given_cardioligsts.to_csv("data/given_cardiologists_plus_national_downloadable_file.csv", index=False)
In [ ]: