Notebook #1: Filtering for only legal cardiologists¶

This notebook...

deduplicates the given 100 NPIs (down to 97)
join in demographic information from data source #1 (data.medicare.gov physician compare)
filter out non-cardiologists
check to make sure none of the cardiologists are on the HHS OIG Exclusions list

# Converted list of NPIs to array for ease of use (could also import from CSV)
given_npis_array = ["1003892316","1013028315","1093806754","1104985845","1134108103","1134326697","1174629950","1184787020","1205852985","1376542811","1417912270","1477612752","1508894361","1518054915","1548217680","1598700627","1639174816","1639277130","1689627051","1699777334","1720006992","1720199763","1770563116","1861695678","1871562900","1932111424","1932135266","1932188521","1952460248","1992999031","1114980703","1215918800","1255334769","1891765574","1679568901","1083790117","1851401822","1902093693","1790975886","1598775538","1730289463","1417911314","1407972490","1811951734","1275604480","1346347028","1184679664","1285723304","1134123474","1215918743","1912909037","1245334309","1275511032","1093812133","1396799185","1447204243","1932205721","1679516314","1124089685","1902069958","1992999031","1619933132","1932198256","1184624785","1477533511","1205931722","1053391185","1831110923","1225020886","1053534610","1821158775","1649488461","1528133949","1962405142","1750359071","1225104318","1497930903","1619933132","1447274972","1144439316","1730150830","1134214091","1902822760","1336105238","1649469628","1699737932","1669685111","1306902739","1811180144","1437112604","1144315839","1053417808","1114189354","1932147626","1699737932","1437178274","1831104694","1073740825","1326080060","1760481808"]

import pandas as pd
import numpy as np

given_npis = pd.DataFrame(given_npis_array)
given_npis.columns = ['npi']
given_npis_count = len(given_npis)

# set pandas option to show the full dataset
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)

print "We received {0} NPIs".format(given_npis_count)
given_npis.drop_duplicates(inplace=True)
given_npis_count_dupes_dropped = len(given_npis)

print "After removing {0} duplicates, we are working with {1} NPIs".format(given_npis_count-given_npis_count_dupes_dropped, given_npis_count_dupes_dropped)

We received 100 NPIs
After removing 3 duplicates, we are working with 97 NPIs

# First order of business is to limit to only cardiologists
#   to do this, we need to know the specialty of the cardiologists 
#   which we can get from the data.medicare.gov National Provider File

# the dataset is big-ish at 673mb so I downloaded it in full and stored it as a flat file
#   alternatively, we could have grabbed data via API for just cardiologists
#   but I want the possibility of doing other analysis later
#      import urllib
#      all_card_providers_url = "https://data.medicare.gov/resource/aeay-dfax.json?$where={0}&$limit={1}".format(urllib.quote("pri_spec LIKE '%CARD%'"),999999999)
#      all_card_providers = pd.read_json(all_providers_url)
all_providers = pd.read_csv("data/data.medicare.gov_s63f-csi6_National-Downloadable_File.csv", low_memory=False)
all_providers.npi = all_providers.npi.astype(np.str)

print "{0} providers brought in from data.medicare.gov/d/s63f-csi6 file".format(len(all_providers))

2279287 providers brought in from data.medicare.gov/d/s63f-csi6 file

# do a left inner join on the two dataframes
# so we are left with a dataframe of only the 97 given NPIs and their demographic info
given_npis_plus_ndf = pd.merge(given_npis, all_providers, how='left', on=['npi'])
given_npis_plus_ndf.drop_duplicates(subset='npi', inplace=True) # see assumption about duplicates
print "Do we have as many NPIs in this joined dataset as we were given? {0}".format(len(given_npis_plus_ndf) == given_npis_count_dupes_dropped)

# TODO: prefix

Do we have as many NPIs in this joined dataset as we were given? True

# OK, let's store the given_npis_plus dataset for later use so we dont have to 
given_npis_plus_ndf.to_csv("data/given_npis_plus_national_downloadable_file.csv", index = False)
print "Saved given NPIs plus national downloadable file demographics to CSV for save keeping"

Saved given NPIs plus national downloadable file demographics to CSV for save keeping

Checkpoint (⚑) 1: Dataset of 97 NPIs with basic demographic info¶

Basic demographic information from source #1 merged in with the list of unique NPIs provided; a sample of the data is below

# Read in given_npis_plus_ndf from file
given_npis_plus_ndf = pd.read_csv("data/given_npis_plus_national_downloadable_file.csv")
given_npis_plus_ndf[0:2]

If the ACO is looking for only cardiologists, we should probably filter non-cardiologists out.

First let's see what specialties are represented in our given list of providers -- perhaps multiple specialities fall within cardiology and/or cardiology shows up in secondary specialties

specialty_fields = ['Primary specialty', 'Secondary specialty 1', 'Secondary specialty 2', 'Secondary specialty 3', 'Secondary specialty 4']

for field in specialty_fields:
    given_npis_plus_ndf[field].replace(np.nan, "NULL", inplace=True)
    print given_npis_plus_ndf.groupby([field]).size().order()
    print ""

Primary specialty
CARDIAC SURGERY                        1
NEUROSURGERY                           1
SPORTS MEDICINE                        1
VASCULAR SURGERY                       1
NULL                                   5
HEMATOLOGY/ONCOLOGY                    6
GENERAL SURGERY                        7
NEPHROLOGY                             7
PULMONARY DISEASE                      7
CARDIOVASCULAR DISEASE (CARDIOLOGY)    8
ENDOCRINOLOGY                          8
GASTROENTEROLOGY                       9
INTERNAL MEDICINE                      9
NEUROLOGY                              9
ORTHOPEDIC SURGERY                     9
RHEUMATOLOGY                           9
dtype: int64

Secondary specialty 1
CARDIAC SURGERY                  1
DIAGNOSTIC RADIOLOGY             1
EMERGENCY MEDICINE               1
HEMATOLOGY                       1
INFECTIOUS DISEASE               1
SLEEP LABORATORY/MEDICINE        1
NEPHROLOGY                       3
CRITICAL CARE (INTENSIVISTS)     9
INTERNAL MEDICINE               27
NULL                            52
dtype: int64

Secondary specialty 2
MEDICAL ONCOLOGY                1
PERIPHERAL VASCULAR DISEASE     1
PULMONARY DISEASE               3
INTERNAL MEDICINE               5
NULL                           87
dtype: int64

Secondary specialty 3
SLEEP LABORATORY/MEDICINE     1
NULL                         96
dtype: int64

Secondary specialty 4
NULL                     97
dtype: int64

CARD* shows up in primary and secondary specialty #1 -- let's see if the instance of secondary specialty is assigned to a provider with the primary specialty.

Note: there were 5 providers with no specialty which we will rule out per a documented assumption

==> there are 9 providers with a specialty related to the heart

8x 'CARDIOVASCULAR DISEASE (CARDIOLOGY)'
1x 'CARDIAC SURGERY'

print given_npis_plus_ndf.groupby(['Primary specialty','Secondary specialty 1']).size()[:5]

Primary specialty                    Secondary specialty 1       
CARDIAC SURGERY                      INTERNAL MEDICINE               1
CARDIOVASCULAR DISEASE (CARDIOLOGY)  CARDIAC SURGERY                 1
                                     CRITICAL CARE (INTENSIVISTS)    1
                                     INTERNAL MEDICINE               2
                                     NULL                            4
dtype: int64

The provider with a secondary specialty 1 of cardiac surgery has a primary specialty of cardiology.

Per the previously documented assumption, we are going to discard the cardiac surgeon bringing us to 8 potential providers to potentially invite to the ACO

given_cardioligsts = given_npis_plus_ndf[given_npis_plus_ndf['Primary specialty'] == 'CARDIOVASCULAR DISEASE (CARDIOLOGY)']
print "We have narrowed the list of {0} unique NPIs to {1} with a primary specialty of cardiology".format(len(given_npis_plus_ndf), len(given_cardioligsts))

We have narrowed the list of 97 unique NPIs to 8 with a primary specialty of cardiology

⚑ 2: Narrowed given NPIs to 8 Cardiologists:¶

given_cardioligsts

⚑ 8 cardiologists in the running (none are in the HHS OIG exclusion list)¶

excluded_entities = pd.read_csv("data/oig.hhs.gov_exclusion-list.csv", low_memory=False)
excluded_given_cardiologist_npis = []
for index, cardiologist in given_cardioligsts.iterrows():
    exclusion_matches = excluded_entities[excluded_entities['NPI'] == cardiologist['npi']]
    if len(exclusion_matches) > 0:
        excluded_given_cardiologist_npis.append(cardiologist['npi'])
#         print "/!\ DANGER /!\ {0} is in the exclusion list".format(cardiologist['npi'])
#     else:
#         print "It's all good, {0} is not in the exclusion list".format(cardiologist['npi'])

print "{0} cardiologists in our prospect list must be excluded".format(len(excluded_given_cardiologist_npis))

0 cardiologists in our prospect list must be excluded

So.. now we have 8 cardiologists but how do we prioritize them? Even if we could invite all eight, we should focus our energy on top performers.

given_cardioligsts.to_csv("data/given_cardiologists_plus_national_downloadable_file.csv", index=False)

	npi	PAC ID	Professional Enrollment ID	Last Name	First Name	Middle Name	Suffix	Gender	Credential	Medical school name	Graduation year	Primary specialty	Secondary specialty 1	Secondary specialty 2	Secondary specialty 3	Secondary specialty 4	All secondary specialties	Organization legal name	Organization DBA name	Group Practice PAC ID	Number of Group Practice members	Line 1 Street Address	Line 2 Street Address	Marker of address line 2 suppression	City	State	Zip Code	Claims based hospital affiliation CCN 1	Claims based hospital affiliation LBN 1	Claims based hospital affiliation CCN 2	Claims based hospital affiliation LBN 2	Claims based hospital affiliation CCN 3	Claims based hospital affiliation LBN 3	Claims based hospital affiliation CCN 4	Claims based hospital affiliation LBN 4	Claims based hospital affiliation CCN 5	Claims based hospital affiliation LBN 5	Professional accepts Medicare Assignment	Participating in eRx	Participating in PQRS	Participating in EHR	Received PQRS Maintenance of Certification Program Incentive	Participated in Million Hearts
0	1003892316	7214992767	I20090817000324	JANUMPALLY	LINGAIAH	NaN	NaN	M	NaN	OTHER	1974	NEUROLOGY	NaN	NaN	NaN	NaN	NaN	ANTELOPE VALLEY NEUROSCIENCE	NaN	345205944	6	913 W ALENE AVE B	NaN	N	RIDGECREST	CA	935552399	50056	ANTELOPE VALLEY HOSPITAL DISTRICT	50204	LANCASTER HOSPITAL CORPORATION	51333	RIDGECREST REGIONAL HOSPITAL	NaN	NaN	NaN	NaN	Y	NaN	Y	Y	NaN	NaN
1	1013028315	7517020688	I20090115000476	BARCELONA	EDGARDO	S	NaN	M	NaN	OTHER	1972	VASCULAR SURGERY	NaN	NaN	NaN	NaN	NaN	HIGH DESERT MEDICAL CORPORATION	NaN	6103730569	49	43839 N 15TH ST W	NaN	N	LANCASTER	CA	935344756	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Y	NaN	NaN	NaN	NaN	NaN

	npi	PAC ID	Professional Enrollment ID	Last Name	First Name	Middle Name	Suffix	Gender	Credential	Medical school name	Graduation year	Primary specialty	Secondary specialty 1	Secondary specialty 2	Secondary specialty 3	Secondary specialty 4	All secondary specialties	Organization legal name	Organization DBA name	Group Practice PAC ID	Number of Group Practice members	Line 1 Street Address	Line 2 Street Address	Marker of address line 2 suppression	City	State	Zip Code	Claims based hospital affiliation CCN 1	Claims based hospital affiliation LBN 1	Claims based hospital affiliation CCN 2	Claims based hospital affiliation LBN 2	Claims based hospital affiliation CCN 3	Claims based hospital affiliation LBN 3	Claims based hospital affiliation CCN 4	Claims based hospital affiliation LBN 4	Claims based hospital affiliation CCN 5	Claims based hospital affiliation LBN 5	Professional accepts Medicare Assignment	Participating in eRx	Participating in PQRS	Participating in EHR	Received PQRS Maintenance of Certification Program Incentive	Participated in Million Hearts
7	1184787020	6002875069	I20041006000639	KUMAR	ANIL	NaN	NaN	M	MD	OTHER	1977	CARDIOVASCULAR DISEASE (CARDIOLOGY)	NULL	NULL	NULL	NULL	NaN	KUMAR MEDICAL CORPORATION	NaN	1456389030	2	44215 15TH W ST	SUITE 215	N	LANCASTER	CA	93534	50056	ANTELOPE VALLEY HOSPITAL DISTRICT	50204	LANCASTER HOSPITAL CORPORATION	51333	RIDGECREST REGIONAL HOSPITAL	NaN	NaN	NaN	NaN	Y	NaN	Y	Y	NaN	NaN
11	1477612752	5193784155	I20041006000364	GILL	KANWALJIT	S	NaN	M	MD	OTHER	1990	CARDIOVASCULAR DISEASE (CARDIOLOGY)	INTERNAL MEDICINE	PERIPHERAL VASCULAR DISEASE	NULL	NULL	INTERNAL MEDICINE, PERIPHERAL VASCULAR DISEASE	KANWALJIT S. GILL MD INC	NaN	8729246806	1	38656 MEDICAL CTR DR A	NaN	N	PALMDALE	CA	935514483	50204	LANCASTER HOSPITAL CORPORATION	50056	ANTELOPE VALLEY HOSPITAL DISTRICT	51333	RIDGECREST REGIONAL HOSPITAL	NaN	NaN	NaN	NaN	Y	NaN	Y	Y	NaN	NaN
28	1952460248	648340653	I20080603000419	KHANAL	SANJAYA	NaN	NaN	M	NaN	OTHER	1991	CARDIOVASCULAR DISEASE (CARDIOLOGY)	CARDIAC SURGERY	NULL	NULL	NULL	CARDIAC SURGERY	ANTELOPE VALLEY CARDIOLOGY ASSOCIATES	NaN	7416054614	4	43723 20TH ST W	NaN	N	LANCASTER	CA	935344763	50056	ANTELOPE VALLEY HOSPITAL DISTRICT	50204	LANCASTER HOSPITAL CORPORATION	51333	RIDGECREST REGIONAL HOSPITAL	NaN	NaN	NaN	NaN	Y	NaN	Y	Y	NaN	NaN
30	1114980703	3779470182	I20040301001181	ALTURJUMAN	AHMAD	MOUTAZ	NaN	M	MD	OTHER	1989	CARDIOVASCULAR DISEASE (CARDIOLOGY)	CRITICAL CARE (INTENSIVISTS)	INTERNAL MEDICINE	NULL	NULL	CRITICAL CARE (INTENSIVISTS), INTERNAL MEDICINE	BLACKSTONE MEDICAL CORP.	NaN	5991725640	2	4500 BROCKTON AVE	SUITE 305	N	RIVERSIDE	CA	925014027	50022	RIVERSIDE HEALTHCARE SYSTEM, L.P.	50770	LOMA LINDA UNIVERSITY MEDICAL CENTER MURRIETA	50775	TEMECULA VALLEY HOSPITAL, INC.	50102	PARKVIEW COMMUNITY HOSPITAL MEDICAL CENTER	50327	SEVENTH DAY ADVENTISTS LOMA LINDA UNIVERSITY M...	Y	NaN	Y	Y	NaN	NaN
31	1215918800	9234125527	I20070823000730	SHENKMAN	HEATHER	J	NaN	F	MD	ALBANY MEDICAL COLLEGE OF UNION UNIVERSITY	1999	CARDIOVASCULAR DISEASE (CARDIOLOGY)	INTERNAL MEDICINE	NULL	NULL	NULL	INTERNAL MEDICINE	LAKESIDE MEDICAL ASSOCIATES, A MEDICAL GROUP, ...	NaN	1951202183	2	7345 MEDICAL CTR DR	SUITE 500	N	WEST HILLS	CA	913071964	50235	PROVIDENCE HEALTH SYSTEM-SOUTHERN CALIFORNIA	50761	PROVIDENCE HEALTH SYSTEM-SOUTHERN CALIFORNIA	50278	PROVIDENCE HEALTH SYSTEM - SOUTHERN CALIFORNIA	NaN	NaN	NaN	NaN	M	NaN	NaN	NaN	NaN	NaN
32	1255334769	3678470341	I20080307000417	ON	ROGER	C	NaN	M	NaN	TUFTS UNIVERSITY SCHOOL OF MEDICINE	1978	CARDIOVASCULAR DISEASE (CARDIOLOGY)	NULL	NULL	NULL	NULL	NaN	LAKESIDE MEDICAL ORGANIZATION, A MEDICAL GROUP...	NaN	7618005166	100	191 S BUENA VISTA	SUITE 440	N	BURBANK	CA	915054554	50235	PROVIDENCE HEALTH SYSTEM-SOUTHERN CALIFORNIA	50278	PROVIDENCE HEALTH SYSTEM - SOUTHERN CALIFORNIA	50761	PROVIDENCE HEALTH SYSTEM-SOUTHERN CALIFORNIA	NaN	NaN	NaN	NaN	Y	NaN	Y	Y	NaN	NaN
33	1891765574	3274585765	I20050211000931	GUPTA	VINOD	K	NaN	M	MD	UNIVERSITY OF CALIFORNIA, SAN FRANCISCO SCHOOL...	1966	CARDIOVASCULAR DISEASE (CARDIOLOGY)	NULL	NULL	NULL	NULL	NaN	HEART CENTER OF SOUTHERN CALIFORNIA INC	NaN	8820040314	1	2876 N SYCAMORE DR	NaN	N	SIMI VALLEY	CA	930651550	50236	SIMI VALLEY HOSPITAL & HEALTH CARE SERVICES	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	M	NaN	Y	Y	NaN	NaN
36	1851401822	6800077249	I20110218000225	SILVA	AUGUSTO	NaN	NaN	M	NaN	OTHER	1967	CARDIOVASCULAR DISEASE (CARDIOLOGY)	NULL	NULL	NULL	NULL	NaN	LAKESIDE MEDICAL ORGANIZATION, A MEDICAL GROUP...	NaN	7618005166	100	191 S BUENA VISTA ST	SUITE 400	N	BURBANK	CA	915054543	50235	PROVIDENCE HEALTH SYSTEM-SOUTHERN CALIFORNIA	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Y	NaN	NaN	NaN	NaN	NaN