library(dplyr)
library(ggplot2)
library(magrittr)
library(DT)

options(stringsAsFactors = FALSE)

write.delim <- function(x, file, sep='\t', quote = FALSE, row.names=FALSE, na = '', ...) {
  write.table(x = x, file = file, sep=sep, quote=quote, row.names=row.names, na=na, ...)
}

Here, we parse the MEDI resource (1), which can be downloaded here. The resource is described (unless noted, blockquotes refer to material from the MEDI publication):

We processed four public medication resources, RxNorm, Side Effect Resource (SIDER) 2, MedlinePlus, and Wikipedia, to create MEDI. We applied natural language processing and ontology relationships to extract indications for prescribable, single-ingredient medication concepts and all ingredient concepts as defined by RxNorm. Indications were coded as Unified Medical Language System (UMLS) concepts and International Classification of Diseases, 9th edition (ICD9) codes. A total of 689 extracted indications were randomly selected for manual review for accuracy using dual-physician review. We identified a subset of medication–indication pairs that optimizes recall while maintaining high precision.

High precision subset (HPS)

The MEDI high-precision subset (MEDI-HPS) includes indications found within either RxNorm or at least two of the three other resources. MEDI-HPS contains 13,304 unique indication pairs regarding 2,136 medications. The mean±SD number of indications for each medication in MEDI-HPS is 6.22±6.09. The estimated precision of MEDI-HPS is 92%.

# Read the HPS dataset
hps.df <- file.path('download', 'MEDI_01212013_HPS_0.csv') %>%
  read.csv(blank.lines.skip = TRUE, colClasses = 'character') %>%
  dplyr::transmute(
    rxnorm_id = RXCUI_IN,
    drug_name = DRUG_DESC,
    disease_icd9 = ICD9,
    disease_name = INDICATION_DESCRIPTION,
    on_label = as.integer(POSSIBLE_LABEL_USE)
    )

# Calculate HPS counts
hps_count.df <- hps.df %>%
  dplyr::distinct(rxnorm_id, disease_icd9) %>%
  dplyr::summarize(
    resource = 'hps',
    medications = n_distinct(rxnorm_id),
    diseases = n_distinct(disease_icd9),
    indications = n())

# Display the HPS
hps.df %>% DT::datatable()

All Indications

MEDI contains 3,112 medications and 63,343 medication–indication pairs. Wikipedia was the largest resource, with 2608 medications and 34,911 pairs. For each resource, estimated precision and recall, respectively, were 94% and 20% for RxNorm, 75% and 33% for MedlinePlus, 67% and 31% for SIDER 2, and 56% and 51% for Wikipedia.

# Read the general dataset
umls.df <- file.path('download', 'MEDI_01212013_UMLS.csv.gz') %>%
  read.csv(blank.lines.skip = TRUE, colClasses = 'character') %>%
  dplyr::transmute(
    rxnorm_id = RXCUI_IN,
    drug_name = DRUG_DESC,
    disease_cui = UMLS_CUI,
    disease_icd9 = ICD9,
    disease_name = INDICATION_DESCRIPTION,
    n_resources = MENTIONEDBYRESOURCES,
    hps = as.integer(HIGHPRECISIONSUBSET),
    on_label = as.integer(POSSIBLE_LABEL_USE)
    )

write.delim(umls.df, file.path('data', 'medi-umls.tsv'))

umls.df %>% dplyr::distinct(rxnorm_id, disease_icd9) %>% nrow()

## [1] 63343

umls.df %>% dplyr::distinct(rxnorm_id, disease_cui) %>% nrow()

## [1] 53106

# Display 1000 rows of the general dataset
umls.df %>% dplyr::sample_n(1000) %>% DT::datatable()

Prevalence

A seperate analysis identified indication prevalence for the MEDI indications (2). MEDI appeared to have good medication coverage:

MEDI covered 97.3% of medications recorded in medical records. The “high precision subset” of MEDI covered 93.8% of recorded medications.

# Read the prevalence information for each indication
prev.df <- file.path('download', 'MEDI_wPrevalence_Published.csv.gz') %>%
  read.csv(blank.lines.skip = TRUE, colClasses = 'character') %>%
  dplyr::transmute(
    rxnorm_id = RXCUI_IN,
    drug_name = DRUG_DESC,
    disease_icd9 = ICD9,
    disease_name = ICD9_STR,
    hps = as.integer(HSP),
    prevalance = as.numeric(PREVALENCE)
    )

# collapse by indications (disease-drug pairs) and display 1000 rows
prev.df %>%
  dplyr::filter(! is.na(prevalance))  %>%
  dplyr::group_by(disease_icd9, rxnorm_id)  %>%
  dplyr::summarize(prevalance = mean(prevalance)) %>% 
  dplyr::sample_n(1000) %>% DT::datatable()

References

1. Wei W-Q, Cronin RM, Xu H, Lasko TA, Bastarache L, Denny JC (2013) Development and evaluation of an ensemble resource linking medications to their indications. Journal of the American Medical Informatics Association. doi:10.1136/amiajnl-2012-001431

2. Wei W-Q, Mosley JD, Bastarache L, Denny JC 2013. Validation and enhancement of a Computable Medication Indication Resource (MEDI) using a large practice-based dataset. In AMIA Annual Symposium Proceedings. Available: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900157/

To the extent possible under law, Daniel Himmelstein has waived all copyright and related or neighboring rights to Indications. This work is published from: United States.

Parsing MEDI

Daniel Himmelstein, Leo Brueggeman, Sergio Baranzini

April 7, 2015

High precision subset (HPS)

All Indications

Prevalence

References