Mapping DrugBank Compounds with UniChem

For more info or as a citation, please see:

Himmelstein DS, Brueggeman LA, Baranzini SE (2015) Repurposing drugs on a heterogeneous network. ThinkLab. doi:10.15363/thinklab.4

Specifically, this notebook corresponds to a dicussion on unifying compound vocabularies using UniChem.

First, we used UniChem to match DrugBank compounds to external resources. We used the UniChem connectivity search which allows fuzzy matching.

library(dplyr)
library(ggplot2)
library(DT)
library(reshape2)

Read all DrugBank approved small molecules with annotated structures.

drugbank.df <- file.path('data', 'drugbank.tsv') %>%
  read.delim(stringsAsFactors=FALSE, na.strings='') %>%
  dplyr::filter(type == 'small molecule') %>%
  dplyr::filter(grepl('approved', groups)) %>%
  dplyr::filter(! is.na(inchikey))

count.df <- file.path('data', 'mapping-counts.tsv') %>%
  read.delim(stringsAsFactors=FALSE, check.names=FALSE)

count.df <- drugbank.df %>%
  dplyr::rename(drugbank_name = name) %>%
  dplyr::left_join(count.df)

## Joining by: c("drugbank_id", "drugbank_name")

The external sources that we want to map to

sources <- c('chembl', 'drugbank', 'fdasrs', 'pubchem', 'lincs')

Percent of approved small molecules in DrugBank matched to external resource

count.df %>%
  dplyr::select(one_of(sources)) %>%
  dplyr::summarise_each(funs(mean(. > 0) * 100)) %>%
  knitr::kable()

chembl	drugbank	fdasrs	pubchem	lincs
97.55155	99.67784	82.28093	99.09794	61.08247

A small number of DrugBank approved small molecules (5) do not map to DrugBank. This appears to occur because these compounds do not contain structural information in the DrugBank database.

The number of compounds matching each approved small molecules in DrugBank

count.df %>%
  dplyr::select(drugbank_id, drugbank_name, one_of(sources)) %>%
  DT::datatable()

The distribution of compounds mapped per approved small molecule in DrugBank

count.df %>%
  dplyr::select(one_of(sources)) %>%
  reshape2::melt(variable.name = 'source', value.name = 'count') %>%
  ggplot(aes(count)) + theme_bw() +
    geom_histogram(binwidth = 1, origin = -0.5, alpha = 0.6, col='black') +
    facet_wrap(~ source, scales='free_x', nrow=1) +
    xlab('Matches per DrugBank compound') + ylab('Count')

Read the mapping file

mapping.df <- file.path('data', 'mapping.tsv.gz') %>%
  read.delim(stringsAsFactors=FALSE)

Small molecules in DrugBank that matched multiple DrugBank IDs

mapping.df %>%
  dplyr::filter(source_name == 'drugbank') %>%
  dplyr::inner_join(drugbank.df %>% dplyr::select(drugbank_id, type, groups)) %>%
  dplyr::group_by(drugbank_id, drugbank_name) %>%
  dplyr::summarise(
    n_matches = n(),
    matches = paste(src_compound_id, collapse = '|')
    ) %>%
  dplyr::filter(n_matches > 1) %>%
  DT::datatable()

## Joining by: "drugbank_id"