For more info or as a citation, please see:

Specifically, this notebook corresponds to a dicussion on unifying compound vocabularies using UniChem.

First, we used UniChem to match DrugBank compounds to external resources. We used the UniChem connectivity search which allows fuzzy matching.

library(dplyr)
library(ggplot2)
library(DT)
library(reshape2)

Read all DrugBank approved small molecules with annotated structures.

drugbank.df <- file.path('data', 'drugbank.tsv') %>%
  read.delim(stringsAsFactors=FALSE, na.strings='') %>%
  dplyr::filter(type == 'small molecule') %>%
  dplyr::filter(grepl('approved', groups)) %>%
  dplyr::filter(! is.na(inchikey))

count.df <- file.path('data', 'mapping-counts.tsv') %>%
  read.delim(stringsAsFactors=FALSE, check.names=FALSE)

count.df <- drugbank.df %>%
  dplyr::rename(drugbank_name = name) %>%
  dplyr::left_join(count.df)
## Joining by: c("drugbank_id", "drugbank_name")

The external sources that we want to map to

sources <- c('chembl', 'drugbank', 'fdasrs', 'pubchem', 'lincs')

Percent of approved small molecules in DrugBank matched to external resource

count.df %>%
  dplyr::select(one_of(sources)) %>%
  dplyr::summarise_each(funs(mean(. > 0) * 100)) %>%
  knitr::kable()
chembl drugbank fdasrs pubchem lincs
97.55155 99.67784 82.28093 99.09794 61.08247

A small number of DrugBank approved small molecules (5) do not map to DrugBank. This appears to occur because these compounds do not contain structural information in the DrugBank database.

The number of compounds matching each approved small molecules in DrugBank

count.df %>%
  dplyr::select(drugbank_id, drugbank_name, one_of(sources)) %>%
  DT::datatable()

The distribution of compounds mapped per approved small molecule in DrugBank

count.df %>%
  dplyr::select(one_of(sources)) %>%
  reshape2::melt(variable.name = 'source', value.name = 'count') %>%
  ggplot(aes(count)) + theme_bw() +
    geom_histogram(binwidth = 1, origin = -0.5, alpha = 0.6, col='black') +
    facet_wrap(~ source, scales='free_x', nrow=1) +
    xlab('Matches per DrugBank compound') + ylab('Count')

Read the mapping file

mapping.df <- file.path('data', 'mapping.tsv.gz') %>%
  read.delim(stringsAsFactors=FALSE)

Small molecules in DrugBank that matched multiple DrugBank IDs

mapping.df %>%
  dplyr::filter(source_name == 'drugbank') %>%
  dplyr::inner_join(drugbank.df %>% dplyr::select(drugbank_id, type, groups)) %>%
  dplyr::group_by(drugbank_id, drugbank_name) %>%
  dplyr::summarise(
    n_matches = n(),
    matches = paste(src_compound_id, collapse = '|')
    ) %>%
  dplyr::filter(n_matches > 1) %>%
  DT::datatable()
## Joining by: "drugbank_id"