Specifically, this notebook corresponds to a dicussion on unifying compound vocabularies using UniChem.

First, we used UniChem to match DrugBank compounds to external resources. We used the UniChem connectivity search which allows fuzzy matching.


Read all DrugBank approved small molecules with annotated structures.

drugbank.df <- file.path('data', 'drugbank.tsv') %>%
  read.delim(stringsAsFactors=FALSE, na.strings='') %>%
  dplyr::filter(type == 'small molecule') %>%
  dplyr::filter(grepl('approved', groups)) %>%

count.df <- file.path('data', 'mapping-counts.tsv') %>%
  read.delim(stringsAsFactors=FALSE, check.names=FALSE)

count.df <- drugbank.df %>%
  dplyr::rename(drugbank_name = name) %>%
The external sources that we want to map to

sources <- c('chembl', 'drugbank', 'fdasrs', 'pubchem', 'lincs')

Percent of approved small molecules in DrugBank matched to external resource

count.df %>%
  dplyr::select(one_of(sources)) %>%
  dplyr::summarise_each(funs(mean(. > 0) * 100)) %>%
chembl drugbank fdasrs pubchem lincs
97.55155 99.67784 82.28093 99.09794 61.08247

A small number of DrugBank approved small molecules (5) do not map to DrugBank. This appears to occur because these compounds do not contain structural information in the DrugBank database.

The number of compounds matching each approved small molecules in DrugBank

count.df %>%
  dplyr::select(drugbank_id, drugbank_name, one_of(sources)) %>%