For more info or as a citation, please see:
Specifically, this notebook corresponds to a dicussion on unifying compound vocabularies using UniChem.
First, we used UniChem to match DrugBank compounds to external resources. We used the UniChem connectivity search which allows fuzzy matching.
library(dplyr)
library(ggplot2)
library(DT)
library(reshape2)
drugbank.df <- file.path('data', 'drugbank.tsv') %>%
read.delim(stringsAsFactors=FALSE, na.strings='') %>%
dplyr::filter(type == 'small molecule') %>%
dplyr::filter(grepl('approved', groups)) %>%
dplyr::filter(! is.na(inchikey))
count.df <- file.path('data', 'mapping-counts.tsv') %>%
read.delim(stringsAsFactors=FALSE, check.names=FALSE)
count.df <- drugbank.df %>%
dplyr::rename(drugbank_name = name) %>%
dplyr::left_join(count.df)
## Joining by: c("drugbank_id", "drugbank_name")
sources <- c('chembl', 'drugbank', 'fdasrs', 'pubchem', 'lincs')
count.df %>%
dplyr::select(one_of(sources)) %>%
dplyr::summarise_each(funs(mean(. > 0) * 100)) %>%
knitr::kable()
chembl | drugbank | fdasrs | pubchem | lincs |
---|---|---|---|---|
97.55155 | 99.67784 | 82.28093 | 99.09794 | 61.08247 |
A small number of DrugBank approved small molecules (5) do not map to DrugBank. This appears to occur because these compounds do not contain structural information in the DrugBank database.
count.df %>%
dplyr::select(drugbank_id, drugbank_name, one_of(sources)) %>%
DT::datatable()
count.df %>%
dplyr::select(one_of(sources)) %>%
reshape2::melt(variable.name = 'source', value.name = 'count') %>%
ggplot(aes(count)) + theme_bw() +
geom_histogram(binwidth = 1, origin = -0.5, alpha = 0.6, col='black') +
facet_wrap(~ source, scales='free_x', nrow=1) +
xlab('Matches per DrugBank compound') + ylab('Count')
mapping.df <- file.path('data', 'mapping.tsv.gz') %>%
read.delim(stringsAsFactors=FALSE)
mapping.df %>%
dplyr::filter(source_name == 'drugbank') %>%
dplyr::inner_join(drugbank.df %>% dplyr::select(drugbank_id, type, groups)) %>%
dplyr::group_by(drugbank_id, drugbank_name) %>%
dplyr::summarise(
n_matches = n(),
matches = paste(src_compound_id, collapse = '|')
) %>%
dplyr::filter(n_matches > 1) %>%
DT::datatable()
## Joining by: "drugbank_id"