Here, we compute the pairwise similarities between compounds. We compute drug-drug similarities separately based on three different types of information:
The method for calculating side effect and indication similarity is based on the method from a previous analysis (1). Substructure similarity is calculated externally using PubChem.
We use dplyr and reshape2 for data manipulation; ggplot2 and ggdendro for plotting; and DT for displaying html tables using the Javascript DataTables library.
library(reshape2)
library(dplyr)
library(ggplot2)
library(ggdendro)
library(DT)
options(stringsAsFactors=FALSE)
doMC::registerDoMC(cores = 6)
write.delim <- function(x, file, sep='\t', quote = FALSE, row.names=FALSE, na = '', ...) {
write.table(x = x, file = file, sep=sep, quote=quote, row.names=row.names, na=na, ...)
}
We use PubChem’s score matrix service. We use the 2D substructure search:
2D Similarity: Substructure key-based 2D Tanimoto similarity. Scores are in the range [0 .. 100].
PubChem provides additional explanation of the keys used to describe substructure:
There are 881 substructure-keys (skeys) in each fingerprint. Each bit in the fingerprint represents the presence (or absence) of a particular chemical substructure (e.g., a carboxylic acid) or a particular count of the same. These skeys are similar in nature to the well-known MDL MACCS skeys fingerprints.
To perform the search, we selected Upload a file with IDs… under ID List
and choose sider_compounds_pubchem.txt
. We selected Id-Id Score
for output format and gzip
for compression.
structure.df <- file.path('..', 'data', 'pubchem_score_matrix_2d.tab.gz') %>%
read.table(col.names = c('compound_1', 'compound_2', 'substructure')) %>%
dplyr::mutate(substructure = substructure / 100) %>%
dplyr::filter(compound_1 != compound_2)
Previousely, we processed data from the from the SIDER 2 resource (2). Here, we read the data and convert from a pairwise format to a binary matrix.
sider.df <- file.path('..', 'data', 'sider2-processed.txt') %>% read.delim(na.strings='')
# Convert into matrixes
se.mat <- xtabs(formula = ~ pubchem_cid + concept_id ,
data = dplyr::filter(sider.df, type == 'side_effect')) %>% as.matrix()
ind.mat <- xtabs(formula = ~ pubchem_cid + concept_id ,
data = dplyr::filter(sider.df, type == 'indication')) %>% as.matrix()
We adopted a similarity metric that assesses the extent of concept (either side effect or indication) overlap between two compounds. The method weights concepts based on their rarity (relating to few compounds) and underrepresentation (dissimilarity compared to other concetps).
Rarity weighting was performed as previousely described (1):
Our weighting scheme consists of two parts, a rareness weight and a correlation weight. The recorded side effects vary greatly in abundance: some, like megaloblastic 4 anaemia, occur in only a few package inserts, while others, like dizziness, appear in most. To be able to account for this, we examined the relation between side-effect frequency and the probability of sharing a drug target within a reference set of 502 drugs with 4857 known human drug–target relations (Fig. S1). We observed an inverse correlation between side-effect frequency and the likelihood of two drugs sharing a protein target. Consequently, the rareness weight for a side effect, \(r_i\), is defined as the negative logarithm of the side-effect frequency (Fig. S1D). (This frequency refers to the fraction of package inserts that feature a certain side effect, not the relative occurrence of the side effect in patients.)
rarity.se.vec <- -log(colMeans(se.mat))
rarity.ind.vec <- -log(colMeans(ind.mat))
dplyr::bind_rows(
data.frame(rarity=rarity.se.vec, type='side_effect'),
data.frame(rarity=rarity.ind.vec, type='indication')) %>%
ggplot(aes(x = rarity)) + theme_bw() +
facet_grid(. ~ type) +
geom_histogram()
Underrepresentation weighting was also performed as previousely described (1), except that Ward’s method (3) was used for heirarchical clustering:
Not all side effects are independent of each other; for example, 90% of drugs that cause nausea also cause vomiting. We correct for this redundancy by weighting side effects in a manner analogous to the down-weighting of similar protein sequences within multiple alignments (4) (Fig. S1C). In order to determine the correlation weight, the correlation of side effects was determined by clustering all side effects according to their assigned drugs using a Tanimoto/Jacquard score to compute a distance matrix: The distance between two
drugs[side effects] was calculated by dividing the number of drugs that feature both side effects by the number of drugs that have either side effect associated. The Gerstein–Sonnhammer–Chothia algorithm (4) was used to compute weights based on a hierarch[ic]al clustering with the aforementioned distance matrix (5).
# Uniform wieghting until a better scheme is implemented
underrep.se.vec <- rep(1, ncol(se.mat))
underrep.ind.vec <- rep(1, ncol(ind.mat))
GitHubScript <- function(...) {
# Source a script from GitHub
library(RCurl)
github.url <- file.path('https://raw.githubusercontent.com', ...)
script <- RCurl::getURL(github.url)
eval(parse(text = script), envir = .GlobalEnv)
}
CalcUnderrep <- function(mat) {
# Returns underrepresentation weight for each column
col.dist <- stats::dist(t(mat), method = 'binary')
col.clust <- hclust(col.dist, method = 'ward.D2')
col.dendro <- as.dendrogram(col.clust)
GitHubScript('antoine-lizee', 'R-GSC', 'master', 'GSC.R')
gsc.weight <- GSC(col.dendro)
list(dist = col.dist, clust = col.clust, dendro = col.dendro, weight = gsc.weight)
}
underrep.se <- CalcUnderrep(se.mat)
underrep.ind <- CalcUnderrep(ind.mat)
ggdendro.se <- ggdendro::dendro_data(underrep.se$clust)
ggdendro.ind <- ggdendro::dendro_data(underrep.ind$clust)
dendro.df <- dplyr::bind_rows(
ggdendro.se %>% ggdendro::segment() %>% dplyr::mutate(type='Side effects'),
ggdendro.ind %>% ggdendro::segment() %>% dplyr::mutate(type='Indications'))
leaf.df <- dplyr::bind_rows(
ggdendro.se %>% ggdendro::label() %>% dplyr::left_join(
y = data.frame(label = names(underrep.se$weight), gsc_weight = underrep.se$weight)) %>%
dplyr::mutate(type='Side effects'),
ggdendro.ind %>% ggdendro::label() %>% dplyr::left_join(
y = data.frame(label = names(underrep.ind$weight), gsc_weight = underrep.ind$weight)) %>%
dplyr::mutate(type='Indications'))
# plot denogrograms
dendro.df %>% ggplot(aes(x=x, y=y)) + theme_bw() +
facet_grid( ~ type, scales = 'free_x', space = 'free_x') +
geom_segment(aes(xend=xend, yend=yend), alpha=0.25) +
geom_point(data = leaf.df, aes(color=gsc_weight), alpha=0.04, size=3) +
xlab(NULL) + ylab('Distance') +
scale_color_gradientn(colours=c('#268bd2', '#dc322f'), name='GSC Weight') +
scale_x_continuous(breaks=NULL, expand = c(0.02, 0)) + scale_y_continuous(breaks=NULL, expand = c(0.04, 0))
# plot underrepresentation distributions
dplyr::bind_rows(
data.frame(underrepresentation=underrep.se$weight, type='side_effect'),
data.frame(underrepresentation=underrep.ind$weight, type='indication')) %>%
ggplot(aes(x = underrepresentation)) + theme_bw() +
facet_grid(. ~ type) +
geom_histogram()
Next, we computed a weighted Jaccard index (6) for each compound pair. This number, between 0 and 1, measures the similarity between two side effect/indication profiles.
SiderSimilarity <- function(x1, x2, w) {
# Compute the similarity between two boolean vectors
# using a weighted jaccard score.
sum(w[x1 & x2]) / sum(w[x1 | x2])
}
GetSimilarityDF <- function(mat, w) {
mat %>%
proxy::simil(
method = SiderSimilarity, w = w,
by_rows = TRUE, upper = FALSE, diag = FALSE) %>%
as.matrix() %>%
reshape2::melt(na.rm = TRUE, value.name = 'similarity', varnames = c('compound_1', 'compound_2')) %>%
dplyr::filter(compound_1 != compound_2) %>% # remove self-similarity (for safety, since `diag = FALSE` in proxy::simil)
dplyr::distinct(compound_1, compound_2) # proxy::simil returning a non-NA upper half, so duplicate rows exist
}
similarity.df <- dplyr::full_join(
x = GetSimilarityDF(mat = se.mat, w = rarity.se.vec * underrep.se$weight) %>%
dplyr::rename(side_effect = similarity),
y = GetSimilarityDF(mat = ind.mat, w = rarity.ind.vec * underrep.ind$weight) %>%
dplyr::rename(indication = similarity)) %>%
dplyr::full_join(y = structure.df) %>%
dplyr::arrange(compound_1, compound_2)
gz <- file.path('..', 'data', 'similarities.txt.gz') %>% gzfile('w')
similarity.df %>% write.delim(gz); close(gz)
gz <- file.path('..', 'data', 'similarities-complete.txt.gz') %>% gzfile('w')
similarity.df %>% na.omit() %T>% write.delim(gz) %>% PubchemDataTable(max.rows=200); close(gz)
Next, we evaluate the distirubtion of compound-compound similarities. Taking the squareroot of similarities provides a zero-inflated but otherwise normal-looking distribution.
# histogram of similarities
similarity.df %>%
na.omit() %>%
reshape2::melt(value.name = 'similarity', variable.name = 'type',
id.vars = c('compound_1', 'compound_2')) %>%
ggplot(aes(similarity)) +
facet_grid(. ~ type) +
geom_histogram(binwidth=0.02) +
scale_x_sqrt(breaks = seq(0, 1, 0.2)) + scale_y_sqrt() + theme_bw() +
xlab('Compound-Compound Similarity')
Next, we evaluate whether compound similarity measured by indications is associated with compound similarity measured by side effects.
# binned frequency plots
similarity.df %>% na.omit() %>%
ggplot(aes(side_effect, indication)) + theme_bw() +
stat_bin2d(aes(alpha=..count.., fill=..count..), bins = 75) +
scale_x_sqrt() + scale_y_sqrt() + coord_equal() +
scale_fill_gradientn(colours=c('#2aa198', '#6c71c4', '#d33682'), trans='log10', name='Compound Pairs') +
scale_alpha_continuous(range = c(0.3, 1), trans='log10', guide=FALSE) +
xlab('Side Effect Similarity') + ylab('Indication Similarity')
Finally, we fit a linear regression to assess the strength of the association. The association is positive, but has a small effect size. The great significance occurs despite the small effect size, since there are close to one million observations.
similarity.df %$% lm(sqrt(indication) ~ sqrt(side_effect)) %>% summary()
##
## Call:
## lm(formula = sqrt(indication) ~ sqrt(side_effect))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14470 -0.03074 -0.02032 -0.00928 1.00602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0060173 0.0001855 -32.43 <2e-16 ***
## sqrt(side_effect) 0.1562346 0.0008798 177.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07799 on 965304 degrees of freedom
## (228200 observations deleted due to missingness)
## Multiple R-squared: 0.03163, Adjusted R-squared: 0.03163
## F-statistic: 3.153e+04 on 1 and 965304 DF, p-value: < 2.2e-16
similarity.df %>% na.omit() %>%
reshape2::melt(value.name = 'similarity', measure.vars=c('indication', 'side_effect'), variable.name = 'type') %>%
ggplot(aes(substructure, similarity)) + theme_bw() +
facet_grid(. ~ type) +
stat_bin2d(aes(alpha=..count.., fill=..count..), bins = 75) +
scale_x_sqrt() + scale_y_sqrt() + coord_equal() +
scale_fill_gradientn(colours=c('#2aa198', '#6c71c4', '#d33682'), trans='log10', name='Compound Pairs') +
scale_alpha_continuous(range = c(0.3, 1), trans='log10', guide=FALSE) +
xlab('Substructure Similarity') + ylab('SIDER 2 Drug Label Similarity')
# Predict indication similarity using substructure similarity
similarity.df %$% lm(sqrt(indication) ~ sqrt(substructure)) %>% summary()
##
## Call:
## lm(formula = sqrt(indication) ~ sqrt(substructure))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03227 -0.02520 -0.02299 -0.01995 0.98621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0117415 0.0003867 30.36 <2e-16 ***
## sqrt(substructure) 0.0205305 0.0006735 30.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07854 on 488564 degrees of freedom
## (704940 observations deleted due to missingness)
## Multiple R-squared: 0.001898, Adjusted R-squared: 0.001896
## F-statistic: 929.2 on 1 and 488564 DF, p-value: < 2.2e-16
# Predict side effect similarity using substructure similarity
similarity.df %$% lm(sqrt(side_effect) ~ sqrt(substructure)) %>% summary()
##
## Call:
## lm(formula = sqrt(side_effect) ~ sqrt(substructure))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.23208 -0.06241 -0.00374 0.05827 0.78685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1300772 0.0004303 302.3 <2e-16 ***
## sqrt(substructure) 0.1063407 0.0007483 142.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08911 on 517651 degrees of freedom
## (675853 observations deleted due to missingness)
## Multiple R-squared: 0.03755, Adjusted R-squared: 0.03755
## F-statistic: 2.02e+04 on 1 and 517651 DF, p-value: < 2.2e-16
# Predict indication similarity using substructure and side effect similarity
similarity.df %$% lm(sqrt(indication) ~ sqrt(substructure) + sqrt(side_effect)) %>% summary()
##
## Call:
## lm(formula = sqrt(indication) ~ sqrt(substructure) + sqrt(side_effect))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14446 -0.03039 -0.01999 -0.00899 1.00788
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0092872 0.0004436 -20.93 < 2e-16 ***
## sqrt(substructure) 0.0057413 0.0007186 7.99 1.36e-15 ***
## sqrt(side_effect) 0.1546505 0.0012779 121.02 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07757 on 466092 degrees of freedom
## (727411 observations deleted due to missingness)
## Multiple R-squared: 0.03225, Adjusted R-squared: 0.03225
## F-statistic: 7767 on 2 and 466092 DF, p-value: < 2.2e-16
Similarity networks were created using Cytoscape (7). For each of the three networks, modules of similar compounds were constructed using MCL Clustering from the clusterMaker2 app (8). Structures, rendered from each compound’s SMILES, were superimposed onto the nodes using the chemViz2 app, although these diagrams are difficult to see in the video.
The CyAnimator app illustrates the transition from substructure similarity modules to indication similarity modules. The compounds from the largest structural similarity module are colored red. When compounds organized into indication similarity modules, the red nodes cosegregated indicating concordance between indication and structural similarity.