Here, we compute the pairwise similarities between compounds. We compute drug-drug similarities separately based on three different types of information:

side effects
indications
substructure

The method for calculating side effect and indication similarity is based on the method from a previous analysis (1). Substructure similarity is calculated externally using PubChem.

Load packages

We use dplyr and reshape2 for data manipulation; ggplot2 and ggdendro for plotting; and DT for displaying html tables using the Javascript DataTables library.

library(reshape2)
library(dplyr)
library(ggplot2)
library(ggdendro)
library(DT)

options(stringsAsFactors=FALSE)
doMC::registerDoMC(cores = 6)

write.delim <- function(x, file, sep='\t', quote = FALSE, row.names=FALSE, na = '', ...) {
  write.table(x = x, file = file, sep=sep, quote=quote, row.names=row.names, na=na, ...)
}

Substructure Similarity

We use PubChem’s score matrix service. We use the 2D substructure search:

2D Similarity: Substructure key-based 2D Tanimoto similarity. Scores are in the range [0 .. 100].

PubChem provides additional explanation of the keys used to describe substructure:

There are 881 substructure-keys (skeys) in each fingerprint. Each bit in the fingerprint represents the presence (or absence) of a particular chemical substructure (e.g., a carboxylic acid) or a particular count of the same. These skeys are similar in nature to the well-known MDL MACCS skeys fingerprints.

To perform the search, we selected Upload a file with IDs… under ID List and choose sider_compounds_pubchem.txt. We selected Id-Id Score for output format and gzipfor compression.

structure.df <- file.path('..', 'data', 'pubchem_score_matrix_2d.tab.gz') %>%
  read.table(col.names = c('compound_1', 'compound_2', 'substructure')) %>%
  dplyr::mutate(substructure = substructure / 100) %>%
  dplyr::filter(compound_1 != compound_2)

Read Processed SIDER 2 Data

Previousely, we processed data from the from the SIDER 2 resource (2). Here, we read the data and convert from a pairwise format to a binary matrix.

sider.df <- file.path('..', 'data', 'sider2-processed.txt') %>% read.delim(na.strings='')

# Convert into matrixes
se.mat <- xtabs(formula = ~ pubchem_cid + concept_id ,
  data = dplyr::filter(sider.df, type == 'side_effect')) %>% as.matrix()
ind.mat <- xtabs(formula = ~ pubchem_cid + concept_id ,
  data = dplyr::filter(sider.df, type == 'indication')) %>% as.matrix()

Compute Side Effect and Indication Similarity

We adopted a similarity metric that assesses the extent of concept (either side effect or indication) overlap between two compounds. The method weights concepts based on their rarity (relating to few compounds) and underrepresentation (dissimilarity compared to other concetps).

Rarity Weighting

Rarity weighting was performed as previousely described (1):

Our weighting scheme consists of two parts, a rareness weight and a correlation weight. The recorded side effects vary greatly in abundance: some, like megaloblastic 4 anaemia, occur in only a few package inserts, while others, like dizziness, appear in most. To be able to account for this, we examined the relation between side-effect frequency and the probability of sharing a drug target within a reference set of 502 drugs with 4857 known human drug–target relations (Fig. S1). We observed an inverse correlation between side-effect frequency and the likelihood of two drugs sharing a protein target. Consequently, the rareness weight for a side effect, \(r_i\), is defined as the negative logarithm of the side-effect frequency (Fig. S1D). (This frequency refers to the fraction of package inserts that feature a certain side effect, not the relative occurrence of the side effect in patients.)

rarity.se.vec <- -log(colMeans(se.mat))
rarity.ind.vec <- -log(colMeans(ind.mat))

dplyr::bind_rows(
  data.frame(rarity=rarity.se.vec, type='side_effect'),
  data.frame(rarity=rarity.ind.vec, type='indication')) %>%
  ggplot(aes(x = rarity)) + theme_bw() +
  facet_grid(. ~ type) +
  geom_histogram()

Underrepresentation Weighting

Underrepresentation weighting was also performed as previousely described (1), except that Ward’s method (3) was used for heirarchical clustering:

Not all side effects are independent of each other; for example, 90% of drugs that cause nausea also cause vomiting. We correct for this redundancy by weighting side effects in a manner analogous to the down-weighting of similar protein sequences within multiple alignments (4) (Fig. S1C). In order to determine the correlation weight, the correlation of side effects was determined by clustering all side effects according to their assigned drugs using a Tanimoto/Jacquard score to compute a distance matrix: The distance between two ~~drugs~~ [side effects] was calculated by dividing the number of drugs that feature both side effects by the number of drugs that have either side effect associated. The Gerstein–Sonnhammer–Chothia algorithm (4) was used to compute weights based on a hierarch[ic]al clustering with the aforementioned distance matrix (5).

# Uniform wieghting until a better scheme is implemented
underrep.se.vec <- rep(1, ncol(se.mat))
underrep.ind.vec <- rep(1, ncol(ind.mat))

GitHubScript <- function(...) {
  # Source a script from GitHub
  library(RCurl)
  github.url <- file.path('https://raw.githubusercontent.com', ...)
  script <- RCurl::getURL(github.url)
  eval(parse(text = script), envir = .GlobalEnv)
}

CalcUnderrep <- function(mat) {
  # Returns underrepresentation weight for each column
  col.dist <- stats::dist(t(mat), method = 'binary')
  col.clust <- hclust(col.dist, method = 'ward.D2')
  col.dendro <- as.dendrogram(col.clust)
  GitHubScript('antoine-lizee', 'R-GSC', 'master', 'GSC.R')
  gsc.weight <- GSC(col.dendro)
  list(dist = col.dist, clust = col.clust, dendro = col.dendro, weight = gsc.weight)
}

underrep.se <- CalcUnderrep(se.mat)
underrep.ind <- CalcUnderrep(ind.mat)

ggdendro.se <- ggdendro::dendro_data(underrep.se$clust)
ggdendro.ind <- ggdendro::dendro_data(underrep.ind$clust)

dendro.df <- dplyr::bind_rows(
  ggdendro.se %>% ggdendro::segment() %>% dplyr::mutate(type='Side effects'),
  ggdendro.ind %>% ggdendro::segment() %>% dplyr::mutate(type='Indications'))

leaf.df <- dplyr::bind_rows(
  ggdendro.se %>% ggdendro::label() %>% dplyr::left_join(
    y = data.frame(label = names(underrep.se$weight), gsc_weight = underrep.se$weight)) %>% 
    dplyr::mutate(type='Side effects'),
  ggdendro.ind %>% ggdendro::label() %>% dplyr::left_join(
    y = data.frame(label = names(underrep.ind$weight), gsc_weight = underrep.ind$weight)) %>% 
    dplyr::mutate(type='Indications'))

# plot denogrograms
dendro.df %>% ggplot(aes(x=x, y=y)) + theme_bw() +
  facet_grid( ~ type, scales = 'free_x', space = 'free_x') +
  geom_segment(aes(xend=xend, yend=yend), alpha=0.25) +
  geom_point(data = leaf.df, aes(color=gsc_weight), alpha=0.04, size=3) +
  xlab(NULL) + ylab('Distance') +
  scale_color_gradientn(colours=c('#268bd2', '#dc322f'), name='GSC Weight') +
  scale_x_continuous(breaks=NULL, expand = c(0.02, 0)) + scale_y_continuous(breaks=NULL, expand = c(0.04, 0))

# plot underrepresentation distributions
dplyr::bind_rows(
  data.frame(underrepresentation=underrep.se$weight, type='side_effect'),
  data.frame(underrepresentation=underrep.ind$weight, type='indication')) %>%
  ggplot(aes(x = underrepresentation)) + theme_bw() +
  facet_grid(. ~ type) +
  geom_histogram()

Computing Compound-Compound Similarity

Next, we computed a weighted Jaccard index (6) for each compound pair. This number, between 0 and 1, measures the similarity between two side effect/indication profiles.

SiderSimilarity <- function(x1, x2, w) {
  # Compute the similarity between two boolean vectors
  # using a weighted jaccard score.
  sum(w[x1 & x2]) / sum(w[x1 | x2])
}

GetSimilarityDF <- function(mat, w) {
  mat %>%
    proxy::simil(
      method = SiderSimilarity, w = w,
      by_rows = TRUE, upper = FALSE, diag = FALSE) %>%
    as.matrix() %>%
    reshape2::melt(na.rm = TRUE, value.name = 'similarity', varnames = c('compound_1', 'compound_2')) %>%
    dplyr::filter(compound_1 != compound_2) %>% # remove self-similarity (for safety, since `diag = FALSE` in proxy::simil)
    dplyr::distinct(compound_1, compound_2) # proxy::simil returning a non-NA upper half, so duplicate rows exist
}

similarity.df <- dplyr::full_join(
  x = GetSimilarityDF(mat = se.mat, w = rarity.se.vec * underrep.se$weight) %>%
    dplyr::rename(side_effect = similarity),
  y = GetSimilarityDF(mat = ind.mat, w = rarity.ind.vec * underrep.ind$weight) %>%
    dplyr::rename(indication = similarity)) %>%
  dplyr::full_join(y = structure.df) %>%
  dplyr::arrange(compound_1, compound_2)

gz <- file.path('..', 'data', 'similarities.txt.gz') %>% gzfile('w')
similarity.df %>% write.delim(gz); close(gz)
gz <- file.path('..', 'data', 'similarities-complete.txt.gz') %>% gzfile('w')
similarity.df %>% na.omit() %T>% write.delim(gz) %>% PubchemDataTable(max.rows=200); close(gz)

Similarity Distributions

Next, we evaluate the distirubtion of compound-compound similarities. Taking the squareroot of similarities provides a zero-inflated but otherwise normal-looking distribution.

# histogram of similarities
similarity.df %>%
  na.omit() %>%
  reshape2::melt(value.name = 'similarity', variable.name = 'type',
                 id.vars = c('compound_1', 'compound_2')) %>%
  ggplot(aes(similarity)) +
  facet_grid(. ~ type) + 
  geom_histogram(binwidth=0.02)  +
  scale_x_sqrt(breaks = seq(0, 1, 0.2)) + scale_y_sqrt() + theme_bw() +
  xlab('Compound-Compound Similarity')

Does side effect similarity predict indication similarity?

Next, we evaluate whether compound similarity measured by indications is associated with compound similarity measured by side effects.

# binned frequency plots
similarity.df %>% na.omit() %>%
  ggplot(aes(side_effect, indication)) + theme_bw() +
    stat_bin2d(aes(alpha=..count.., fill=..count..), bins = 75) + 
    scale_x_sqrt() + scale_y_sqrt() + coord_equal() +
    scale_fill_gradientn(colours=c('#2aa198', '#6c71c4', '#d33682'), trans='log10', name='Compound Pairs') +
    scale_alpha_continuous(range = c(0.3, 1), trans='log10', guide=FALSE) +
    xlab('Side Effect Similarity') + ylab('Indication Similarity')

Finally, we fit a linear regression to assess the strength of the association. The association is positive, but has a small effect size. The great significance occurs despite the small effect size, since there are close to one million observations.

similarity.df %$% lm(sqrt(indication) ~ sqrt(side_effect)) %>% summary()

## 
## Call:
## lm(formula = sqrt(indication) ~ sqrt(side_effect))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14470 -0.03074 -0.02032 -0.00928  1.00602 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.0060173  0.0001855  -32.43   <2e-16 ***
## sqrt(side_effect)  0.1562346  0.0008798  177.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07799 on 965304 degrees of freedom
##   (228200 observations deleted due to missingness)
## Multiple R-squared:  0.03163,    Adjusted R-squared:  0.03163 
## F-statistic: 3.153e+04 on 1 and 965304 DF,  p-value: < 2.2e-16

Does structural similarity predict indication or side effect similarity?

similarity.df %>% na.omit() %>%
  reshape2::melt(value.name = 'similarity', measure.vars=c('indication', 'side_effect'), variable.name = 'type') %>%
  ggplot(aes(substructure, similarity)) + theme_bw() +
  facet_grid(. ~ type) + 
    stat_bin2d(aes(alpha=..count.., fill=..count..), bins = 75) + 
    scale_x_sqrt() + scale_y_sqrt() + coord_equal() +
    scale_fill_gradientn(colours=c('#2aa198', '#6c71c4', '#d33682'), trans='log10', name='Compound Pairs') +
    scale_alpha_continuous(range = c(0.3, 1), trans='log10', guide=FALSE) +
    xlab('Substructure Similarity') + ylab('SIDER 2 Drug Label Similarity')

# Predict indication similarity using substructure similarity
similarity.df %$% lm(sqrt(indication) ~ sqrt(substructure)) %>% summary()

## 
## Call:
## lm(formula = sqrt(indication) ~ sqrt(substructure))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03227 -0.02520 -0.02299 -0.01995  0.98621 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.0117415  0.0003867   30.36   <2e-16 ***
## sqrt(substructure) 0.0205305  0.0006735   30.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07854 on 488564 degrees of freedom
##   (704940 observations deleted due to missingness)
## Multiple R-squared:  0.001898,   Adjusted R-squared:  0.001896 
## F-statistic: 929.2 on 1 and 488564 DF,  p-value: < 2.2e-16

# Predict side effect similarity using substructure similarity
similarity.df %$% lm(sqrt(side_effect) ~ sqrt(substructure)) %>% summary()

## 
## Call:
## lm(formula = sqrt(side_effect) ~ sqrt(substructure))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23208 -0.06241 -0.00374  0.05827  0.78685 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.1300772  0.0004303   302.3   <2e-16 ***
## sqrt(substructure) 0.1063407  0.0007483   142.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08911 on 517651 degrees of freedom
##   (675853 observations deleted due to missingness)
## Multiple R-squared:  0.03755,    Adjusted R-squared:  0.03755 
## F-statistic: 2.02e+04 on 1 and 517651 DF,  p-value: < 2.2e-16

# Predict indication similarity using substructure and side effect similarity
similarity.df %$% lm(sqrt(indication) ~ sqrt(substructure) + sqrt(side_effect)) %>% summary()

## 
## Call:
## lm(formula = sqrt(indication) ~ sqrt(substructure) + sqrt(side_effect))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14446 -0.03039 -0.01999 -0.00899  1.00788 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -0.0092872  0.0004436  -20.93  < 2e-16 ***
## sqrt(substructure)  0.0057413  0.0007186    7.99 1.36e-15 ***
## sqrt(side_effect)   0.1546505  0.0012779  121.02  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07757 on 466092 degrees of freedom
##   (727411 observations deleted due to missingness)
## Multiple R-squared:  0.03225,    Adjusted R-squared:  0.03225 
## F-statistic:  7767 on 2 and 466092 DF,  p-value: < 2.2e-16

Network Animation

Similarity networks were created using Cytoscape (7). For each of the three networks, modules of similar compounds were constructed using MCL Clustering from the clusterMaker2 app (8). Structures, rendered from each compound’s SMILES, were superimposed onto the nodes using the chemViz2 app, although these diagrams are difficult to see in the video.

Substructure → Indication

The CyAnimator app illustrates the transition from substructure similarity modules to indication similarity modules. The compounds from the largest structural similarity module are colored red. When compounds organized into indication similarity modules, the red nodes cosegregated indicating concordance between indication and structural similarity.

Substructure → Side Effect

We repeat the animation showing the transition from substructure to side effect similarity.

References

1. Campillos M, Kuhn M, Gavin A-C, Jensen LJ, Bork P (2008) Drug Target Identification Using Side-Effect Similarity. Science. doi:10.1126/science.1158140

2. Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P (2010) A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol. doi:10.1038/msb.2009.98

3. Ward JH (1963) Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association. doi:10.1080/01621459.1963.10500845

4. Gerstein M, Sonnhammer EL, Chothia C (1994) Volume changes in protein evolution. Journal of Molecular Biology. doi:10.1016/0022-2836(94)90012-4

5. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences. doi:10.1073/pnas.95.25.14863

6. Ioffe S 2010. Improved Consistent Sampling, Weighted Minhash and L1 Sketching. In 2010 IEEE International Conference on Data Mining. doi:10.1109/icdm.2010.80

7. Shannon P (2003) Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Research. doi:10.1101/gr.1239303

8. Morris JH, Apeltsin L, Newman AM, Baumbach J, Wittkop T, Su G, et al. (2011) clusterMaker: a multi-algorithm clustering plugin for Cytoscape. BMC Bioinformatics. doi:10.1186/1471-2105-12-436

To the extent possible under law, Daniel Himmelstein has waived all copyright and related or neighboring rights to SIDER 2. This work is published from: United States.

SIDER 2 Pairwise Similarity

Daniel Himmelstein, Alexander Pico, Sergio Baranzini

February 6, 2015