Systematic integration of biomedical knowledge prioritizes drugs for repurposing

Daniel S. Himmelstein; Antoine Lizee; Christine Hessler; Leo Brueggeman; Sabrina L. Chen; Dexter Hadley; Ari Green; Pouya Khankhanian; Sergio E. Baranzini

This manuscript was published in eLife on September 22, 2017. The DOI-citable version is available at https://doi.org/10.7554/eLife.26726.

Authors

Abstract

The ability to computationally predict whether a compound treats a disease would improve the economy and success rate of drug approval. This study describes Project Rephetio to systematically model drug efficacy based on 755 existing treatments. First, we constructed Hetionet (neo4j.het.io), an integrative network encoding knowledge from millions of biomedical studies. Hetionet v1.0 consists of 47,031 nodes of 11 types and 2,250,197 relationships of 24 types. Data was integrated from 29 public resources to connect compounds, diseases, genes, anatomies, pathways, biological processes, molecular functions, cellular components, pharmacologic classes, side effects, and symptoms. Next, we identified network patterns that distinguish treatments from non-treatments. Then we predicted the probability of treatment for 209,168 compound–disease pairs (het.io/repurpose). Our predictions validated on two external sets of treatment and provided pharmacological insights on epilepsy, suggesting they will help prioritize drug repurposing candidates. This study was entirely open and received realtime feedback from 40 community members.

Introduction

The cost of developing a new therapeutic drug has been estimated at 1.4 billion dollars [1], the process typically takes 15 years from lead compound to market [2], and the likelihood of success is stunningly low [3]. Strikingly, the costs have been doubling every 9 years since 1970, a sort of inverse Moore’s law, which is far from an optimal strategy from both a business and public health perspective [4]. Drug repurposing — identifying novel uses for existing therapeutics — can drastically reduce the duration, failure rates, and costs of approval [5]. These benefits stem from the rich preexisting information on approved drugs, including extensive toxicology profiling performed during development, preclinical models, clinical trials, and postmarketing surveillance.

Drug repurposing is poised to become more efficient as mining of electronic health records (EHRs) to retrospectively assess the effect of drugs gains feasibility [6,7,8,9]. However, systematic approaches to repurpose drugs based on mining EHRs alone will likely lack power due to multiple testing. Similar to the approach followed to increase the power of genome-wide association studies (GWAS) [10,11], integration of biological knowledge to prioritize drug repurposing will help overcome limited EHR sample size and data quality.

In addition to repurposing, several other paradigm shifts in drug development have been proposed to improve efficiency. Since small molecules tend to bind to many targets, polypharmacology aims to find synergy in the multiple effects of a drug [12]. Network pharmacology assumes diseases consist of a multitude of molecular alterations resulting in a robust disease state. Network pharmacology seeks to uncover multiple points of intervention into a specific pathophysiological state that together rehabilitate an otherwise resilient disease process [13,14]. Although target-centric drug discovery has dominated the field for decades, phenotypic screens have more recently resulted in a comparatively higher number of first-in-class small molecules [15]. Recent technological advances have enabled a new paradigm in which mid- to high-throughput assessment of intermediate phenotypes, such as the molecular response to drugs, is replacing the classic target discovery approach [16,17,18]. Furthermore, integration of multiple channels of evidence, particularly diverse types of data, can overcome the limitations and weak performance inherent to data of a single domain [19]. Modern computational approaches offer a convenient platform to tie these developments together as the reduced cost and increased velocity of in silico experimentation massively lowers the barriers to entry and price of failure [20,21].

Hetnets (short for heterogeneous networks) are networks with multiple types of nodes and relationships. They offer an intuitive, versatile, and powerful structure for data integration by aggregating graphs for each relationship type onto common nodes. In this study, we developed a hetnet (Hetionet v1.0) by integrating knowledge and experimental findings from decades of biomedical research spanning millions of publications. We adapted an algorithm originally developed for social network analysis and applied it to Hetionet v1.0 to identify patterns of efficacy and predict new uses for drugs. The algorithm performs edge prediction through a machine learning framework that accommodates the breadth and depth of information contained in Hetionet v1.0 [22,23]. Our approach represents an in silico implementation of network pharmacology that natively incorporates polypharmacology and high-throughput phenotypic screening.

One fundamental characteristic of our method is that it learns and evaluates itself on existing medical indications (i.e. a “gold standard”). Next, we introduce previous approaches that also performed comprehensive evaluation on existing treatments. A 2011 study, named PREDICT, compiled 1,933 treatments between 593 drugs and 313 diseases [24]. Starting from the premise that similar drugs treat similar diseases, PREDICT trained a classifier that incorporates 5 types of drug-drug and 2 types of disease-disease similarity. A 2014 study compiled 890 treatments between 152 drugs and 145 diseases with transcriptional signatures [25]. The authors found that compounds triggering an opposing transcriptional response to the disease were more likely to be treatments, although this effect was weak and limited to cancers. A 2016 study compiled 402 treatments between 238 drugs and 78 diseases and used a single proximity score — the average shortest path distance between a drug’s targets and disease’s associated proteins on the interactome — as a classifier [26].

We build on these successes by creating a framework for incorporating the effects of any biological relationship into the prediction of whether a drug treats a disease. By doing this, we were able to capture a multitude of effects that have been suggested as influential for drug repurposing including drug-drug similarity [24,27], disease-disease similarity [24,28], transcriptional signatures [17,18,25,29,30], protein interactions [26], genetic association [31,32], drug side effects [33,34], disease symptoms [35], and molecular pathways [36]. Our ability to create such an integrative model of drug efficacy relies on the hetnet data structure to unite diverse information. On Hetionet v1.0, our algorithm learns which types of compound–disease paths discriminate treatments from non-treatments in order to predict the probability that a compound treats a disease.

We refer to this study as Project Rephetio (pronounced as rep-het-ee-oh). Both Rephetio and Hetionet are portmanteaus combining the words repurpose, heterogeneous, and network with the URL het.io.

Results

Hetionet v1.0

We obtained and integrated data from 29 publicly available resources to create Hetionet v1.0 (Figure 1). The hetnet contains 47,031 nodes of 11 types (Table 1) and 2,250,197 relationships of 24 types (Table 2). The nodes consist of 1,552 small molecule compounds and 137 complex diseases, as well as genes, anatomies, pathways, biological processes, molecular functions, cellular components, perturbations, pharmacologic classes, drug side effects, and disease symptoms. The edges represent relationships between these nodes and encompass the collective knowledge produced by millions of studies over the last half century.

For example, Compound–binds–Gene edges represent when a compound binds to a protein encoded by a gene. This information has been extracted from the literature by human curators and compiled into databases such as DrugBank, ChEMBL, DrugCentral, and BindingDB. We combined these databases to create 11,571 binding edges between 1,389 compounds and 1,689 genes. These edges were compiled from 10,646 distinct publications, which Hetionet binding edges reference as an attribute. Binding edges represent a comprehensive catalog constructed from low throughput experimentation. However, we also integrated findings from high throughput technologies — many of which have only recently become available. For example, we generated consensus transcriptional signatures for compounds in LINCS L1000 and diseases in STARGEO.

While Hetionet v1.0 is ideally suited for drug repurposing, the network has broader biological applicability. For example, we have prototyped queries for a) identifying drugs that target a specific pathway, b) identifying biological processes involved in a specific disease, c) identifying the drug targets responsible for causing a specific side effect, and d) identifying anatomies with transcriptional relevance for a specific disease [37]. Each of these queries was simple to write and took less than a second to run on our publicly available Hetionet Browser. While it is possible that existing services provide much of the aforementioned functionality, they offer less versatility. Hetionet differentiates itself in its ability to flexibly query across multiple domains of information. As a proof of concept, we enhanced the biological process query (b), which identified processes that were enriched for disease-associated genes, using multiple sclerosis (MS) as an example disease. The verbose Cypher code for this query is shown below:

The query above identifies genes that interact with MS GWAS-genes. However, interacting genes are discarded unless they are upregulated in an MS-related anatomy (i.e. anatomical structure, e.g. organ or tissue). Then relevant biological processes are identified. Thus, this single query spans 4 node and 5 relationship types.

The integrative potential of Hetionet v1.0 is reflected by its connectivity. Among the 11 metanodes, there are 66 possible source–target pairs. However, only 11 of them have at least one direct connection. In contrast, for paths of length 2, 50 pairs have connectivity (paths types that start on the source node type and end on the target node type, see Figure 1C). At length 3, all 66 pairs are connected. At length 4, the source–target pair with the fewest types of connectivity (Side Effect to Symptom) has 13 metapaths, while the pair with the most connectivity types (Gene to Gene) has 3,542 pairs. This high level of connectivity across a diversity of biomedical entities forms the foundation for automated translation of knowledge into biomedical insight.

Hetionet v1.0 is accessible via a Neo4j Browser at https://neo4j.het.io. This public Neo4j instance provides users an installation-free method to query and visualize the network. The Browser contains a tutorial guide as well as guides with the details of each Project Rephetio prediction. Hetionet v1.0 is also available for download in JSON, Neo4j, and TSV formats [38]. The JSON and Neo4j database formats include node and edge properties — such as URLs, source and license information, and confidence scores — and are thus recommended.

Systematic mechanisms of efficacy

One aim of Project Rephetio was to systematically evaluate how drugs exert their therapeutic potential. To address this question, we compiled a gold standard of 755 disease-modifying indications, which form the Compound–treats–Disease edges in Hetionet v1.0. Next, we identified types of paths (metapaths) that occurred more frequently between treatments than non-treatments (any compound–disease pair that is not a treatment). The advantage of this approach is that metapaths naturally correspond to mechanisms of pharmacological efficacy. For example, the Compound–binds–Gene–associates–Disease (CbGaD) metapath identifies when a drug binds to a protein corresponding to a gene involved in the disease.

We evaluated all 1,206 metapaths that traverse from compound to disease and have length of 2–4 (Figure 2A). To control for the different degrees of nodes, we used the degree-weighted path count (DWPC, see Methods) — which downweights paths going through highly-connected nodes [22] — to assess path prevalence. In addition, we compared the performance of each metapath to a baseline computed from permuted networks. Hetnet permutation preserves node degree while eliminating edge specificity, allowing us to isolate the portion of unpermuted metapath performance resulting from actual network paths. We refer to the permutation-adjusted performance measure as Δ AUROC. A positive Δ AUROC indicates that paths of the given type tended to occur more frequently between treatments than non-treatments, after accounting for different levels of connectivity (node degrees) in the hetnet. In general terms, Δ AUROC assesses whether paths of a given type were informative of drug efficacy.

Overall, 709 of the 1,206 metapaths exhibited a statistically significant Δ AUROC at a false discovery rate cutoff of 5%. These 709 metapaths included all 24 metaedges, suggesting that each type of relationship we integrated provided at least some therapeutic utility. However, not all metaedges were equally present in significant metapaths: 259 significant metapaths included a Compound–binds–Gene metaedge, whereas only 4 included a Gene–participates–Cellular Component metaedge. Table 3 lists the predictiveness of several metapaths of interest. Refer to the Discussion for our interpretation of these findings.

Predictions of drug efficacy

We implemented a machine learning approach to translate the network connectivity between a compound and a disease into a probability of treatment [40,41]. The approach relies on the 755 known treatments as positives and 29,044 non-treatments as negatives to train a logistic regression model. Note that 179,369 non-treatments were omitted as negative training observations because they had a prior probability of treatment equal to zero (see Methods). The features consisted of a prior probability of treatment, node degrees for 14 metaedges, and DWPCs for 123 metapaths that were well suited for modeling. A cross-validated elastic net was used to minimize overfitting, yielding a model with 31 features (Figure 2B). The DWPC features with negative coefficients appear to be included as node-degree-capturing covariates, i.e. they reflect the general connectivity of the compound and disease rather than specific paths between them. However, the 11 DWPC features with non-negligible positive coefficients represent the most salient types of connectivity for systematically modeling drug efficacy. See the metapaths with positive coefficients in Table 3 for unabbreviated names. As an example, the CcSEcCtD feature assesses whether the compound causes the same side effects as compounds that treat the disease. Alternatively, the CbGeAlD feature assesses whether the compound binds to genes that are expressed in the anatomies affected by the disease.

We applied this model to predict the probability of treatment between each of 1,538 connected compounds and each of 136 connected diseases, resulting in predictions for 209,168 compound–disease pairs [42], available at http://het.io/repurpose/. The 755 known disease-modifying indications were highly ranked (AUROC = 97.4%, Figure 3). The predictions also successfully prioritized two external validation sets: novel indications from DrugCentral (AUROC = 85.5%) and novel indications in clinical trial (AUROC = 70.0%). Together, these findings indicate that Project Rephetio has the ability to recognize efficacious compound–disease pairs.

Predictions were scaled to the overall prevalence of treatments (0.36%). Hence a compound–disease pair that received a prediction of 1% represents a 2-fold enrichment over the null probability. Of the 3,980 predictions with a probability exceeding 1%, 586 corresponded to known disease-modifying indications, leaving 3,394 repurposing candidates. For a given compound or disease, we provide the percentile rank of each prediction. Therefore, users can assess whether a given prediction is a top prediction for the compound or disease. In addition, our table-based prediction browser links to a custom guide for each prediction, which displays in the Neo4j Hetionet Browser. Each guide includes a query to display the top paths supporting the prediction and lists clinical trials investigating the indication.

Nicotine dependence case study

There are currently two FDA-approved medications for smoking cessation (varenicline and bupropion) that are not nicotine replacement therapies. PharmacotherapyDB v1.0 lists varenicline as a disease-modifying indication and nicotine itself as a symptomatic indication for nicotine dependence, but is missing bupropion. Bupropion was first approved for depression in 1985. Owing to the serendipitous observation that it decreased smoking in depressed patients taking this drug, Bupropion was approved for smoking cessation in 1997 [43]. Therefore we looked whether Project Rephetio could have predicted this repurposing. Bupropion was the 9th best prediction for nicotine dependence (99.5th percentile) with a probability 2.50-fold greater than the null. Figure 4 shows the top paths supporting the repurposing of bupropion.

Atop the nicotine dependence predictions were nicotine (10.97-fold over null), cytisine (10.58-fold), and galantamine (9.50-fold). Cytisine is widely used in Eastern Europe for smoking cessation due to its availability at a fraction of the cost of other pharmaceutical options [48]. In the last half decade, large scale clinical trials have confirmed cytisine’s efficacy [49,50]. Galantamine, an approved Alzheimer’s treatment, is currently in Phase 2 trial for smoking cessation and is showing promising results [51]. In summary, nicotine dependence illustrates Project Rephetio’s ability to predict efficacious treatments and prioritize historic and contemporary repurposing opportunities.

Epilepsy case study

Several factors make epilepsy an interesting disease for evaluating repurposing predictions [52]. Antiepileptic drugs work by increasing the seizure threshold — the amount of electric stimulation that is required to induce seizure. The effect of a drug on the seizure threshold can be cheaply and reliably tested in rodent models. As a result, the viability of most approved drugs in treating epilepsy is known.

We focused our evaluation on the top 100 scoring compounds — referred to as the epilepsy predictions in this section — after discarding a single combination drug. We classified each compound as anti-ictogenic (seizure suppressing), unknown (no established effect on the seizure threshold), or ictogenic (seizure generating) according to medical literature [52]. Of the top 100 epilepsy predictions, 77 were anti-ictogenic, 8 were unknown, and 15 were ictogenic (Figure 5A). Notably, the predictions contained 23 of the 25 disease-modifying antiepileptics in PharamcotherapyDB v1.0.

Many of the 77 anti-ictogenic compounds were not first-line antiepileptic drugs. Instead, they were used as ancillary drugs in the treatment of status epilepticus. For example, we predicted four halogenated ethers, two of which (isoflurane and desflurane) are used clinically to treat life-threatening seizures that persist despite treatment [54]. As inhaled anesthetics, these compounds are not appropriate as daily epilepsy medications, but are feasible for refractory status epilepticus where patients are intubated.

Given this high precision (77%), the 8 compounds of unknown effect are promising repurposing candidates. For example, acamprosate — whose top prediction was epilepsy — is a taurine analog that promotes alcohol abstinence. Support for this repurposing arose from acamprosate’s inhibition of the glutamate receptor and positive modulation of the GABAᴀ receptor (Figure 5C). If effective against epilepsy, acamprosate could serve a dual benefit for recovering alcoholics who experience seizures from alcohol withdrawal.

While certain classes of compounds were highly represented in our epilepsy predictions, such benzodiazepines and barbiturates, there was also considerable diversity [52]. The 100 predicted compounds encompassed 26 third-level ATC codes [55], such as antiarrhythmics (quinidine, classified as anti-ictogenic) and urologicals (phenazopyridine, classified as unknown). Furthermore, 25 of the compounds were chemically distinct, i.e. they did not resemble any of the other epilepsy predictions (Figure 5B).

Next, we investigated which components of Hetionet contributed to the epilepsy predictions [52]. In total, 392,956 paths of 12 types supported the predictions. Using several different methods for grouping paths, we were able to quantify the aggregate biological evidence. Our algorithm primarily drew on two aspects of epilepsy: its known treatments (76% of the total support) and its genetic associations (22% of support). In contrast, our algorithm drew heavily on several aspects of the predicted compounds: their targeted genes (44%), their chemically similar compounds (30%), their pharmacologic classes, their palliative indications (5%), and their side effects (4%).

Specifically, 266,192 supporting paths originated with a Compound–binds–Gene relationship. Aggregating support by these genes shows the extent that 121 different drug targets contributed to the predictions [52]. In order of importance, the predictions targeted GABAᴀ receptors (15.3% of total support), cytochrome P450 enzymes (5.6%), the sodium channel (4.6%), glutamate receptors (3.8%), the calcium channel (2.7%), carbonic anhydrases (2.5%), cholinergic receptors (2.1%) and the potassium channel (1.4%). Besides cytochrome P450, which primarily influences pharmacokinetics [56], our method detected and leveraged bonafide anti-ictogenic mechanisms [57]. Figure 5C shows drug target contributions per compound and illustrates the considerable mechanistic diversity among the predictions.

Also notable are the 15 ictogenic compounds in our top 100 predictions. Nine of the ictogenic compounds share a tricyclic structure (Figure 5B), five of which are tricyclic antidepressants. While the ictogenic mechanisms of these antidepressants are still unclear [58], Figure 5C suggests their anticholinergic effects may be responsible [59], in accordance with previous theories [tag:dailey?].

We also ranked the contribution of the 1,137 side effects that supported the epilepsy predictions through 117,720 CcSEcCtD paths. The top five side effects — ataxia (0.069% of total support), nystagmus (0.049%), diplopia (0.045%), somnolence (0.044%), and vomiting (0.043%) — reflect established adverse effects of antiepileptic drugs [60,61,62,tag:hilton?,tag:placidi?]. In summary, our method simultaneously identified the hallmark side effects of antiepileptic drugs while incorporating this knowledge to prioritize 1,538 compounds for anti-ictogenic activity.

Discussion

We created Hetionet v1.0 by integrating 29 resources into a single data structure — the hetnet. Consisting of 11 types of nodes and 24 types of relationships, Hetionet v1.0 brings more types of information together than previous leading-studies in biological data integration [63]. Moreover, we strove to create a reusable, extensible, and property-rich network. While all of the resources we include are publicly available, their integration was a time-intensive undertaking and required careful consideration of legal barriers to data reuse. Hetionet allows researchers to begin answering integrative questions without having to first spend months processing data.

Our public Neo4j instance allows users to immediately interact with Hetionet. Through the Cypher language, users can perform highly specialized graph queries with only a few lines of code. Queries can be executed in the web browser or programmatically from a language with a Neo4j driver. For users that are unfamiliar with Cypher, we include several example queries in a Browser guide. In contrast to traditional REST APIs, our public Neo4j instance provides users with maximal flexibility to construct custom queries by exposing the underlying database.

As data has grown more plentiful and diverse, so has the applicability of hetnets. Unfortunately, network science has been naturally fragmented by discipline resulting in relatively slow progress in integrating heterogeneous data. A 2014 analysis identified 78 studies using multilayer networks — a superset of hetnets (heterogeneous information networks) with the potential for additional dimensions, such as time. However, the studies relied on 26 different terms, 9 of which had multiple definitions [64,65]. Nonetheless, core infrastructure and algorithms for hetnets are emerging. Compared to the existing mathematical frameworks for multilayer networks that must deal with layers other than type (such as the aspect of time) [64], the primary obligation of hetnet algorithms is to be type aware. One goal of our project has been to unite hetnet research across disciplines. We approached this goal by making Project Rephetio entirely available online and inviting community feedback throughout the process [66].

Integrating every resource into a single interconnected data structure allowed us to assess systematic mechanisms of drug efficacy. Using the max performing metapath to assess the pharmacological utility of a metaedge (Figure 2A), we can divide our relationships into tiers of informativeness. The top tier consists of the types of information traditionally considered by pharmacology: Compound–treats–Disease, Pharmacologic Class–includes–Compound, Compound–resembles–Compound, Disease–resembles–Disease, and Compound–binds–Gene. The upper-middle tier consists of types of information that have been the focus of substantial medical study, but have only recently started to play a bigger role in drug development, namely the metaedges Disease–associates–Gene, Compound–causes–Side Effect, Disease–presents–Symptom, Disease–localizes–Anatomy, and Gene–interacts–Gene.

The lower-middle tier contains the transcriptomics metaedges such as Compound–downregulates–Gene, Anatomy–expresses–Gene, Gene→regulates→Gene, and Disease–downregulates–Gene. Much excitement surrounds these resources due to their high throughput and genome-wide scope, which offers a route to drug discovery that is less biased by existing knowledge. However, our findings suggest that these resources are only moderately informative of drug efficacy. Other lower-middle tier metaedges were the product of time-intensive biological experimentation, such as Gene–participates–Pathway, Gene–participates–Molecular Function, and Gene–participates–Biological Process. Unlike the top tier resources, this knowledge has historically been pursued for basic science rather than primarily medical applications. The weak yet appreciable performance of the Gene–covaries–Gene suggests the synergy between the fields of evolutionary genomics and disease biology. The lower tier included the Gene–participates–Cellular Component metaedge, which may reflect that the relevance of cellular location to pharmacology is highly case dependent and not amenable to systematic profiling.

The performance of specific metapaths (Table 3) provides further insight. For example, significant emphasis has been put on the use of transcriptional data for drug repurposing [30]. One common approach has been to identify compounds with opposing transcriptional signatures to a disease [18,67]. However, several systematic studies report underwhelming performance of this approach [24,25,26] — a finding supported by the low performance of the CuGdD and CdGuD metapaths in Project Rephetio. Nonetheless, other transcription-based methods showed some promise. Compounds with similar transcriptional signatures were prone to treating the same disease (CuGuCtD and CdGdCtD metapaths), while compounds with opposing transcriptional signatures were slightly averse to treating the same disease (CuGdCtD and CdGuCtD metapaths). In contrast, diseases with similar transcriptional profiles were not prone to treatment by the same compound (CtDdGuD and CtDuGdD).

By comparably assessing the informativeness of different metaedges and metapaths, Project Rephetio aims to guide future research towards promising data types and analyses. One caveat is that omics-scale experimental data will likely play a larger role in developing the next generation of pharmacotherapies. Hence, were performance reevaluated on treatments discovered in the forthcoming decades, the predictive ability of these data types may rise. Encouragingly, most data types were at least weakly informative and hence suitable for further study. Ideally, different data types would provide orthogonal information. However, our model for whether a compound treats a disease focused on 11 metapaths — a small portion of the hundreds of metapaths available. While parsimony aids interpretation, our model did not draw on the weakly-predictive high-throughput data types — which are intriguing for their novelty, scalability, and cost-effectiveness — as much as we had hypothesized.

Instead our model selected types of information traditionally considered in pharmacology. However unlike a pharmacologist whose area of expertise may be limited to a few drug classes, our model was able to predict probabilities of treatment for all 209,168 compound–disease pairs. Furthermore, our model systematically learned the importance of each type of network connectivity. For any compound–disease pair, we now can immediately provide the top network paths supporting its therapeutic efficacy. A traditional pharmacologist may be able to produce a similar explanation, but likely not until spending substantial time researching the compound’s pharmacology, the disease’s pathophysiology, and the molecular relationships in between. Accordingly, we hope certain predictions will spur further research, such as trials to investigate the off-label use of acamprosate for epilepsy, which is supported by one animal model [68].

As demonstrated by the 15 ictogenic compounds in our top 100 epilepsy predictions, Project Rephetio’s predictions can include contraindications in addition to indications. Since many of Hetionet v1.0’s relationship types are general (e.g. the Compound–binds–Gene relationship type conflates antagonist with agonist effects), we expect some high scoring predictions to exacerbate rather than treat the disease. However, the predictions made by Hetionet v1.0 represent such substantial relative enrichment over the null that uncovering the correct directionality is a logical next step and worth undertaking. Going forward, advances in automated mining of the scientific literature could enable extraction of precise relationship types at omics scale [69,70].

Future research should focus on gleaning orthogonal information from data types that are so expansive that computational methods are the only option. Our CuGuCtD feature — measuring whether a compound upregulates the same genes as compounds which treat the disease — is a good example. This metapath was informative by itself (Δ AUROC = 4.4%) but was not selected by the model, despite its orthogonal origin (gene expression) to selected metapaths. Using a more extensive catalog of treatments as the gold standard would be one possible approach to increase the power of feature selection.

Integrating more types of information into Hetionet should also be a future priority. The “network effect” phenomenon suggests that the addition of each new piece of information will enhance the value of Hetionet’s existing information. We envision a future where all biological knowledge is encoded into a single hetnet. Hetionet v1.0 was an early attempt, and we hope the strong performance of Project Rephetio in repurposing drugs foreshadows the many applications that will thrive from encoding biology in hetnets.

Methods

Hetionet was built entirely from publicly available resources with the goal of integrating a broad diversity of information types of medical relevance, ranging in scale from molecular to organismal. Practical considerations such as data availability, licensing, reusability, documentation, throughput, and standardization informed our choice of resources. We abided by a simple litmus test for determining how to encode information in a hetnet: nodes represent nouns, relationships represent verbs [71,tag:chen?].

Our method for relationship prediction creates a strong incentive to avoid redundancy, which increases the computational burden without improving performance. In a previous study to predict disease–gene associations using a hetnet of pathophysiology [22], we found that different types of gene sets contributed highly redundant information. Therefore, in Hetionet v1.0 we reduced the number of gene set node types from 14 to 3 by omitting several gene set collections and aggregating all pathway nodes.

Nodes

Nodes encode entities. We extracted nodes from standard terminologies, which provide curated vocabularies to enable data integration and prevent concept duplication. The ease of mapping external vocabularies, adoption, and comprehensiveness were primary selection criteria. Hetionet v1.0 includes nodes from 5 ontologies — which provide hierarchy of entities for a specific domain — selected for their conformity to current best practices [72].

We selected 137 terms from the Disease Ontology [73,74] (which we refer to as DO Slim [75,76]) as our disease set. Our goal was to identify complex diseases that are distinct and specific enough to be clinically relevant yet general enough to be well annotated. To this end, we included diseases that have been studied by GWAS and cancer types from TopNodes_DOcancerslim [77]. We ensured that no DO Slim disease was a subtype of another DO Slim disease. Symptoms were extracted from MeSH by taking the 438 descendants of Signs and Symptoms [78,79].

Approved small molecule compounds with documented chemical structures were extracted from DrugBank version 4.2 [80,81,82]. Unapproved compounds were excluded because our focus was repurposing. In addition, unapproved compounds tend to be less studied than approved compounds making them less attractive for our approach where robust network connectivity is critical. Finally, restricting to small molecules with known documented structures enabled us to map between compound vocabularies (see Mappings).

Side effects were extracted from SIDER version 4.1 [83,84,85]. SIDER codes side effects using UMLS identifiers [86], which we also adopted. Pharmacologic Classes were extracted from the DrugCentral data repository [87,88]. Only pharmacologic classes corresponding to the “Chemical/Ingredient”, “Mechanism of Action”, and “Physiologic Effect” FDA class types were included to avoid pharmacologic classes that were synonymous with indications [88].

Protein-coding human genes were extracted from Entrez Gene [89,90,91]. Anatomical structures, which we refer to as anatomies, were extracted from Uberon [92]. We selected a subset of 402 Uberon terms by excluding terms known not to exist in humans and terms that were overly broad or arcane [93,94].

Pathways were extracted by combining human pathways from WikiPathways [95,96], Reactome [97], and the Pathway Interaction Database [98]. The latter two resources were retrieved from Pathway Commons (RRID:SCR_002103) [99], which compiles pathways from several providers. Duplicate pathways and pathways without multiple participating genes were removed [100,101]. Biological processes, cellular components, and molecular functions were extracted from the Gene Ontology [102]. Only terms with 2–1000 annotated genes were included.

Mappings

Before adding relationships, all identifiers needed to be converted into the vocabularies matching that of our nodes. Oftentimes, our node vocabularies included external mappings. For example, the Disease Ontology includes mappings to MeSH, UMLS, and the ICD, several of which we submitted during the course of this study [103]. In a few cases, the only option was to map using gene symbols, a disfavored method given that it can lead to ambiguities.

When mapping external disease concepts onto DO Slim, we used transitive closure. For example, the UMLS concept for primary progressive multiple sclerosis (C0751964) was mapped to the DO Slim term for multiple sclerosis (DOID:2377).

Chemical vocabularies presented the greatest mapping challenge [81], since these are poorly standardized [104]. UniChem’s [105] Connectivity Search [106] was used to map compounds, which maps by atomic connectivity (based on First InChIKey Hash Blocks [107]) and ignores small molecular differences.

Edges

Anatomy–downregulates–Gene and Anatomy–upregulates–Gene edges [108,109,110] were extracted from Bgee [111], which computes differentially expressed genes by anatomy in post-juvenile adult humans. Anatomy–expresses–Gene edges were extracted from Bgee and TISSUES [112,113,114].

Compound–binds–Gene edges were aggregated from BindingDB [115,116], DrugBank [80,117], and DrugCentral [87]. Only binding relationships to single proteins with affinities of at least 1 μM (as determined by K_d, Kᵢ, or IC₅₀) were selected from the October 2015 release of BindingDB [118,119]. Target, carrier, transporter, and enzyme interactions with single proteins (i.e. excluding protein groups) were extracted from DrugBank 4.2 [82,120]. In addition, all mapping DrugCentral target relationships were included [88].

Compound–treats–Disease (disease-modifying indications) and Compound–palliates–Disease (symptomatic indications) edges are from PharmacotherapyDB as described in Intermediate resources. Compound–causes–Side Effect edges were obtained from SIDER 4.1 [83,84,85], which uses natural language processing to identify side effects in drug labels. Compound–resembles–Compound relationships [82,121,122] represent chemical similarity and correspond to a Dice coefficient ≥ 0.5 [123] between extended connectivity fingerprints [124,125]. Pharmacologic Class–includes–Compound edges were extracted from DrugCentral for three FDA class types [87,88]. Compound–downregulates–Gene and Compound–upregulates–Gene relationships were computed from LINCS L1000 as described in Intermediate resources.

Disease–associates–Gene edges were extracted from the GWAS Catalog [126], DISEASES [127,128], DisGeNET [129,130], and DOAF [131,132]. The GWAS Catalog compiles disease–SNP associations from published GWAS [133]. We aggregated overlapping loci associated with each disease and identified the mode reported gene for each high confidence locus [134,135]. DISEASES integrates evidence of association from text mining, curated catalogs, and experimental data [136]. Associations from DISEASES with integrated scores ≥ 2 were included after removing the contribution of DistiLD. DisGeNET integrates evidence from over 10 sources and reports a single score for each association [137,138]. Associations with scores ≥ 0.06 were included. DOAF mines Entrez Gene GeneRIFs (textual annotations of gene function) for disease mentions [139]. Associations with 3 or more supporting GeneRIFs were included. Disease–downregulates–Gene and Disease–upregulates–Gene relationships [140,141] were computed using STARGEO as described in Intermediate resources.

Disease–localizes–Anatomy, Disease–presents–Symptom, and Disease–resembles–Disease edges were calculated from MEDLINE co-occurrence [78,142]. MEDLINE is a subset of 21 million PubMed articles for which designated human curators have assigned topics. When retrieving articles for a given topic (MeSH term), we activated two non-default search options as specified below: majr for selecting only articles where the topic is major and noexp for suppressing explosion (returning articles linked to MeSH subterms). We identified 4,161,769 articles with two or more disease topics; 696,252 articles with both a disease topic (majr) and an anatomy topic (noexp) [143]; and 363,928 articles with both a disease topic (majr) and a symptom topic (noexp). We used a Fisher’s exact test [144] to identify pairs of terms that occurred together more than would be expected by chance in their respective corpus. We included co-occurring terms with p < 0.005 in Hetionet v1.0.

Gene→regulates→Gene directed edges were generated from the LINCS L1000 genetic interference screens (see Intermediate resources) and indicate that knockdown or overexpression of the source gene significantly dysregulated the target gene [145,146]. Gene–covaries–Gene edges represent evolutionary rate covariation ≥ 0.75 [147,148,149]. Gene–interacts–Gene edges [150,151] represent when two genes produce physically-interacting proteins. We compiled these interactions from the Human Interactome Database [152,153,154,155], the Incomplete Interactome [156], and our previous study [22]. Gene–participates–Biological Process, Gene–participates–Cellular Component, and Gene–participates–Molecular Function edges are from Gene Ontology annotations [157]. As described in Intermediate resources, annotations were propagated [158,159]. Gene–participates–Pathway edges were included from the human pathway resources described in the Nodes section [100,101].

Directionality

Whether a certain type of relationship has directionality is defined at the metaedge level. Directed metaedges are only necessary when they connect a metanode to itself and correspond to an asymmetric relationship. In the case of Hetionet v1.0, the sole directed metaedge was Gene→regulates→Gene. To demonstrate the implications of directionality, Hetionet v1.0 contains two relationships between the genes HADH and STAT1: HADH–interacts–STAT1 and HADH→regulates→STAT1. Both edges can be represented in the inverse orientation: STAT1–interacts–HADH and STAT1←regulates←HADH. However due to directed nature of the regulates relationship, STAT1→regulates→HADH is a distinct edge, which does not exist in the network. Similarly, HADH–associates–obesity and obesity–associates–HADH are inverse orientations of the same underlying undirected relationship. Accordingly, the following path exists in the network: obesity–associates–HADH→regulates→STAT1, which can also be inverted to STAT1←regulates←HADH–associates–obesity.

Intermediate resources

In the process of creating Hetionet, we produced several datasets with broad applicability that extended beyond Project Rephetio. These resources are referred to as intermediate resources and described below.

Transcriptional signatures of disease using STARGEO

STARGEO is a nascent platform for annotating and meta-analyzing differential gene expression experiments [160]. The STAR acronym stands for Search-Tag-Analyze Resources, while GEO refers to the Gene Expression Omnibus [161,162]. STARGEO is a layer on top of GEO that crowdsources sample annotation and automates meta-analysis.

Using STARGEO, we computed differentially expressed genes between healthy and diseased samples for 49 diseases [140,141]. First, we and others created case/control tags for 66 diseases. After combing through GEO series and tagging samples, 49 diseases had sufficient data for case-control meta-analysis: multiple series with at least 3 cases and 3 controls. For each disease, we performed a random effects meta-analysis on each gene to combine log₂ fold-change across series. These analyses incorporated 27,019 unique samples from 460 series on 107 platforms.

Differentially expressed genes (false discovery rate ≤ 0.05) were identified for each disease. The median number of upregulated genes per disease was 351 and the median number of downregulated genes was 340. Endogenous depression was the only of the 49 diseases without any significantly dysregulated genes.

Transcriptional signatures of perturbation from LINCS L1000

LINCS L1000 profiled the transcriptional response to small molecule and genetic interference perturbations. To increase throughput, expression was only measured for 978 genes, which were selected for their ability to impute expression of the remaining genes. A single perturbation was often assayed under a variety of conditions including cell types, dosages, timepoints, and concentrations. Each condition generates a single signature of dysregulation z-scores. We further processed these signatures to fit into our approach [163,164].

First we computed consensus signatures — which meta-analyze multiple signatures to condense them into one — for DrugBank small molecules, Entrez genes, and all L1000 perturbations [145,146]. First, we discarded non-gold (non-replicating or indistinct) signatures. Then we meta-analyzed z-scores using Stouffer’s method. Each signature was weighted by its average Spearman’s correlation to other signatures, with a 0.05 minimum, to de-emphasize discordant signatures. Our signatures include the 978 measured genes and the 6,489 imputed genes from the “best inferred gene subset”. To identify significantly dysregulated genes, we selected genes using a Bonferroni cutoff of p = 0.05 and limited the number of imputed genes to 1,000.

The consensus signatures for genetic perturbations allowed us to assess various characteristics of the L1000 dataset. First, we looked at whether genetic interference dysregulated its target gene in the expected direction [165]. Looking at measured z-scores for target genes, we found that the knockdown perturbations were highly reliable, while the overexpression perturbations were only moderately reliable with 36% of overexpression perturbations downregulating their target. However, imputed z-scores for target genes barely exceeded chance at responding in the expected direction to interference. Hence, we concluded that the imputation quality of LINCS L1000 is poor. However, when restricting to significantly dyseregulated targets, 22 out of 29 imputed genes responded in the expected direction. This provides some evidence that the directional fidelity of imputation is higher for significantly dysregulated genes. Finally, we found that the transcriptional signatures of knocking down and overexpressing the same gene were positively correlated 65% of the time, suggesting the presence of a general stress response [166].

Based on these findings, we performed additional filtering of signifcantly dysregulated genes when building Hetionet v1.0. Compound–down/up-regulates–Gene relationships were restricted to the 125 most significant per compound-direction-status combination (status refers to measured versus imputed). For genetic interference perturbations, we restricted to the 50 most significant genes per gene-direction-status combination and merged the remaining edges into a single Gene→regulates→Gene relationship type containing both knockdown and overexpression perturbations.

PharmacotherapyDB: physician curated indications

We created PharmacotherapyDB, an open catalog of drug therapies for disease [167,168,169]. Version 1.0 contains 755 disease-modifying therapies and 390 symptomatic therapies between 97 diseases and 601 compounds.

This resource was motivated by the need for a gold standard of medical indications to train and evaluate our approach. Initially, we identified four existing indication catalogs [170]: MEDI-HPS which mined indications from RxNorm, SIDER 2, MedlinePlus, and Wikipedia [171]; LabeledIn which extracted indications from drug labels via human curation [172,173,174]; EHRLink which identified medication–problem pairs that clinicians linked together in electronic health records [175,176]; and indications from PREDICT, which were compiled from UMLS relationships, drugs.com, and drug labels [24]. After mapping to DO Slim and DrugBank Slim, the four resources contained 1,388 distinct indications.

However, we noticed that many indications were palliative and hence problematic as a gold standard of pharmacotherapy for our in silico approach. Therefore, we recruited two practicing physicians to curate the 1,388 preliminary indications [177]. After a pilot on 50 indications, we defined three classifications: disease modifying meaning a drug that therapeutically changes the underlying or downstream biology of the disease; symptomatic meaning a drug that treats a significant symptom of the disease; and non-indication meaning a drug that neither therapeutically changes the underlying or downstream biology nor treats a significant symptom of the disease. Both curators independently classified all 1,388 indications.

The two curators disagreed on 444 calls (Cohen’s κ = 49.9%). We then recruited a third practicing physician, who reviewed all 1,388 calls and created a detailed explanation of his methodology [177]. We proceeded with the third curator’s calls as the consensus curation. The first two curators did have reservations with classifying steroids as disease modifying for autoimmune diseases. We ultimately considered that these indications met our definition of disease modifying, which is based on a pathophysiological rather than clinical standard. Accordingly, therapies we consider disease modifying may not be used to alter long-term disease course in the modern clinic due to a poor risk–benefit ratio.

User-friendly Gene Ontology annotations

We created a browser (http://git.dhimmel.com/gene-ontology/) to provide straightforward access to Gene Ontology annotations [158,159]. Our service provides annotations between Gene Ontology terms and Entrez Genes. The user chooses propagated/direct annotation and all/experimental evidence. Annotations are currently available for 37 species and downloadable as user-friendly TSV files.

Data copyright and licensing

We committed to openly releasing our data and analyses from the origin of the project [178]. Our goals were to contribute to the advancement of science [179,180], maximize our impact [181,182], and enable reproducibility [183,184,185]. These objectives required publicly distributing and openly licensing Hetionet and Project Rephetio data and analyses [186,187].

Since we integrated only public resources, which were overwhelmingly funded by academic grants, we had assumed that our project and open sharing of our network would not be an issue. However, upon releasing a preliminary version of Hetionet [188], a community reviewer informed us of legal barriers to integrating public data. In essence, both copyright (rights of exclusivity automatically granted to original works) and terms of use (rules that users must agree to in order to use a resource) place legally-binding restrictions on data reuse. In short, public data is not by default open data.

Hetionet v1.0 integrates 29 resources (Table 4), but two resources were removed prior to the v1.0 release. Of the total 31 resources [189], five were United States government works not subject to copyright, and twelve had licenses that met the Open Definition of knowledge version 2.1. Four resources allowed only non-commercial reuse. Most problematic were the remaining nine resources that had no license — which equates to all rights reserved by default and forbids reuse [190] — and one resource that explicitly forbid redistribution.

Additional difficulty resulted from license incompatibles across resources, which was caused primarily by non-commercial and share-alike stipulations. Furthermore, it was often unclear who owned the data [194]. Therefore, we sought input from legal experts and chronicled our progress [189,191,192,193,195].

Ultimately, we did not find an ideal solution. We had to choose between absolute compliance and Hetionet: strictly adhering to copyright and licensing arrangements would have decimated the network. On the other hand, in the United States, mere facts are not subject to copyright, and fair use doctrine helps protect reuse that is transformative and educational. Hence, we choose a path forward which balanced legal, normative, ethical, and scientific considerations.

If a resource was in the public domain, we licensed any derivatives as CC0 1.0. For resources licensed to allow reuse, redistribution, and modification, we transmitted their licenses as properties on the specific nodes and relationships in Hetionet v1.0. For all other resources — for example, resources without licenses or with licenses that forbid redistribution — we sent permission requests to their creators. The median time till first response to our permission requests was 16 days, with only 2 resources affirmatively granting us permission. We did not receive any responses asking us to remove a resource. However, we did voluntarily remove MSigDB [196], since its license was highly problematic [195]. As a result of our experience, we recommend that publicly-funded data should be explicitly dedicated to the public domain whenever possible.

Permuted hetnets

From Hetionet, we derived five permuted hetnets [197]. The permutations preserve node degree but eliminate edge specificity by employing an algorithm called XSwap to randomly swap edges [198]. To extend XSwap to hetnets [22], we permuted each metaedge separately, so that edges were only swapped with other edges of the same type. We adopted a Markov chain approach, whereby the first permuted hetnet was generated from Hetionet v1.0, the second permuted hetnet was generated from the first, and so on. For each metaedge, we assessed the percent of edges unchanged as the algorithm progressed to ensure that a sufficient number of swaps had been performed to randomize the network [197]. Permuted hetnets are useful for computing the baseline performance of meaningless edges while preserving node degree [199]. Since, our use of permutation focused on assessing Δ AUROC, a small number of permuted hetnets was sufficient, as the variability in a metapath’s AUROC across the permuted hetnets was low.

Graph databases & Neo4j

Traditional relational databases — such as SQLite, MySQL, and PostgreSQL — excel at storing highly structured data in tables. Connectivity between tables is accomplished using foreign-key references between columns. However, for many biomedical applications the connectivity between entities is of foremost importance. Furthermore, enforcing a rigid structure of what attributes an entity may possess is less important and often unnecessarily prohibitive. Graph databases focus instead on capturing connectivity (relationships) between entities (nodes). Accordingly, graph databases such as Neo4j offer greater ease when modeling biomedical relationships and superior performance when traversing many levels of connectivity [200,201]. Until recently, graph database adoption in bioinformatics was limited [202]. However lately, the demand to model and capture biological connectivity at scale has led to increasing adoption [203,204,205,206].

We used the Neo4j graph database for storing and operating on Hetionet and noticed major benefits from tapping into this large open source ecosystem [207]. Persistent storage with immediate access and the Cypher query language — a sort of SQL for hetnets — were two of the biggest benefits. To facilitate our migration to Neo4j, we updated hetio — our existing Python package for hetnets [208] — to export networks into Neo4j and DWPC queries to Cypher. In addition, we created an interactive GraphGist for Project Rephetio, which introduces our approach and showcases its Cypher queries. Finally, we created a public Neo4j instance [209], which leverages several modern technologies such Neo4j Browser guides, cloud hosting with HTTPS, and Docker deployment [210,211].

Machine learning approach

Project Rephetio relied on the previously-published DWPC metric to generate features for compound–disease pairs. The DWPC measures the prevalence of a given metapath between a given source and target node [22]. It is calculated by first extracting all paths from the source to target node that follow the specified metapath. Next, each path is weighted by taking the product of the node degrees along the path raised to a negative exponent. This damping exponent — the sole parameter — thereby determines the extent that paths through high-degree nodes are downweighted: we chose w = 0.4 based on our past optimizations [22]. The DWPC equals the sum of the path weights (referred to as path-degree products). Traversing the hetnet to extract all paths between a source and target node, which we performed in Neo4j, is the most computationally intensive step in computing DWPCs [212]. For future work, we are exploring matrix multiplication approaches, which could improve runtime several orders of magnitude.

Project Rephetio made several refinements to metapath-based hetnet edge prediction compared to previous studies [22,23]. First, we transformed DWPCs by mean scaling and then taking the inverse hyperbolic sine [213] to make them more amenable to modeling [214]. Second, we bifurcated the workflow into an all-features stage and an all-observations stage [40]. The all-features stage assesses feature performance and does not require computing features for all negatives. Here we selected a random subset of 3,020 (4 × 755) negatives. Little error was introduced by this optimization, since the predominant limitation to performance assessment was the small number of positives (755) rather than negatives. Based on the all-features performance assessment [215], we selected 142 DWPCs to compute on all observations (all 209,168 compound–disease pairs). The feature selection was designed to remove uninformative features (according to permutation) and guard against edge-dropout contamination [216]. Third, we included 14 degree features, which assess the degree of a specific metaedge for either the source compound or target disease.

Network support of predictions

To improve the interpretability of the predictions, we developed a method for decomposing a prediction into its network support [217]. This information is deployed to our Neo4j Browser guides, allowing users to assess the biomedical evidence contributing to a given prediction. First, we used logistic regression terms to quantify the contribution of metapaths that positively support a prediction. Second, we decomposed a metapath’s contribution, according to its DWPC, into specific paths contributions. Finally, we aggregated paths based on their source (first) or target (last) edge to quantify the contribution of specific edges of the source compound or target disease [218].

Using the acamprosate–epilepsy prediction as an example, we first quantified metapath contributions: 40% of the prediction was supported by CbGbCtD paths, 36% by CbGaD paths, 11% by CcSEcCtD paths, 8% by CbGpPWpGaD paths, and 5% by CbGeAlD paths. Second, we calculated path contributions: Acamprosate–binds–GRM5–associates–epilepsy syndrome was the most supportive path, contributing 11% of the prediction. Finally, we aggregated path contributions to calculate that the source edge of Acamprosate—binds—GRM5 contributed 23% of the prediction, while the target edge of epilepsy syndrome–treats–Felbamate contributed 12%.

Prior probability of treatment

The 755 treatments in Hetionet v1.0 are not evenly distributed between all compounds and diseases. For example, methotrexate treats 19 diseases and hypertension is treated by 68 compounds. We estimated a prior probability of treatment — based only on the treatment degree of the source compound and target disease — on 744,975 permutations of the bipartite treatment network [219]. Methotrexate received a 79.6% prior probability of treating hypertension, whereas a compound and disease that both had only one treatment received a prior of 0.12%.

Across the 209,168 compound–disease pairs, the prior predicted the known treatments with AUROC = 97.9%. The strength of this association threatened to dominate our predictions. However, not modeling the prior can lead to omitted-variable bias and confounded proxy variables. To address the issue, we included the logit-transformed prior, without any regularization, as a term in the model. This restricted model fitting to the 29,799 observations with a nonzero prior — corresponding to the 387 compounds and 77 diseases with at least one treatment. To enable predictions for all 209,168 observations, we set the prior for each compound–disease pair to the overall prevalence of positives (0.36%).

This method succeeded at accommodating the treatment degrees. The prior probabilities performed poorly on the validation sets with AUROC = 54.1% on DrugCentral indications and AUROC = 62.5% on clinical trials. This performance dropoff compared to training shows the danger of encoding treatment degree into predictions. The benefits of our solution are highlighted by the superior validation performance of our predictions compared to the prior (Figure 3).

Indication sets

Only the Clinical Trial and DrugCentral indication sets were used for external validation, since the Disease Modifying and Symptomatic indications were included in the hetnet. As an aside, several additional indication catalogs have recently been published, which future studies may want to also consider [170,222,223,224].

Realtime open science & Thinklab

We conducted our study using Thinklab — a platform for realtime open collaborative science — on which this study was the first project [66]. We began the study by publicly proposing the idea and inviting discussion [225]. We continued by chronicling our progress via discussions. We used Thinklab as the frontend to coordinate and report our analyses and GitHub as the backend to host our code, data, and notebooks. On top of our Thinklab team consisting of core contributors, we welcomed community contribution and review. In areas where our expertise was lacking or advice would be helpful, we sought input from domain experts and encouraged them to respond on Thinklab where their comments would be CC BY licensed and their contribution rated and rewarded.

In total, 40 non-team members commented across 86 discussions, which generated 622 comments and 191 notes (Figure 6). Thinklab content for this project totaled 145,771 words or 918,837 characters [226]. Using an estimated 7,000 words per academic publication as a benchmark, Project Rephetio generated written content comparable in volume to 20.8 publications prior to its completion. We noticed several other benefits from using Thinklab including forging a community of contributors [227]; receiving feedback during the early stages when feedback was most actionable [228]; disseminating our research without delay [229,230]; opening avenues for external input [231]; facilitating problem-oriented teaching [232,233]; and improving our documentation by maintaining a publication-grade digital lab notebook [234].

Thinklab began winding down operations in July 2017 and has switched to a static state. While users will no longer be able to add comments, the corpus of content remains browsable at https://think-lab.github.io and available in machine-readable formats at dhimmel/thinklytics.

The preprint for this study is available at doi.org/bs4f [235]. The manuscript was written in markdown, originally on Thinklab at doi.org/bszr [236]. In August 2017, we switched to using the Manubot system to generate the manuscript. With Manubot, a GitHub repository (dhimmel/rephetio-manuscript) tracks the manuscript’s source code, while continuous integration automatically rebuilds the manuscript upon changes. As a result, the latest version of the manuscript is always available at dhimmel.github.io/rephetio-manuscript. Additionally, readers can leave feedback or questions for the Project Rephetio team via GitHub Issues.

Software & data availability

All software and datasets from Project Rephetio are publicly available on GitHub, Zenodo, or Figshare [237]. Additional documentation for these materials is available in the corresponding Thinklab discussions. For reader convenience, software, datasets, and Thinklab discussions have been cited throughout the manuscript as relevant.

Acknowledgements

We are immensely grateful to our Thinklab contributors who joined us in our experiment of radically open science. The following non-team members provided contributions that received 5 or more Thinklab points: Lars Juhl Jensen, Frederic Bastian, Alexander Pico, Casey Greene, Benjamin Good, Craig Knox, Mike Gilson, Chris Mungall, Katie Fortney, Venkat Malladi, Tudor Oprea, MacKenzie Smith, Caty Chung, Allison McCoy, Alexey Strokach, Ritu Khare, Greg Way, Marina Sirota, Raghavendran Partha, Oleg Ursu, Jesse Spaulding, Gaya Nadarajan, Alex Ratner, Scooter Morris, Alessandro Didonna, Alex Pankov, Tong Shu Li, and Janet Piñero. Additionally, the founder of Thinklab, Jesse Spaulding, supported community contributions and developed the platform with Project Rephetio’s needs in mind. We also appreciate DigitalOcean’s sponsorship the Hetionet Browser to cover its hosting costs. Finally, we would like to thank Neo Technology, whose staff provided excellent technical support.

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant Number 1144247 to DSH. SEB is supported by NINDS/NIH Grant Number 5R01NS088155 and the Heidrich Family and Friends Foundation. DH is supported by the the National Cancer Institute of the National Institutes of Health under Award Number UH2CA203792 and the National Library of Medicine under Award Number 1U01LM012675. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

References

Innovation in the pharmaceutical industry: New estimates of R&D costs

Joseph A DiMasi, Henry G Grabowski, Ronald W Hansen

Journal of Health Economics (2016-05) https://doi.org/f3mn5k

DOI: 10.1016/j.jhealeco.2016.01.012 · PMID: 26928437

Trends in development and approval times for new therapeutics in the United States

Janice M Reichert

Nature Reviews Drug Discovery (2003-09) https://doi.org/fw24wk

DOI: 10.1038/nrd1178 · PMID: 12951576

Clinical development success rates for investigational drugs

Michael Hay, David W Thomas, John L Craighead, Celia Economides, Jesse Rosenthal

Nature Biotechnology (2014-01) https://doi.org/f3mn5m

DOI: 10.1038/nbt.2786 · PMID: 24406927

Diagnosing the decline in pharmaceutical R&D efficiency

Jack W Scannell, Alex Blanckley, Helen Boldon, Brian Warrington

Nature Reviews Drug Discovery (2012-03) https://doi.org/f3mn5n

DOI: 10.1038/nrd3681 · PMID: 22378269

Drug repositioning: identifying and developing new uses for existing drugs

Ted T Ashburn, Karl B Thor

Nature Reviews Drug Discovery (2004-08) https://doi.org/cfmks9

DOI: 10.1038/nrd1468 · PMID: 15286734

A method for systematic discovery of adverse drug events from clinical notes

Guan Wang, Kenneth Jung, Rainer Winnenburg, Nigam H Shah

Journal of the American Medical Informatics Association (2015-07-31) https://doi.org/f3mn4r

DOI: 10.1093/jamia/ocv102 · PMID: 26232442 · PMCID: PMC4921953

Validating drug repurposing signals using electronic health records: a case study of metformin associated with reduced cancer mortality

H Xu, MC Aldrich, Q Chen, H Liu, NB Peterson, Q Dai, M Levy, A Shah, X Han, X Ruan, … JC Denny

Journal of the American Medical Informatics Association (2014-07-22) https://doi.org/f3mn5p

DOI: 10.1136/amiajnl-2014-002649 · PMID: 25053577 · PMCID: PMC4433365

Mining Retrospective Data for Virtual Prospective Drug Repurposing: L-DOPA and Age-related Macular Degeneration

Murray H Brilliant, Kamyar Vaziri, Thomas B Connor Jr., Stephen G Schwartz, Joseph J Carroll, Catherine A McCarty, Steven J Schrodi, Scott J Hebbring, Krishna S Kishor, Harry W Flynn Jr., … Brian S McKay

The American Journal of Medicine (2016-03) https://doi.org/f3mn5q

DOI: 10.1016/j.amjmed.2015.10.015 · PMID: 26524704 · PMCID: PMC4841631

Data-Driven Prediction of Drug Effects and Interactions

NP Tatonetti, PP Ye, R Daneshjou, RB Altman

Science Translational Medicine (2012-03-14) https://doi.org/f3mn5r

DOI: 10.1126/scitranslmed.3003377 · PMID: 22422992 · PMCID: PMC3382018

10.

Bayesian statistical methods for genetic association studies

Matthew Stephens, David J Balding

Nature Reviews Genetics (2009-10) https://doi.org/fwsqv7

DOI: 10.1038/nrg2615 · PMID: 19763151

11.

The complex genetics of multiple sclerosis: pitfalls and prospects

Stephen Sawcer

Brain (2008-05-18) https://doi.org/bwb6wq

DOI: 10.1093/brain/awn081 · PMID: 18490360 · PMCID: PMC2639203

12.

Magic shotguns versus magic bullets: selectively non-selective drugs for mood disorders and schizophrenia

Bryan L Roth, Douglas J Sheffler, Wesley K Kroeze

Nature Reviews Drug Discovery (2004-04) https://doi.org/ctv9tr

DOI: 10.1038/nrd1346 · PMID: 15060530

13.

Network pharmacology: the next paradigm in drug discovery

Andrew L Hopkins

Nature Chemical Biology (2008-10-20) https://doi.org/d9wjp3

DOI: 10.1038/nchembio.118 · PMID: 18936753

14.

Network pharmacology

Andrew L Hopkins

Nature Biotechnology (2007-10) https://doi.org/cfmxn4

DOI: 10.1038/nbt1007-1110 · PMID: 17921993

15.

How were new medicines discovered?

David C Swinney, Jason Anthony

Nature Reviews Drug Discovery (2011-06-24) https://doi.org/bbg5wh

DOI: 10.1038/nrd3480 · PMID: 21701501

16.

Drug discovery in the age of systems biology: the rise of computational approaches for data integration

Murat Iskar, Georg Zeller, Xing-Ming Zhao, Vera van Noort, Peer Bork

Current Opinion in Biotechnology (2012-08) https://doi.org/b6c3jw

DOI: 10.1016/j.copbio.2011.11.010 · PMID: 22153034

17.

The Connectivity Map: a new tool for biomedical research

Justin Lamb

Nature Reviews Cancer (2007-01) https://doi.org/cdvmsf

DOI: 10.1038/nrc2044 · PMID: 17186018

18.

Applications of Connectivity Map in drug discovery and development

Xiaoyan A Qu, Deepak K Rajpal

Drug Discovery Today (2012-12) https://doi.org/f3mn5s

DOI: 10.1016/j.drudis.2012.07.017 · PMID: 22889966

19.

In silicomethods for drug repurposing and pharmacology

Rachel A Hodos, Brian A Kidd, Khader Shameer, Ben P Readhead, Joel T Dudley

Wiley Interdisciplinary Reviews: Systems Biology and Medicine (2016-04-15) https://doi.org/f9pbwc

DOI: 10.1002/wsbm.1337 · PMID: 27080087 · PMCID: PMC4845762

20.

Computational Drug Repositioning: From Data to Therapeutics

MR Hurle, L Yang, Q Xie, DK Rajpal, P Sanseau, P Agarwal

Clinical Pharmacology & Therapeutics (2013-01-15) https://doi.org/f3mn5t

DOI: 10.1038/clpt.2013.1 · PMID: 23443757

21.

In silico drug repositioning – what we need to know

Zhichao Liu, Hong Fang, Kelly Reagan, Xiaowei Xu, Donna L Mendrick, William Slikker Jr, Weida Tong

Drug Discovery Today (2013-02) https://doi.org/f3mn5v

DOI: 10.1016/j.drudis.2012.08.005 · PMID: 22935104

22.

Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes

Daniel S Himmelstein, Sergio E Baranzini

PLOS Computational Biology (2015-07-09) https://doi.org/98q

DOI: 10.1371/journal.pcbi.1004259 · PMID: 26158728 · PMCID: PMC4497619

23.

Co-author Relationship Prediction in Heterogeneous Bibliographic Networks

Yizhou Sun, Rick Barber, Manish Gupta, Charu C Aggarwal, Jiawei Han

2011 International Conference on Advances in Social Networks Analysis and Mining (2011-07) https://doi.org/cm4chq

DOI: 10.1109/asonam.2011.112

24.

PREDICT: a method for inferring novel drug indications with application to personalized medicine

Assaf Gottlieb, Gideon Y Stein, Eytan Ruppin, Roded Sharan

Molecular Systems Biology (2011-01) https://doi.org/cjvp2w

DOI: 10.1038/msb.2011.26 · PMID: 21654673 · PMCID: PMC3159979

25.

Systematic evaluation of connectivity map for disease indications

Jie Cheng, Lun Yang, Vinod Kumar, Pankaj Agarwal

Genome Medicine (2014-12) https://doi.org/f3mn5w

DOI: 10.1186/s13073-014-0095-1 · PMID: 25606058 · PMCID: PMC4278345

26.

Network-based in silico drug efficacy screening

Emre Guney, Jörg Menche, Marc Vidal, Albert-László Barábasi

Nature Communications (2016-02-01) https://doi.org/f3mn5x

DOI: 10.1038/ncomms10331 · PMID: 26831545 · PMCID: PMC4740350

27.

A new method for computational drug repositioning using drug pairwise similarity

Jiao Li, Zhiyong Lu

2012 IEEE International Conference on Bioinformatics and Biomedicine (2012-10) https://doi.org/f3mn48

DOI: 10.1109/bibm.2012.6392722 · PMID: 25264495 · PMCID: PMC4175719

28.

Systematic Evaluation of Drug–Disease Relationships to Identify Leads for Novel Drug Uses

AP Chiang, AJ Butte

Clinical Pharmacology & Therapeutics (2009-07-01) https://doi.org/bshjd8

DOI: 10.1038/clpt.2009.103 · PMID: 19571805 · PMCID: PMC2836384

29.

The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease

J Lamb

Science (2006-09-29) https://doi.org/c92ptt

DOI: 10.1126/science.1132939 · PMID: 17008526

30.

Transcriptional data: a new gateway to drug repositioning?

Francesco Iorio, Timothy Rittman, Hong Ge, Michael Menden, Julio Saez-Rodriguez

Drug Discovery Today (2013-04) https://doi.org/f3mn5z

DOI: 10.1016/j.drudis.2012.07.014 · PMID: 22897878 · PMCID: PMC3625109

31.

The support of human genetic evidence for approved drug indications

Matthew R Nelson, Hannah Tipney, Jeffery L Painter, Judong Shen, Paola Nicoletti, Yufeng Shen, Aris Floratos, Pak Chung Sham, Mulin Jun Li, Junwen Wang, … Philippe Sanseau

Nature Genetics (2015-06-29) https://doi.org/f3mn52

DOI: 10.1038/ng.3314 · PMID: 26121088

32.

Use of genome-wide association studies for drug repositioning

Philippe Sanseau, Pankaj Agarwal, Michael R Barnes, Tomi Pastinen, JBrent Richards, Lon R Cardon, Vincent Mooser

Nature Biotechnology (2012-04) https://doi.org/f3mn53

DOI: 10.1038/nbt.2151 · PMID: 22491277

33.

Drug Target Identification Using Side-Effect Similarity

M Campillos, M Kuhn, A-C Gavin, LJ Jensen, P Bork

Science (2008-07-11) https://doi.org/bb6

DOI: 10.1126/science.1158140 · PMID: 18621671

34.

Computational drug repositioning based on side-effects mined from social media

Timothy Nugent, Vassilis Plachouras, Jochen L Leidner

PeerJ Computer Science (2016-02-24) https://doi.org/f3mn54

DOI: 10.7717/peerj-cs.46

35.

Human symptoms–disease network

XueZhong Zhou, Jörg Menche, Albert-László Barabási, Amitabh Sharma

Nature Communications (2014-06-26) https://doi.org/f3mn55

DOI: 10.1038/ncomms5212 · PMID: 24967666

36.

Pathway-based Bayesian inference of drug–disease interactions

Naruemon Pratanwanich, Pietro Lió

Mol. BioSyst. (2014) https://doi.org/f3mn56

DOI: 10.1039/c4mb00014e · PMID: 24695945

37.

Exploring the power of Hetionet: a Cypher query depot

Daniel Himmelstein

ThinkLab (2016-06-25) https://doi.org/brsd

DOI: 10.15363/thinklab.d220

38.

Dhimmel/Hetionet V1.0.0: Hetionet V1.0 In Json, Tsv, And Neo4J Formats

Daniel Himmelstein

Zenodo (2017-02-03) https://doi.org/gbr42n

DOI: 10.5281/zenodo.268568

39.

Computing standardized logistic regression coefficients

Daniel Himmelstein, Antoine Lizee

ThinkLab (2016-04-21) https://doi.org/bhz9

DOI: 10.15363/thinklab.d205

40.

Our hetnet edge prediction methodology: the modeling framework for Project Rephetio

Daniel Himmelstein

ThinkLab (2016-05-04) https://doi.org/f3qbmj

DOI: 10.15363/thinklab.d210

41.

Dhimmel/Learn V1.0: The Machine Learning Repository For Project Rephetio

Daniel Himmelstein

Zenodo (2017-02-04) https://doi.org/gbr42p

DOI: 10.5281/zenodo.268654

42.

Predictions of whether a compound treats a disease

Daniel Himmelstein, Chrissy Hessler, Pouya Khankhanian

ThinkLab (2016-05-17) https://doi.org/f3qbmh

DOI: 10.15363/thinklab.d203

43.

Development of Novel Pharmacotherapeutics for Tobacco Dependence: Progress and Future Directions

D Harmey, PR Griffin, PJ Kenny

Nicotine & Tobacco Research (2012-09-27) https://doi.org/f4crs2

DOI: 10.1093/ntr/nts201 · PMID: 23024249 · PMCID: PMC3611986

44.

Varenicline Is a Partial Agonist at α4β2 and a Full Agonist at α7 Neuronal Nicotinic Receptors

Karla B Mihalak, FIvy Carroll, Charles W Luetje

Molecular Pharmacology (2006-06-09) https://doi.org/cbmrs6

DOI: 10.1124/mol.106.025130 · PMID: 16766716

45.

A variant associated with nicotine dependence, lung cancer and peripheral arterial disease

Thorgeir E Thorgeirsson, Frank Geller, Patrick Sulem, Thorunn Rafnar, Anna Wiste, Kristinn P Magnusson, Andrei Manolescu, Gudmar Thorleifsson, Hreinn Stefansson, Andres Ingason, … Kari Stefansson

Nature (2008-04) https://doi.org/bxg8qk

DOI: 10.1038/nature06846 · PMID: 18385739 · PMCID: PMC4539558

46.

Evaluation of the safety of bupropion (Zyban) for smoking cessation from experience gained in general practice use in England in 2000

Andrew Boshier, Lynda V Wilton, Saad AW Shakir

European Journal of Clinical Pharmacology (2003-12-01) https://doi.org/b77h82

DOI: 10.1007/s00228-003-0693-0 · PMID: 14615857

47.

Efficacy and Safety of Varenicline for Smoking Cessation

JTaylor Hays, Jon O Ebbert, Amit Sood

The American Journal of Medicine (2008-04) https://doi.org/bp33z7

DOI: 10.1016/j.amjmed.2008.01.017 · PMID: 18342165

48.

Nicotine receptor partial agonists for smoking cessation

Kate Cahill, Nicola Lindson-Hawley, Kyla H Thomas, Thomas R Fanshawe, Tim Lancaster

Cochrane Database of Systematic Reviews (2016-05-09) https://doi.org/f8rnz8

DOI: 10.1002/14651858.cd006103.pub7 · PMID: 27158893 · PMCID: PMC6464943

49.

Placebo-Controlled Trial of Cytisine for Smoking Cessation

Robert West, Witold Zatonski, Magdalena Cedzynska, Dorota Lewandowska, Joanna Pazik, Paul Aveyard, John Stapleton

New England Journal of Medicine (2011-09-29) https://doi.org/d69twc

DOI: 10.1056/nejmoa1102035 · PMID: 21991893

50.

Cytisine versus Nicotine for Smoking Cessation

Natalie Walker, Colin Howe, Marewa Glover, Hayden McRobbie, Joanne Barnes, Vili Nosa, Varsha Parag, Bruce Bassett, Christopher Bullen

New England Journal of Medicine (2014-12-18) https://doi.org/xtz

DOI: 10.1056/nejmoa1407764 · PMID: 25517706

51.

Repeated administration of an acetylcholinesterase inhibitor attenuates nicotine taking in rats and smoking behavior in human smokers

RL Ashare, BA Kimmey, LE Rupprecht, ME Bowers, MR Hayes, HD Schmidt

Translational Psychiatry (2016-01) https://doi.org/f78cns

DOI: 10.1038/tp.2015.209 · PMID: 26784967 · PMCID: PMC5068882

52.

Prediction in epilepsy

Pouya Khankhanian, Daniel Himmelstein

ThinkLab (2016-09-18) https://doi.org/gbr42f

DOI: 10.15363/thinklab.d224

53.

Visualizing the top epilepsy predictions in Cytoscape

Daniel Himmelstein, Pouya Khankhanian, Alexander Pico, Lars Juhl Jensen, Scooter Morris

ThinkLab (2017-01-24) https://doi.org/gbr42r

DOI: 10.15363/thinklab.d230

54.

Treatment of Refractory Status Epilepticus With Inhalational Anesthetic Agents Isoflurane and Desflurane

Seyed M Mirsattari, Michael D Sharpe, GBryan Young

Archives of Neurology (2004-08-01) https://doi.org/b648br

DOI: 10.1001/archneur.61.8.1254 · PMID: 15313843

55.

Anatomical Therapeutic Chemical Classification System (WHO)

Karen Knaus

The SAGE Encyclopedia of Pharmacology and Society (2016-03-29) https://doi.org/gbr42m

DOI: 10.4135/9781483349985.n37 · ISBN: 9781483350004

56.

Antiepileptic Drug Interactions - Principles and Clinical Implications

Svein I. Johannessen, Cecilie Johannessen Landmark

Current Neuropharmacology (2010-09-01) https://doi.org/cvc5z8

DOI: 10.2174/157015910792246254 · PMID: 21358975 · PMCID: PMC3001218

57.

The neurobiology of antiepileptic drugs

Michael A Rogawski, Wolfgang Löscher

Nature Reviews Neuroscience (2004-07) https://doi.org/fvwhnm

DOI: 10.1038/nrn1430 · PMID: 15208697

58.

Proconvulsant effects of antidepressants — What is the current evidence?

Cecilie Johannessen Landmark, Oliver Henning, Svein I Johannessen

Epilepsy & Behavior (2016-08) https://doi.org/f8zscw

DOI: 10.1016/j.yebeh.2016.01.029 · PMID: 26926001

59.

Why we predicted ictogenic tricyclic compounds treat epilepsy?

Daniel Himmelstein

ThinkLab (2017-03-10) https://doi.org/gbr42k

DOI: 10.15363/thinklab.d231

60.

Movement disorders in patients taking anticonvulsants

C Zadikoff, RP Munhoz, AN Asante, N Politzer, R Wennberg, P Carlen, A Lang

Journal of Neurology, Neurosurgery & Psychiatry (2007-02-01) https://doi.org/fpvxgq

DOI: 10.1136/jnnp.2006.100222 · PMID: 17012337 · PMCID: PMC2077655

61.

Anticonvulsant-induced downbeat nystagmus in epilepsy

Dongyan Wu, Roland D Thijs

Epilepsy & Behavior Case Reports (2015) https://doi.org/gbr4z9

DOI: 10.1016/j.ebcr.2015.07.003 · PMID: 26543808 · PMCID: PMC4556747

62.

Gastrointestinal adverse effects of antiepileptic drugs in intractable epileptic patients

Soodeh Razeghi Jahromi, Mansoureh Togha, Sohrab Hashemi Fesharaki, Masoumeh Najafi, Nahid Beladi Moghadam, Jalil Arab Kheradmand, Hadi Kazemi, Ali Gorji

Seizure (2011-05) https://doi.org/chk4dn

DOI: 10.1016/j.seizure.2010.12.011 · PMID: 21236703

63.

Methods for biological data integration: perspectives and challenges

Vladimir Gligorijević, Nataša Pržulj

Journal of The Royal Society Interface (2015-11-06) https://doi.org/bdzp

DOI: 10.1098/rsif.2015.0571 · PMID: 26490630 · PMCID: PMC4685837

64.

Multilayer networks

M Kivela, A Arenas, M Barthelemy, JP Gleeson, Y Moreno, MA Porter

Journal of Complex Networks (2014-07-14) https://doi.org/f3mn4x

DOI: 10.1093/comnet/cnu016

65.

Renaming ‘heterogeneous networks’ to a more concise and catchy term

Daniel Himmelstein, Casey Greene, Sergio Baranzini

ThinkLab (2015-08-16) https://doi.org/f3mn4v

DOI: 10.15363/thinklab.d104

66.

Rephetio: Repurposing drugs on a hetnet [project]

Daniel Himmelstein, Antoine Lizee, Chrissy Hessler, Leo Brueggeman, Sabrina Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, Sergio Baranzini

ThinkLab (2015-01-12) https://doi.org/993

DOI: 10.15363/thinklab.4

67.

Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data

M Sirota, JT Dudley, J Kim, AP Chiang, AA Morgan, A Sweet-Cordero, J Sage, AJ Butte

Science Translational Medicine (2011-08-17) https://doi.org/c3fwxv

DOI: 10.1126/scitranslmed.3001318 · PMID: 21849665 · PMCID: PMC3502016

68.

Acamprosate attenuates the handling induced convulsions during alcohol withdrawal in Swiss Webster mice

Justin M Farook, Ali Krazem, Ben Lewis, Dennis J Morrell, John M Littleton, Susan Barron

Physiology & Behavior (2008-09) https://doi.org/cnfsgc

DOI: 10.1016/j.physbeh.2008.05.020 · PMID: 18577392 · PMCID: PMC2561203

69.

Data programming with DDLite

Henry R Ehrenberg, Jaeho Shin, Alexander J Ratner, Jason A Fries, Christopher Ré

Proceedings of the Workshop on Human-In-the-Loop Data Analytics - HILDA '16 (2016) https://doi.org/gbr42c

DOI: 10.1145/2939502.2939515

70.

Brainstorming future directions for Hetionet

Daniel Himmelstein, Benjamin Good, Pouya Khankhanian, Alex Ratner

ThinkLab (2016-11-19) https://doi.org/gbr42g

DOI: 10.15363/thinklab.d227

71.

Data nomenclature: naming and abbreviating our network types

Daniel Himmelstein, Lars Juhl Jensen, Pouya Khankhanian

ThinkLab (2016-02-17) https://doi.org/f3qbmc

DOI: 10.15363/thinklab.d162

72.

Ten Simple Rules for Selecting a Bio-ontology

James Malone, Robert Stevens, Simon Jupp, Tom Hancocks, Helen Parkinson, Cath Brooksbank

PLOS Computational Biology (2016-02-11) https://doi.org/f3mn59

DOI: 10.1371/journal.pcbi.1004743 · PMID: 26867217 · PMCID: PMC4750991

73.

Disease Ontology: a backbone for disease semantic integration

LM Schriml, C Arze, S Nadendla, Y-WW Chang, M Mazaitis, V Felix, G Feng, WA Kibbe

Nucleic Acids Research (2011-11-12) https://doi.org/fwvx82

DOI: 10.1093/nar/gkr972 · PMID: 22080554 · PMCID: PMC3245088

74.

Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data

Warren A Kibbe, Cesar Arze, Victor Felix, Elvira Mitraka, Evan Bolton, Gang Fu, Christopher J Mungall, Janos X Binder, James Malone, Drashtti Vasant, … Lynn M Schriml

Nucleic Acids Research (2014-10-27) https://doi.org/f3mn6b

DOI: 10.1093/nar/gku1011 · PMID: 25348409 · PMCID: PMC4383880

75.

Unifying disease vocabularies

Daniel Himmelstein, Tong Shu Li

ThinkLab (2015-03-30) https://doi.org/f3mqv5

DOI: 10.15363/thinklab.d44

76.

User-Friendly Extensions To The Disease Ontology V1.0

Daniel S Himmelstein

Zenodo (2016-02-04) https://doi.org/f3mqvc

DOI: 10.5281/zenodo.45584

77.

Generating a focused view of disease ontology cancer terms for pan-cancer data integration and analysis

T-J Wu, LM Schriml, Q-R Chen, M Colbert, DJ Crichton, R Finney, Y Hu, WA Kibbe, H Kincaid, D Meerzaman, … R Mazumder

Database (2015-04-04) https://doi.org/f3mn6c

DOI: 10.1093/database/bav032 · PMID: 25841438 · PMCID: PMC4385274

78.

Mining knowledge from MEDLINE articles and their indexed MeSH terms

Daniel Himmelstein, Alex Pankov

ThinkLab (2015-05-10) https://doi.org/f3mqwp

DOI: 10.15363/thinklab.d67

79.

User-Friendly Extensions To Mesh V1.0

Daniel S Himmelstein

Zenodo (2016-02-04) https://doi.org/f3mqvt

DOI: 10.5281/zenodo.45586

80.

DrugBank 4.0: shedding new light on drug metabolism

Vivian Law, Craig Knox, Yannick Djoumbou, Tim Jewison, An Chi Guo, Yifeng Liu, Adam Maciejewski, David Arndt, Michael Wilson, Vanessa Neveu, … David S Wishart

Nucleic Acids Research (2013-11-06) https://doi.org/f3mn6d

DOI: 10.1093/nar/gkt1068 · PMID: 24203711 · PMCID: PMC3965102

81.

Unifying drug vocabularies

Daniel Himmelstein

ThinkLab (2015-03-16) https://doi.org/f3mqtx

DOI: 10.15363/thinklab.d40

82.

User-Friendly Extensions Of The Drugbank Database V1.0

Daniel S Himmelstein

Zenodo (2016-02-04) https://doi.org/f3mqwk

DOI: 10.5281/zenodo.45579

83.

The SIDER database of drugs and side effects

Michael Kuhn, Ivica Letunic, Lars Juhl Jensen, Peer Bork

Nucleic Acids Research (2015-10-19) https://doi.org/f3mn6f

DOI: 10.1093/nar/gkv1075 · PMID: 26481350 · PMCID: PMC4702794

84.

Extracting side effects from SIDER 4

Daniel Himmelstein

ThinkLab (2015-08-08) https://doi.org/f3mqvd

DOI: 10.15363/thinklab.d97

85.

Extracting Tidy And User-Friendly Tsvs From Sider 4.1

Daniel S Himmelstein

Zenodo (2016-02-03) https://doi.org/f3mqwd

DOI: 10.5281/zenodo.45521

86.

The Unified Medical Language System (UMLS): integrating biomedical terminology

O Bodenreider

Nucleic Acids Research (2004-01-01) https://doi.org/bzs9ps

DOI: 10.1093/nar/gkh061 · PMID: 14681409 · PMCID: PMC308795

87.

DrugCentral: online drug compendium

Oleg Ursu, Jayme Holmes, Jeffrey Knockel, Cristian G Bologa, Jeremy J Yang, Stephen L Mathias, Stuart J Nelson, Tudor I Oprea

Nucleic Acids Research (2016-10-26) https://doi.org/f9vphz

DOI: 10.1093/nar/gkw993 · PMID: 27789690 · PMCID: PMC5210665

88.

Incorporating DrugCentral data in our network

Daniel Himmelstein, Oleg Ursu, Mike Gilson, Pouya Khankhanian, Tudor Oprea

ThinkLab (2016-03-20) https://doi.org/f3mqtn

DOI: 10.15363/thinklab.d186

89.

Entrez Gene: gene-centered information at NCBI

D Maglott, J Ostell, KD Pruitt, T Tatusova

Nucleic Acids Research (2010-11-28) https://doi.org/fsjcqz

DOI: 10.1093/nar/gkq1237 · PMID: 21115458 · PMCID: PMC3013746

90.

Using Entrez Gene as our gene vocabulary

Daniel Himmelstein, Casey Greene, Alexander Pico

ThinkLab (2015-02-27) https://doi.org/f3mqv4

DOI: 10.15363/thinklab.d34

91.

Processed Entrez Gene Datasets For Humans V1.0

Daniel S Himmelstein

Zenodo (2016-02-04) https://doi.org/f3mqvz

DOI: 10.5281/zenodo.45524

92.

Uberon, an integrative multi-species anatomy ontology

Christopher J Mungall, Carlo Torniai, Georgios V Gkoutos, Suzanna E Lewis, Melissa A Haendel

Genome Biology (2012) https://doi.org/fxx6qr

DOI: 10.1186/gb-2012-13-1-r5 · PMID: 22293552 · PMCID: PMC3334586

93.

Tissue Node

Venkat Malladi, Daniel Himmelstein, Chris Mungall

ThinkLab (2015-03-19) https://doi.org/f3mn6g

DOI: 10.15363/thinklab.d41

94.

User-Friendly Anatomical Structures Data From The Uberon Ontology V1.0

Daniel S Himmelstein

Zenodo (2016-02-04) https://doi.org/f3mqtt

DOI: 10.5281/zenodo.45527

95.

WikiPathways: capturing the full diversity of pathway knowledge

Martina Kutmon, Anders Riutta, Nuno Nunes, Kristina Hanspers, Egon L Willighagen, Anwesha Bohler, Jonathan Mélius, Andra Waagmeester, Sravanthi R Sinha, Ryan Miller, … Alexander R Pico

Nucleic Acids Research (2015-10-19) https://doi.org/f3mn6h

DOI: 10.1093/nar/gkv1024 · PMID: 26481357 · PMCID: PMC4702772

96.

WikiPathways: Pathway Editing for the People

Alexander R Pico, Thomas Kelder, Martijn P van Iersel, Kristina Hanspers, Bruce R Conklin, Chris Evelo

PLoS Biology (2008-07-22) https://doi.org/bhsdc2

DOI: 10.1371/journal.pbio.0060184 · PMID: 18651794 · PMCID: PMC2475545

97.

The Reactome pathway Knowledgebase

Antonio Fabregat, Konstantinos Sidiropoulos, Phani Garapati, Marc Gillespie, Kerstin Hausmann, Robin Haw, Bijay Jassal, Steven Jupe, Florian Korninger, Sheldon McKay, … Peter D'Eustachio

Nucleic Acids Research (2015-12-09) https://doi.org/f3mn6j

DOI: 10.1093/nar/gkv1351 · PMID: 26656494 · PMCID: PMC4702931

98.

PID: the Pathway Interaction Database

Carl F Schaefer, Kira Anthony, Shiva Krupa, Jeffrey Buchoff, Matthew Day, Timo Hannay, Kenneth H Buetow

Nucleic Acids Research (2008-10-02) https://doi.org/dv62wn

DOI: 10.1093/nar/gkn653 · PMID: 18832364 · PMCID: PMC2686461

99.

Pathway Commons, a web resource for biological pathway data

EG Cerami, BE Gross, E Demir, I Rodchenkov, O Babur, N Anwar, N Schultz, GD Bader, C Sander

Nucleic Acids Research (2010-11-10) https://doi.org/csjvsx

DOI: 10.1093/nar/gkq1039 · PMID: 21071392 · PMCID: PMC3013659

100.

Adding pathway resources to your network

Alexander Pico, Daniel Himmelstein

ThinkLab (2015-06-08) https://doi.org/f3mn6k

DOI: 10.15363/thinklab.d72

101.

Dhimmel/Pathways V2.0: Compiling Human Pathway Gene Sets

Daniel S Himmelstein, Alexander R Pico

Zenodo (2016-04-02) https://doi.org/bfq3

DOI: 10.5281/zenodo.48810

102.

Gene Ontology: tool for the unification of biology

Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler, JMichael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, … Gavin Sherlock

Nature Genetics (2000-05) https://doi.org/b9gp96

DOI: 10.1038/75556 · PMID: 10802651 · PMCID: PMC3037419

103.

Disease Ontology feature requests

Daniel Himmelstein

ThinkLab (2015-05-11) https://doi.org/f3mqvf

DOI: 10.15363/thinklab.d68

104.

Chemical databases: curation or integration by user-defined equivalence?

Anne Hersey, Jon Chambers, Louisa Bellis, A Patrícia Bento, Anna Gaulton, John P Overington

Drug Discovery Today: Technologies (2015-07) https://doi.org/f3mn6m

DOI: 10.1016/j.ddtec.2015.01.005 · PMID: 26194583 · PMCID: PMC6294287

105.

UniChem: a unified chemical structure cross-referencing and identifier tracking system

Jon Chambers, Mark Davies, Anna Gaulton, Anne Hersey, Sameer Velankar, Robert Petryszak, Janna Hastings, Louisa Bellis, Shaun McGlinchey, John P Overington

Journal of Cheminformatics (2013-01-14) https://doi.org/f3mn6n

DOI: 10.1186/1758-2946-5-3 · PMID: 23317286 · PMCID: PMC3616875

106.

UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers

Jon Chambers, Mark Davies, Anna Gaulton, George Papadatos, Anne Hersey, John P Overington

Journal of Cheminformatics (2014-09-04) https://doi.org/vkq

DOI: 10.1186/s13321-014-0043-5 · PMID: 25221628 · PMCID: PMC4158273

107.

InChI - the worldwide chemical structure identifier standard

Stephen Heller, Alan McNaught, Stephen Stein, Dmitrii Tchekhovskoi, Igor Pletnev

Journal of Cheminformatics (2013-01-24) https://doi.org/6bg

DOI: 10.1186/1758-2946-5-7 · PMID: 23343401 · PMCID: PMC3599061

108.

Dhimmel/Bgee V1.0: Anatomy-Specific Gene Expression In Humans From Bgee

Daniel Himmelstein, Frederic Bastian, Sergio Baranzini

Zenodo (2016-03-08) https://doi.org/f3mqv2

DOI: 10.5281/zenodo.47157

109.

Processing Bgee for tissue-specific gene presence and over/under-expression

Daniel Himmelstein, Frederic Bastian

ThinkLab (2015-11-03) https://doi.org/f3mqwg

DOI: 10.15363/thinklab.d124

110.

Tissue-specific gene expression resources

Daniel Himmelstein, Frederic Bastian

ThinkLab (2015-06-17) https://doi.org/f3mqvb

DOI: 10.15363/thinklab.d81

111.

Bgee: Integrating and Comparing Heterogeneous Transcriptome Data Among Species

Frederic Bastian, Gilles Parmentier, Julien Roux, Sebastien Moretti, Vincent Laudet, Marc Robinson-Rechavi

Lecture Notes in Computer Science (2008-06) https://doi.org/b6wnrr

DOI: 10.1007/978-3-540-69828-9_12

112.

Comprehensive comparison of large-scale tissue expression datasets

Alberto Santos, Kalliopi Tsafou, Christian Stolte, Sune Pletscher-Frankild, Seán I O’Donoghue, Lars Juhl Jensen

PeerJ (2015-06-30) https://doi.org/f3mn6p

DOI: 10.7717/peerj.1054 · PMID: 26157623 · PMCID: PMC4493645

113.

Gene–Tissue Relationships From The Tissues Database

Daniel Himmelstein, Lars Juhl Jensen

Zenodo (2015-08-09) https://doi.org/f3mqv8

DOI: 10.5281/zenodo.27244

114.

The TISSUES resource for the tissue-specificity of genes

Daniel Himmelstein, Lars Juhl Jensen

ThinkLab (2015-07-10) https://doi.org/f3mqwf

DOI: 10.15363/thinklab.d91

115.

BindingDB: A Web-Accessible Molecular Recognition Database

Xi Chen, Ming Liu, Michael Gilson

Combinatorial Chemistry & High Throughput Screening (2001-12-01) https://doi.org/f3mn6q

DOI: 10.2174/1386207013330670 · PMID: 11812264

116.

BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology

Michael K Gilson, Tiqing Liu, Michael Baitaluk, George Nicola, Linda Hwang, Jenny Chong

Nucleic Acids Research (2015-10-19) https://doi.org/f3mn6r

DOI: 10.1093/nar/gkv1072 · PMID: 26481362 · PMCID: PMC4702793

117.

DrugBank: a comprehensive resource for in silico drug discovery and exploration

DS Wishart

Nucleic Acids Research (2006-01-01) https://doi.org/c7cp42

DOI: 10.1093/nar/gkj067 · PMID: 16381955 · PMCID: PMC1347430

118.

Integrating drug target information from BindingDB

Daniel Himmelstein, Mike Gilson

ThinkLab (2015-04-13) https://doi.org/f3mqvv

DOI: 10.15363/thinklab.d53

119.

Processing The October 2015 Bindingdb

Daniel Himmelstein, Michael Gilson, Sergio Baranzini

Zenodo (2015-11-19) https://doi.org/f3mqvp

DOI: 10.5281/zenodo.33987

120.

Protein (target, carrier, transporter, and enzyme) interactions in DrugBank

Daniel Himmelstein, Sabrina Chen

ThinkLab (2015-05-09) https://doi.org/f3mqvm

DOI: 10.15363/thinklab.d65

121.

Calculating molecular similarities between DrugBank compounds

Daniel Himmelstein, Sabrina Chen

ThinkLab (2015-05-18) https://doi.org/f3mqwr

DOI: 10.15363/thinklab.d70

122.

Pairwise molecular similarities between DrugBank compounds

Daniel Himmelstein, Leo Brueggeman, Sergio Baranzini

Figshare (2015) https://doi.org/f3mqt2

DOI: 10.6084/m9.figshare.1418386

123.

Measures of the Amount of Ecologic Association Between Species

Lee R Dice

Ecology (1945-07) https://doi.org/dsb8pd

DOI: 10.2307/1932409

124.

Extended-Connectivity Fingerprints

David Rogers, Mathew Hahn

Journal of Chemical Information and Modeling (2010-04-28) https://doi.org/fp3ctj

DOI: 10.1021/ci100050t · PMID: 20426451

125.

The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service.

HL Morgan

Journal of Chemical Documentation (1965-05) https://doi.org/bpzn7w

DOI: 10.1021/c160017a018

126.

Dhimmel/Gwas-Catalog V1.0: Extracting Gene–Disease Associations From The Gwas Catalog

Daniel S Himmelstein, Sergio E Baranzini

Zenodo (2016-03-26) https://doi.org/f3mqws

DOI: 10.5281/zenodo.48428

127.

Processing the DISEASES resource for disease–gene relationships

Daniel Himmelstein, Lars Juhl Jensen

ThinkLab (2015-08-20) https://doi.org/f3mqt3

DOI: 10.15363/thinklab.d106

128.

Dhimmel/Diseases V1.0: Processing The Diseases Database Of Gene–Disease Associations

Daniel S Himmelstein, Lars Juhl Jensen

Zenodo (2016-03-26) https://doi.org/f3mqv7

DOI: 10.5281/zenodo.48425

129.

Processing DisGeNET for disease-gene relationships

Daniel Himmelstein, janet piñero

ThinkLab (2015-08-17) https://doi.org/f3mqv3

DOI: 10.15363/thinklab.d105

130.

Dhimmel/Disgenet V1.0: Processing The Disgenet Database Of Gene–Disease Associations

Daniel S Himmelstein, Janet Piñero

Zenodo (2016-03-26) https://doi.org/f3mqwt

DOI: 10.5281/zenodo.48426

131.

Functional disease annotations for genes using DOAF

Daniel Himmelstein

ThinkLab (2015-07-14) https://doi.org/f3mqvw

DOI: 10.15363/thinklab.d94

132.

Dhimmel/Doaf V1.0: Processing The Doaf Database Of Gene–Disease Associations

Daniel S Himmelstein

Zenodo (2016-03-26) https://doi.org/f3mqwx

DOI: 10.5281/zenodo.48427

133.

The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog)

Jacqueline MacArthur, Emily Bowler, Maria Cerezo, Laurent Gil, Peggy Hall, Emma Hastings, Heather Junkins, Aoife McMahon, Annalisa Milano, Joannella Morales, … Helen Parkinson

Nucleic Acids Research (2016-11-29) https://doi.org/f9v7cp

DOI: 10.1093/nar/gkw1133 · PMID: 27899670 · PMCID: PMC5210590

134.

Extracting disease-gene associations from the GWAS Catalog

Daniel Himmelstein

ThinkLab (2015-06-16) https://doi.org/f3mqv6

DOI: 10.15363/thinklab.d80

135.

Calculating genomic windows for GWAS lead SNPs

Daniel Himmelstein, Marina Sirota, Greg Way

ThinkLab (2015-06-08) https://doi.org/f3mqt8

DOI: 10.15363/thinklab.d71

136.

DISEASES: Text mining and data integration of disease–gene associations

Sune Pletscher-Frankild, Albert Pallejà, Kalliopi Tsafou, Janos X Binder, Lars Juhl Jensen

Methods (2015-03) https://doi.org/f3mn6s

DOI: 10.1016/j.ymeth.2014.11.020 · PMID: 25484339

137.

DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes

J Pinero, N Queralt-Rosinach, A Bravo, J Deu-Pons, A Bauer-Mehren, M Baron, F Sanz, LI Furlong

Database (2015-04-15) https://doi.org/f3mn6t

DOI: 10.1093/database/bav028 · PMID: 25877637 · PMCID: PMC4397996

138.

DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants

Janet Piñero, Àlex Bravo, Núria Queralt-Rosinach, Alba Gutiérrez-Sacristán, Jordi Deu-Pons, Emilio Centeno, Javier García-García, Ferran Sanz, Laura I Furlong

Nucleic Acids Research (2016-10-19) https://doi.org/f9v9wp

DOI: 10.1093/nar/gkw943 · PMID: 27924018 · PMCID: PMC5210640

139.

A Framework for Annotating Human Genome in Disease Context

Wei Xu, Huisong Wang, Wenqing Cheng, Dong Fu, Tian Xia, Warren A Kibbe, Simon M Lin

PLoS ONE (2012-12-10) https://doi.org/f3mn6v

DOI: 10.1371/journal.pone.0049686 · PMID: 23251346 · PMCID: PMC3519466

140.

STARGEO: expression signatures for disease using crowdsourced GEO annotation

Daniel Himmelstein, Frederic Bastian, Dexter Hadley, Casey Greene

ThinkLab (2015-07-28) https://doi.org/f3mqwh

DOI: 10.15363/thinklab.d96

141.

Dhimmel/Stargeo V1.0: Differentially Expressed Genes For 48 Diseases From Stargeo

Daniel Himmelstein, Dexter Hadley, Alexander Schepanovski

Zenodo (2016-03-03) https://doi.org/f3mqvg

DOI: 10.5281/zenodo.46866

142.

Dhimmel/Medline V1.0: Disease, Symptom, And Anatomy Cooccurence In Medline

Daniel S Himmelstein

Zenodo (2016-03-28) https://doi.org/f3mqts

DOI: 10.5281/zenodo.48445

143.

Disease similarity from MEDLINE topic cooccurrence

Daniel Himmelstein

ThinkLab (2015-07-14) https://doi.org/f3mqvx

DOI: 10.15363/thinklab.d93

144.

On the Interpretation of χ 2 from Contingency Tables, and the Calculation of P

RA Fisher

Journal of the Royal Statistical Society (1922-01) https://doi.org/frpswx

DOI: 10.2307/2340521

145.

Computing consensus transcriptional profiles for LINCS L1000 perturbations

Daniel Himmelstein, Caty Chung

ThinkLab (2015-03-26) https://doi.org/f3mqwc

DOI: 10.15363/thinklab.d43

146.

Consensus signatures for LINCS L1000 perturbations

Daniel Himmelstein, Leo Brueggeman, Sergio Baranzini

Figshare (2016) https://doi.org/f3mqvs

DOI: 10.6084/m9.figshare.3085426.v1

147.

Evolutionary Signatures amongst Disease Genes Permit Novel Methods for Gene Prioritization and Construction of Informative Gene-Based Networks

Nolan Priedigkeit, Nicholas Wolfe, Nathan L Clark

PLOS Genetics (2015-02-13) https://doi.org/f3mn6w

DOI: 10.1371/journal.pgen.1004967 · PMID: 25679399 · PMCID: PMC4334549

148.

Selecting informative ERC (evolutionary rate covariation) values between genes

Daniel Himmelstein, Raghavendran Partha

ThinkLab (2015-04-22) https://doi.org/f3mqv9

DOI: 10.15363/thinklab.d57

149.

Dhimmel/Erc V1.0: Processing Human Evolutionary Rate Covaration Data

Daniel S Himmelstein

Zenodo (2016-03-28) https://doi.org/f3mqwm

DOI: 10.5281/zenodo.48444

150.

Creating a catalog of protein interactions

Daniel Himmelstein, Dexter Hadley, Alexey Strokach

ThinkLab (2015-07-01) https://doi.org/f3mqtp

DOI: 10.15363/thinklab.d85

151.

Dhimmel/Ppi V1.0: Compiling A Human Protein Interaction Catalog

Daniel S Himmelstein, Sergio E Baranzini

Zenodo (2016-03-28) https://doi.org/f3mqtw

DOI: 10.5281/zenodo.48443

152.

Towards a proteome-scale map of the human protein–protein interaction network

Jean-François Rual, Kavitha Venkatesan, Tong Hao, Tomoko Hirozane-Kishikawa, Amélie Dricot, Ning Li, Gabriel F Berriz, Francis D Gibbons, Matija Dreze, Nono Ayivi-Guedehoussou, … Marc Vidal

Nature (2005-09-28) https://doi.org/dw6q23

DOI: 10.1038/nature04209 · PMID: 16189514

153.

An empirical framework for binary interactome mapping

Kavitha Venkatesan, Jean-François Rual, Alexei Vazquez, Ulrich Stelzl, Irma Lemmens, Tomoko Hirozane-Kishikawa, Tong Hao, Martina Zenkner, Xiaofeng Xin, Kwang-Il Goh, … Marc Vidal

Nature Methods (2008-12-07) https://doi.org/cn6p3m

DOI: 10.1038/nmeth.1280 · PMID: 19060904 · PMCID: PMC2872561

154.

Next-generation sequencing to generate interactome datasets

Haiyuan Yu, Leah Tardivo, Stanley Tam, Evan Weiner, Fana Gebreab, Changyu Fan, Nenad Svrzikapa, Tomoko Hirozane-Kishikawa, Edward Rietman, Xinping Yang, … Marc Vidal

Nature Methods (2011-04-24) https://doi.org/bzrsvs

DOI: 10.1038/nmeth.1597 · PMID: 21516116 · PMCID: PMC3188388

155.

A Proteome-Scale Map of the Human Interactome Network

Thomas Rolland, Murat Taşan, Benoit Charloteaux, Samuel J Pevzner, Quan Zhong, Nidhi Sahni, Song Yi, Irma Lemmens, Celia Fontanillo, Roberto Mosca, … Marc Vidal

Cell (2014-11) https://doi.org/f3mn6x

DOI: 10.1016/j.cell.2014.10.050 · PMID: 25416956 · PMCID: PMC4266588

156.

Uncovering disease-disease relationships through the incomplete interactome

J Menche, A Sharma, M Kitsak, SD Ghiassian, M Vidal, J Loscalzo, A-L Barabasi

Science (2015-02-19) https://doi.org/f3mn6z

DOI: 10.1126/science.1257601 · PMID: 25700523 · PMCID: PMC4435741

157.

The GOA database: Gene Ontology annotation updates for 2015

Rachael P Huntley, Tony Sawford, Prudence Mutowo-Meullenet, Aleksandra Shypitsyna, Carlos Bonilla, Maria J Martin, Claire O'Donovan

Nucleic Acids Research (2014-11-06) https://doi.org/35x

DOI: 10.1093/nar/gku1113 · PMID: 25378336 · PMCID: PMC4383930

158.

Compiling Gene Ontology annotations into an easy-to-use format

Daniel Himmelstein, Casey Greene, Venkat Malladi, Frederic Bastian

ThinkLab (2015-03-12) https://doi.org/f3mqt9

DOI: 10.15363/thinklab.d39

159.

Gene-Ontology: Initial Zenodo Release

Daniel Himmelstein, Casey Greene, Venkat Malladi, Frederic Bastian, Sergio Baranzini

Zenodo (2015-07-28) https://doi.org/f3mqvj

DOI: 10.5281/zenodo.21711

160.

Precision annotation of digital samples in NCBI’s gene expression omnibus

Dexter Hadley, James Pan, Osama El-Sayed, Jihad Aljabban, Imad Aljabban, Tej D Azad, Mohamad O Hadied, Shuaib Raza, Benjamin Abhishek Rayikanti, Bin Chen, … Atul J Butte

Scientific Data (2017-09-19) https://doi.org/gbv379

DOI: 10.1038/sdata.2017.125 · PMID: 28925997 · PMCID: PMC5604135

161.

Gene Expression Omnibus: NCBI gene expression and hybridization array data repository

R Edgar

Nucleic Acids Research (2002-01-01) https://doi.org/fttpkn

DOI: 10.1093/nar/30.1.207 · PMID: 11752295 · PMCID: PMC99122

162.

NCBI GEO: archive for functional genomics data sets—update

Tanya Barrett, Stephen E Wilhite, Pierre Ledoux, Carlos Evangelista, Irene F Kim, Maxim Tomashevsky, Kimberly A Marshall, Katherine H Phillippy, Patti M Sherman, Michelle Holko, … Alexandra Soboleva

Nucleic Acids Research (2012-11-26) https://doi.org/f3mn62

DOI: 10.1093/nar/gks1193 · PMID: 23193258 · PMCID: PMC3531084

163.

Dhimmel/Lincs V2.0: Refined Consensus Signatures From Lincs L1000

Daniel Himmelstein, Leo Brueggeman, Sergio Baranzini

Zenodo (2016-03-08) https://doi.org/f3mqvr

DOI: 10.5281/zenodo.47223

164.

l1000.db: SQLite database of LINCS L1000 metadata

Daniel Himmelstein, Leo Brueggeman, Sergio Baranzini

Figshare (2016) https://doi.org/f3mqtq

DOI: 10.6084/m9.figshare.3085837.v1

165.

Assessing the imputation quality of gene expression in LINCS L1000

Daniel Himmelstein

ThinkLab (2016-03-11) https://doi.org/f3mqtr

DOI: 10.15363/thinklab.d185

166.

Positive correlations between knockdown and overexpression profiles from LINCS L1000

Daniel Himmelstein, Casey Greene, Lars Juhl Jensen

ThinkLab (2016-02-26) https://doi.org/f3mqt7

DOI: 10.15363/thinklab.d171

167.

Announcing PharmacotherapyDB: the Open Catalog of Drug Therapies for Disease

Daniel Himmelstein

ThinkLab (2016-03-15) https://doi.org/f3mqtv

DOI: 10.15363/thinklab.d182

168.

PharmacotherapyDB 1.0: the open catalog of drug therapies for disease

Daniel Himmelstein, Pouya Khankhanian, Christine S Hessler, Ari J Green, Sergio Baranzini

Figshare (2016) https://doi.org/f3mqvq

DOI: 10.6084/m9.figshare.3103054

169.

Dhimmel/Indications V1.0. Pharmacotherapydb: The Open Catalog Of Drug Therapies For Disease

Daniel S Himmelstein, Pouya Khankhanian, Christine S Hessler, Ari J Green, Sergio E Baranzini

Zenodo (2016-03-15) https://doi.org/f3mqwb

DOI: 10.5281/zenodo.47664

170.

How should we construct a catalog of drug indications?

Daniel Himmelstein, Benjamin Good, Tudor Oprea, Allison McCoy, Antoine Lizee

ThinkLab (2015-01-13) https://doi.org/f3mqtz

DOI: 10.15363/thinklab.d21

171.

Development and evaluation of an ensemble resource linking medications to their indications

Wei-Qi Wei, Robert M Cronin, Hua Xu, Thomas A Lasko, Lisa Bastarache, Joshua C Denny

Journal of the American Medical Informatics Association (2013-09) https://doi.org/f3mn63

DOI: 10.1136/amiajnl-2012-001431 · PMID: 23576672 · PMCID: PMC3756263

172.

LabeledIn: Cataloging labeled indications for human drugs

Ritu Khare, Jiao Li, Zhiyong Lu

Journal of Biomedical Informatics (2014-12) https://doi.org/f3mn64

DOI: 10.1016/j.jbi.2014.08.004 · PMID: 25220766 · PMCID: PMC4260997

173.

Scaling drug indication curation through crowdsourcing

Ritu Khare, John D Burger, John S Aberdeen, David W Tresner-Kirsch, Theodore J Corrales, Lynette Hirchman, Zhiyong Lu

Database (2015-01-01) https://doi.org/f3mn65

DOI: 10.1093/database/bav016 · PMID: 25797061 · PMCID: PMC4369375

174.

Processing LabeledIn to extract indications

Daniel Himmelstein, Ritu Khare

ThinkLab (2015-04-02) https://doi.org/f3mqww

DOI: 10.15363/thinklab.d46

175.

Development and evaluation of a crowdsourcing methodology for knowledge base construction: identifying relationships between clinical problems and medications

Allison B McCoy, Adam Wright, Archana Laxmisan, Madelene J Ottosen, Jacob A McCoy, David Butten, Dean F Sittig

Journal of the American Medical Informatics Association (2012-09) https://doi.org/f3mn66

DOI: 10.1136/amiajnl-2012-000852 · PMID: 22582202 · PMCID: PMC3422843

176.

Extracting indications from the ehrlink resource

Daniel Himmelstein

ThinkLab (2015-05-01) https://doi.org/f3mqwv

DOI: 10.15363/thinklab.d62

177.

Expert curation of our indication catalog for disease-modifying treatments

Daniel Himmelstein, Pouya Khankhanian, Chrissy Hessler

ThinkLab (2015-07-14) https://doi.org/f3mqwn

DOI: 10.15363/thinklab.d95

178.

Enabling reproducibility and reuse

Jesse Spaulding, Daniel Himmelstein, Casey Greene, Benjamin Good

ThinkLab (2015-01-16) https://doi.org/f3mn67

DOI: 10.15363/thinklab.d23

179.

The need and drive for open data in biomedical publishing

Iain Hrynaszkiewicz

Serials: The Journal for the Serials Community (2011-03-01) https://doi.org/c7zvmd

DOI: 10.1629/2431

180.

The Open Knowledge Foundation: Open Data Means Better Science

Jennifer C Molloy

PLoS Biology (2011-12-06) https://doi.org/g3b

DOI: 10.1371/journal.pbio.1001195 · PMID: 22162946 · PMCID: PMC3232214

181.

How open science helps researchers succeed

Erin C McKiernan, Philip E Bourne, CTitus Brown, Stuart Buck, Amye Kenall, Jennifer Lin, Damon McDougall, Brian A Nosek, Karthik Ram, Courtney K Soderberg, … Tal Yarkoni

eLife (2016-07-07) https://doi.org/gbqsng

DOI: 10.7554/elife.16800 · PMID: 27387362 · PMCID: PMC4973366

182.

Data reuse and the open data citation advantage

Heather A Piwowar, Todd J Vision

PeerJ (2013-10-01) https://doi.org/f3mn68

DOI: 10.7717/peerj.175 · PMID: 24109559 · PMCID: PMC3792178

183.

Enhancing reproducibility for computational methods

V Stodden, M McNutt, DH Bailey, E Deelman, Y Gil, B Hanson, MA Heroux, JPA Ioannidis, M Taufer

Science (2016-12-08) https://doi.org/gbr42b

DOI: 10.1126/science.aah6168 · PMID: 27940837

184.

Best Practices for Computational Science: Software Infrastructure and Environments for Reproducible and Extensible Research

Victoria Stodden, Sheila Miguez

Journal of Open Research Software (2014-07-09) https://doi.org/f3mn69

DOI: 10.5334/jors.ay

185.

Disclose all data in publications

Keith Baggerly

Nature (2010-09) https://doi.org/fhc9z5

DOI: 10.1038/467401b · PMID: 20864982

186.

Open by default: a proposed copyright license and waiver agreement for open access research and data in peer-reviewed journals

Iain Hrynaszkiewicz, Matthew J Cockerill

BMC Research Notes (2012-09-07) https://doi.org/f3mn7c

DOI: 10.1186/1756-0500-5-494 · PMID: 22958225 · PMCID: PMC3465200

187.

Creative Commons licenses and the non-commercial condition: Implications for the re-use of biodiversity information

Gregor Hagedorn, Daniel Mietchen, Robert Morris, Donat Agosti, Lyubomir Penev, Walter Berendsohn, Donald Hobern

ZooKeys (2011-11-28) https://doi.org/g25

DOI: 10.3897/zookeys.150.2189 · PMID: 22207810 · PMCID: PMC3234435

188.

One network to rule them all

Daniel Himmelstein, Lars Juhl Jensen

ThinkLab (2015-08-14) https://doi.org/f3mqt4

DOI: 10.15363/thinklab.d102

189.

Integrating resources with disparate licensing into an open network

Daniel Himmelstein, Lars Juhl Jensen, MacKenzie Smith, Katie Fortney, Caty Chung

ThinkLab (2015-08-28) https://doi.org/bfmk

DOI: 10.15363/thinklab.d107

190.

Legal confusion threatens to slow data science

Simon Oxenham

Nature (2016-08) https://doi.org/bndt

DOI: 10.1038/536016a · PMID: 27488781

191.

LINCS L1000 licensing

Daniel Himmelstein

ThinkLab (2015-09-28) https://doi.org/bfmn

DOI: 10.15363/thinklab.d110

192.

Sounding the alarm on DrugBank’s new license and terms of use

Daniel Himmelstein, Katie Fortney, Craig Knox, Christopher Southan

ThinkLab (2016-05-08) https://doi.org/bgnh

DOI: 10.15363/thinklab.d213

193.

Incomplete Interactome licensing

Daniel Himmelstein

ThinkLab (2015-10-01) https://doi.org/bfmp

DOI: 10.15363/thinklab.d111

194.

Who owns scientific data? The impact of intellectual property rights on the scientific publication chain

Roger Elliott

Learned Publishing (2005-04) https://doi.org/cxfd27

DOI: 10.1087/0953151053584984

195.

MSigDB licensing

Daniel Himmelstein

ThinkLab (2015-09-28) https://doi.org/bfmm

DOI: 10.15363/thinklab.d108

196.

Molecular signatures database (MSigDB) 3.0

A Liberzon, A Subramanian, R Pinchback, H Thorvaldsdottir, P Tamayo, JP Mesirov

Bioinformatics (2011-05-05) https://doi.org/b8mx73

DOI: 10.1093/bioinformatics/btr260 · PMID: 21546393 · PMCID: PMC3106198

197.

Assessing the effectiveness of our hetnet permutations

Daniel Himmelstein

ThinkLab (2016-02-25) https://doi.org/f3mqt5

DOI: 10.15363/thinklab.d178

198.

Randomization Techniques for Graphs

Sami Hanhijärvi, Gemma C Garriga, Kai Puolamäki

Proceedings of the 2009 SIAM International Conference on Data Mining (2009-04-30) https://doi.org/f3mn58

DOI: 10.1137/1.9781611972795.67

199.

Permuting hetnets and implementing randomized edge swaps in cypher

Daniel Himmelstein

ThinkLab (2015-12-21) https://doi.org/f3mqt6

DOI: 10.15363/thinklab.d136

200.

Use of Graph Database for the Integration of Heterogeneous Biological Data

Byoung-Ha Yoon, Seon-Kyu Kim, Seon-Young Kim

Genomics & Informatics (2017) https://doi.org/f93xct

DOI: 10.5808/gi.2017.15.1.19 · PMID: 28416946 · PMCID: PMC5389944

201.

Comparative analysis of Relational and Graph databases

Garima Jaiswal

IOSR Journal of Engineering (2013-08) https://doi.org/gbr42z

DOI: 10.9790/3021-03822527

202.

Are graph databases ready for bioinformatics?

Christian Theil Have, Lars Juhl Jensen

Bioinformatics (2013-10-17) https://doi.org/f3mn4w

DOI: 10.1093/bioinformatics/btt549 · PMID: 24135261 · PMCID: PMC3842757

203.

Representing and querying disease networks using graph databases

Artem Lysenko, Irina A Roznovăţ, Mansoor Saqi, Alexander Mazein, Christopher J Rawlings, Charles Auffray

BioData Mining (2016-07-25) https://doi.org/gbr42v

DOI: 10.1186/s13040-016-0102-8 · PMID: 27462371 · PMCID: PMC4960687

204.

Recon2Neo4j: applying graph database technologies for managing comprehensive genome-scale networks

Irina Balaur, Alexander Mazein, Mansoor Saqi, Artem Lysenko, Christopher J Rawlings, Charles Auffray

Bioinformatics (2016-12-19) https://doi.org/f9kpsz

DOI: 10.1093/bioinformatics/btw731 · PMID: 27993779 · PMCID: PMC5408918

205.

The Network Library: a framework to rapidly integrate network biology resources

Georg Summer, Thomas Kelder, Marijana Radonjic, Marc van Bilsen, Suzan Wopereis, Stephane Heymans

Bioinformatics (2016-09-01) https://doi.org/f86d74

DOI: 10.1093/bioinformatics/btw436 · PMID: 27587664

206.

The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species

Christopher J Mungall, Julie A McMurry, Sebastian Köhler, James P Balhoff, Charles Borromeo, Matthew Brush, Seth Carbon, Tom Conlin, Nathan Dunn, Mark Engelstad, … Melissa A Haendel

Nucleic Acids Research (2016-11-29) https://doi.org/f9v7bz

DOI: 10.1093/nar/gkw1128 · PMID: 27899636 · PMCID: PMC5210586

207.

Using the neo4j graph database for hetnets

Daniel Himmelstein

ThinkLab (2015-10-02) https://doi.org/f3mqvk

DOI: 10.15363/thinklab.d112

208.

Dhimmel/Hetio V0.2.0: Neo4J Export, Cypher Query Creation, Hetnet Stats, And Other Enhancements

Daniel Himmelstein

Zenodo (2016-09-05) https://doi.org/gbr42q

DOI: 10.5281/zenodo.61571

209.

Hosting Hetionet in the cloud: creating a public Neo4j instance

Daniel Himmelstein

ThinkLab (2016-06-23) https://doi.org/brsc

DOI: 10.15363/thinklab.d216

210.

Bioboxes: standardised containers for interchangeable bioinformatics software

Peter Belmann, Johannes Dröge, Andreas Bremges, Alice C McHardy, Alexander Sczyrba, Michael D Barton

GigaScience (2015-10-15) https://doi.org/gbr42d

DOI: 10.1186/s13742-015-0087-0 · PMID: 26473029 · PMCID: PMC4607242

211.

Reproducibility of computational workflows is automated using continuous analysis

Brett K Beaulieu-Jones, Casey S Greene

Nature Biotechnology (2017-03-13) https://doi.org/f9ttx6

DOI: 10.1038/nbt.3780 · PMID: 28288103 · PMCID: PMC6103790

212.

Estimating the complexity of hetnet traversal

Daniel Himmelstein, Antoine Lizee

ThinkLab (2016-03-22) https://doi.org/gbr42x

DOI: 10.15363/thinklab.d187

213.

Alternative Transformations to Handle Extreme Values of the Dependent Variable

John B Burbidge, Lonnie Magee, ALeslie Robb

Journal of the American Statistical Association (1988-03) https://doi.org/bggvmg

DOI: 10.2307/2288929

214.

Transforming DWPCs for hetnet edge prediction

Daniel Himmelstein, Pouya Khankhanian, Antoine Lizee

ThinkLab (2016-04-01) https://doi.org/f3qbmd

DOI: 10.15363/thinklab.d193

215.

Assessing the informativeness of features

Daniel Himmelstein

ThinkLab (2015-10-04) https://doi.org/f3qbmb

DOI: 10.15363/thinklab.d115

216.

Edge dropout contamination in hetnet edge prediction

Daniel Himmelstein

ThinkLab (2016-05-16) https://doi.org/f3qbmm

DOI: 10.15363/thinklab.d215

217.

Decomposing predictions into their network support

Daniel Himmelstein

ThinkLab (2016-12-21) https://doi.org/gbr42j

DOI: 10.15363/thinklab.d229

218.

Decomposing the DWPC to assess intermediate node or edge contributions

Daniel Himmelstein

ThinkLab (2016-12-15) https://doi.org/gbr42h

DOI: 10.15363/thinklab.d228

219.

Network Edge Prediction: Estimating the prior

Antoine Lizee, Daniel Himmelstein

ThinkLab (2016-04-14) https://doi.org/f3qbmg

DOI: 10.15363/thinklab.d201

220.

Network Edge Prediction: how to deal with self-testing

Antoine Lizee, Daniel Himmelstein

ThinkLab (2016-04-05) https://doi.org/f3qbmf

DOI: 10.15363/thinklab.d194

221.

Cataloging drug–disease therapies in the ClinicalTrials.gov database

Daniel Himmelstein

ThinkLab (2016-05-08) https://doi.org/f3qbmk

DOI: 10.15363/thinklab.d212

222.

A standard database for drug repositioning

Adam S Brown, Chirag J Patel

Scientific Data (2017-03-14) https://doi.org/gbr42s

DOI: 10.1038/sdata.2017.29 · PMID: 28291243 · PMCID: PMC5349249

223.

Systematic analyses of drugs and disease indications in RepurposeDB reveal pharmacological, biological and epidemiological factors influencing drug repositioning

Khader Shameer, Benjamin S Glicksberg, Rachel Hodos, Kipp W Johnson, Marcus A Badgeley, Ben Readhead, Max S Tomlinson, Timothy O’Connor, Riccardo Miotto, Brian A Kidd, … Joel T Dudley

Briefings in Bioinformatics (2017-02-15) https://doi.org/gbr42t

DOI: 10.1093/bib/bbw136 · PMID: 28200013 · PMCID: PMC6192146

224.

Toward a comprehensive drug ontology: extraction of drug-indication relations from diverse information sources

Mark E Sharp

Journal of Biomedical Semantics (2017-01-10) https://doi.org/gbr42w

DOI: 10.1186/s13326-016-0110-0 · PMID: 28069052 · PMCID: PMC5223332

225.

Rephetio: Repurposing drugs on a hetnet [proposal]

Daniel Himmelstein, Antoine Lizee, Chrissy Hessler, Leo Brueggeman, Sabrina Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, Sergio Baranzini

ThinkLab (2015-01-12) https://doi.org/bd9g

DOI: 10.15363/thinklab.a5

226.

Measuring user contribution and content creation

Daniel Himmelstein, Antoine Lizee

ThinkLab (2016-04-11) https://doi.org/f3mqvn

DOI: 10.15363/thinklab.d200

227.

This revolution will be digitized: online tools for radical collaboration

C Patil, V Siegel

Disease Models & Mechanisms (2009-04-30) https://doi.org/fvjhcj

DOI: 10.1242/dmm.003285 · PMID: 19407323 · PMCID: PMC2675795

228.

Publishing the research process

Daniel Mietchen, Ross Mounce, Lyubomir Penev

Research Ideas and Outcomes (2015-12-17) https://doi.org/f3mn7d

DOI: 10.3897/rio.1.e7547

229.

Does it take too long to publish research?

Kendall Powell

Nature (2016-02) https://doi.org/f3mn4t

DOI: 10.1038/530148a · PMID: 26863966

230.

Accelerating scientific publication in biology

Ronald D Vale

Proceedings of the National Academy of Sciences (2015-10-27) https://doi.org/f3mn7f

DOI: 10.1073/pnas.1511912112 · PMID: 26508643 · PMCID: PMC4640799

231.

Reproducibility: A tragedy of errors

David B Allison, Andrew W Brown, Brandon J George, Kathryn A Kaiser

Nature (2016-02) https://doi.org/f3mn7g

DOI: 10.1038/530027a · PMID: 26842041 · PMCID: PMC4831566

232.

Workshop to analyze LINCS data for the Systems Pharmacology course at UCSF

Daniel Himmelstein, Kathleen Keough, Misha Vysotskiy, Jeffrey Kim, Beau Norgeot, Julia Cluceru, Marjorie Imperial, Emmalyn Chen, Jasleen Sodhi, Elizabeth Levy

ThinkLab (2016-03-08) https://doi.org/f3mn57

DOI: 10.15363/thinklab.d181

233.

Why we are teaching science wrong, and how to make it right

MMitchell Waldrop

Nature (2015-07) https://doi.org/f3mn7h

DOI: 10.1038/523272a · PMID: 26178948

234.

Going paperless: The digital lab

Jim Giles

Nature (2012-01) https://doi.org/fznpgr

DOI: 10.1038/481430a · PMID: 22281576

235.

Systematic integration of biomedical knowledge prioritizes drugs for repurposing

Daniel S Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, Sergio E Baranzini

Cold Spring Harbor Laboratory (2016-11-14) https://doi.org/bs4f

DOI: 10.1101/087619

236.

Rephetio: Repurposing drugs on a hetnet [report]

Daniel Himmelstein, Antoine Lizee, Chrissy Hessler, Leo Brueggeman, Sabrina Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, Sergio Baranzini

ThinkLab (2016-11-13) https://doi.org/bszr

DOI: 10.15363/thinklab.a7

237.

Figshare depositions from Project Rephetio

Daniel Himmelstein, Leo Brueggeman, Sergio Baranzini

Figshare (2017) https://doi.org/ccq3

DOI: 10.6084/m9.figshare.c.2861359.v1

Metaedge	Abbr	Edges	Sources	Targets
Anatomy–downregulates–Gene	AdG	102,240	36	15,097
Anatomy–expresses–Gene	AeG	526,407	241	18,094
Anatomy–upregulates–Gene	AuG	97,848	36	15,929
Compound–binds–Gene	CbG	11,571	1,389	1,689
Compound–causes–Side Effect	CcSE	138,944	1,071	5,701
Compound–downregulates–Gene	CdG	21,102	734	2,880
Compound–palliates–Disease	CpD	390	221	50
Compound–resembles–Compound	CrC	6,486	1,042	1,054
Compound–treats–Disease	CtD	755	387	77
Compound–upregulates–Gene	CuG	18,756	703	3,247
Disease–associates–Gene	DaG	12,623	134	5,392
Disease–downregulates–Gene	DdG	7,623	44	5,745
Disease–localizes–Anatomy	DlA	3,602	133	398
Disease–presents–Symptom	DpS	3,357	133	415
Disease–resembles–Disease	DrD	543	112	106
Disease–upregulates–Gene	DuG	7,731	44	5,630
Gene–covaries–Gene	GcG	61,690	9,043	9,532
Gene–interacts–Gene	GiG	147,164	9,526	14,084
Gene–participates–Biological Process	GpBP	559,504	14,772	11,381
Gene–participates–Cellular Component	GpCC	73,566	10,580	1,391
Gene–participates–Molecular Function	GpMF	97,222	13,063	2,884
Gene–participates–Pathway	GpPW	84,372	8,979	1,822
Gene→regulates→Gene	Gr>G	265,672	4,634	7,048
Pharmacologic Class–includes–Compound	PCiC	1,029	345	724

Abbrev.	Len.	Δ AUROC	−log₁₀(p)	Coef.	Metapath
CbGaD	2	14.5%	6.2	0.20	Compound–binds–Gene–associates–Disease
CdGuD	2	1.7%	4.5		Compound–downregulates–Gene–upregulates–Disease
CrCtD	2	22.8%	6.9	0.15	Compound–resembles–Compound–treats–Disease
CtDrD	2	17.2%	5.8	0.13	Compound–treats–Disease–resembles–Disease
CuGdD	2	1.1%	2.6		Compound–upregulates–Gene–downregulates–Disease
CbGbCtD	3	21.7%	6.5	0.22	Compound–binds–Gene–binds–Compound–treats–Disease
CbGeAlD	3	8.4%	5.2	0.04	Compound–binds–Gene–expresses–Anatomy–localizes–Disease
CbGiGaD	3	9.0%	4.4	0.00	Compound–binds–Gene–interacts–Gene–associates–Disease
CcSEcCtD	3	14.0%	6.8	0.08	Compound–causes–Side Effect–causes–Compound–treats–Disease
CdGdCtD	3	3.8%	4.6	0.00	Compound–downregulates–Gene–downregulates–Compound–treats–Disease
CdGuCtD	3	-2.1%	2.4		Compound–downregulates–Gene–upregulates–Compound–treats–Disease
CiPCiCtD	3	23.3%	7.5	0.16	Compound–includes–Pharmacologic Class–includes–Compound–treats–Disease
CpDpCtD	3	4.3%	3.9	0.06	Compound–palliates–Disease–palliates–Compound–treats–Disease
CrCrCtD	3	17.0%	5.0	0.12	Compound–resembles–Compound–resembles–Compound–treats–Disease
CrCbGaD	3	8.2%	6.1	0.002	Compound–resembles–Compound–binds–Gene–associates–Disease
CtDdGdD	3	4.2%	3.9		Compound–treats–Disease–downregulates–Gene–downregulates–Disease
CtDdGuD	3	0.5%	1.0		Compound–treats–Disease–downregulates–Gene–upregulates–Disease
CtDlAlD	3	12.4%	6.0		Compound–treats–Disease–localizes–Anatomy–localizes–Disease
CtDpSpD	3	13.9%	6.1		Compound–treats–Disease–presents–Symptom–presents–Disease
CtDuGdD	3	0.7%	1.3		Compound–treats–Disease–upregulates–Gene–downregulates–Disease
CtDuGuD	3	1.1%	1.4		Compound–treats–Disease–upregulates–Gene–upregulates–Disease
CuGdCtD	3	-1.6%	2.9		Compound–upregulates–Gene–downregulates–Compound–treats–Disease
CuGuCtD	3	4.4%	3.5	0.00	Compound–upregulates–Gene–upregulates–Compound–treats–Disease
CbGiGiGaD	4	7.0%	5.1	0.00	Compound–binds–Gene–interacts–Gene–interacts–Gene–associates–Disease
CbGpBPpGaD	4	4.9%	3.8	0.00	Compound–binds–Gene–participates–Biological Process–participates–Gene–associates–Disease
CbGpPWpGaD	4	7.6%	7.9	0.05	Compound–binds–Gene–participates–Pathway–participates–Gene–associates–Disease

Resource	Components	License	Cat.	References
Entrez Gene	G	custom	1	RRID:SCR_002473 [89,90,91]
LabeledIn	CtD, CpD	custom	1	RRID:SCR_015667 [172,173,174]
MEDLINE	DlA, DpS, DrD	custom	1	RRID:SCR_002185 [78,142]
MeSH	S	custom	1	RRID:SCR_004750 [78,79]
Pathway Interaction Database	PW, GpPW		1	RRID:SCR_006866 [98,100,101]
Disease Ontology	D	CC BY 3.0	2ᴼᴰ	RRID:SCR_000476 [73,74,75,76]
DISEASES	DaG	CC BY 4.0	2ᴼᴰ	RRID:SCR_015664 [127,128,136]
DrugCentral	PC, CbG, PCiC	CC BY 4.0	2ᴼᴰ	RRID:SCR_015663 [87,88]
Gene Ontology	BP, CC, MF, GpBP, GpCC, GpMF	CC BY 4.0	2ᴼᴰ	RRID:SCR_002811 [102,157,158,159]
GWAS Catalog	DaG	custom	2ᴼᴰ	RRID:SCR_012745 [126,133,134,135]
Reactome	PW, GpPW	custom	2ᴼᴰ	RRID:SCR_003485 [97,99,100,101]
LINCS L1000	CdG, CuG, Gr>G	custom	2ᴼᴰ	[145,146,191]
TISSUES	AeG	CC BY 4.0	2ᴼᴰ	RRID:SCR_015665 [112,113,114]
Uberon	A	CC BY 3.0	2ᴼᴰ	RRID:SCR_010668 [92,93,94]
WikiPathways	PW, GpPW	CC BY 3.0 / custom	2ᴼᴰ	RRID:SCR_002134 [95,96,100,101]
BindingDB	CbG	mixed CC BY 3.0 & CC BY-SA 3.0	2ᴼᴰ	RRID:SCR_000390 [115,116,118,119]
DisGeNET	DaG	ODbL	2ᴼᴰ	RRID:SCR_006178 [129,130,137,138]
DrugBank	C, CbG, CrC	custom	2	RRID:SCR_002700 [80,81,82,192]
MEDI	CtD, CpD	CC BY-NC-SA 3.0	2	RRID:SCR_015668 [170,171]
PREDICT	CtD, CpD	CC BY-NC-SA 3.0	2	[24,170]
SIDER	SE, CcSE	CC BY-NC-SA 4.0	2	RRID:SCR_004321 [83,84,85]
Bgee	AeG, AdG, AuG		4	RRID:SCR_002028 [108,109,110,111]
DOAF	DaG		4	RRID:SCR_015666 [131,132,139]
ehrlink	CtD, CpD		4	[175,176]
Evolutionary Rate Covariation	GcG		4	RRID:SCR_015669 [147,148,149]
hetio-dag	GiG		4	[22,150,151]
Incomplete Interactome	GiG		4	[150,151,156,193]
Human Interactome Database	GiG		4	RRID:SCR_015670 [150,151,152,153,154,155]
STARGEO	DdG, DuG		4	[140,141,160]

Metanode	Abbr	Nodes	Disconnected	Metaedges
Anatomy	A	402	2	4
Biological Process	BP	11,381	0	1
Cellular Component	CC	1,391	0	1
Compound	C	1,552	14	8
Disease	D	137	1	8
Gene	G	20,945	1,800	16
Molecular Function	MF	2,884	0	1
Pathway	PW	1,822	0	1
Pharmacologic Class	PC	345	0	1
Side Effect	SE	5,734	33	1
Symptom	S	438	23	1