OPTAR: a computational tool for target discovery based on disease correlation inference from literature of interacting proteins

Xiao Yuan; Siyu Zhou; Jiayi Yu; Mengyuan Wang; Cheng Luo; Hao Zhang; Xiao Yuan; Siyu Zhou; Jiayi Yu; Mengyuan Wang; Cheng Luo; Hao Zhang

doi:10.48130/targetome-0026-0017

The identification of novel therapeutic targets remains a major bottleneck in target-based drug discovery, particularly when mining large-scale omics data. Although transcriptomic and proteomic profiling generate extensive lists of disease-associated candidates, prioritizing truly novel and druggable targets, especially those lacking active compounds, requires the assistance of advanced computational strategies. Here, we report the development of a new computational tool named OPTAR (Omics and Pocket Analysis-based Target Assessment and Ranking) for identifying promising new target proteins from omics data. These new target proteins are expected to have no active compounds, and have not previously been reported to be correlated to the disease of interest. OPTAR applies a multi-layer filtering and ranking workflow, including automated literature-based exclusion of known disease-associated proteins, drug availability screening, and algorithm-driven disease correlation inference, enabling systematic de novo target discovery. From the hepatocellular carcinoma (HCC) omics data used by the previously reported tool OTTM (Omics and Text-driven Translational Medicine), OPTAR identified high-ranking candidate proteins at the intersection of 'hepatocellular carcinoma' and 'cell cycle' lists. Functional verification indicated that silencing of UBE2J1, KDELR3, and VTI1A in HCC cells inhibited cell viability and reduced the migration and invasion abilities of HCC cells. Furthermore, it was found that UBE2J1 was upregulated in HCC tissues, and its knockdown induced apoptosis-related changes, and cell cycle disorder. Together, these findings establish OPTAR as a reliable and efficient computational tool for promising therapeutic target discovery with high originality from omics data.

HTML

Introduction

The common drug development paradigm consists of phenotype-based drug discovery (PDD) and target-based drug discovery (TDD). In the long history of human drug development, most drugs have been discovered through PDD, especially before 1980^[1,2]. The common paradigm of PDD is evaluating whether compounds exhibit satisfactory efficacy in specific pathological models. However, with advances in molecular biology and the success of targeted therapies, TDD has become a dominant paradigm in modern drug development^[3,4]. Selecting molecular targets based on disease understanding is a critical step in TDD. The drug development pipeline typically spans 12 to 15 years, with less than 10% of therapeutic drugs entering Phase I trials failing to reach approval^[5,6]. A major contributing factor is the insufficient identification and validation of effective therapeutic targets^[7,8]. Therefore, improving strategies for target discovery is essential for both understanding disease biology and enhancing drug development success rates. There is an ancient Chinese saying, 'A journey of a thousand miles begins with a single step'. For target-based drug development, the discovery and confirmation of new targets is the first, most crucial step in this long journey.

In the contemporary drug discovery process, large-scale omics data from clinical samples have become a major source for identifying potential therapeutic targets. High-throughput technologies, particularly transcriptomics and proteomics, can generate extensive lists of differentially expressed genes or proteins associated with disease states. However, translating these large candidate lists into a small number of truly promising targets remains a significant challenge^[9]. Typically, omics data contain hundreds to thousands of differentially expressed genes or proteins. Without effective prioritization strategies, it is difficult to make the optimal choice of potential new drug targets.

Computational tools based on sophisticated algorithms can integrate rich biological information, greatly enhancing the process of target and drug discovery^[10]. As an ancient Chinese saying goes, 'A workman must first sharpen his tools if he is to do his work well'. For target discovery based on omics data, efficient and reliable computational methods are like such sharp tools. However, existing computational approaches for target prioritization often rely on statistical associations, functional enrichment, or network-based proximity to known disease-related genes. While these methods are useful, they tend to prioritize well-studied proteins with existing literature evidence, potentially overlooking novel targets that lack prior disease annotations. This limitation is particularly important in the context of first-in-class drug discovery, where originality and unexplored targets are essential.

Previously, we have developed an automated classification tool named OTTM (Omics and Text-driven Translational Medicine), mainly used for drug repositioning. OTTM can efficiently screen candidate proteins from omics data and prioritize those that have not been reported to be associated with the disease of interest, but have corresponding approved, or clinical drugs^[11]. Testing about 20–30 drugs, which are selected by OTTM from omics data, can efficiently discover effective approved or clinical drugs that have never been reported to be effective for the disease of interest. This strategy enables rapid experimental validation through compound testing. However, by design, OTTM focuses on targets with known drugs, and does not address the large proportion of candidate proteins that lack any active compounds.

To address this limitation, it is necessary to develop a complementary computational strategy that focuses on identifying original therapeutic targets without known drugs. Thus, here we report the development of a new computational tool named OPTAR (Omics and Pocket Analysis-based Target Assessment and Ranking), which is specifically designed for discovering novel target proteins derived from omics data. Compared with OTTM, OPTAR specifically focuses on proteins that lack approved or clinical drugs. In addition, both OTTM and OPTAR exclude candidate proteins that have already been reported to be associated with the disease of interest through systematic literature mining. Importantly, OPTAR further introduces a disease correlation inference strategy based on protein-protein interaction (PPI) networks, in which candidate proteins are ranked according to the extent to which their interacting partners are associated with the disease. Moreover, OPTAR incorporates a structure-based binding pocket assessment module using AlphaFold-predicted protein structures to evaluate the potential druggability of candidate targets.

Taken together, OTTM and OPTAR form a complementary framework for target discovery from omics data, addressing both drug repurposing and de novo target identification. In this study, starting from differentially expressed genes derived from hepatocellular carcinoma (HCC) omics data, we applied OPTAR using the keywords 'hepatocellular carcinoma' and 'cell cycle', where the latter was selected due to its central role in cancer progression. Candidate genes were prioritized based on their intersection and subsequent network-based evaluation. From the top-ranking candidates, we identified several proteins for experimental validation, demonstrating the feasibility of OPTAR as a computational framework for discovering novel and potentially druggable therapeutic targets.

Materials and methods

Development of literature mining, disease correlation inference, and pocket assessment modules of OPTAR

For literature assessment in OPTAR, all PubMed abstract information was extracted from 1,219 XML files (from pubmed24n0001 to pubmed24n1551). For drug availability assessment in OPTAR, the mapping table of proteins and corresponding drugs was extracted from the downloaded data of the TTD database^[12]. OPTAR inputs a list of differentially expressed genes or proteins uploaded by users, as well as a keyword used as the disease type of interest. For interacting protein (IP) assessment in OPTAR, human PPI data were extracted from the STRING database. On this basis, OPTAR uses the following formula to calculate the disease correlation score of each candidate protein:

$ \text{Score}\mathit{_{{IP}}}=\dfrac{\mathrm{Hit}\_\mathit{{{Num}}}}{\mathrm{PubMed}\_\mathit{{{Num}}}} $

$ \text{Total Score}=\sum\limits_1^n\text{Score}\mathit{_{IP}} $

For each IP, the score is defined as: the number of abstracts containing disease-related keywords divided by the total number of PubMed abstracts associated with the IP. Subsequently, the total score of the candidate protein was calculated by summing the scores of all its IPs.

On common personal computers, it takes about 1 h for a single CPU core to run OPTAR, depending on the frequency of keywords in the PubMed database, and the performance of the CPU. The first version of OPTAR, with source code written in C++, provides users with executable programs. Subsequent versions will be written in Python with open-source codes. In addition to the literature mining and IP evaluation modules, OPTAR also has an independent drug binding pocket assessment module that uses the AlphaFold structures of human proteins.

Web server of OPTAR

The executable programs of OPTAR are available from its web server (http://otter-simm.com/optar.html). In the input list of OPTAR, each line of symbols represents a candidate protein or gene. Examples of input protein lists are provided in the downloaded compressed files and on the Help page of the OPTAR web server.

Cell culture
Human HCC cell lines HepG2, and Huh7 were purchased from Shanghai Zhong Qiao Xin Zhou Biotechnology Co., Ltd. Cells were authenticated by STR profiling. Cells were maintained at 37 °C in a humidified incubator with 95% air, and 5% CO₂. HepG2 cells were cultured in 87% MEM (with NEAA) (ZQ-300) supplemented with 10% fetal bovine serum (ZQ500-S), 1% L-alanyl-L-glutamine (CSP004), 1% sodium pyruvate (CSP003), and 1% penicillin–streptomycin (CSP006). Huh7 cells were cultured in DMEM high glucose (ZQ-100) supplemented with 10% fetal bovine serum (ZQ500-A), 1% penicillin-streptomycin (CSP006), 1% L-alanyl-L-glutamine (CSP004), and 1% sodium pyruvate (CSP003).

siRNA design and transfection
siRNAs targeting UBE2J1, KDELR3, and VTI1A were custom designed and synthesized by MCE (Supplementary Table S1). Transfection was performed using CALNP™ RNAi in vitro (D-Nano Therapeutic). Cells were seeded simultaneously into 12-well plates (for knockdown validation by RT-qPCR) and 96-well plates (for CCK-8 assay). After cell attachment, transfection complexes were prepared according to the CALNP™ protocol to achieve a final siRNA concentration of 300 nmol·L⁻¹, using the following mixture per transfection unit: 20 μmol·L⁻¹ siRNA (3 μL), Reagent A (21 μL), Reagent B (6 μL), combined with 170 μL serum-free medium. For 96-well plates, 10 μL transfection complex was added per well together with 90 μL serum-free medium. For 12-well plates, 100 μL transfection complex was added per well, together with 900 μL serum-free medium. A non-targeting siRNA (si-NC) served as the negative control.

RNA extraction and RT-qPCR
At 24 h post-transfection, total RNA was extracted using FastPure®Cell/Tissue Total RNA Isolation Kit V2 (Vazyme, RC112). cDNA was synthesized using HiScript IV All-in-One Ultra RT SuperMix for qPCR (Vazyme, R433). qPCR was performed using ChamQ Universal SYBR qPCR Master Mix (Vazyme, Q711). GAPDH was used as an internal reference gene, and relative mRNA levels were calculated using the 2^−ΔΔCᴛ method, consistent with the reference workflow. Primer sequences are provided in Supplementary Table S2 (Primers for RT-qPCR).

Western blotting
Cells were harvested for SDS-PAGE and Western blot analysis on NC membranes. After blocking with 5% nonfat milk in TBST for 1 h, membranes were incubated with the primary antibodies overnight at 4 °C, followed by secondary antibodies for 1 h at room temperature. ECL detection systems were applied to develop signals. The following antibodies were used: Caspase 3/P17/P19 polyclonal antibody (Proteintech, 19677-1-AP), Cleaved PARP (Asp214) antibody (Cell Signaling Technology, 9541), and α-tubulin polyclonal antibody (Proteintech, 11224-1-AP).

Cell viability assay (CCK-8)
HepG2 cells were seeded into 96-well plates at a density of 5 × 10³ cells per well. After 24 h of culture, the cells were treated with compounds at various concentrations (25 and 50 μmol·L⁻¹). After 48 h post-transfection or compound treatment, cell viability was assessed using Cell Counting Kit-8 (Meilunbio, MA0218). We added 100 μL of serum-free medium containing 10% CCK-8 reagent to each well, and the plates were incubated for an additional hour. For each well, 10 μL CCK-8 reagent was added together with 90 μL serum-free medium, followed by incubation for 1 h. Absorbance was measured at 450 nm using a microplate reader, following the typical CCK-8 readout approach.

Wound-healing assay
HepG2 and Huh7 cells were seeded in six-well plates and transfected with the indicated siRNAs. After the cells reached near-confluence, a linear wound was generated using a sterile pipette tip. Detached cells were removed by washing with PBS, and a serum-free medium was added. Images were captured at 0 and 48 h under a microscope. The cell migration ability is quantified and statistically analyzed using the cell migration rate. The cell migration rate = (initial wound width − wound width after 48 h)/initial wound width × 100%.

Transwell invasion assay
The Transwell invasion assay was conducted using a coated culture chamber (Falcon, 353097) coated with 1 : 8 diluted Matrigel (Corning), which was equipped with a membrane with an 8 μm pore size. After 48 h of siRNA transfection, 5,000 cells were added to the upper chamber of each cell culture chamber, while the lower chamber was filled with a medium containing 10% FBS as a chemotactic agent. After 24 h, the non-migrating cells on the upper surface were removed, and the migrating cells on the lower surface were fixed. After staining with crystal violet, the cells were photographed under a microscope and counted.

Cell cycle analysis by flow cytometry
After transfection with si-NC or si-UBE2J1, HepG2 and Huh7 cells were harvested, fixed in 70% ethanol, and stained using a DNA content staining solution according to the manufacturer's instructions (Vazyme, AC101). Cell-cycle distribution was analyzed by flow cytometry, and the percentages of cells in G0/G1, S, and G2/M phases were quantified.

Public database analysis of UBE2J1 expression in HCC
UBE2J1 expression in HCC and normal liver tissues was analyzed using GEPIA2 (https://gepia2.cancer-pku.cn), an online platform based on transcriptomic data from The Cancer Genome Atlas (TCGA), and the Genotype-Tissue Expression (GTEx) project. The 'LIHC' dataset was used to compare UBE2J1 expression between tumor and normal liver tissues. Expression values were presented as log₂(TPM + 1).

Virtual screening
The Natural Product Library for HTS purchased from TargetMol (Cat. No. L6000) was used for virtual screening. Proper three-dimensional conformations were generated for 4,654 natural compounds from this library using the LigPrep module of the Schrödinger software package. The AlphaFold structure of UBE2J1 was downloaded from UniProt, and prepared using the Protein Preparation Wizard module of Schrödinger. After all missing hydrogen atoms were added, binding sites were detected using the SiteMap module of Schrödinger. Using the binding site consistent with the position identified by OPTAR, grids for subsequent molecular docking were generated using the Receptor Grid Generation module of Schrödinger. Then molecular docking-based virtual screening was carried out using the Glide module of Schrödinger. Finally, 30 compounds with the higher docking scores, and the binding modes matching the pocket were selected for subsequent SPR-based binding detection.

Surface plasmon resonance (SPR) assays
SPR binding assays were performed on a Biacore T200 instrument (GE Healthcare) at 25 °C. The recombinant Human UBE2J1 was purchased from MedChemExpress (HY-P703625). The protein was coupled with the CM5 chip at 100 μg·mL⁻¹ in 10 mmol·L⁻¹ sodium acetate (pH 4.0). After immobilization, the system was equilibrated for 1 h. Each compound was serially diluted in PBS buffer with 0.05% Tween-20 and injected over the chip at a flow rate of 30 μL·min⁻¹. Each injection was associated with the sensor chip for 120 s and dissociated for 180 s. All data were processed using the Biacore T200 evaluation software (v1.0).

Statistical analysis
Data are presented as mean ± SD from three biological replicates per group. Statistical analysis and plotting were performed using GraphPad Prism 10.1.2; the data were analyzed by one-way ANOVA and an unpaired t-test. P < 0.05 was considered significant.

Supplementary Table S1 List of siRNA used for transfection.
Supplementary Table S2 Primers for RT-qPCR.
Supplementary Table S3 OPTAR output protein list.
Supplementary Table S4 List of 30 compounds for UBE2J1 selected for binding detection via virtual screening.
Supplementary Fig. S1 SPR binding curves and fitted KD values for 20 UBE2J1-binding compounds identified by virtual screening.

[1]	Moffat JG, Vincent F, Lee JA, Eder J, Prunotto M. 2017. Opportunities and challenges in phenotypic drug discovery: an industry perspective. Nature Reviews Drug Discovery 16:531−543 doi: 10.1038/nrd.2017.111 CrossRef Google Scholar
[2]	Tong X, Qu N, Kong X, Ni S, Zhou J, et al. 2024. Deep representation learning of chemical-induced transcriptional profile for phenotype-based drug discovery. Nature Communications 15:5378 doi: 10.1038/s41467-024-49620-3 CrossRef Google Scholar
[3]	Jia ZC, Yang X, Wu YK, Li M, Das D, et al. 2024. The art of finding the right drug target: emerging methods and strategies. Pharmacological Reviews 76:896−914 doi: 10.1124/pharmrev.123.001028 CrossRef Google Scholar
[4]	Du S, Hu X, Menéndez-Arias L, Zhan P, Liu X. 2024. Target-based drug design strategies to overcome resistance to antiviral agents: opportunities and challenges. Drug Resistance Updates 73:101053 doi: 10.1016/j.drup.2024.101053 CrossRef Google Scholar
[5]	Southey MWY, Brunavs M. 2023. Introduction to small molecule drug discovery and preclinical development. Frontiers in Drug Discovery 3:1314077 doi: 10.3389/fddsv.2023.1314077 CrossRef Google Scholar
[6]	Schuhmacher A, Hinder M, Brief E, Gassmann O, Hartl D. 2025. Benchmarking R&D success rates of leading pharmaceutical companies: an empirical analysis of FDA approvals (2006–2022). Drug Discovery Today 30:104291 doi: 10.1016/j.drudis.2025.104291 CrossRef Google Scholar
[7]	Minikel EV, Painter JL, Dong CC, Nelson MR. 2024. Refining the impact of genetic evidence on clinical success. Nature 629:624−629 doi: 10.1038/s41586-024-07316-0 CrossRef Google Scholar
[8]	Razuvayevskaya O, Lopez I, Dunham I, Ochoa D. 2024. Genetic factors associated with reasons for clinical trial stoppage. Nature Genetics 56:1862−1867 doi: 10.1038/s41588-024-01854-z CrossRef Google Scholar
[9]	Deng YT, You J, He Y, Zhang Y, Li HY, et al. 2025. Atlas of the plasma proteome in health and disease in 53,026 adults. Cell 188:253−271.e7 doi: 10.1016/j.cell.2024.10.045 CrossRef Google Scholar
[10]	Xu H, Zhao H, Ding C, Jiang D, Zhao Z, et al. 2023. Celastrol suppresses colorectal cancer via covalent targeting peroxiredoxin 1. Signal Transduction and Targeted Therapy 8:51 doi: 10.1038/s41392-022-01231-4 CrossRef Google Scholar
[11]	Yang X, Zhang B, Wang S, Lu Y, Chen K, et al. 2023. OTTM: an automated classification tool for translational drug discovery from omics data. Briefings in Bioinformatics 24:bbad301 doi: 10.1093/bib/bbad301 CrossRef Google Scholar
[12]	Zhou Y, Zhang Y, Zhao D, Yu X, Shen X, et al. 2024. TTD: therapeutic target database describing target druggability information. Nucleic Acids Research 52:D1465−D1477 doi: 10.1093/nar/gkad751 CrossRef Google Scholar
[13]	Offensperger F, Tin G, Duran-Frigola M, Hahn E, Dobner S, et al. 2024. Large-scale chemoproteomics expedites ligand discovery and predicts ligand behavior in cells. Science 384:eadk5864 doi: 10.1126/science.adk5864 CrossRef Google Scholar
[14]	Sun Q, Wang H, Xie J, Wang L, Mu J, et al. 2025. Computer-aided drug discovery for undruggable targets. Chemical Reviews 125:6309−6365 doi: 10.1021/acs.chemrev.4c00969 CrossRef Google Scholar
[15]	Sun D, Gao W, Hu H, Zhou S. 2022. Why 90% of clinical drug development fails and how to improve it? Acta Pharmaceutica Sinica B 12:3049−3062 doi: 10.1016/j.apsb.2022.02.002 CrossRef Google Scholar
[16]	Su W, Hou X. 2024. Targeting active RAS with molecular glue. Pharmaceutical Science Advances 2:100047 doi: 10.1016/j.pscia.2024.100047 CrossRef Google Scholar
[17]	Lessard S, Chao M, Reis K, Beauvais M, Rajpal DK, et al. 2024. Leveraging large-scale multi-omics evidences to identify therapeutic targets from genome-wide association studies. BMC Genomics 25:1111 doi: 10.1186/s12864-024-10971-2 CrossRef Google Scholar
[18]	Jia P, Zhao Z. 2014. Network-assisted analysis to prioritize GWAS results: principles, methods and perspectives. Human Genetics 133:125−138 doi: 10.1007/s00439-013-1377-1 CrossRef Google Scholar
[19]	Wang X, Gulbahce N, Yu H. 2011. Network-based methods for human disease gene prediction. Briefings in Functional Genomics 10:280−293 doi: 10.1093/bfgp/elr024 CrossRef Google Scholar
[20]	Guney E, Oliva B. 2012. Exploiting protein-protein interaction networks for genome-wide disease-gene prioritization. PLoS One 7:e43557 doi: 10.1371/journal.pone.0043557 CrossRef Google Scholar
[21]	Rossin EJ, Lage K, Raychaudhuri S, Xavier RJ, Tatar D, et al. 2011. Proteins encoded in genomic regions associated with immune-mediated disease physically interact and suggest underlying biology. PLoS Genetics 7:e1001273 doi: 10.1371/journal.pgen.1001273 CrossRef Google Scholar

{{lists.name}}

OPTAR: a computational tool for target discovery based on disease correlation inference from literature of interacting proteins

Abstract

Supplementary information

Rights and permissions

References

About this article

Cite this article

Special Issue

Article Metrics

Access History

Other Articles By Authors