<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF xmlns:admin="http://webns.net/mvcb/" xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:prism="http://purl.org/rss/1.0/modules/prism/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
<channel rdf:about="https://biorxiv.org">
<admin:errorReportsTo rdf:resource="mailto:biorxiv@cshlpress.edu"/>
<title>bioRxiv Subject Collection: Bioinformatics</title>
<link>https://biorxiv.org</link>
<description>
This feed contains articles for bioRxiv Subject Collection "Bioinformatics"
</description>

<items>
<rdf:Seq>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718599v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718654v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718648v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718634v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718133v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718336v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718756v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718796v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718564v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718546v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718559v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718256v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718510v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718511v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718168v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.13.717816v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.16.718906v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.13.715198v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718485v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718479v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718488v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718708v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718434v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718370v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718378v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718492v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.12.717909v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718375v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718403v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718363v1?rss=1"/>
</rdf:Seq>
</items>
<prism:eIssn/>
<prism:publicationName>bioRxiv</prism:publicationName>
<prism:issn/>

<image rdf:resource=""/>
</channel>
<image rdf:about="">
<title>bioRxiv</title>
<url>https://www.biorxiv.org/sites/default/files/bioRxiv_article.jpg</url>
<link>https://www.biorxiv.org</link>
</image>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718599v1?rss=1">
<title>
<![CDATA[
Calibration of in-frame indel variant effect predictors for clinical variant classification 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718599v1?rss=1
</link>
<description><![CDATA[
Insertions and deletions (indels) represent a substantial source of genetic variation in humans and are associated with a diverse array of functional consequences. Despite their prevalence and clinical importance, indels, particularly short in-frame indels, remain critically understudied compared to single nucleotide variants and are challenging to interpret clinically. While many computational predictors for missense variants have been rigorously evaluated and calibrated for clinical use, the clinical utility of tools for in-frame indels remains uncertain. To address this gap, we have calibrated in-frame indel prediction tools for clinical variant classification. We constructed a high-confidence dataset of in-frame indel variants ([&le;] 50bp) from clinical and population databases and estimated the prior probability of pathogenicity of a rare in-frame indel observed in a disease-associated gene, and of an insertion and deletion separately. Using a previously developed statistical framework based on local posterior probabilities, we then established score thresholds for eight computational tools, corresponding to distinct evidence levels for pathogenic and benign classification according to ACMG/AMP guidelines. All in-frame indel predictors evaluated here reached multiple evidence levels of pathogenicity and/or benignity, demonstrating measurable clinical value. However, these models consistently exhibited lower performance levels compared to missense predictors, highlighting the need for improved computational approaches for indel classification.
]]></description>
<dc:creator><![CDATA[ Abderrazzaq, H., Singh, M., Babb, L., Bergquist, T., Brenner, S. E., Pejaver, V., O'Donnell-Luria, A., Radivojac, P., ClinGen Computational Working Group,, ClinGen Variant Classification Working Group ]]></dc:creator>
<dc:date>2026-04-18</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718599</dc:identifier>
<dc:title><![CDATA[Calibration of in-frame indel variant effect predictors for clinical variant classification]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718654v1?rss=1">
<title>
<![CDATA[
LagCI Enables Inference of Temporal Causal Relationships from Dense Multi-Omic Time Series 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718654v1?rss=1
</link>
<description><![CDATA[
Inferring causal relationships from time-series data is critical for uncovering the dynamics of biological regulation. However, in multi-omics studies, this task is often hampered by sparse temporal sampling and the limitations of existing methods. To address this, we developed Lagged-Correlation Based Causal Inference (lagCI), a computational framework designed to identify time-lagged associations by combining comprehensive lag-correlation profiling with a robust statistical filtering scheme. Rather than relying on simple cross-correlation, lagCI analyzes the entire correlation profile and applies a quality-scoring system to filter out spurious associations that often plague high-dimensional datasets. We first tested lagCI on wearable physiological data, where it successfully captured the well-known causal link between physical activity and heart rate, even accounting for variations in lag times between individuals. Moving to high-frequency human multi-omics, we used lagCI to build a directed network of 1,624 molecules connected by over 157,000 predicted interactions. This network didn't just mirror established biology (such as cytokine-hormone crosstalk); it also pointed to specific molecular hubs that seem to orchestrate the timing of metabolic and immune responses. Overall, lagCI provides a data-driven way to extract temporal insights from dense longitudinal omics. We've made the tool available as an R package with multiple interfaces to ensure it's accessible for both bioinformaticians and clinicians.
]]></description>
<dc:creator><![CDATA[ Ge, Y., Bai, S., Qiang, Z., Liu, Y., Wu, Y., Shen, X. ]]></dc:creator>
<dc:date>2026-04-18</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718654</dc:identifier>
<dc:title><![CDATA[LagCI Enables Inference of Temporal Causal Relationships from Dense Multi-Omic Time Series]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718648v1?rss=1">
<title>
<![CDATA[
Unsupervised Machine Learning for Adaptive Immune Receptors with immuneML 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718648v1?rss=1
</link>
<description><![CDATA[
Machine learning (ML) enables adaptive immune receptor repertoires (AIRRs) analyses for biomarker identification and therapeutic development. With the majority of AIRR data partially or imperfectly labeled, unsupervised ML is essential for motif discovery, biologically meaningful clustering, and generation of novel receptor sequences. However, no unified framework for unsupervised ML exists in the AIRR field, hindering the assessment of model robustness and generalizability. Here, we present an immuneML release advancing unsupervised ML in the AIRR field through unified clustering workflows, interpretable generative modeling, integration with protein language model embeddings, dimensionality reduction, and visualization. We demonstrate immuneML's utility in three use cases: (i) benchmarking generative models for epitope-specific sequence generation, assessing specificity and novelty, (ii) systematic evaluation of clustering approaches on experimental receptor sequences against biological properties, such as epitope specificity and MHC, and (iii) unsupervised analysis of an experimental AIRR dataset to examine potential confounding, a practice widespread in related fields but unexplored in AIRR analyses.
]]></description>
<dc:creator><![CDATA[ Pavlovic, M., Wurtzen, C., Kanduri, C., Mamica, M., Scheffer, L., Lund-Andersen, C., Gubatan, J. M., Ullmann, T., Greiff, V., Sandve, G. K. ]]></dc:creator>
<dc:date>2026-04-18</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718648</dc:identifier>
<dc:title><![CDATA[Unsupervised Machine Learning for Adaptive Immune Receptors with immuneML]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718634v1?rss=1">
<title>
<![CDATA[
Pan-cancer survival modeling reveals structural limits of genomic feature integration in immunotherapy outcomes 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718634v1?rss=1
</link>
<description><![CDATA[
Background Immune checkpoint inhibitors (ICIs) have improved outcomes across multiple cancer types, yet reliable predictors of survival remain limited. While genomic features such as tumor mutational burden (TMB) are widely used, their contribution to predictive modeling in heterogeneous real-world cohorts remains unclear. We evaluated the relative contributions of clinical and whole-genome sequencing (WGS) features in pan-cancer survival modeling. Methods We analyzed 658 patients treated with ICIs with matched WGS data from the Genomics England. Using a leakage-controlled machine learning framework with strict train-test separation, we compared four models: TMB-only, clinical-only, clinical+TMB, and an integrated 11-feature clinico-genomic XGBoost survival model. Model performance was assessed using Harrells concordance index (C-index) with bootstrap confidence intervals. Results TMB alone demonstrated near-random discrimination (C-index 0.50; 95% CI 0.44-0.56). Clinical variables substantially improved predictive performance (0.59; 95% CI 0.53-0.64), with marginal gain from adding TMB (0.59). The integrated model achieved a C-index of 0.60 (95% CI 0.55-0.65). While improvement over TMB alone was significant, incremental gain beyond optimized clinical models was modest. Feature attribution analysis showed that model performance was dominated by clinical variables, with genomic features contributing limited additional signal. Conclusions These findings suggest that, in heterogeneous pan-cancer cohorts, predictive performance is constrained by the underlying data structure, in which dominant clinical signals overshadow genome-scale features. This study highlights fundamental limitations in integrating genomic data into survival models across diverse cancer types and provides a benchmark for future computational approaches.
]]></description>
<dc:creator><![CDATA[ Hassan, W., Adeleke, S. ]]></dc:creator>
<dc:date>2026-04-18</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718634</dc:identifier>
<dc:title><![CDATA[Pan-cancer survival modeling reveals structural limits of genomic feature integration in immunotherapy outcomes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718133v1?rss=1">
<title>
<![CDATA[
GANGE: Achieving Sequencing Without Sequencing With Diffusion Guided Generative Genomic Transformer 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718133v1?rss=1
</link>
<description><![CDATA[
The genome of a species is its book of life, but opening that book remains a costly affair due to the limitations the existing sequencing technologies pose. Short reads sequencers struggle to capture long and complex genomes, though have high fidelity rate. To counter that long reads from IIIrd generation sequencers are used, which are full of indel errors. Thus, reads from both approaches are collectively used with very high coverage, making the sequencing projects unreasonably high of cost and unapproachable to majority. Here we present a first of its kind generative deep-learning system, GANGE, which not just recovers the correct sequence with high accuracy from indel prone ONT reads at manifold lesser coverage but also extends it by 4kb, achieving sequencing without sequencing, horizontally as well as vertically while maintaining >92% accuracy consistently. This all makes it possible to drastically pull down sequencing project cost. GANGE was tested across A. thaliana, O. sativa genomes and Human chromosome 1 where it delivered outstanding assembly performance. Besides this, it was also used to accurately generate 2kb upstream promoters of all the genes from 12 different species, demonstrating that one can now also take up regulomics research just using RNA data alone when genome sequence is not available. With this all, GANGE brings a democratic turning point in the area of genomics and sequencing research.
]]></description>
<dc:creator><![CDATA[ Gupta, S., Kumar, A., Bhati, U., Shankar, R. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718133</dc:identifier>
<dc:title><![CDATA[GANGE: Achieving Sequencing Without Sequencing With Diffusion Guided Generative Genomic Transformer]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718336v1?rss=1">
<title>
<![CDATA[
cellNexus: Quality control, annotation, aggregation and analytical layers for the Human Cell Atlas data 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718336v1?rss=1
</link>
<description><![CDATA[
Large-scale single-cell atlases such as the Human Cell Atlas have transformed our understanding of human biology. Yet, the lack of a robust framework that standardises quality control, expands cellular annotation, and adds normalisation and analytical layers, limits multi-study analyses and the usefulness of this resource. Here we present cellNexus, a comprehensive tool and resource that converts the Human Cell Atlas collection into analysis-ready data by linking quality control layers, metadata enrichment, expression normalisation, analysis and data aggregation. These enhancements enable robust statistical modelling across studies, exemplified by a multi-tissue map of immune cell communication during ageing, which reveals macrophage-muscle axes as among the most depleted regenerative interactions with age. All harmonised layers, including pseudobulk and cell-cell communication summaries, are accessible via a public web interface and with R and Python APIs. By providing continuous integration with CELLxGENE releases, cellNexus transforms large cell atlas corpora into an accessible, reproducible, interoperable foundation for large-scale biological discovery and the next generation of single-cell foundation models.
]]></description>
<dc:creator><![CDATA[ Shen, M., Gao, Y., Liu, N., Bhuva, D., Milton, M., Henao, J., Andrews, J., Yang, E., Zhan, C., Liu, N., Si, S., Hutchison, W. J., Shakeel, M. H., Morgan, M., Papenfuss, A. T., Iskander, J., Polo, J. M., Mangiola, S. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718336</dc:identifier>
<dc:title><![CDATA[cellNexus: Quality control, annotation, aggregation and analytical layers for the Human Cell Atlas data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718756v1?rss=1">
<title>
<![CDATA[
Benchmarking Tools for Identification of rRNA Modifications in Escherichia coli using Oxford Nanopore Direct RNA Sequencing 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718756v1?rss=1
</link>
<description><![CDATA[
RNA modifications are important for RNA structure, stability, and ribosome function, but their identification and localisation remains challenging. Oxford Nanopore direct RNA sequencing (DRS) enables modification-agnostic detection in native RNA, but existing tool benchmarks have focused almost exclusively on m6A in eukaryotic mRNA, leaving multi-modification tool performance in bacterial systems largely untested. Here, we benchmark ten RNA modification detection tools spanning signal-comparison, error-rate, and hybrid approaches on Escherichia coli K-12 MG1655 16S and 23S rRNA, which harbour 11 and 25 known modified sites, respectively, across 17 modification types. Using native RNA and in vitro transcribed (IVT) unmodified RNA, we evaluate performance across 25 coverage levels (5x to 1000x). DiffErr and JACUSA2 showed the strongest discrimination performance (AUROC >0.9 on both 16S and 23S rRNA), with DiffErr achieving the highest F1 score on 16S and JACUSA2 showing the most consistent precision-recall balance across both rRNAs. Both tools achieved full transcript-wide scoring and, along with DRUMMER, exact positional localisation. Several other tools produced no output at many rRNA positions, and restricting evaluation to reported positions inflated apparent performance. Signal-based tools showed a systematic 1-4 nucleotide 5'; offset from known modified positions, consistent with the ~5-mer nucleotide stretch present in the read head of the nanopore; applying tool-specific offset corrections substantially improved per-site recovery and reduced false positives, substantially improving the performance of tools such as EpiNano and nanoDoc. At single-site resolution, no known modified site was recovered by all tools, and several m5C, m5U, and m6A sites were missed by the majority of tools. Tool combination analysis showed that pairing error-rate-based tools with offset-corrected signal-based tools improved site recovery beyond any individual tool, with the best three-tool combination recovering 30 of the 36 known sites while maintaining low false positive rates. These results establish that discrimination metrics (e.g. AUROC) alone are insufficient to evaluate modification detection tools: output completeness, positional precision, and per-modification-type sensitivity should be reported alongside standard benchmarking metrics.
]]></description>
<dc:creator><![CDATA[ Morampalli, B. R., Silander, O. K. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718756</dc:identifier>
<dc:title><![CDATA[Benchmarking Tools for Identification of rRNA Modifications in Escherichia coli using Oxford Nanopore Direct RNA Sequencing]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718796v1?rss=1">
<title>
<![CDATA[
Using machine learning to overcome mosquito collections missing data for malaria modeling 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718796v1?rss=1
</link>
<description><![CDATA[
Entomological surveillance plays a crucial role in areas where malaria remains endemic, yet gathering data on mosquito populations is often expensive and complicated, particularly in remote locations with challenging logistics and inconsistent sampling schedules. Access to extensive time series data on mosquito species at specific sites would greatly enhance insights into seasonal trends and the biting habits of vectors of malaria parasites. Gaps in mosquito count records pose a significant challenge for researchers and public health officials seeking to establish early warning systems and effective vector control programs. In this study, we apply quantitative machine learning techniques to address missing data in estimates of mosquito abundance collected from 2009 to 2016 in Bolivar State, Venezuela. We evaluated Linear Regression, Stochastic Linear Regression, K Nearest-Neighbor, and Gradient Boosting methods for imputing missing counts of Anopheles mosquitoes, employing a leave-one-out cross-validation strategy. Additionally, we developed a predictive malaria transmission model incorporating mosquito abundance and climate variables (El Nino 3.4 Index, rainfall, and mean air temperature) as covariates. Our generalized time series model forecasts malaria incidence of Plasmodium vivax and Plasmodium falciparum based on climate dynamics and imputed mosquito data. Model performance was assessed using root mean square error, mean absolute error, and mean absolute percentage error. The final results demonstrated that machine learning imputation significantly improved the accuracy and reliability of P. vivax malaria incidence predictions but failed to predict P. falciparum incidence. The study demonstrates that method choice significantly influences the reconstruction of seasonal abundance patterns and the performance of malaria incidence models. Nevertheless, the proposed models strengthen the foundation for targeted interventions and surveillance in endemic regions. Despite limitations in data continuity and coverage, the findings highlight the value of combining multiyear entomological data sets with robust imputation and sensitivity analyses to improve predictive modeling in resource-constrained, malaria-endemic settings.
]]></description>
<dc:creator><![CDATA[ Rubio-Palis, Y., Feng, L., Liang, K. S., Song, C., Wang, S., Duchnicki, T., Zhang, X., Bravo de Guenni, L. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718796</dc:identifier>
<dc:title><![CDATA[Using machine learning to overcome mosquito collections missing data for malaria modeling]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718564v1?rss=1">
<title>
<![CDATA[
Hybrid Gated Fusion: A Multimodal Deep Learning Framework for Protein Function Annotation 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718564v1?rss=1
</link>
<description><![CDATA[
Protein function annotation requires integrating diverse biological signals, yet existing multimodal methods often struggle with missing inputs and redundant information. We present Hybrid Gated Fusion, a multimodal architecture that combines intrinsic protein features, including sequence and structure, with extrinsic functional context from text and interaction networks. Rather than weighting all modalities equally, the model uses bilinear gating to assess both the informativeness of each modality and its agreement with the others, while auxiliary supervision reduces modality dominance and preserves useful signal in weaker modalities. On the CAFA3 benchmark, a single Hybrid Gated Fusion model achieves state-of-the-art performance in Biological Process (F_max = 0.601) and Cellular Component (F_max = 0.706), while remaining competitive in Molecular Function (F_max = 0.702). Analysis of the learned gates shows that interaction networks and text often provide complementary functional signals, whereas structural features are down-weighted when redundant but remain valuable under sparse-input settings. These results establish Hybrid Gated Fusion as a robust and scalable framework for genome-scale protein function annotation.
]]></description>
<dc:creator><![CDATA[ Zhou, Z., Buchan, D. W. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718564</dc:identifier>
<dc:title><![CDATA[Hybrid Gated Fusion: A Multimodal Deep Learning Framework for Protein Function Annotation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718546v1?rss=1">
<title>
<![CDATA[
Recursive Repeat Extender (RRE): A recursive approach to automatically extend repeat element models 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718546v1?rss=1
</link>
<description><![CDATA[
Repetitive elements, including transposable elements (TEs), are integral structural components of eukaryotic genomes; consequently, their identification and classification are crucial to their study. Several approaches have been developed to perform de novo genome-wide repeat identification through pairwise sequence comparisons; however, they often generate truncated repeat models due to their sampling strategies and the substantial fragmentation of many of the older repeat copies in the genome. To improve repeat models generated de novo, several algorithms have been developed that increase model length via the BEEA (BLAST-Extend-Extract-Align) approach, in which genomic instances of each repeat are identified with BLAST, their coordinates are extended, and a refined model is generated by aligning the extended sequences. Nevertheless, these extension algorithms exhibit two key limitations that hinder the reconstruction of highly degenerate and fragmented repeats: the use of BLAST as a search algorithm - which limits their sensitivity in detecting highly diverged sequences - and the use of a single search step, which precludes the reconstruction of extensively fragmented repeat models. In this work, we present a novel approach to extend repeat models, called RRE (Recursive Repeat Extender), which uses profile hidden Markov models (HMMs) to search for repeat elements with high sensitivity and employs a recursive extension strategy that iteratively searches and extends the repeat model, using the extended model from each round as input for the next and continuing until no additional sequence can be incorporated. We apply RRE to repeat libraries generated de novo from five model organisms, and our results show that RRE-generated repeat libraries contain fewer but longer repeat models and can identify a larger proportion of the genomes as repetitive than RepeatModeler2-generated repeat libraries. Notably, RRE can reconstruct highly degenerate repeats such as CR1_Mam, producing a model that achieves similar coverage to the reference Dfam model while extending it by an additional 131 bp that were not captured in the reference model. Overall, RRE enables the automatic improvement of de novo repeat libraries and the reconstruction of highly degenerate and fragmented repeats.
]]></description>
<dc:creator><![CDATA[ Falcon, F., Tanaka, E. M., Rodriguez-Terrones, D. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718546</dc:identifier>
<dc:title><![CDATA[Recursive Repeat Extender (RRE): A recursive approach to automatically extend repeat element models]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718559v1?rss=1">
<title>
<![CDATA[
Virtual multiplex staining of the pancreatic islets across type 1 diabetes progression using a Schroedinger bridge 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718559v1?rss=1
</link>
<description><![CDATA[
Classical hematoxylin and eosin (H&E) staining enables review of tissue morphology but lacks information regarding the molecular state of cells. Immunohistochemical (IHC) techniques label specific proteins in tissue, allowing differentiation of relevant structures that may go undetectable in H&E. However, the IHC process is complex, expensive, and time-consuming, especially for multiplex IHC (mIHC) limiting its use in large cohorts. Stain conversion of H&E to IHC using generative artificial intelligence models such as generative adversarial networks (GANs) represent one solution to this problem. However, GANs are unstable during out of distribution sampling and are prone to hallucinations or mode collapse, limiting their accuracy in challenging image conversion tasks. To address this, the field has recently turned to diffusion models. Here, we introduce Schroedinger-bridge for Multiplex ImmunoLabel Estimation (SMILE). Unlike conventional diffusion models that map from source to target through an intermediate Gaussian noise, Schroedinger-bridge diffusion models skip this step and have been shown to better preserve structures during image translation. To test the performance of SMILE, we generated a large cohort of high-fidelity H&E-mIHC image pairs from pancreatic organ donors, targeting insulin, glucagon, and CD3. Our dataset well-sampled across type-1 diabetes status, pancreas anatomical location, age, and sex. Using this cohort, we demonstrate the superiority of SMILE compared to GANs via a comprehensive evaluation framework incorporating texture, distribution, and antibody-specific metrics, as well as blinded pathologist reviews. We further confirmed the ability of SMILE to generate accurate mIHC images from H&Es generated at an external site, to perform whole slide image conversion, and to generate realistic three-dimensional maps of the pancreatic islets in non-diabetic, auto-antibody positive, and type-1 diabetic donor tissue. Finally, we performed stain conversion of paired H&E to HER2 and Ki67 images in breast cancer, confirming the superiority of SMILE in diverse stain conversion applications. Collectively, this framework provides a scalable pipeline for high-throughput proteomic inference from archival H&Es, providing transformative potential for pancreatic research and digital pathology.
]]></description>
<dc:creator><![CDATA[ Shen, Y., Cho, W. J., Joshi, S., Wen, B., Naganathanhalli, S., Beery, M., Grubel, C. R., Sivasubramanian, A., Forjaz, A., Grahn, M. P., Dequiedt, L., Huang, Y., Han, K. S., Wu, F., Pedro, B. A., Wood, L. D., Chen, T., Hruban, R. H., Kusmartseva, I., Atkinson, M. A., Wirtz, D., Kiemen, A. L. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718559</dc:identifier>
<dc:title><![CDATA[Virtual multiplex staining of the pancreatic islets across type 1 diabetes progression using a Schroedinger bridge]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718256v1?rss=1">
<title>
<![CDATA[
PathwaySeeker: Evidence-Grounded AI Reasoning over Organism-Specific Metabolic Networks 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718256v1?rss=1
</link>
<description><![CDATA[
Metabolic activity is not an intrinsic property of an organism, but an emergent state shaped by environmental and experimental context. Despite recent advances in large language models (LLMs) and multi-omics profiling, current computational frameworks struggle to represent and reason over metabolism in a condition-specific manner. General-purpose AI systems operate on static, public biochemical knowledge, while multi-omics datasets capture dynamic measurements without a structured framework for mechanistic interpretation. As a result, metabolic networks remains analysis remains disconnected from the experimental states that define biological function. Here, we introduce PathwaySeeker, an evidence-grounded AI system for organism-specific metabolic network reasoning. PathwaySeeker reconstructs sample-specific metabolic graphs from integrated proteomic and metabolomic data, fine-tunes an LLM on the resulting graph structure, and verifies each reasoning step against the experimental graph through iterative hypothesis search, an approach we term Oracle-in-the-Loop inference. Every output claim carries explicit evidence provenance, distinguishing experimentally confirmed relationships from biochemically plausible hypotheses requiring validation. We demonstrate the system using multi-omics data from the non-model white-rot fungus Trametes versicolor, where PathwaySeeker recovers branched phenylpropanoid pathways and transparently stratifies confirmed reactions from testable extensions. Post-hoc thermodynamic analysis condition-specific metabolite dynamics support the biological feasibility of the reconstructed routes. By embedding experimental evidence provenance directly into language model-guided metabolic network reasoning, PathwaySeeker enables systematic differentiation between experimentally grounded knowledge and structured hypothesis, bridging frontier AI capabilities with organism-specific experimental evidence.
]]></description>
<dc:creator><![CDATA[ Oliveira Monteiro, L. M., Chowdhury, N. B., Oostrom, M., McDermott, J. E., Stratton, K. G., Choudhury, S., Bardhan, J. P. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718256</dc:identifier>
<dc:title><![CDATA[PathwaySeeker: Evidence-Grounded AI Reasoning over Organism-Specific Metabolic Networks]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718510v1?rss=1">
<title>
<![CDATA[
Active Learning for Budget-Constrained TCR--pMHC Wet-Lab Validation 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718510v1?rss=1
</link>
<description><![CDATA[
Wet-lab validation of TCR--pMHC binding hypotheses is the rate-limiting step in T-cell therapy discovery: a single binding assay round can cost thousands of dollars and weeks of turnaround time, yet computational models generate thousands of candidate pairs per run. We frame this as a emph{pool-based active learning} problem: given a fixed annotation budget $B$, which unlabeled pairs should be sent to the assay to maximally improve a predictive model that will guide the next screening round? We introduce emph{UDAL} (Uncertainty--Diversity Active Learning), a batch acquisition strategy that combines BALD-based uncertainty estimation via MC Dropout with greedy core-set diversity selection in the encoder feature space. Evaluated on a curated VDJdb--IEDB benchmark under epitope-held-out and distance-aware protocols, UDAL achieves AUPRC 0.487 with only 5{,}000 queried labels---matching the performance of a model trained on 3$times$ more randomly sampled labels. At a budget of 2{,}000 labels, UDAL improves AUPRC by 16.7% over random acquisition, translating directly to fewer wasted assay slots. These results demonstrate that principled active query strategies can substantially reduce the wet-lab cost of building reliable TCR specificity models.
]]></description>
<dc:creator><![CDATA[ Mazur, K., Piotrowska, M., Kowalski, J. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718510</dc:identifier>
<dc:title><![CDATA[Active Learning for Budget-Constrained TCR--pMHC Wet-Lab Validation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718511v1?rss=1">
<title>
<![CDATA[
FairTCR: Equity-Aware TCR--pMHC Binding Prediction\\Across HLA Alleles and Cohort Strata 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718511v1?rss=1
</link>
<description><![CDATA[
Public TCR--pMHC binding databases are heavily skewed toward a handful of well-studied HLA alleles---most prominently HLA-A*02:01, which covers $sim$45% of curated records---and toward patients from European-ancestry cohorts. Standard empirical risk minimization (ERM) trained on such data achieves strong pooled accuracy but routinely underperforms on rare alleles and underrepresented cohorts, creating systematic disparities that are invisible in single-metric benchmarks. We introduce emph{FairTCR}, a group distributionally robust optimization (GDRO) framework that minimizes worst-group loss across HLA supertypes and cohort strata via online exponentiated gradient updates. FairTCR reduces the average--worst-group AUPRC disparity from 0.190 (ERM) to 0.098 on a curated VDJdb--IEDB benchmark, achieving a 48.4% disparity reduction while maintaining competitive average AUPRC (0.432 vs. 0.431 for ERM). Per-HLA analysis shows that rare allele groups (B*08:01, B*44:02) gain up to 0.062 AUPRC points, directly improving the equity of computational pre-screening for underrepresented patient populations.
]]></description>
<dc:creator><![CDATA[ Nowak, P., Kowalski, J., Lewandowski, T. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718511</dc:identifier>
<dc:title><![CDATA[FairTCR: Equity-Aware TCR--pMHC Binding Prediction\\Across HLA Alleles and Cohort Strata]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718168v1?rss=1">
<title>
<![CDATA[
Uncertainty-aware benchmarking reveals ambiguous transcripts in mRNA-lncRNA classification 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718168v1?rss=1
</link>
<description><![CDATA[
Background. Long non-coding RNAs (lncRNAs) have gained significant attention in recent years, yet distinguishing them from protein-coding transcripts remains challenging. Indeed, many lncRNAs share mRNA-like processing and existing sequence-derived signals do not fully capture the coding/non-coding boundary. Recent GENCODE annotation efforts revealed tens of thousands of novel lncRNA sequences as well as the reclassification of some lncRNAs into the protein-coding class, highlighting the need to better characterize transcript features associated with classification uncertainty and errors. Results. We performed uncertainty-aware benchmarking by retraining and evaluating eight transcript classifiers under a controlled protocol on a label-stable GENCODE v46-v47 subset. Beyond conventional model evaluation metrics, we quantified inter-tool agreement and entropy-based uncertainty to stratify transcripts into consensus, discordant, and consensus-error groups. To expand standard sequence and ORF-derived signals, we incorporated repeat-derived features from mature transcripts and non-B DNA motif features across gene bodies. Although aggregate performance was high, ~45% of transcripts showed inter-tool discordance, particularly among lncRNAs. Feature analyses linked low-uncertainty predictions to strong coding-like signals, whereas high-uncertainty profiles exhibited mixed signatures. Alongside classical predictors in global importance analyses, repeat-derived features appear as main contributors. Conclusions. By combining controlled benchmarking with transcript-level agreement and uncertainty stratification, together with extended feature profiling, we identified patterns associated with classifier disagreement and misclassification. This novel framework provides practical guidance for interpreting predictions, motivating the development of more robust coding/non-coding classifiers, while also shedding light on the sequence properties that distinguish lncRNA sequences.
]]></description>
<dc:creator><![CDATA[ Garcia-Ruano, D., Georges, M., Mohanty, S. K., Baaziz, R., Makova, K. D., Nikolski, M., Chalopin, D. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718168</dc:identifier>
<dc:title><![CDATA[Uncertainty-aware benchmarking reveals ambiguous transcripts in mRNA-lncRNA classification]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.13.717816v1?rss=1">
<title>
<![CDATA[
Agent-Guided De Novo Design of Nanobody Binders Against a Novel Cancer Target 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.13.717816v1?rss=1
</link>
<description><![CDATA[
Therapeutic antibody discovery remains slow and resource-intensive, with traditional methods providing limited control over epitope selection. We present a workflow for de novo nanobody design applied to a novel Desmoplastic Small Round Cell Tumor target encompassing four stages: (1) epitope identification guided by our hotspot recommendation agent using physical chemistry-based structure and sequence analysis tools with two curated databases (IEDB, PFAM), (2) de novo nanobody generation using three independent methods (RFantibody, IgGM, mBER) across multiple predicted antigen structures and nanobody frameworks, (3) multi-metric scoring including structural metrics from folding models, and in silico binding affinity from our sequence- based predictor, (4) high-throughput yeast surface display (YSD) screening followed by surface plasmon resonance (SPR) characterization of the specific binders. We generated 288,000 nanobody designs spanning eight target epitope regions and three variable domains of heavy chain-only antibody (VHH) frameworks. Multi-objective Pareto filtering with our candidate selection agent yielded 100,000 candidates for YSD screening with fluorescence-activated cell sorting (FACS). Of 116 enriched candidates advanced to SPR characterization, 46/116 (39.7%) produced reliable kinetic fits with Rmax [&ge;] 30 RU, yielding KD values from 0.66 nM to 305 nM (median 31.7 nM). These results show that an agent-guided computational workflow can design nanomolar to sub-nanomolar nanobody binders against a novel target without experimental structure or prior antibody information.
]]></description>
<dc:creator><![CDATA[ Zhao, Y., Yilmaz, M., Lee, E., Teh, C., Guo, L., Sonmez, K., Giancardo, L., Trang, G., Xu, F., Espinosa-Cotton, M., Cheung, N.-K., Kim, J., Cheng, X. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.13.717816</dc:identifier>
<dc:title><![CDATA[Agent-Guided De Novo Design of Nanobody Binders Against a Novel Cancer Target]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.16.718906v1?rss=1">
<title>
<![CDATA[
Integrating glycosylation in de novo protein design with ReGlyco Binder Design Filter 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.16.718906v1?rss=1
</link>
<description><![CDATA[
Artificial Intelligence (AI)-based methods for 3D protein structure prediction are revolutionising structural biology, providing novel templates for experimental data refinement and an on demand 3D perspective on any molecular architecture and protein-protein interaction (PPI). Regardless of the inherent limitations of the various approaches available to date, the continuous improvement of the algorithms, the broad availability of open access (OA) web servers, software packages and databases are bound to accelerate the discovery and optimisation of novel biopharmaceuticals. Within this context, the development of computational pipelines for the de novo design of target-specific protein binders is especially exciting. As it stands, these processes are still rather inefficient and expensive, rapidly outputting thousands of designs relatively quickly, which translate into meagre yields. Here we show how the explicit integration of glycosylation as a filter in the 3D de novo design pipeline can significantly improve efficiency and reduce laboratory costs with minimal additional computational resources. As a proof-of-concept, we used the GlycoShape database and ReGlyco tools to filter the results of a recent open competition launched by Adaptyv Bio for the design of binders as inhibitors against the heavily glycosylated Nipah virus glycoprotein (NiV-G). Screening of the 1,201 selected designs in block with ReGlyco, refined with the new ReGlyco Rotamer tool, flagged 11% of non-binders prior to experiment in approximately 3 hours on a dual-core CPU. We complement this analysis with a demo colab notebook to illustrate our workflow. In this demo users can design mini-binders against human erythropoietin (hEPO) by integrating GlycoShape resources with the RFdiffusion3 (RFD3) pipeline from the Institute for Protein Design (IDP).
]]></description>
<dc:creator><![CDATA[ Singh, O., Fadda, E. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.16.718906</dc:identifier>
<dc:title><![CDATA[Integrating glycosylation in de novo protein design with ReGlyco Binder Design Filter]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.13.715198v1?rss=1">
<title>
<![CDATA[
Canonical self-supervised pretraining paradigm constrains the capacity of genomic language models on regulatory decoding 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.13.715198v1?rss=1
</link>
<description><![CDATA[
Recent studies suggest that genomic language models (gLMs) could help decode genomic regulatory code. Here, we systematically evaluated 11 representative gLMs across multiple regulatory genomics applications and found that current gLMs offer limited advantages over the random baseline. Further analysis revealed a systematic misalignment between the canonical sequence-only self-supervised pretraining paradigm and the context-specific dynamic nature of gene regulation, highlighting the need for function-oriented pretraining strategies that explicitly incorporate biochemical and regulatory priors.
]]></description>
<dc:creator><![CDATA[ Liang, Y.-X., Wang, Y., Pan, W.-Y., Chen, Z.-Y., Wei, J.-C., Gao, G. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.13.715198</dc:identifier>
<dc:title><![CDATA[Canonical self-supervised pretraining paradigm constrains the capacity of genomic language models on regulatory decoding]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718485v1?rss=1">
<title>
<![CDATA[
Inferring division-associated stochasticity from time-series single-cell transcriptomes 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718485v1?rss=1
</link>
<description><![CDATA[
Cell division is fundamental to multicellular organisms and stochastic partitioning of cellular components can strongly affect genome-wide gene expression states. However, how cell division-associated partitioning noise shapes the dynamics of proliferating cells is poorly understood. Here, we propose scDIVIDE, a neural stochastic differential equation framework to infer continuous cellular dynamics and division rates while accounting for partitioning noise. We combined birth-death-mutation processes from population genetics with dynamical optimal transport and revealed that the birth rate is embedded in the diffusion coefficient, enabling its inference from time-series scRNA-seq data. scDIVIDE accurately inferred birth rates in synthetic data and the inferred birth rates recapitulated turnover-related programs in mouse hematopoiesis data. By exploiting the birth-diffusion coupling, scDIVIDE provides a biologically-informed constraint on growth rate estimation, outperforming existing methods in predicting future cell distributions. scDIVIDE provides a conceptual avenue for quantitatively dissecting how partitioning noise shapes fate decisions in multicellular systems.
]]></description>
<dc:creator><![CDATA[ Okochi, Y., Sawazaki, Y., Kondo, Y., Naoki, H. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718485</dc:identifier>
<dc:title><![CDATA[Inferring division-associated stochasticity from time-series single-cell transcriptomes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718479v1?rss=1">
<title>
<![CDATA[
ProteomeScan: A Toolkit For Target Validation By Proteome-Wide Docking And Analysis 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718479v1?rss=1
</link>
<description><![CDATA[
The problem of identifying which protein target a potential drug-like molecule interacts with is crucial for both the study of existing drugs and the design of new therapeutic compounds. Despite the importance of target identification, existing computational approaches remain limited in terms of speed, accuracy, and protein target coverage. We introduce ProteomeScan, a large-scale, gene-driven computational toolkit for systematic proteome-wide scanning to uncover hidden or previously uncharacterized protein-ligand interactions. ProteomeScan leverages cloud-scale high performance computing to perform extensive molecular docking simulations across the human proteome to rank candidate targets based on binding affinities. After filtering promiscuous targets, we found that ProteomeScan ranks known target significantly better than a random baseline for a set of control compounds. Furthermore, we performed physical analyses of predicted binding modes for both promiscuous and known protein-ligand binding pairs to validate that ProteomeScan identifies interactions with valid binding pockets. In addition, we conducted experiments using mutant variants of proteins to study how mutations affect binding behavior. We have open sourced the core ProteomeScan algorithm as part of the DeepChem ecosystem to enhance transparency and reproducibility.
]]></description>
<dc:creator><![CDATA[ Barsainyan, A. A., Panda, R., Siguenza, J., Merico, D., Ramsundar, B. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718479</dc:identifier>
<dc:title><![CDATA[ProteomeScan: A Toolkit For Target Validation By Proteome-Wide Docking And Analysis]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718488v1?rss=1">
<title>
<![CDATA[
MICRON learns outcome-associated representations of spatial immune microenvironments 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718488v1?rss=1
</link>
<description><![CDATA[
Spatial imaging proteomics modalities, such as imaging mass cytometry, enable comprehensive identification of immune microenvironments driving disease outcomes. Identifying outcome-associated immune microenvironments from these data has proven to be complex, as it requires segmenting cells with complex shapes and reconciling spatial signatures across many heterogeneous samples. We present MICRON, a segmentation-free, fully automated multiple-instance learning based tool for automatic identification of outcome-linked immune microenvironments. MICRON learns representations of samples profiled with spatial imaging proteomics modalities, enabling more accurate prognostic and diagnostic prediction over existing approaches. As a case study, we show that MICRON generates a comprehensive importance map that reveals key outcome-associated immune microenvironments in brain cancer, uncovering coordinated cell-cell communication between astrocytes, NK cells, and macrophages linked to survival outcomes. MICRON is provided as open source software for broad use by clinicians and biologists at https://github.com/ChenCookie/micron.
]]></description>
<dc:creator><![CDATA[ Chen, C.-J., George, B., Dhawka, L., Evangelista, B., Stanley, N. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718488</dc:identifier>
<dc:title><![CDATA[MICRON learns outcome-associated representations of spatial immune microenvironments]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718708v1?rss=1">
<title>
<![CDATA[
DIOPT: the DRSC Integrative Ortholog Prediction Tool, 2026 update 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718708v1?rss=1
</link>
<description><![CDATA[
Mapping orthologous proteins is a critical step for cross-species literature mining, data integration, experimental design, and more, making the ability to quickly predict orthologs across species a key tool for functional genomic studies. The DRSC Integrative Ortholog Prediction Tool (DIOPT) was initially developed in 2011 to provide a centralized portal for identifying predicted orthologs among major model organisms. By integrating results from multiple ortholog prediction algorithms, DIOPT allows users to compare predictions across methods and prioritize high-confidence ortholog relationships. Over the years, we regularly updated the underlying genome annotations and refreshed predictions from each integrated algorithm. In addition, both the number of supported species and the number of ortholog prediction algorithms incorporated into the platform have grown. The web portal has also been enhanced with new features designed to improve usability, facilitate data exploration, and support a broader range of research applications. We also developed a sister version of DIOPT tailored specifically for arthropod species; this enables researchers working with a diverse set of insects and related organisms to perform ortholog mapping and comparative analyses more effectively. Together, these developments ensure that DIOPT remains a robust and broadly useful resource for functional genomics research.
]]></description>
<dc:creator><![CDATA[ Hu, Y., Comjean, A., Gao, C., Yamamoto, S., Mohr, S., Perrimon, N. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718708</dc:identifier>
<dc:title><![CDATA[DIOPT: the DRSC Integrative Ortholog Prediction Tool, 2026 update]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718434v1?rss=1">
<title>
<![CDATA[
MISSTE: a multiscale integrative spatial simulator for understanding the mechanisms underlying tissue ecosystems 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718434v1?rss=1
</link>
<description><![CDATA[
Multiscale tissue ecosystems are governed by coupled intracellular decision-making, cell-cell interactions, and spatially structured microenvironmental signals, yet these scales are often studied separately. Here we present MISSTE, a modular framework that integrates Boolean intracellular state logic, agent-based modeling, and partial differential equation fields within a unified spatial simulation architecture. As a proof of concept, we applied MISSTE to CAR-T therapy in a solid tumor microenvironment. The model recapitulated emergent features of CAR-T behavior, including limited tumor penetration, stromal suppression, localized cytokine remodeling, hypoxia-associated constraint, and progressive functional exhaustion. Comparison of baseline and optimized conditions showed that coordinated enhancement of interaction range, migration, and cytotoxic function improved immune persistence and partial tumor control. Systematic parameter scans further identified effective immune-tumor contact as a stronger determinant of outcome than killing strength alone, highlighting spatial access as the dominant bottleneck. Guided by these results, we designed sequential intervention strategies and found that time-ordered enhancement of infiltration, killing, and late functional protection outperformed a static optimized regime. Together, these results establish MISSTE as a generalizable multiscale methodology for dissecting tissue ecosystems and for generating mechanistically grounded strategies for engineered cellular therapy design.
]]></description>
<dc:creator><![CDATA[ Su, Z., Yin, S., Wu, Y. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718434</dc:identifier>
<dc:title><![CDATA[MISSTE: a multiscale integrative spatial simulator for understanding the mechanisms underlying tissue ecosystems]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718370v1?rss=1">
<title>
<![CDATA[
vcfilt: A Zero-Allocation Streaming Filter for High-Throughput VCF Processing 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718370v1?rss=1
</link>
<description><![CDATA[
Variant Call Format (VCF) files are the dominant interchange format for genomic variant data, but their size - routinely exceeding tens of gigabytes for population-scale studies - creates a significant computational bottleneck at the quality-filtering stage. Existing tools such as bcftools and vcftools provide broad functionality through general-purpose expression engines, but incur substantial per-record overhead from dynamic field lookup, type resolution, and heap allocation. We present vcfilt, a streaming, batch-parallel VCF filter implemented in Go that restricts its scope to three high-frequency filter criteria (INFO/DP, INFO/AF, and QUAL) and applies them via a zero-allocation byte-scan parser. Benchmarked on real 1000 Genomes Project data (chromosome 20, 1,811,146 variants), vcfilt achieves 147,000 variants/second on an 18 GB plain-text VCF file using a single thread - a 12.2x speedup over bcftools 1.18 under identical conditions. On gzip-compressed input, the speedup is 7.9x. Output is byte-for-byte identical to bcftools across all tested filter combinations. vcfilt is distributed as a self-contained static binary, a Docker image, and a Singularity-compatible container. The source code and all benchmark scripts are openly available under the MIT licence.
]]></description>
<dc:creator><![CDATA[ KP, M. M. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718370</dc:identifier>
<dc:title><![CDATA[vcfilt: A Zero-Allocation Streaming Filter for High-Throughput VCF Processing]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718378v1?rss=1">
<title>
<![CDATA[
Sampling antibody conformational ensembles withABodyBuilder4-STEROIDS 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718378v1?rss=1
</link>
<description><![CDATA[
Conformational flexibility is fundamental to the function of many proteins and in the case of antibodies can impact key properties such as affinity and specificity. While it is possible to predict single, static protein structures with high accuracy, predicting conformational ensemble remains challenging. Molecular dynamics simulations suffer from high computational costs, while deep learning methods are yet to achieve the same level of accuracy. Here, we introduce ABB4-STEROIDS a generative structure prediction model that samples conformational ensembles of antibodies. We trained our model on 4.2 million structural frames derived from $sim$136,000 coarse-grained and a set of 83 new all-atom antibody MD simulations. We benchmarked our model on reproducing MD ensembles and evaluated the diversity of sampled structures and the covered conformational space against experimental evidence. ABB4-STEROIDS achieves state-of-the-art accuracy, particularly within the experimental benchmarks. The model is openly available and provides a robust resource for large-scale investigations of antibody conformational ensembles.
]]></description>
<dc:creator><![CDATA[ Spoendlin, F. C., Cagiada, M., Ifashe, K., Vavourakis, O., Deane, C. M. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718378</dc:identifier>
<dc:title><![CDATA[Sampling antibody conformational ensembles withABodyBuilder4-STEROIDS]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718492v1?rss=1">
<title>
<![CDATA[
Multiscale transcriptomic organization of the human brain with DigitalBrain 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718492v1?rss=1
</link>
<description><![CDATA[
The human brain varies across anatomical regions, cell types, development, aging and disease states, yet existing single-cell transcriptomic resources remain fragmented and difficult to integrate into a unified biological model. Here we present DigitalBrain, a human brain-specific atlas and foundation-model framework for organizing diverse and fragmented human brain transcriptomic data across scales. We first built DigitalBrain-Atlas, a harmonized whole-brain single-cell resource comprising 16.35 million transcriptomes from 2,143 donors across 165 brain regions, spanning the human lifespan and multiple neurological and clinical conditions. We then developed DigitalBrain-M1, a Transformer-based model that jointly encodes gene identity and expression magnitude to learn a shared embedding space for cells and genes. Across held-out datasets, DigitalBrain supported robust single-cell integration, clustering and cell-type annotation while preserving major biological structure and reducing technical fragmentation. Beyond these benchmarks, the learned embeddings revealed emergent large-scale hierarchical organization of the human brain, linking anatomically distinct regions into higher-order patterns consistent with known functional systems. Applied to human hippocampal aging, DigitalBrain identified cell-type-specific aging sensitive gene sets, highlighted dentate gyrus granule cells as a particularly age-sensitive population, and discovered selective reorganization of gene programs related to synaptic transmission, postsynaptic structure, membrane excitability and axon guidance during aging. Cross-dataset convergence was strongest at the level of functional modules and recurrent aging sensitive genes. Together, these results demonstrate that DigitalBrain is a brain-specific framework for mapping human brain organization across scales, and as an early step towards a complete virtual organ for the human brain.
]]></description>
<dc:creator><![CDATA[ An, J., Hu, X., Jiang, Y., Jiang, M., Qiu, S., Liu, G., Wei, X., Wang, Y., Lin, J. Q., Wang, C., Lu, M. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718492</dc:identifier>
<dc:title><![CDATA[Multiscale transcriptomic organization of the human brain with DigitalBrain]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.12.717909v1?rss=1">
<title>
<![CDATA[
scDisent: disentangled representation learning with causal structure for multi-omic single-cell analysis 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.12.717909v1?rss=1
</link>
<description><![CDATA[
Single-cell multi-omic technologies measure complementary aspects of cellular identity and regulatory state, yet most integration models compress these signals into one entangled latent space. Such representations are useful for clustering but poorly suited for mechanistic interpretation or perturbation-oriented analysis. We present scDisent (https://github.com/xig uoren/scDisent), a generative framework for disentangled representation learning that separates expression-associated variables (zexpr) from regulation-associated variables (zreg) and links them through a sparse directed mapping. scDisent combines modality-specific encoding, variational disentanglement with total-correlation and orthogonality constraints, and a Gumbelgated causal module protected by detach-based gradient isolation. Evaluated on benchmark datasets with matched modalities, scDisent achieved best-in-benchmark integration performance while exposing regulatory structure that competing integration methods do not model explicitly. The learned causal atlas remained sparse, perturbation analyses recovered biologically coherent lineage-associated programs, and cross-dataset discovery analyses highlighted interpretable immune, neural and developmental signatures. Quantitative branch-separation analyses further showed that benchmark-label information concentrated in zexpr rather than zreg. Together, these results position scDisent as a computational method that improves not only integration quality but also biological interpretability, making single-cell multi-omic representations better suited to biological question answering and in silico hypothesis generation.
]]></description>
<dc:creator><![CDATA[ Xi, G. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.12.717909</dc:identifier>
<dc:title><![CDATA[scDisent: disentangled representation learning with causal structure for multi-omic single-cell analysis]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718375v1?rss=1">
<title>
<![CDATA[
Three-dimensional Virtual Adult Cardiomyocyte Transcriptomics 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718375v1?rss=1
</link>
<description><![CDATA[
Adult cardiomyocytes are large, rod-shaped, and often multinucleated, which makes them challenging for current single-cell or single-nucleus RNA-sequencing platforms. Current spatial transcriptomics (ST) relies on nuclear-based cell segmentation, which performs poorly when identifying adult cardiomyocytes. Moreover, single-section ST of adult myocardium is insufficient to capture the cellular transcriptomic information of intact cardiomyocytes. Thus, there is an urgent need for novel technology that accurately profiles the transcriptome of adult cardiomyocytes in situ at the single-cell level. Here, we report the first three-dimensional virtual cardiomyocyte (3D-VirtualCM) transcriptome atlas by reconstructing multi-layer ST spanning a 100m depth of the adult mouse heart. Using membrane-based cell segmentation and similarity-guided cross-sectional contour matching, 3D-VirtualCM delineates individual cardiomyocyte 3D contours and integrates in situ transcriptome. 3D-VirtualCM identifies cardiomyocytes in the cell cycle using proliferative markers in the context of myocardial infarction (MI) and reveals the asymmetric intracellular RNA distribution along the longitudinal axis of cardiomyocytes. Using 3D RNA fluorescence in situ hybridization (FISH), we validated the longitudinal asymmetry of Glul and Gja1 mRNA in adult cardiomyocytes. In summary, 3D-VirtualCM provides a workflow that advances the study of cardiac pathophysiology at a bona fide single-cell level while preserving spatial context.
]]></description>
<dc:creator><![CDATA[ Luo, C., Lyu, Y., Guo, X., Cheng, L., Liang, Q., Wang, S., Wang, Y., Zhang, S., Wang, S., Liu, T., Luo, Y., Lu, F., Ran, B., Zhang, Y., Liu, X., Wang, Y., Qin, G., Wu, J., Lyu, Q. R. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718375</dc:identifier>
<dc:title><![CDATA[Three-dimensional Virtual Adult Cardiomyocyte Transcriptomics]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718403v1?rss=1">
<title>
<![CDATA[
Thermoadaptation of EndoG proteins in the Xenopus frog genus 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718403v1?rss=1
</link>
<description><![CDATA[
Xenopus is a genus of entirely aquatic frogs found in sub-Saharan Africa. Currently, the complete genomes of two species within the Xenopus genus, Xenopus laevis and Xenopus tropicalis, have been fully sequenced, annotated, and made publicly available. The two species inhabit markedly different environments: X. tropicalis lives in the hot, equatorial regions of Africa, whereas X. laevis resides in the cooler climates of southern Africa. In the present study, mutational profiling, comparative homology modeling, and computational bioinformatics were used to identify the features of adaptive evolution in Xenopus endonuclease G (EndoG) proteins. The multiple characteristics of EndoG isozymes were discovered to vary considerably between the two Xenopus species dwelling in different locations. Most notably, EndoG proteins from the psychrophilic X. laevis exhibit the increased contents of charged and polar residues, elevated pI, higher intramolecular interaction energies, B factors, molecular void volumes, and solvent accessibilities, but the decreased contents of nonpolar and aromatic amino acids, lower hydrophobicity, buried surface area, and molecular packing density compared to those from the thermophilic X. tropicalis. The observed differences strongly suggest that temperature plays a dominant role in EndoG diversification. Evaluation of intramolecular interaction energies appears to be a particularly sensitive and discriminative framework for assessing protein divergence at the structural level. Overall, this study highlights the diversification of homologous proteins in ectothermic vertebrate eukaryotes and provides mechanistic insight into protein adaptation to contrasting environments.
]]></description>
<dc:creator><![CDATA[ Tokmakov, A. A. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718403</dc:identifier>
<dc:title><![CDATA[Thermoadaptation of EndoG proteins in the Xenopus frog genus]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718363v1?rss=1">
<title>
<![CDATA[
Generative design of intrinsically disordered proteins based on conditioned protein language models: Data is the limit 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718363v1?rss=1
</link>
<description><![CDATA[
Intrinsically disordered proteins and regions (IDRs) are central to a multitude of biological processes. Despite extensive studies of their structural and physicochemical properties, the rational design of IDRs with defined conformational behavior remains challenging due to their ensemble nature. Here we present a generative framework for designing disordered protein sequences conditioned on target conformational ensemble descriptors using protein language models (pLMs). We formulate IDR design as the task of generating amino acid sequences predicted to realize specified biophysical properties and implement a Transformer encoder-decoder architecture that maps numerical descriptors to protein sequences. By training models on datasets spanning two orders of magnitude in size, we show that accurate control of conformational and physicochemical properties is achieved only at large data scale. These results demonstrate the feasibility of conditioning generative models on ensemble-level descriptors for IDR design. More broadly, these results support a data-centric paradigm for protein engineering, in which data availability emerges as a key limiting factor for the accurate design of IDRs.
]]></description>
<dc:creator><![CDATA[ Carriere, L., Huyghe, A., Pajkos, M., Bernado, P., Cortes, J. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718363</dc:identifier>
<dc:title><![CDATA[Generative design of intrinsically disordered proteins based on conditioned protein language models: Data is the limit]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
</rdf:RDF>
