<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF xmlns:admin="http://webns.net/mvcb/" xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:prism="http://purl.org/rss/1.0/modules/prism/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
<channel rdf:about="https://biorxiv.org">
<admin:errorReportsTo rdf:resource="mailto:biorxiv@cshlpress.edu"/>
<title>bioRxiv Subject Collection: Genomics Bioinformatics</title>
<link>https://biorxiv.org</link>
<description>
This feed contains articles for bioRxiv Subject Collection "Genomics Bioinformatics"
</description>

<items>
<rdf:Seq>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718599v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718654v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718648v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718632v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718634v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718133v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718336v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718756v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.717773v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718796v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718564v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718546v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718559v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718256v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.13.717821v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718510v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718511v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718168v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.13.717816v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718501v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.16.718906v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.16.718917v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718677v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.13.715198v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718485v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718479v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718488v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718508v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.15.718708v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.14.718434v1?rss=1"/>
</rdf:Seq>
</items>
<prism:eIssn/>
<prism:publicationName>bioRxiv</prism:publicationName>
<prism:issn/>

<image rdf:resource=""/>
</channel>
<image rdf:about="">
<title>bioRxiv</title>
<url>https://www.biorxiv.org/sites/default/files/bioRxiv_article.jpg</url>
<link>https://www.biorxiv.org</link>
</image>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718599v1?rss=1">
<title>
<![CDATA[
Calibration of in-frame indel variant effect predictors for clinical variant classification 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718599v1?rss=1
</link>
<description><![CDATA[
Insertions and deletions (indels) represent a substantial source of genetic variation in humans and are associated with a diverse array of functional consequences. Despite their prevalence and clinical importance, indels, particularly short in-frame indels, remain critically understudied compared to single nucleotide variants and are challenging to interpret clinically. While many computational predictors for missense variants have been rigorously evaluated and calibrated for clinical use, the clinical utility of tools for in-frame indels remains uncertain. To address this gap, we have calibrated in-frame indel prediction tools for clinical variant classification. We constructed a high-confidence dataset of in-frame indel variants ([&le;] 50bp) from clinical and population databases and estimated the prior probability of pathogenicity of a rare in-frame indel observed in a disease-associated gene, and of an insertion and deletion separately. Using a previously developed statistical framework based on local posterior probabilities, we then established score thresholds for eight computational tools, corresponding to distinct evidence levels for pathogenic and benign classification according to ACMG/AMP guidelines. All in-frame indel predictors evaluated here reached multiple evidence levels of pathogenicity and/or benignity, demonstrating measurable clinical value. However, these models consistently exhibited lower performance levels compared to missense predictors, highlighting the need for improved computational approaches for indel classification.
]]></description>
<dc:creator><![CDATA[ Abderrazzaq, H., Singh, M., Babb, L., Bergquist, T., Brenner, S. E., Pejaver, V., O'Donnell-Luria, A., Radivojac, P., ClinGen Computational Working Group,, ClinGen Variant Classification Working Group ]]></dc:creator>
<dc:date>2026-04-18</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718599</dc:identifier>
<dc:title><![CDATA[Calibration of in-frame indel variant effect predictors for clinical variant classification]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718654v1?rss=1">
<title>
<![CDATA[
LagCI Enables Inference of Temporal Causal Relationships from Dense Multi-Omic Time Series 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718654v1?rss=1
</link>
<description><![CDATA[
Inferring causal relationships from time-series data is critical for uncovering the dynamics of biological regulation. However, in multi-omics studies, this task is often hampered by sparse temporal sampling and the limitations of existing methods. To address this, we developed Lagged-Correlation Based Causal Inference (lagCI), a computational framework designed to identify time-lagged associations by combining comprehensive lag-correlation profiling with a robust statistical filtering scheme. Rather than relying on simple cross-correlation, lagCI analyzes the entire correlation profile and applies a quality-scoring system to filter out spurious associations that often plague high-dimensional datasets. We first tested lagCI on wearable physiological data, where it successfully captured the well-known causal link between physical activity and heart rate, even accounting for variations in lag times between individuals. Moving to high-frequency human multi-omics, we used lagCI to build a directed network of 1,624 molecules connected by over 157,000 predicted interactions. This network didn't just mirror established biology (such as cytokine-hormone crosstalk); it also pointed to specific molecular hubs that seem to orchestrate the timing of metabolic and immune responses. Overall, lagCI provides a data-driven way to extract temporal insights from dense longitudinal omics. We've made the tool available as an R package with multiple interfaces to ensure it's accessible for both bioinformaticians and clinicians.
]]></description>
<dc:creator><![CDATA[ Ge, Y., Bai, S., Qiang, Z., Liu, Y., Wu, Y., Shen, X. ]]></dc:creator>
<dc:date>2026-04-18</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718654</dc:identifier>
<dc:title><![CDATA[LagCI Enables Inference of Temporal Causal Relationships from Dense Multi-Omic Time Series]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718648v1?rss=1">
<title>
<![CDATA[
Unsupervised Machine Learning for Adaptive Immune Receptors with immuneML 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718648v1?rss=1
</link>
<description><![CDATA[
Machine learning (ML) enables adaptive immune receptor repertoires (AIRRs) analyses for biomarker identification and therapeutic development. With the majority of AIRR data partially or imperfectly labeled, unsupervised ML is essential for motif discovery, biologically meaningful clustering, and generation of novel receptor sequences. However, no unified framework for unsupervised ML exists in the AIRR field, hindering the assessment of model robustness and generalizability. Here, we present an immuneML release advancing unsupervised ML in the AIRR field through unified clustering workflows, interpretable generative modeling, integration with protein language model embeddings, dimensionality reduction, and visualization. We demonstrate immuneML's utility in three use cases: (i) benchmarking generative models for epitope-specific sequence generation, assessing specificity and novelty, (ii) systematic evaluation of clustering approaches on experimental receptor sequences against biological properties, such as epitope specificity and MHC, and (iii) unsupervised analysis of an experimental AIRR dataset to examine potential confounding, a practice widespread in related fields but unexplored in AIRR analyses.
]]></description>
<dc:creator><![CDATA[ Pavlovic, M., Wurtzen, C., Kanduri, C., Mamica, M., Scheffer, L., Lund-Andersen, C., Gubatan, J. M., Ullmann, T., Greiff, V., Sandve, G. K. ]]></dc:creator>
<dc:date>2026-04-18</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718648</dc:identifier>
<dc:title><![CDATA[Unsupervised Machine Learning for Adaptive Immune Receptors with immuneML]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718632v1?rss=1">
<title>
<![CDATA[
Improved deconvolution of circulating tumor DNA from ultra-low-pass whole-genome methylation sequencing using CelFiE-ISH 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718632v1?rss=1
</link>
<description><![CDATA[
Liquid biopsy using ultra-low-pass whole-genome sequencing (ULP-WGS, ~0.25x coverage) is a promising tool to detect circulating tumor DNA (ctDNA) for cancer management, and the use of the native Oxford Nanopore (ONT) sequencing platform adds DNA methylation to the set of detectable features. Here, we test the performance of methylation-based cell-type deconvolution in ULP-WGS samples from diverse epithelial malignancies and investigate several new computational strategies using our CelFiE-ISH deconvolution framework. We find that incorporating larger numbers of markers restricted to the epithelial cell lineage can reduce the cancer fraction limit of detection down to 1.7-3.1%, matching or exceeding the 3% floor of established copy-number alteration (CNA) benchmarks. Our study provides a useful strategy for analysis of ULP-WGS ONT data and indicates that marker selection remains a key challenge for analyzing methylation-based cancer datasets.
]]></description>
<dc:creator><![CDATA[ Katsman, E., Isaac, S., Darwish, A., Maoz, M., Inbar, M., Marouani, M., Unterman, I., Gugenheim, A., Salaymeh, N., Abu Khdeir, S., Uziely, B., Peretz, T., Kaduri, L., Hubert, A., Cohen, J. E., Salah, A., Temper, M., Sela, T., Grinshpun, A., Zick, A., Berman, B. P., Eden, A. ]]></dc:creator>
<dc:date>2026-04-18</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718632</dc:identifier>
<dc:title><![CDATA[Improved deconvolution of circulating tumor DNA from ultra-low-pass whole-genome methylation sequencing using CelFiE-ISH]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718634v1?rss=1">
<title>
<![CDATA[
Pan-cancer survival modeling reveals structural limits of genomic feature integration in immunotherapy outcomes 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718634v1?rss=1
</link>
<description><![CDATA[
Background Immune checkpoint inhibitors (ICIs) have improved outcomes across multiple cancer types, yet reliable predictors of survival remain limited. While genomic features such as tumor mutational burden (TMB) are widely used, their contribution to predictive modeling in heterogeneous real-world cohorts remains unclear. We evaluated the relative contributions of clinical and whole-genome sequencing (WGS) features in pan-cancer survival modeling. Methods We analyzed 658 patients treated with ICIs with matched WGS data from the Genomics England. Using a leakage-controlled machine learning framework with strict train-test separation, we compared four models: TMB-only, clinical-only, clinical+TMB, and an integrated 11-feature clinico-genomic XGBoost survival model. Model performance was assessed using Harrells concordance index (C-index) with bootstrap confidence intervals. Results TMB alone demonstrated near-random discrimination (C-index 0.50; 95% CI 0.44-0.56). Clinical variables substantially improved predictive performance (0.59; 95% CI 0.53-0.64), with marginal gain from adding TMB (0.59). The integrated model achieved a C-index of 0.60 (95% CI 0.55-0.65). While improvement over TMB alone was significant, incremental gain beyond optimized clinical models was modest. Feature attribution analysis showed that model performance was dominated by clinical variables, with genomic features contributing limited additional signal. Conclusions These findings suggest that, in heterogeneous pan-cancer cohorts, predictive performance is constrained by the underlying data structure, in which dominant clinical signals overshadow genome-scale features. This study highlights fundamental limitations in integrating genomic data into survival models across diverse cancer types and provides a benchmark for future computational approaches.
]]></description>
<dc:creator><![CDATA[ Hassan, W., Adeleke, S. ]]></dc:creator>
<dc:date>2026-04-18</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718634</dc:identifier>
<dc:title><![CDATA[Pan-cancer survival modeling reveals structural limits of genomic feature integration in immunotherapy outcomes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718133v1?rss=1">
<title>
<![CDATA[
GANGE: Achieving Sequencing Without Sequencing With Diffusion Guided Generative Genomic Transformer 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718133v1?rss=1
</link>
<description><![CDATA[
The genome of a species is its book of life, but opening that book remains a costly affair due to the limitations the existing sequencing technologies pose. Short reads sequencers struggle to capture long and complex genomes, though have high fidelity rate. To counter that long reads from IIIrd generation sequencers are used, which are full of indel errors. Thus, reads from both approaches are collectively used with very high coverage, making the sequencing projects unreasonably high of cost and unapproachable to majority. Here we present a first of its kind generative deep-learning system, GANGE, which not just recovers the correct sequence with high accuracy from indel prone ONT reads at manifold lesser coverage but also extends it by 4kb, achieving sequencing without sequencing, horizontally as well as vertically while maintaining >92% accuracy consistently. This all makes it possible to drastically pull down sequencing project cost. GANGE was tested across A. thaliana, O. sativa genomes and Human chromosome 1 where it delivered outstanding assembly performance. Besides this, it was also used to accurately generate 2kb upstream promoters of all the genes from 12 different species, demonstrating that one can now also take up regulomics research just using RNA data alone when genome sequence is not available. With this all, GANGE brings a democratic turning point in the area of genomics and sequencing research.
]]></description>
<dc:creator><![CDATA[ Gupta, S., Kumar, A., Bhati, U., Shankar, R. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718133</dc:identifier>
<dc:title><![CDATA[GANGE: Achieving Sequencing Without Sequencing With Diffusion Guided Generative Genomic Transformer]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718336v1?rss=1">
<title>
<![CDATA[
cellNexus: Quality control, annotation, aggregation and analytical layers for the Human Cell Atlas data 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718336v1?rss=1
</link>
<description><![CDATA[
Large-scale single-cell atlases such as the Human Cell Atlas have transformed our understanding of human biology. Yet, the lack of a robust framework that standardises quality control, expands cellular annotation, and adds normalisation and analytical layers, limits multi-study analyses and the usefulness of this resource. Here we present cellNexus, a comprehensive tool and resource that converts the Human Cell Atlas collection into analysis-ready data by linking quality control layers, metadata enrichment, expression normalisation, analysis and data aggregation. These enhancements enable robust statistical modelling across studies, exemplified by a multi-tissue map of immune cell communication during ageing, which reveals macrophage-muscle axes as among the most depleted regenerative interactions with age. All harmonised layers, including pseudobulk and cell-cell communication summaries, are accessible via a public web interface and with R and Python APIs. By providing continuous integration with CELLxGENE releases, cellNexus transforms large cell atlas corpora into an accessible, reproducible, interoperable foundation for large-scale biological discovery and the next generation of single-cell foundation models.
]]></description>
<dc:creator><![CDATA[ Shen, M., Gao, Y., Liu, N., Bhuva, D., Milton, M., Henao, J., Andrews, J., Yang, E., Zhan, C., Liu, N., Si, S., Hutchison, W. J., Shakeel, M. H., Morgan, M., Papenfuss, A. T., Iskander, J., Polo, J. M., Mangiola, S. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718336</dc:identifier>
<dc:title><![CDATA[cellNexus: Quality control, annotation, aggregation and analytical layers for the Human Cell Atlas data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718756v1?rss=1">
<title>
<![CDATA[
Benchmarking Tools for Identification of rRNA Modifications in Escherichia coli using Oxford Nanopore Direct RNA Sequencing 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718756v1?rss=1
</link>
<description><![CDATA[
RNA modifications are important for RNA structure, stability, and ribosome function, but their identification and localisation remains challenging. Oxford Nanopore direct RNA sequencing (DRS) enables modification-agnostic detection in native RNA, but existing tool benchmarks have focused almost exclusively on m6A in eukaryotic mRNA, leaving multi-modification tool performance in bacterial systems largely untested. Here, we benchmark ten RNA modification detection tools spanning signal-comparison, error-rate, and hybrid approaches on Escherichia coli K-12 MG1655 16S and 23S rRNA, which harbour 11 and 25 known modified sites, respectively, across 17 modification types. Using native RNA and in vitro transcribed (IVT) unmodified RNA, we evaluate performance across 25 coverage levels (5x to 1000x). DiffErr and JACUSA2 showed the strongest discrimination performance (AUROC >0.9 on both 16S and 23S rRNA), with DiffErr achieving the highest F1 score on 16S and JACUSA2 showing the most consistent precision-recall balance across both rRNAs. Both tools achieved full transcript-wide scoring and, along with DRUMMER, exact positional localisation. Several other tools produced no output at many rRNA positions, and restricting evaluation to reported positions inflated apparent performance. Signal-based tools showed a systematic 1-4 nucleotide 5'; offset from known modified positions, consistent with the ~5-mer nucleotide stretch present in the read head of the nanopore; applying tool-specific offset corrections substantially improved per-site recovery and reduced false positives, substantially improving the performance of tools such as EpiNano and nanoDoc. At single-site resolution, no known modified site was recovered by all tools, and several m5C, m5U, and m6A sites were missed by the majority of tools. Tool combination analysis showed that pairing error-rate-based tools with offset-corrected signal-based tools improved site recovery beyond any individual tool, with the best three-tool combination recovering 30 of the 36 known sites while maintaining low false positive rates. These results establish that discrimination metrics (e.g. AUROC) alone are insufficient to evaluate modification detection tools: output completeness, positional precision, and per-modification-type sensitivity should be reported alongside standard benchmarking metrics.
]]></description>
<dc:creator><![CDATA[ Morampalli, B. R., Silander, O. K. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718756</dc:identifier>
<dc:title><![CDATA[Benchmarking Tools for Identification of rRNA Modifications in Escherichia coli using Oxford Nanopore Direct RNA Sequencing]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.717773v1?rss=1">
<title>
<![CDATA[
Genome sequencing and multi-stage, blood-feeding, and tissue-specific transcriptome atlas of the Rocky Mountain wood tick provide a critical resource for this vector 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.717773v1?rss=1
</link>
<description><![CDATA[
Dermacentor andersoni, the Rocky Mountain wood tick, is an important vector for pathogens impacting human and animal health, including bovine anaplasmosis, Colorado tick fever, and Rocky Mountain spotted fever. A better understanding of the biology of this tick is needed for developing disease prevention and vector control strategies. A reference genome was assembled for D. andersoni using high-fidelity (HiFi) long-read PacBio sequences and HiC contact mapping, yielding a contiguous assembly in which most contigs matched one of 11 chromosomes. Genome annotation by the NCBI eukaryotic genome annotation pipeline revealed high gene content completeness, yielding a genome completeness score of 94.0% using the Arachnida ortholog dataset. Following genome sequencing, we identified specific genes involved in blood feeding across a range of tissue types and life stages for D. andersoni. To accomplish this, RNA-seq analysis was used to investigate differential gene expression across most organs in adult, nymphal, and larval D. andersoni before and after feeding. Based on this analysis, we identified several gene groups that are involved in blood feeding. Furthermore, we establish sex- and developmental-stage-specific transcriptional profiles. Collectively, this study advances knowledge of D. andersoni biology and enables the development of strategies to limit the spread of diseases transmitted by this tick.
]]></description>
<dc:creator><![CDATA[ Tompkin, J. E., Saelao, P., Kruczalak, J., Yeo, H., Olafson, P. U., Sim, S. B., Oyen, K., Kelley, M., Corpuz, R. L., Scheffler, B., Geib, S. M., Childers, A., Chen, X., Weirauch, M. T., Dergousoff, S. J., Soghigian, J., Noh, S. M., Benoit, J. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.717773</dc:identifier>
<dc:title><![CDATA[Genome sequencing and multi-stage, blood-feeding, and tissue-specific transcriptome atlas of the Rocky Mountain wood tick provide a critical resource for this vector]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718796v1?rss=1">
<title>
<![CDATA[
Using machine learning to overcome mosquito collections missing data for malaria modeling 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718796v1?rss=1
</link>
<description><![CDATA[
Entomological surveillance plays a crucial role in areas where malaria remains endemic, yet gathering data on mosquito populations is often expensive and complicated, particularly in remote locations with challenging logistics and inconsistent sampling schedules. Access to extensive time series data on mosquito species at specific sites would greatly enhance insights into seasonal trends and the biting habits of vectors of malaria parasites. Gaps in mosquito count records pose a significant challenge for researchers and public health officials seeking to establish early warning systems and effective vector control programs. In this study, we apply quantitative machine learning techniques to address missing data in estimates of mosquito abundance collected from 2009 to 2016 in Bolivar State, Venezuela. We evaluated Linear Regression, Stochastic Linear Regression, K Nearest-Neighbor, and Gradient Boosting methods for imputing missing counts of Anopheles mosquitoes, employing a leave-one-out cross-validation strategy. Additionally, we developed a predictive malaria transmission model incorporating mosquito abundance and climate variables (El Nino 3.4 Index, rainfall, and mean air temperature) as covariates. Our generalized time series model forecasts malaria incidence of Plasmodium vivax and Plasmodium falciparum based on climate dynamics and imputed mosquito data. Model performance was assessed using root mean square error, mean absolute error, and mean absolute percentage error. The final results demonstrated that machine learning imputation significantly improved the accuracy and reliability of P. vivax malaria incidence predictions but failed to predict P. falciparum incidence. The study demonstrates that method choice significantly influences the reconstruction of seasonal abundance patterns and the performance of malaria incidence models. Nevertheless, the proposed models strengthen the foundation for targeted interventions and surveillance in endemic regions. Despite limitations in data continuity and coverage, the findings highlight the value of combining multiyear entomological data sets with robust imputation and sensitivity analyses to improve predictive modeling in resource-constrained, malaria-endemic settings.
]]></description>
<dc:creator><![CDATA[ Rubio-Palis, Y., Feng, L., Liang, K. S., Song, C., Wang, S., Duchnicki, T., Zhang, X., Bravo de Guenni, L. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718796</dc:identifier>
<dc:title><![CDATA[Using machine learning to overcome mosquito collections missing data for malaria modeling]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718564v1?rss=1">
<title>
<![CDATA[
Hybrid Gated Fusion: A Multimodal Deep Learning Framework for Protein Function Annotation 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718564v1?rss=1
</link>
<description><![CDATA[
Protein function annotation requires integrating diverse biological signals, yet existing multimodal methods often struggle with missing inputs and redundant information. We present Hybrid Gated Fusion, a multimodal architecture that combines intrinsic protein features, including sequence and structure, with extrinsic functional context from text and interaction networks. Rather than weighting all modalities equally, the model uses bilinear gating to assess both the informativeness of each modality and its agreement with the others, while auxiliary supervision reduces modality dominance and preserves useful signal in weaker modalities. On the CAFA3 benchmark, a single Hybrid Gated Fusion model achieves state-of-the-art performance in Biological Process (F_max = 0.601) and Cellular Component (F_max = 0.706), while remaining competitive in Molecular Function (F_max = 0.702). Analysis of the learned gates shows that interaction networks and text often provide complementary functional signals, whereas structural features are down-weighted when redundant but remain valuable under sparse-input settings. These results establish Hybrid Gated Fusion as a robust and scalable framework for genome-scale protein function annotation.
]]></description>
<dc:creator><![CDATA[ Zhou, Z., Buchan, D. W. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718564</dc:identifier>
<dc:title><![CDATA[Hybrid Gated Fusion: A Multimodal Deep Learning Framework for Protein Function Annotation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718546v1?rss=1">
<title>
<![CDATA[
Recursive Repeat Extender (RRE): A recursive approach to automatically extend repeat element models 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718546v1?rss=1
</link>
<description><![CDATA[
Repetitive elements, including transposable elements (TEs), are integral structural components of eukaryotic genomes; consequently, their identification and classification are crucial to their study. Several approaches have been developed to perform de novo genome-wide repeat identification through pairwise sequence comparisons; however, they often generate truncated repeat models due to their sampling strategies and the substantial fragmentation of many of the older repeat copies in the genome. To improve repeat models generated de novo, several algorithms have been developed that increase model length via the BEEA (BLAST-Extend-Extract-Align) approach, in which genomic instances of each repeat are identified with BLAST, their coordinates are extended, and a refined model is generated by aligning the extended sequences. Nevertheless, these extension algorithms exhibit two key limitations that hinder the reconstruction of highly degenerate and fragmented repeats: the use of BLAST as a search algorithm - which limits their sensitivity in detecting highly diverged sequences - and the use of a single search step, which precludes the reconstruction of extensively fragmented repeat models. In this work, we present a novel approach to extend repeat models, called RRE (Recursive Repeat Extender), which uses profile hidden Markov models (HMMs) to search for repeat elements with high sensitivity and employs a recursive extension strategy that iteratively searches and extends the repeat model, using the extended model from each round as input for the next and continuing until no additional sequence can be incorporated. We apply RRE to repeat libraries generated de novo from five model organisms, and our results show that RRE-generated repeat libraries contain fewer but longer repeat models and can identify a larger proportion of the genomes as repetitive than RepeatModeler2-generated repeat libraries. Notably, RRE can reconstruct highly degenerate repeats such as CR1_Mam, producing a model that achieves similar coverage to the reference Dfam model while extending it by an additional 131 bp that were not captured in the reference model. Overall, RRE enables the automatic improvement of de novo repeat libraries and the reconstruction of highly degenerate and fragmented repeats.
]]></description>
<dc:creator><![CDATA[ Falcon, F., Tanaka, E. M., Rodriguez-Terrones, D. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718546</dc:identifier>
<dc:title><![CDATA[Recursive Repeat Extender (RRE): A recursive approach to automatically extend repeat element models]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718559v1?rss=1">
<title>
<![CDATA[
Virtual multiplex staining of the pancreatic islets across type 1 diabetes progression using a Schroedinger bridge 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718559v1?rss=1
</link>
<description><![CDATA[
Classical hematoxylin and eosin (H&E) staining enables review of tissue morphology but lacks information regarding the molecular state of cells. Immunohistochemical (IHC) techniques label specific proteins in tissue, allowing differentiation of relevant structures that may go undetectable in H&E. However, the IHC process is complex, expensive, and time-consuming, especially for multiplex IHC (mIHC) limiting its use in large cohorts. Stain conversion of H&E to IHC using generative artificial intelligence models such as generative adversarial networks (GANs) represent one solution to this problem. However, GANs are unstable during out of distribution sampling and are prone to hallucinations or mode collapse, limiting their accuracy in challenging image conversion tasks. To address this, the field has recently turned to diffusion models. Here, we introduce Schroedinger-bridge for Multiplex ImmunoLabel Estimation (SMILE). Unlike conventional diffusion models that map from source to target through an intermediate Gaussian noise, Schroedinger-bridge diffusion models skip this step and have been shown to better preserve structures during image translation. To test the performance of SMILE, we generated a large cohort of high-fidelity H&E-mIHC image pairs from pancreatic organ donors, targeting insulin, glucagon, and CD3. Our dataset well-sampled across type-1 diabetes status, pancreas anatomical location, age, and sex. Using this cohort, we demonstrate the superiority of SMILE compared to GANs via a comprehensive evaluation framework incorporating texture, distribution, and antibody-specific metrics, as well as blinded pathologist reviews. We further confirmed the ability of SMILE to generate accurate mIHC images from H&Es generated at an external site, to perform whole slide image conversion, and to generate realistic three-dimensional maps of the pancreatic islets in non-diabetic, auto-antibody positive, and type-1 diabetic donor tissue. Finally, we performed stain conversion of paired H&E to HER2 and Ki67 images in breast cancer, confirming the superiority of SMILE in diverse stain conversion applications. Collectively, this framework provides a scalable pipeline for high-throughput proteomic inference from archival H&Es, providing transformative potential for pancreatic research and digital pathology.
]]></description>
<dc:creator><![CDATA[ Shen, Y., Cho, W. J., Joshi, S., Wen, B., Naganathanhalli, S., Beery, M., Grubel, C. R., Sivasubramanian, A., Forjaz, A., Grahn, M. P., Dequiedt, L., Huang, Y., Han, K. S., Wu, F., Pedro, B. A., Wood, L. D., Chen, T., Hruban, R. H., Kusmartseva, I., Atkinson, M. A., Wirtz, D., Kiemen, A. L. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718559</dc:identifier>
<dc:title><![CDATA[Virtual multiplex staining of the pancreatic islets across type 1 diabetes progression using a Schroedinger bridge]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718256v1?rss=1">
<title>
<![CDATA[
PathwaySeeker: Evidence-Grounded AI Reasoning over Organism-Specific Metabolic Networks 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718256v1?rss=1
</link>
<description><![CDATA[
Metabolic activity is not an intrinsic property of an organism, but an emergent state shaped by environmental and experimental context. Despite recent advances in large language models (LLMs) and multi-omics profiling, current computational frameworks struggle to represent and reason over metabolism in a condition-specific manner. General-purpose AI systems operate on static, public biochemical knowledge, while multi-omics datasets capture dynamic measurements without a structured framework for mechanistic interpretation. As a result, metabolic networks remains analysis remains disconnected from the experimental states that define biological function. Here, we introduce PathwaySeeker, an evidence-grounded AI system for organism-specific metabolic network reasoning. PathwaySeeker reconstructs sample-specific metabolic graphs from integrated proteomic and metabolomic data, fine-tunes an LLM on the resulting graph structure, and verifies each reasoning step against the experimental graph through iterative hypothesis search, an approach we term Oracle-in-the-Loop inference. Every output claim carries explicit evidence provenance, distinguishing experimentally confirmed relationships from biochemically plausible hypotheses requiring validation. We demonstrate the system using multi-omics data from the non-model white-rot fungus Trametes versicolor, where PathwaySeeker recovers branched phenylpropanoid pathways and transparently stratifies confirmed reactions from testable extensions. Post-hoc thermodynamic analysis condition-specific metabolite dynamics support the biological feasibility of the reconstructed routes. By embedding experimental evidence provenance directly into language model-guided metabolic network reasoning, PathwaySeeker enables systematic differentiation between experimentally grounded knowledge and structured hypothesis, bridging frontier AI capabilities with organism-specific experimental evidence.
]]></description>
<dc:creator><![CDATA[ Oliveira Monteiro, L. M., Chowdhury, N. B., Oostrom, M., McDermott, J. E., Stratton, K. G., Choudhury, S., Bardhan, J. P. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718256</dc:identifier>
<dc:title><![CDATA[PathwaySeeker: Evidence-Grounded AI Reasoning over Organism-Specific Metabolic Networks]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.13.717821v1?rss=1">
<title>
<![CDATA[
Detection of a sequence feature for recursive splicing 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.13.717821v1?rss=1
</link>
<description><![CDATA[
RNA splicing is directed by cis-acting sequence signals interacting with trans-acting factors to remove introns from newly transcribed pre-mRNA, joining exons to generate mature mRNA. Splicing happens far more often than observed exon-exon junctions in mRNA. As one contributing process, the spliceosome progressively removes large introns in small segments by 'recursive splicing', instead of splicing the whole intron at one time. However, the cis-acting sequences associated with recursive splicing have not been identified. Using probabilistic mixture models, we found that recursive splicing occurs more frequently in first introns, which are typically longer, and exhibit a distinct CG-rich sequence feature in the sequences flanking the upstream 5'SS, and depletion of CGs in the downstream polypyrimidine tract. Remarkably, recursive splicing is also more frequent in downstream introns of genes containing first introns with these properties. Mechanistically, these data suggest that early events in RNA synthesis and processing influence the prevalence of recursive splicing for the rest of the transcript. Finally, we developed a sequence-dependent classifier for recursive splicing, which we tested with a novel medium-throughput primer extension assay. In summary, the usage of recursive splicing sites is established at the beginning of RNA synthesis through newly-identified sequence motifs flanking both ends of the first intron.
]]></description>
<dc:creator><![CDATA[ Wang, B., Yang, K., Barash, Y., Choi, P., Mount, S. M., Larson, D. R. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.13.717821</dc:identifier>
<dc:title><![CDATA[Detection of a sequence feature for recursive splicing]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718510v1?rss=1">
<title>
<![CDATA[
Active Learning for Budget-Constrained TCR--pMHC Wet-Lab Validation 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718510v1?rss=1
</link>
<description><![CDATA[
Wet-lab validation of TCR--pMHC binding hypotheses is the rate-limiting step in T-cell therapy discovery: a single binding assay round can cost thousands of dollars and weeks of turnaround time, yet computational models generate thousands of candidate pairs per run. We frame this as a emph{pool-based active learning} problem: given a fixed annotation budget $B$, which unlabeled pairs should be sent to the assay to maximally improve a predictive model that will guide the next screening round? We introduce emph{UDAL} (Uncertainty--Diversity Active Learning), a batch acquisition strategy that combines BALD-based uncertainty estimation via MC Dropout with greedy core-set diversity selection in the encoder feature space. Evaluated on a curated VDJdb--IEDB benchmark under epitope-held-out and distance-aware protocols, UDAL achieves AUPRC 0.487 with only 5{,}000 queried labels---matching the performance of a model trained on 3$times$ more randomly sampled labels. At a budget of 2{,}000 labels, UDAL improves AUPRC by 16.7% over random acquisition, translating directly to fewer wasted assay slots. These results demonstrate that principled active query strategies can substantially reduce the wet-lab cost of building reliable TCR specificity models.
]]></description>
<dc:creator><![CDATA[ Mazur, K., Piotrowska, M., Kowalski, J. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718510</dc:identifier>
<dc:title><![CDATA[Active Learning for Budget-Constrained TCR--pMHC Wet-Lab Validation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718511v1?rss=1">
<title>
<![CDATA[
FairTCR: Equity-Aware TCR--pMHC Binding Prediction\\Across HLA Alleles and Cohort Strata 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718511v1?rss=1
</link>
<description><![CDATA[
Public TCR--pMHC binding databases are heavily skewed toward a handful of well-studied HLA alleles---most prominently HLA-A*02:01, which covers $sim$45% of curated records---and toward patients from European-ancestry cohorts. Standard empirical risk minimization (ERM) trained on such data achieves strong pooled accuracy but routinely underperforms on rare alleles and underrepresented cohorts, creating systematic disparities that are invisible in single-metric benchmarks. We introduce emph{FairTCR}, a group distributionally robust optimization (GDRO) framework that minimizes worst-group loss across HLA supertypes and cohort strata via online exponentiated gradient updates. FairTCR reduces the average--worst-group AUPRC disparity from 0.190 (ERM) to 0.098 on a curated VDJdb--IEDB benchmark, achieving a 48.4% disparity reduction while maintaining competitive average AUPRC (0.432 vs. 0.431 for ERM). Per-HLA analysis shows that rare allele groups (B*08:01, B*44:02) gain up to 0.062 AUPRC points, directly improving the equity of computational pre-screening for underrepresented patient populations.
]]></description>
<dc:creator><![CDATA[ Nowak, P., Kowalski, J., Lewandowski, T. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718511</dc:identifier>
<dc:title><![CDATA[FairTCR: Equity-Aware TCR--pMHC Binding Prediction\\Across HLA Alleles and Cohort Strata]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718168v1?rss=1">
<title>
<![CDATA[
Uncertainty-aware benchmarking reveals ambiguous transcripts in mRNA-lncRNA classification 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718168v1?rss=1
</link>
<description><![CDATA[
Background. Long non-coding RNAs (lncRNAs) have gained significant attention in recent years, yet distinguishing them from protein-coding transcripts remains challenging. Indeed, many lncRNAs share mRNA-like processing and existing sequence-derived signals do not fully capture the coding/non-coding boundary. Recent GENCODE annotation efforts revealed tens of thousands of novel lncRNA sequences as well as the reclassification of some lncRNAs into the protein-coding class, highlighting the need to better characterize transcript features associated with classification uncertainty and errors. Results. We performed uncertainty-aware benchmarking by retraining and evaluating eight transcript classifiers under a controlled protocol on a label-stable GENCODE v46-v47 subset. Beyond conventional model evaluation metrics, we quantified inter-tool agreement and entropy-based uncertainty to stratify transcripts into consensus, discordant, and consensus-error groups. To expand standard sequence and ORF-derived signals, we incorporated repeat-derived features from mature transcripts and non-B DNA motif features across gene bodies. Although aggregate performance was high, ~45% of transcripts showed inter-tool discordance, particularly among lncRNAs. Feature analyses linked low-uncertainty predictions to strong coding-like signals, whereas high-uncertainty profiles exhibited mixed signatures. Alongside classical predictors in global importance analyses, repeat-derived features appear as main contributors. Conclusions. By combining controlled benchmarking with transcript-level agreement and uncertainty stratification, together with extended feature profiling, we identified patterns associated with classifier disagreement and misclassification. This novel framework provides practical guidance for interpreting predictions, motivating the development of more robust coding/non-coding classifiers, while also shedding light on the sequence properties that distinguish lncRNA sequences.
]]></description>
<dc:creator><![CDATA[ Garcia-Ruano, D., Georges, M., Mohanty, S. K., Baaziz, R., Makova, K. D., Nikolski, M., Chalopin, D. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718168</dc:identifier>
<dc:title><![CDATA[Uncertainty-aware benchmarking reveals ambiguous transcripts in mRNA-lncRNA classification]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.13.717816v1?rss=1">
<title>
<![CDATA[
Agent-Guided De Novo Design of Nanobody Binders Against a Novel Cancer Target 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.13.717816v1?rss=1
</link>
<description><![CDATA[
Therapeutic antibody discovery remains slow and resource-intensive, with traditional methods providing limited control over epitope selection. We present a workflow for de novo nanobody design applied to a novel Desmoplastic Small Round Cell Tumor target encompassing four stages: (1) epitope identification guided by our hotspot recommendation agent using physical chemistry-based structure and sequence analysis tools with two curated databases (IEDB, PFAM), (2) de novo nanobody generation using three independent methods (RFantibody, IgGM, mBER) across multiple predicted antigen structures and nanobody frameworks, (3) multi-metric scoring including structural metrics from folding models, and in silico binding affinity from our sequence- based predictor, (4) high-throughput yeast surface display (YSD) screening followed by surface plasmon resonance (SPR) characterization of the specific binders. We generated 288,000 nanobody designs spanning eight target epitope regions and three variable domains of heavy chain-only antibody (VHH) frameworks. Multi-objective Pareto filtering with our candidate selection agent yielded 100,000 candidates for YSD screening with fluorescence-activated cell sorting (FACS). Of 116 enriched candidates advanced to SPR characterization, 46/116 (39.7%) produced reliable kinetic fits with Rmax [&ge;] 30 RU, yielding KD values from 0.66 nM to 305 nM (median 31.7 nM). These results show that an agent-guided computational workflow can design nanomolar to sub-nanomolar nanobody binders against a novel target without experimental structure or prior antibody information.
]]></description>
<dc:creator><![CDATA[ Zhao, Y., Yilmaz, M., Lee, E., Teh, C., Guo, L., Sonmez, K., Giancardo, L., Trang, G., Xu, F., Espinosa-Cotton, M., Cheung, N.-K., Kim, J., Cheng, X. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.13.717816</dc:identifier>
<dc:title><![CDATA[Agent-Guided De Novo Design of Nanobody Binders Against a Novel Cancer Target]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718501v1?rss=1">
<title>
<![CDATA[
Whole-genome 3D architectural screen reveals modulators of brain DNA structure 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718501v1?rss=1
</link>
<description><![CDATA[
Three-dimensional (3D) genome architecture is the foundation of gene regulation, and plays a critical role in normal physiology and disease. However, our understanding of its biochemical determinants has long been limited by technology: imaging-based screens only profile a small number of loci, while sequencing-based studies rarely exceed 100 samples or conditions. Here we present "in-plate chromosome conformation capture" (Plate-C), a high-throughput, cost-effective platform that profiles thousands of whole-genome architectures in a day. Plate-C enabled the first chemical screen for whole-genome structural changes--profiling 2,956 samples from 834 conditions across 5 neuronal and glial types, accompanied by 6,081 single cells using "easy diploid chromosome conformation capture" (Easy Dip-C) and 200,893 single-cell transcriptomes. We discovered that diverse, dose/time-dependent, and cell type/species-specific modes of DNA structural changes can be rapidly induced by manipulating epigenetic (HDAC, BET), metabolic (mTOR), proteostatic (UPR), developmental (GSK3/Wnt, Hedgehog), immune (cGAS/STING), and neurotransmission pathways. To validate our finding in vivo, we demonstrated in newborn mice that HDAC inhibition drives brain-wide genome rewiring within hours, highly correlated with changes in vitro and inducing a latent structural and transcriptional state orthogonal to normal differentiation. By enabling massively parallel profiling of whole-genome structures, Plate-C paves the way for systematic discovery of DNA folding principles to better understand and engineer the human genome in 3D.
]]></description>
<dc:creator><![CDATA[ Parasar, B., Raja Venkatesh, A., Perera, J., Sosnick, L., Moghadami, S., Seo, Y., Shi, J., Chan, L., Takenawa, S., Akiyama, T., Sianto, O., Uenaka, T., Hadjipanayis, A., Wernig, M., Gitler, A. D., Tan, L. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718501</dc:identifier>
<dc:title><![CDATA[Whole-genome 3D architectural screen reveals modulators of brain DNA structure]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.16.718906v1?rss=1">
<title>
<![CDATA[
Integrating glycosylation in de novo protein design with ReGlyco Binder Design Filter 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.16.718906v1?rss=1
</link>
<description><![CDATA[
Artificial Intelligence (AI)-based methods for 3D protein structure prediction are revolutionising structural biology, providing novel templates for experimental data refinement and an on demand 3D perspective on any molecular architecture and protein-protein interaction (PPI). Regardless of the inherent limitations of the various approaches available to date, the continuous improvement of the algorithms, the broad availability of open access (OA) web servers, software packages and databases are bound to accelerate the discovery and optimisation of novel biopharmaceuticals. Within this context, the development of computational pipelines for the de novo design of target-specific protein binders is especially exciting. As it stands, these processes are still rather inefficient and expensive, rapidly outputting thousands of designs relatively quickly, which translate into meagre yields. Here we show how the explicit integration of glycosylation as a filter in the 3D de novo design pipeline can significantly improve efficiency and reduce laboratory costs with minimal additional computational resources. As a proof-of-concept, we used the GlycoShape database and ReGlyco tools to filter the results of a recent open competition launched by Adaptyv Bio for the design of binders as inhibitors against the heavily glycosylated Nipah virus glycoprotein (NiV-G). Screening of the 1,201 selected designs in block with ReGlyco, refined with the new ReGlyco Rotamer tool, flagged 11% of non-binders prior to experiment in approximately 3 hours on a dual-core CPU. We complement this analysis with a demo colab notebook to illustrate our workflow. In this demo users can design mini-binders against human erythropoietin (hEPO) by integrating GlycoShape resources with the RFdiffusion3 (RFD3) pipeline from the Institute for Protein Design (IDP).
]]></description>
<dc:creator><![CDATA[ Singh, O., Fadda, E. ]]></dc:creator>
<dc:date>2026-04-17</dc:date>
<dc:identifier>doi:10.64898/2026.04.16.718906</dc:identifier>
<dc:title><![CDATA[Integrating glycosylation in de novo protein design with ReGlyco Binder Design Filter]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.16.718917v1?rss=1">
<title>
<![CDATA[
Ancestral Genome Reconstruction. 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.16.718917v1?rss=1
</link>
<description><![CDATA[
AGR, for Ancestral Genome Reconstruction, is an automatic publicly available and open-source pipeline to infer paleogenomes from modern species genome comparisons exploiting the concept of inter-species chromosomal synteny relationships' hierarchical clustering that can be used to unveil how ancestral genomes, genes, sequences and functions have been shaped during million years of present-day plant evolution.
]]></description>
<dc:creator><![CDATA[ Siguret, C., Olivier, M., Huneau, C., SOW, M. D., Stenger, P.-L., Klopp, C., Martin, M.-L., Tamby, J.-P., Civan, P., Pont, C., Mathieu, O., SALSE, J. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.16.718917</dc:identifier>
<dc:title><![CDATA[Ancestral Genome Reconstruction.]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718677v1?rss=1">
<title>
<![CDATA[
Histone H1 Variants Regulate Neurodevelopmental Transcriptional Programs in Autism with 16p11.2 deletion 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718677v1?rss=1
</link>
<description><![CDATA[
Background: Neurodevelopmental disorders, including autism spectrum disorder, involve widespread transcriptional dysregulation. Copy number variations at 16p11.2 are among the strongest genetic risk factors for autism spectrum disorder, yet the molecular mechanisms by which these copy number variations contribute to neurodevelopmental pathology remain unclear. Results: We identify significant genetic associations between autism spectrum disorder susceptibility and the HIST1 histone gene cluster through genome-wide analysis. Transcriptomic profiling across post-mortem brain tissue, patient-derived neural progenitor cells, neurons, and cerebral organoids reveals consistent upregulation of linker histone variants H1.2 and H1.5 in idiopathic autism spectrum disorder and 16p11.2 hemi-deletion carriers, but not in schizophrenia or bipolar disorder. Functional assays demonstrate that dysregulated H1 expression disrupts gene networks involved in synaptic signaling, chromatin remodeling, and neural differentiation. Mechanistically, we link H1 upregulation to MAZ, a transcription factor encoded within the 16p11.2 locus. MAZ binds the promoter regions of H1 genes and represses their transcription. Knockdown of MAZ leads to H1 overexpression. H1 upregulation alone is sufficient to alter the expression of autism spectrum disorder-associated genes. Conclusions: Our findings define a MAZ-dependent regulation of H1 dosage as a critical chromatin-mediated mechanism contributing to transcriptional pathology in 16p11.2-associated autism spectrum disorder. Keywords: Histone H1, Autism Spectrum Disorder, 16p11.2 Hemi-deletion, MAZ, chromatin remodeling, transcriptomics.
]]></description>
<dc:creator><![CDATA[ Brudno, R., Askayo, D., Khair, D., Shayevitch, R., Keydar, I., Zmudjak-Olevson, M., Lev-Maor, G., Zavolan, M., Elkon, R., Ast, G. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718677</dc:identifier>
<dc:title><![CDATA[Histone H1 Variants Regulate Neurodevelopmental Transcriptional Programs in Autism with 16p11.2 deletion]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.13.715198v1?rss=1">
<title>
<![CDATA[
Canonical self-supervised pretraining paradigm constrains the capacity of genomic language models on regulatory decoding 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.13.715198v1?rss=1
</link>
<description><![CDATA[
Recent studies suggest that genomic language models (gLMs) could help decode genomic regulatory code. Here, we systematically evaluated 11 representative gLMs across multiple regulatory genomics applications and found that current gLMs offer limited advantages over the random baseline. Further analysis revealed a systematic misalignment between the canonical sequence-only self-supervised pretraining paradigm and the context-specific dynamic nature of gene regulation, highlighting the need for function-oriented pretraining strategies that explicitly incorporate biochemical and regulatory priors.
]]></description>
<dc:creator><![CDATA[ Liang, Y.-X., Wang, Y., Pan, W.-Y., Chen, Z.-Y., Wei, J.-C., Gao, G. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.13.715198</dc:identifier>
<dc:title><![CDATA[Canonical self-supervised pretraining paradigm constrains the capacity of genomic language models on regulatory decoding]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718485v1?rss=1">
<title>
<![CDATA[
Inferring division-associated stochasticity from time-series single-cell transcriptomes 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718485v1?rss=1
</link>
<description><![CDATA[
Cell division is fundamental to multicellular organisms and stochastic partitioning of cellular components can strongly affect genome-wide gene expression states. However, how cell division-associated partitioning noise shapes the dynamics of proliferating cells is poorly understood. Here, we propose scDIVIDE, a neural stochastic differential equation framework to infer continuous cellular dynamics and division rates while accounting for partitioning noise. We combined birth-death-mutation processes from population genetics with dynamical optimal transport and revealed that the birth rate is embedded in the diffusion coefficient, enabling its inference from time-series scRNA-seq data. scDIVIDE accurately inferred birth rates in synthetic data and the inferred birth rates recapitulated turnover-related programs in mouse hematopoiesis data. By exploiting the birth-diffusion coupling, scDIVIDE provides a biologically-informed constraint on growth rate estimation, outperforming existing methods in predicting future cell distributions. scDIVIDE provides a conceptual avenue for quantitatively dissecting how partitioning noise shapes fate decisions in multicellular systems.
]]></description>
<dc:creator><![CDATA[ Okochi, Y., Sawazaki, Y., Kondo, Y., Naoki, H. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718485</dc:identifier>
<dc:title><![CDATA[Inferring division-associated stochasticity from time-series single-cell transcriptomes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718479v1?rss=1">
<title>
<![CDATA[
ProteomeScan: A Toolkit For Target Validation By Proteome-Wide Docking And Analysis 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718479v1?rss=1
</link>
<description><![CDATA[
The problem of identifying which protein target a potential drug-like molecule interacts with is crucial for both the study of existing drugs and the design of new therapeutic compounds. Despite the importance of target identification, existing computational approaches remain limited in terms of speed, accuracy, and protein target coverage. We introduce ProteomeScan, a large-scale, gene-driven computational toolkit for systematic proteome-wide scanning to uncover hidden or previously uncharacterized protein-ligand interactions. ProteomeScan leverages cloud-scale high performance computing to perform extensive molecular docking simulations across the human proteome to rank candidate targets based on binding affinities. After filtering promiscuous targets, we found that ProteomeScan ranks known target significantly better than a random baseline for a set of control compounds. Furthermore, we performed physical analyses of predicted binding modes for both promiscuous and known protein-ligand binding pairs to validate that ProteomeScan identifies interactions with valid binding pockets. In addition, we conducted experiments using mutant variants of proteins to study how mutations affect binding behavior. We have open sourced the core ProteomeScan algorithm as part of the DeepChem ecosystem to enhance transparency and reproducibility.
]]></description>
<dc:creator><![CDATA[ Barsainyan, A. A., Panda, R., Siguenza, J., Merico, D., Ramsundar, B. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718479</dc:identifier>
<dc:title><![CDATA[ProteomeScan: A Toolkit For Target Validation By Proteome-Wide Docking And Analysis]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718488v1?rss=1">
<title>
<![CDATA[
MICRON learns outcome-associated representations of spatial immune microenvironments 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718488v1?rss=1
</link>
<description><![CDATA[
Spatial imaging proteomics modalities, such as imaging mass cytometry, enable comprehensive identification of immune microenvironments driving disease outcomes. Identifying outcome-associated immune microenvironments from these data has proven to be complex, as it requires segmenting cells with complex shapes and reconciling spatial signatures across many heterogeneous samples. We present MICRON, a segmentation-free, fully automated multiple-instance learning based tool for automatic identification of outcome-linked immune microenvironments. MICRON learns representations of samples profiled with spatial imaging proteomics modalities, enabling more accurate prognostic and diagnostic prediction over existing approaches. As a case study, we show that MICRON generates a comprehensive importance map that reveals key outcome-associated immune microenvironments in brain cancer, uncovering coordinated cell-cell communication between astrocytes, NK cells, and macrophages linked to survival outcomes. MICRON is provided as open source software for broad use by clinicians and biologists at https://github.com/ChenCookie/micron.
]]></description>
<dc:creator><![CDATA[ Chen, C.-J., George, B., Dhawka, L., Evangelista, B., Stanley, N. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718488</dc:identifier>
<dc:title><![CDATA[MICRON learns outcome-associated representations of spatial immune microenvironments]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718508v1?rss=1">
<title>
<![CDATA[
Ancestral chromatin state constrains the functional landscape of bivalent domains in mammalian spermatogenesis 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718508v1?rss=1
</link>
<description><![CDATA[
Mammalian germ cells are enriched for bivalent chromatin, an epigenetic state defined by the dual presence of the activating H3K4me3 and repressive H3K27me3 histone modifications. Bivalency is evolutionarily conserved at developmentally important genes in germ cells but diverges at hundreds of additional loci, and evolutionary gains in bivalency have been proposed to reflect divergent somatic functions of the associated genes. Here, we sought to discover if evolutionary gains in bivalency occur selectively at genes with specific functions, and to better elucidate the role of bivalent chromatin in germ cells. By comparing genome-wide profiles for four histone modifications in spermatogenic cells of six mammalian species, we define a comprehensive set of mammalian bivalent domains and classify them based on conservation or divergence of chromatin state. We find that evolutionarily conserved bivalent regions exhibit canonical features of bivalency and maintain bivalency in embryonic stem cells. In contrast, bivalent domains emerging from a purely active or repressed ancestral chromatin state have atypical sequence and regulatory features and are frequently germ cell specific. Genes associated with these recent bivalent domains exhibit distinct somatic expression patterns that reflect their ancestral chromatin state in germ cells. Specifically, bivalent genes emerging from ancestrally active chromatin are more highly expressed in somatic tissues and are enriched for immune-related functions, while those emerging from ancestrally H3K27me3-only domains are lowly expressed in the soma and enriched for neurogenesis functions. We propose that recent bivalent regions demarcate sites of regulatory sequence change that preferentially impacts specific somatic lineages.
]]></description>
<dc:creator><![CDATA[ Farris, D. B., Tai, J., Lesch, B. J. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718508</dc:identifier>
<dc:title><![CDATA[Ancestral chromatin state constrains the functional landscape of bivalent domains in mammalian spermatogenesis]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.15.718708v1?rss=1">
<title>
<![CDATA[
DIOPT: the DRSC Integrative Ortholog Prediction Tool, 2026 update 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.15.718708v1?rss=1
</link>
<description><![CDATA[
Mapping orthologous proteins is a critical step for cross-species literature mining, data integration, experimental design, and more, making the ability to quickly predict orthologs across species a key tool for functional genomic studies. The DRSC Integrative Ortholog Prediction Tool (DIOPT) was initially developed in 2011 to provide a centralized portal for identifying predicted orthologs among major model organisms. By integrating results from multiple ortholog prediction algorithms, DIOPT allows users to compare predictions across methods and prioritize high-confidence ortholog relationships. Over the years, we regularly updated the underlying genome annotations and refreshed predictions from each integrated algorithm. In addition, both the number of supported species and the number of ortholog prediction algorithms incorporated into the platform have grown. The web portal has also been enhanced with new features designed to improve usability, facilitate data exploration, and support a broader range of research applications. We also developed a sister version of DIOPT tailored specifically for arthropod species; this enables researchers working with a diverse set of insects and related organisms to perform ortholog mapping and comparative analyses more effectively. Together, these developments ensure that DIOPT remains a robust and broadly useful resource for functional genomics research.
]]></description>
<dc:creator><![CDATA[ Hu, Y., Comjean, A., Gao, C., Yamamoto, S., Mohr, S., Perrimon, N. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.15.718708</dc:identifier>
<dc:title><![CDATA[DIOPT: the DRSC Integrative Ortholog Prediction Tool, 2026 update]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.14.718434v1?rss=1">
<title>
<![CDATA[
MISSTE: a multiscale integrative spatial simulator for understanding the mechanisms underlying tissue ecosystems 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.14.718434v1?rss=1
</link>
<description><![CDATA[
Multiscale tissue ecosystems are governed by coupled intracellular decision-making, cell-cell interactions, and spatially structured microenvironmental signals, yet these scales are often studied separately. Here we present MISSTE, a modular framework that integrates Boolean intracellular state logic, agent-based modeling, and partial differential equation fields within a unified spatial simulation architecture. As a proof of concept, we applied MISSTE to CAR-T therapy in a solid tumor microenvironment. The model recapitulated emergent features of CAR-T behavior, including limited tumor penetration, stromal suppression, localized cytokine remodeling, hypoxia-associated constraint, and progressive functional exhaustion. Comparison of baseline and optimized conditions showed that coordinated enhancement of interaction range, migration, and cytotoxic function improved immune persistence and partial tumor control. Systematic parameter scans further identified effective immune-tumor contact as a stronger determinant of outcome than killing strength alone, highlighting spatial access as the dominant bottleneck. Guided by these results, we designed sequential intervention strategies and found that time-ordered enhancement of infiltration, killing, and late functional protection outperformed a static optimized regime. Together, these results establish MISSTE as a generalizable multiscale methodology for dissecting tissue ecosystems and for generating mechanistically grounded strategies for engineered cellular therapy design.
]]></description>
<dc:creator><![CDATA[ Su, Z., Yin, S., Wu, Y. ]]></dc:creator>
<dc:date>2026-04-16</dc:date>
<dc:identifier>doi:10.64898/2026.04.14.718434</dc:identifier>
<dc:title><![CDATA[MISSTE: a multiscale integrative spatial simulator for understanding the mechanisms underlying tissue ecosystems]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-16</prism:publicationDate>
<prism:section></prism:section>
</item>
</rdf:RDF>
