<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF xmlns:admin="http://webns.net/mvcb/" xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:prism="http://purl.org/rss/1.0/modules/prism/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
<channel rdf:about="https://biorxiv.org">
<admin:errorReportsTo rdf:resource="mailto:biorxiv@cshlpress.edu"/>
<title>bioRxiv Subject Collection: Genomics Bioinformatics</title>
<link>https://biorxiv.org</link>
<description>
This feed contains articles for bioRxiv Subject Collection "Genomics Bioinformatics"
</description>

<items>
<rdf:Seq>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.08.714730v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.08.717207v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.08.717212v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.08.717021v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.08.717199v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.08.702570v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.08.717168v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.716833v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.717125v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.716940v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.717010v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.717040v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.716815v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.717034v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.08.717130v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.714835v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.716920v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.08.717258v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.716976v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.716912v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.716565v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.716958v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.716967v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.716683v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.716715v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.706161v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.07.716863v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.06.716845v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.06.716854v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.04.06.716861v1?rss=1"/>
</rdf:Seq>
</items>
<prism:eIssn/>
<prism:publicationName>bioRxiv</prism:publicationName>
<prism:issn/>

<image rdf:resource=""/>
</channel>
<image rdf:about="">
<title>bioRxiv</title>
<url>https://www.biorxiv.org/sites/default/files/bioRxiv_article.jpg</url>
<link>https://www.biorxiv.org</link>
</image>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.08.714730v1?rss=1">
<title>
<![CDATA[
PERREO: An integrated pipeline for repetitive elements analysis enables the repeatome expression profiling in cancer 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.08.714730v1?rss=1
</link>
<description><![CDATA[
Transcriptome-wide profiling of repetitive elements expression reveals transposable element-derived transcripts that are deregulated in diverse biological contexts including cancer. However, most RNA-seq pipelines are optimized for annotated genes and substantially undercount repeat RNA molecules, limiting their discovery and characterization. Here we present PERREO, a comprehensive, user-friendly pipeline for analyzing repetitive RNA elements from short- and long-read sequencing data. PERREO performs quality control, repeat-aware alignment and quantification, differential expression analysis, co-expression network analysis, and de novo transcript assembly with minimal computational expertise required. We validate PERREO across cell lines, tumor tissues and liquid biopsies, demonstrating superior sensitivity to repetitive RNA signatures compared with standard RNA-seq approaches. PERREO integrates predictive modelling to identify biological associations and generates publication-ready visualizations. By removing the bioinformatic barrier to repetitive RNA discovery, this pipeline enables broader investigation of the repeatome's role in cellular biology and disease, yielding valuable results that, for specific analytical objectives, outperform certain existing tools and pipelines.
]]></description>
<dc:creator><![CDATA[ Rodriguez-Martin, F., Masero-Leon, M., Gomez-Cabello, D. ]]></dc:creator>
<dc:date>2026-04-10</dc:date>
<dc:identifier>doi:10.64898/2026.04.08.714730</dc:identifier>
<dc:title><![CDATA[PERREO: An integrated pipeline for repetitive elements analysis enables the repeatome expression profiling in cancer]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.08.717207v1?rss=1">
<title>
<![CDATA[
BrightEyes-FFS: an open-source platform for comprehensive analysis of fluorescence fluctuation spectroscopy experiments with small detector arrays 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.08.717207v1?rss=1
</link>
<description><![CDATA[
Fluorescence fluctuation spectroscopy (FFS) is an ensemble of techniques for quantitative measurement of molecular dynamics and interactions. Recently, the introduction of small-format array detectors has opened up a new range of spatiotemporal information, allowing for more detailed analysis of system kinetics. However, there is currently no open-source software available for analyzing the high-dimensional FFS data sets. We present BrightEyes-FFS, an open-source Python-based environment for FFS analysis with array detectors. The environment includes a Python package for reading raw FFS data, computing auto- and cross-correlations using various algorithms, and fitting the correlations to several models. A graphical user interface (GUI), available as a standalone executable, makes the analysis fast and user-friendly. An automated Jupyter Notebook writing tool enables transition from the GUI to Jupyter Notebook for custom analysis. We believe that BrightEyes-FFS will enable a wider community to study diffusion, flow, and interaction dynamics.
]]></description>
<dc:creator><![CDATA[ Slenders, E., Perego, E., Zappone, S., Vicidomini, G. ]]></dc:creator>
<dc:date>2026-04-10</dc:date>
<dc:identifier>doi:10.64898/2026.04.08.717207</dc:identifier>
<dc:title><![CDATA[BrightEyes-FFS: an open-source platform for comprehensive analysis of fluorescence fluctuation spectroscopy experiments with small detector arrays]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.08.717212v1?rss=1">
<title>
<![CDATA[
Statistical Principles Define an Open-Source Differential Analysis Workflow for Mass Spectrometry Imaging Experiments with Complex Designs 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.08.717212v1?rss=1
</link>
<description><![CDATA[
Mass spectrometry imaging (MSI) characterizes the spatial heterogeneity of molecular abundances in biological samples. Experiments with complex designs, involving multiple conditions and multiple samples, provide particularly useful insight into differential abundance of analytes. However, analyses of these experiments require attention to details such as signal processing, selection of regions of interest, and statistical methodology. This manuscript contributes a statistical analysis workflow for detecting differentially abundant analytes in MSI experiments with complex designs. Using a case study of histologic samples of human tibial plateaus from knees of osteoarthritis patients and cadaveric controls, as well as simulated datasets, we illustrate the impact of the analysis decisions. We illustrate the importance of signal processing and feature aggregation for preserving biological relevance and alleviating the stringency of multiple testing. We further demonstrate the importance of selecting regions of interest in ways that are compatible with differential analysis. Finally, we contrast several common statistical models for differential analysis, showcase the appropriate use of replication, and demonstrate model-based calculation of sample size for followup investigations. The discussion is accompanied by detailed recommendations and an open-source R-based implementation that can be followed by other investigations.
]]></description>
<dc:creator><![CDATA[ Rogers, E. B. T., Lakkimsetty, S. S., Bemis, K. A., Schurman, C. A., Angel, P. A., Schilling, B., Vitek, O. ]]></dc:creator>
<dc:date>2026-04-10</dc:date>
<dc:identifier>doi:10.64898/2026.04.08.717212</dc:identifier>
<dc:title><![CDATA[Statistical Principles Define an Open-Source Differential Analysis Workflow for Mass Spectrometry Imaging Experiments with Complex Designs]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.08.717021v1?rss=1">
<title>
<![CDATA[
Deep learning enables direct HLA typing from immunopeptidomics data 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.08.717021v1?rss=1
</link>
<description><![CDATA[
The immune system eliminates malignant and infected cells through T-cell-mediated recognition of peptides presented by human leukocyte antigen molecules. Mass spectrometry-based immunopeptidomics enables unbiased identification of naturally presented HLA-restricted peptides and has become central to the development of T-cell-based immunotherapies. However, immunopeptidomics data reflects the combined peptide presentation of multiple HLA alleles, and determining which allotypes are represented in this multi-allelic complexity remains an unmet computational challenge. Here, we introduce immunotype, a deep learning-based ensemble predictor for HLA class I allotyping directly from immunopeptidomics data. Immunotype integrates peptide and HLA sequence information through transformer encoders and a graph neural network, complemented by a curated mono-allelic reference of known peptide-HLA binding preferences. Immunotype achieves an overall accuracy of 87.2% at protein-level resolution across diverse tissues and thereby enables rapid, cost-effective HLA typing of large-scale immunopeptidomics datasets.
]]></description>
<dc:creator><![CDATA[ Pilz, M., Scheid, J., Bauer, A., Lemke, S., Sachsenberg, T., Bauer, J., Nelde, A., Stadelmaier, J., Walter, A., Rammensee, H.-G., Nahnsen, S., Kohlbacher, O., Walz, J. S. ]]></dc:creator>
<dc:date>2026-04-10</dc:date>
<dc:identifier>doi:10.64898/2026.04.08.717021</dc:identifier>
<dc:title><![CDATA[Deep learning enables direct HLA typing from immunopeptidomics data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.08.717199v1?rss=1">
<title>
<![CDATA[
A computational model for quantifying instability of tandem repeats across the genome 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.08.717199v1?rss=1
</link>
<description><![CDATA[
Tandem repeats (TRs) exhibit high levels of somatic mosaicism, which is increasingly recognized as an important modifier of repeat expansion disorders. Long-read sequencing can capture full-length repeat alleles, yet robust frameworks for quantifying instability across TRs genome-wide are still needed. Here, we introduce a general-purpose model for quantifying TR instability in a given long-read sequencing dataset, without explicitly distinguishing biological mosaicism from technical noise, and which is broadly applicable to both simple and structurally complex loci. This model accurately characterizes allelic instability at each TR locus by representing the distribution of read-to-consensus deviations for each allele. Using HiFi sequencing data from 256 HPRC cell line samples, we fitted models for 617,007 TR loci, including known pathogenic repeats. We observe that instability levels are generally low, but vary substantially across individual TRs, and are driven more strongly by repeat composition than overall repeat length. Furthermore, we applied our method to targeted PureTarget long-read data from samples with known repeat expansions and identified significant mosaicism in the majority of expanded alleles. Our model offers a practical way to quantify instability of tandem repeats across the genome and to detect unusually unstable repeat alleles.
]]></description>
<dc:creator><![CDATA[ Dolzhenko, E., English, A., Mokveld, T., de Sena Brandine, G., Kronenberg, Z., Wright, G., Drogemoller, B., Rowell, W. J., Wenger, A. M., Bennett, M. F., Weisburd, B., Erwin, G. S., Jin, P., Nelson, D. L., Dashnow, H., Sedlazeck, F., Eberle, M. A. ]]></dc:creator>
<dc:date>2026-04-10</dc:date>
<dc:identifier>doi:10.64898/2026.04.08.717199</dc:identifier>
<dc:title><![CDATA[A computational model for quantifying instability of tandem repeats across the genome]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.08.702570v1?rss=1">
<title>
<![CDATA[
Eco-physiological and transcriptomic plasticity of Dianthus inoxianus in response to drought 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.08.702570v1?rss=1
</link>
<description><![CDATA[
Phenotypic plasticity is a key mechanism by which plants adjust their traits to environmental changes. These phenotypic adjustments are driven by plastic changes in gene expression regulated by gene regulatory networks. Drought, a major selective force in Mediterranean ecosystems, provides a powerful context to examine how genomic plasticity translates into phenotypic responses. Here, we used Dianthus inoxianus, a drought-tolerant Mediterranean carnation, in order to characterize the phenotypic and transcriptomic plasticity in response to drought stress combining ecophysiological measurements with RNA-seq, gene co-expression and gene regulatory network analyses. Most of the phenotypic traits exhibited low plasticity in response to drought, except water and osmotic potential. At transcriptome level, we identified 57 plastic genes, suggesting that drought tolerance in D. inoxianus relies predominantly on constitutive gene expression. These plastic genes were enriched in processes typically related to drought response, such as cell wall components and abscisic acid (ABA) signaling. Some plastic genes belonged to drought-responsive modules, while others were hubs in different modules acting as inter-modular connectors. Furthermore, the regulatory network revealed that these plastic genes were strongly regulated by multiple stress-responsive transcription factors, and that drought-associated modules were regulated through both ABA-dependent and ABA-independent pathways. In addition, we identified contrasting patterns of canalization and decanalization, with immune and post-transcriptional regulation remaining canalized under drought, whereas photosynthesis and amino acid metabolism became decanalized, potentially releasing cryptic genetic variation. Overall, our results emphasise that drought tolerance in D. inoxianus emerges from a strategy combining preadaptation with targeted plasticity in key molecular pathways.
]]></description>
<dc:creator><![CDATA[ Parra, A. R., Balao, F. ]]></dc:creator>
<dc:date>2026-04-10</dc:date>
<dc:identifier>doi:10.64898/2026.04.08.702570</dc:identifier>
<dc:title><![CDATA[Eco-physiological and transcriptomic plasticity of Dianthus inoxianus in response to drought]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.08.717168v1?rss=1">
<title>
<![CDATA[
Structure-aware geometric graph learning for modeling protease-substrate specificity at scale 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.08.717168v1?rss=1
</link>
<description><![CDATA[
Protease-substrate specificity is central to cellular regulation and disease pathogenesis, and accurately modeling its structural determinants remains challenging. Substrate recognition is governed by spatial constraints and higher-order relationships that extend beyond local sequence motifs. Most computational approaches rely predominantly on motif-centric or sequence-based representations, limiting their ability to capture the geometric and relational structure underlying enzymatic specificity. Here, we introduce OmniCleave, a structure-aware geometric graph learning framework for modeling protease-substrate specificity at scale. OmniCleave is trained on 57,278 structure-informed protease-substrate pairs derived from 9,651 substrates spanning over 100 proteases across six distinct families. The framework integrates multi-scale structural graphs with higher-order protease relational topology, explicitly encoding spatial context and inter-protease dependencies within a unified geometric representation. This formulation moves beyond local pattern recognition and enables transferable modelling across six protease families. Across large-scale benchmarks, the framework consistently outperforms existing approaches and reveals interpretable geometric determinants underlying substrate recognition. Experimental validation confirms three novel caspase-3 substrates and 21 cleavage sites predicted by OmniCleave, supporting the biological relevance of the learned representations. Together, OmniCleave provides a scalable geometric framework for modeling protease-substrate specificity, with practical utility for systematic analysis of protease biology.
]]></description>
<dc:creator><![CDATA[ Guo, X., Bi, Y., Ran, Z., Pan, T., Sun, H., Hao, Y., Jia, R., Wang, C., Zhang, Q., Kurgan, L., Song, J., Li, F. ]]></dc:creator>
<dc:date>2026-04-10</dc:date>
<dc:identifier>doi:10.64898/2026.04.08.717168</dc:identifier>
<dc:title><![CDATA[Structure-aware geometric graph learning for modeling protease-substrate specificity at scale]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.716833v1?rss=1">
<title>
<![CDATA[
MTB-KB: A Curated Knowledgebase of Mycobacterium tuberculosis Related Studies 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.716833v1?rss=1
</link>
<description><![CDATA[
Tuberculosis (TB), caused by Mycobacterium tuberculosis (MTB), has regained its position as the world's leading killer among infectious diseases. Despite extensive research progress across epidemiology, diagnosis, drug development, treatment regimens, vaccines, drug resistance, virulence factors, and immune mechanisms, MTB-related knowledge remains fragmented across thousands of publications, limiting its effective use. To address this gap, we present MTB-KB, a literature-curated knowledgebase that systematically integrates high-impact findings from eight major sections of TB research. The current release contains 75,170 associations from 1,246 publications, covering 18,439 entities standardized using authoritative databases and WHO-endorsed classifications. A central feature is the interactive knowledge graph, which links cross-section associations to reveal and infer MTB-host interactions, treatment strategies, and vaccine development opportunities. MTB-KB also provides a user-friendly interface with browsing, advanced search, and statistical visualization. Overall, by consolidating dispersed MTB knowledge into a structured and accessible platform, MTB-KB provides a valuable resource for researchers, clinicians, and policymakers, supporting both basic and clinical TB research, enabling evidence-based TB prevention, diagnosis, and treatment, and contributing to global elimination efforts. MTB-KB is accessible at https://ngdc.cncb.ac.cn/mtbkb/.
]]></description>
<dc:creator><![CDATA[ Li, P., Li, C., Zhu, R., Sun, W., Zhou, H., Fan, Z., Yue, L., Zhang, S., Jiang, X., Luo, Q., Han, J., Huang, H., Shen, A., Bahetibieke, T., Wang, J., Zhang, W., Wen, H., Niu, H., Bu, C., Zhang, Z., Xiao, J., Gao, R., Chen, F. ]]></dc:creator>
<dc:date>2026-04-10</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.716833</dc:identifier>
<dc:title><![CDATA[MTB-KB: A Curated Knowledgebase of Mycobacterium tuberculosis Related Studies]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.717125v1?rss=1">
<title>
<![CDATA[
Genomic epidemiology of the 2017-2023 outbreak of Mycoplasma bovis sequence type ST21 in New Zealand 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.717125v1?rss=1
</link>
<description><![CDATA[
Mycoplasma bovis was first detected in cattle in New Zealand in 2017, prompting an eradication programme that incorporated extensive surveillance and a test-and-cull policy. Genome sequence data and phylodynamic models were used to inform decision making throughout the eradication programme. Isolates from 697 cattle on 126 farms were collected and sequenced between July 2017 and December 2023. Phylodynamic models were used to estimate the time of most recent common ancestor, the effective reproduction number (Reff) and effective population size, and long-range and local between-farm transmission dynamics. The analysis revealed the dramatic impact of movement restrictions and culling up to early 2020, with a sharp reduction in the Reff to less than 1 in 2018/9 and the extinction of two of three major lineages in 2020. This was followed by three-years of residual infection in farms in the South Island, associated with persistent infection of a large feedlot farm and nearby farms. The comprehensive dataset of genomic and epidemiological data provided a unique opportunity to study the dynamics of a country-wide outbreak of a single-host pathogen from first detection to potential eradication, underlining the utility of integrated genomic surveillance during an outbreak response.
]]></description>
<dc:creator><![CDATA[ French, N. P., Burroughs, A., Binney, B., Bloomfield, S., Firestone, S. M., Foxwell, J., Gias, E., Sawford, K., van Andel, M., Welch, D., Biggs, P. J. ]]></dc:creator>
<dc:date>2026-04-10</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.717125</dc:identifier>
<dc:title><![CDATA[Genomic epidemiology of the 2017-2023 outbreak of Mycoplasma bovis sequence type ST21 in New Zealand]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.716940v1?rss=1">
<title>
<![CDATA[
TCMCard: A High-Confidence Digital Infrastructure for Traditional Chinese Medicine Quantified by Multi-Dimensional Evidence Integration 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.716940v1?rss=1
</link>
<description><![CDATA[
Network pharmacology has become a widely used approach for deciphering multi-component, multi-target mechanisms of traditional Chinese medicine (TCM). Here we introduce TCMCard, a high-confidence digital infrastructure built on a Multi-Dimensional Evidence Integration (MDEI) framework. The framework integrates experimental activity data from authoritative chemical databases, literature-derived evidence, and structure-based similarity inference. Preprocessing steps include chemical structure normalization, species-specific filtering, and target quality scoring. Applied to conventional interaction datasets, this pipeline leads to the removal of over 60% of low-confidence noise. TCMCard supports network pharmacology exploration through an interactive visualization platform, and module analysis identifies functionally relevant communities that offer insights into the synergistic actions of TCM formulas. Overall, TCMCard may help move the field beyond simple data aggregation toward evidence-informed curation and quality-driven analysis. As an interactive and publicly accessible platform, it reveals an organized backbone within complex interaction networks, offering a more reliable basis for understanding multi-component synergy in TCM.
]]></description>
<dc:creator><![CDATA[ Wang, Y., Dong, W., Yao, J., Wang, K., Zhang, L., Wang, Y., Guo, S., Li, H., Cai, H., Wang, X., Li, Y. ]]></dc:creator>
<dc:date>2026-04-10</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.716940</dc:identifier>
<dc:title><![CDATA[TCMCard: A High-Confidence Digital Infrastructure for Traditional Chinese Medicine Quantified by Multi-Dimensional Evidence Integration]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.717010v1?rss=1">
<title>
<![CDATA[
Generating, curating, and evaluating trnL reference sequence databases: Benchmarking OBITools3/ecoPCR, RESCRIPt, and MetaCurator 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.717010v1?rss=1
</link>
<description><![CDATA[
Plant DNA metabarcoding enables the identification of plant taxa in mixed samples, with the trnL (UAA) intron and its P6 loop mini-barcode region performing as well as or better than other commonly used markers. Reliable metabarcoding requires high-quality reference databases, yet a regularly maintained trnL resource is currently lacking. Consequently, most studies use uncurated sequences downloaded directly from public repositories without essential validation. We address these gaps by providing guidance through a systematic comparison of three database curation tools: OBITools3/ecoPCR, RESCRIPt, and MetaCurator, to generate three trnL reference sequence databases and evaluate their classification performance across commonly sequenced trnL regions (CD, CH, and GH). Reference trnL sequences and taxonomy files were retrieved from public sequence repositories and curated using standardized filtering steps to reduce taxonomic errors, sequence ambiguity, and redundancy. Four simulated query datasets; two base sets and their mutated counterparts, were constructed to assess classification performance of the databases using the Naive Bayesian Classifier implemented in DADA2. The evaluation showed that performance differed by trnL region: MetaCurator and RESCRIPt yielded higher and similar metrics for trnL CD; OBITools3/ecoPCR and RESCRIPt were comparable for trnL CH; and MetaCurator attained the highest performance for trnL GH region. All reference databases, taxonomy, and evaluation files are available at Zenodo (https://doi.org/10.5281/zenodo.17969450). The complete computational workflow and scripts are available on GitHub (https://github.com/oskuddar/trnL_DB). Although evaluation was focused on plant taxa in the United States, the resulting databases are suitable for use as global trnL reference databases.
]]></description>
<dc:creator><![CDATA[ KUDDAR, O. S., Meiklejohn, K. A., Callahan, B. J. ]]></dc:creator>
<dc:date>2026-04-10</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.717010</dc:identifier>
<dc:title><![CDATA[Generating, curating, and evaluating trnL reference sequence databases: Benchmarking OBITools3/ecoPCR, RESCRIPt, and MetaCurator]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.717040v1?rss=1">
<title>
<![CDATA[
Synolog: A Scalable Synteny-Based Framework for Genome Architecture Characterization 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.717040v1?rss=1
</link>
<description><![CDATA[
Detailing the genomic architecture across multiple organisms has been a task performed for decades. The continuing growth of genomic datasets not only serves as a resource for studying genome evolution but warrants the availability of scalable and user-friendly software for processing these datasets. Here, we present Synolog, a bioinformatic toolkit that can automatically identify orthologs for both protein-coding and non-coding genes, synteny clusters across two or more genomes, as well as retrogenes, and segmental duplications. Applying Synolog, we illustrate cases of local gene expansions in ecologically disparate turtle species, identify synteny clusters across hundreds of millions of years of metazoan evolution, and reconstruct chromosome-level assemblies in teleosts using the inferred synteny clusters; all using its integrated visual features. In parallel, we compare our orthogroup method to that of commonly used software and note the tradeoffs of making inferences solely based on sequence similarity versus a synteny-based approach.
]]></description>
<dc:creator><![CDATA[ Madrigal, G., Catchen, J. M. ]]></dc:creator>
<dc:date>2026-04-10</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.717040</dc:identifier>
<dc:title><![CDATA[Synolog: A Scalable Synteny-Based Framework for Genome Architecture Characterization]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.716815v1?rss=1">
<title>
<![CDATA[
Impact of Regularization Methods and Outlier Removal on Unsupervised Sample Classification 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.716815v1?rss=1
</link>
<description><![CDATA[
Background: High-content assays have problems distinguishing biologically significant effects from the incidental effects of non-repeatable technical factors. Non-repeatable results are attributed to variations in the cell culture environment and the numerous, heterogeneous descriptors evaluated. The aim here was to determine whether preprocessing operations impacted the reproducibility of class assignments of experimental data. Methods: Batch effects that could affect reproducibility, i.e., signal/noise ratio, instrumental conditions, and segmentation, were controlled variables. The remaining batch effects, variations in materials, personnel, and culture environment could not be controlled. The values of descriptors were measured directly from images. Exploratory factor analysis was used to solve the identifiable and interpretable feature, factor 4. In each of five trials, one sample was treated with the same chemical mixture (EXP) and another with the solvent vehicle alone (CON). Results: Repeated CON and EXP samples showed significant differences among factor 4 means in data regularized within each trial. The mean of Trial 3 CON differed significantly from all other CON samples. These differences disappeared upon regularization to comprehensive databases. Among repeated EXPs, the Trial 2 mean differed from three other EXPs, but regularization to comprehensive databases had little effect. However, classification patterns were unchanged after regularization to any comprehensive database derived by the same protocol. After regularization to datasets derived by two different protocols, the classification pattern differed but only reflected elevation of differences that had been marginal to statistical significance. Outlier removal was deleterious. Even with the most sparing definition of outliers, over 3% of the contents of a single sample were removed from most trials. Elimination based on the overall within-trial distributions caused type I and type II errors. Conclusions: Non-repeatable factor 4 means in repeated trials had negligible influence on classification outcomes, so repeatability may not be a good indicator of assay quality. Irreducible batch effects, combined with small sample sizes and skewed distributions of the descriptor values, may account for non-repeatability. As the current results are based on real-world data, they suggest that non-repeatability is an uncorrectable feature of these assays. Classification patterns are not affected by several irreducible technical factors, namely materials, personnel, and non-repeatable environmental variables.
]]></description>
<dc:creator><![CDATA[ Heckman, C. A. ]]></dc:creator>
<dc:date>2026-04-10</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.716815</dc:identifier>
<dc:title><![CDATA[Impact of Regularization Methods and Outlier Removal on Unsupervised Sample Classification]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.717034v1?rss=1">
<title>
<![CDATA[
MHCXGraph: A Graph-Based approach to detecting T cell receptor cross-reactivity 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.717034v1?rss=1
</link>
<description><![CDATA[
The T cell receptor (TCR) recognition of multiple peptides presented by the major histocompatibility complex (MHC) is a key natural phenomenon, enabling the T cell repertoire to respond to a broad array of antigens. Despite its importance to the immune response, T cell cross-reactivity poses a major challenge for the development of novel T cell-based therapies. In this study, we present MHCXGraph, a graph-based computational approach for identifying conserved and immunologically relevant regions across multiple structures of peptides bound to MHC molecules (pMHC). Our approach provides three operational modes with user-defined parameters, allowing flexible configuration according to specific scientific needs while delivering fully interpretable results through user-friendly interfaces. We evaluated MHCXGraph across three case studies, including peptides bound to classical MHC Class I, MHC Class II, and unbound HLA alleles, demonstrating its ability to capture conserved structural determinants beyond sequence similarity. By integrating structural information with efficient graph-based analysis, MHCXGraph addresses key limitations of sequence-based methods while maintaining computational scalability. Collectively, these results indicate that MHCXGraph can be readily integrated into computational pipelines for T cell cross-reactivity discovery, especially in the context of de novo pMHC engager design and T cell-based vaccine development.
]]></description>
<dc:creator><![CDATA[ Simoes, C. D. M. S., Maidana, R. L. B. R., De Assis, S. C., Guerra, J. V. d. S., Ribeiro-Filho, H. V. ]]></dc:creator>
<dc:date>2026-04-10</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.717034</dc:identifier>
<dc:title><![CDATA[MHCXGraph: A Graph-Based approach to detecting T cell receptor cross-reactivity]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.08.717130v1?rss=1">
<title>
<![CDATA[
Benchmarking ambient RNA removal across droplet and well-plate platforms reveals artificial count generation as a critical failure mode of scAR and CellClear 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.08.717130v1?rss=1
</link>
<description><![CDATA[
Background: Ambient RNA contamination is a pervasive artifact of single-cell and single-nucleus RNA sequencing (sxRNA-seq), yet no consensus exists on which computational removal tool performs best across experimental platforms. Results: We present a systematic benchmark of six tools: CellBender, DecontX, SoupX, scCDC, scAR, and CellClear - evaluated across six human-mouse cell line mixing (hgmm) datasets (1k-20k cells) providing partial ground truth, two droplet-based complex tissue datasets (PBMC scRNA-seq; prefrontal cortex snRNA-seq), and a well-plate-based dataset (BD Rhapsody WBC). Using inter-species counts as partial ground truth, we quantify sensitivity, specificity, precision, and removal consistency per tool. We further apply a count-integrity criterion quantifying gene-cell positions where corrected values exceed raw counts. This reveals that scAR and CellClear do not merely denoise but fundamentally restructure count matrices: CellClear replaces >93% of counts with values derived from matrix factorization, while scAR generates spurious cell types absent from uncorrected data, including three spurious coarse cell types in the BD Rhapsody dataset and up to eight novel cell types in the prefrontal cortex. CellBender and SoupX exhibit reliable contamination removal with minimal count distortion. DecontX and scCDC are the only tools operable on non-droplet platforms without raw count matrix access. Runtime benchmarking at atlas scale (up to 172,000 nuclei) further demonstrates that CellClear fails to scale. Conclusions: Count matrix integrity, not removal sensitivity alone, must be a primary criterion when selecting ambient RNA correction tools. We provide platform-specific recommendations and a decision framework to guide tool selection across experimental contexts.
]]></description>
<dc:creator><![CDATA[ Schroeder, L., Gerber, S., Ruffini, N. ]]></dc:creator>
<dc:date>2026-04-10</dc:date>
<dc:identifier>doi:10.64898/2026.04.08.717130</dc:identifier>
<dc:title><![CDATA[Benchmarking ambient RNA removal across droplet and well-plate platforms reveals artificial count generation as a critical failure mode of scAR and CellClear]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.714835v1?rss=1">
<title>
<![CDATA[
SimpleFold-Turbo: Adaptive Inference Caching Yields 14-fold Acceleration of Flow-Matching Protein Structure Prediction 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.714835v1?rss=1
</link>
<description><![CDATA[
We apply TeaCache, an adaptive caching technique from video diffusion to SimpleFold's flow-matching protein structure prediction and achieve (9 to 14)-fold inference speedups with negligible quality loss. We determine that flow matching's near-linear generative trajectories make consecutive neural-network evaluations highly redundant. At a low redundancy threshold, SimpleFold-Turbo (SFT) skips {approx} 93 % of forward passes while preserving near-baseline template modeling (TM)-scores across 300 structurally diverse CATH domains and all six SimpleFold model sizes (100 million to 3 billion parameters), at compute budgets where log-uniform step-skipping collapses. Speedup scales with model size because caching overhead is constant while per-step cost grows, and a general three-phase skip pattern emerges independent of protein size or fold. SF-T requires no retraining, no weight modification, and no MSA server dependencies. We release SF-T as fully open-source software enabling thousands of structure predictions per hour on commodity hardware.
]]></description>
<dc:creator><![CDATA[ Taghon, G. ]]></dc:creator>
<dc:date>2026-04-10</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.714835</dc:identifier>
<dc:title><![CDATA[SimpleFold-Turbo: Adaptive Inference Caching Yields 14-fold Acceleration of Flow-Matching Protein Structure Prediction]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.716920v1?rss=1">
<title>
<![CDATA[
Structure-Based and Stability-Validated Prioritization of BACE1 Inhibitors Integrating Meta-Ensemble QSAR and Molecular Dynamics 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.716920v1?rss=1
</link>
<description><![CDATA[
Alzheimers disease remains an unmet therapeutic challenge, and no {beta}-secretase (BACE1) inhibitor has achieved clinical approval. A major limitation of prior discovery efforts is reliance on single-parameter optimization, often yielding computational hits with poor translational potential. Here, we present a stability-validated, biology-informed computational framework that integrates meta-ensemble QSAR (five tree-based classifiers with ECFP4 fingerprints), structure-based docking, Protein Language Model (ESM-1b)-guided hybrid residue interaction weighting, and comprehensive ADMET profiling within a normalized composite ranking scheme. Model robustness was confirmed through external validation and Y-randomization (n = 100; empirical p = 0.009). Heuristic weighting was quantitatively stress-tested using global {+/-}10% perturbation analysis (mean Spearman {rho} = 0.998; mean Kendalls {tau} = 0.970), demonstrating exceptional ranking stability under controlled parameter uncertainty. Screening of 16,196 structurally diverse compounds, including CNS-active molecules, phytochemicals, approved drugs, and investigational agents, identified 153 predicted actives (accuracy 0.852; ROC-AUC 0.920), which were refined to 111 drug-like candidates and seven prioritized leads. Two-hundred-nanosecond molecular dynamics simulations confirmed stable binding within the BACE1 catalytic pocket and sustained interaction networks over time. Mol-2 exhibited the most favorable profile, characterized by low ligand RMSD (1.2-1.6 [A]), persistent catalytic dyad interactions (ASP32 98%, ASP228 99%), predicted BBB permeability, acceptable efflux profile, and balanced ADMET characteristics consistent with CNS drug-like space. Collectively, this integrative, interpretable, and robustness-validated framework provides a systematic strategy for multi-criteria lead prioritization and may serve as a transferable platform for structure-guided discovery of therapeutics targeting complex neurodegenerative pathways
]]></description>
<dc:creator><![CDATA[ Chowdhury, T. D., Shafoyat, M. U., Hemel, N. H., Nizam, D., Sajib, J. H., Toha, T. I., Nyeem, T. A., Farzana, M., Haque, S. R., Hasan, M., Siddiquee, K. N. e. A., Mannoor, K. ]]></dc:creator>
<dc:date>2026-04-10</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.716920</dc:identifier>
<dc:title><![CDATA[Structure-Based and Stability-Validated Prioritization of BACE1 Inhibitors Integrating Meta-Ensemble QSAR and Molecular Dynamics]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.08.717258v1?rss=1">
<title>
<![CDATA[
Conditional genome-wide associations reveal novel genes 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.08.717258v1?rss=1
</link>
<description><![CDATA[
We introduce two novel approaches for gene discovery based on conditional genome-wide associations. Experimental validation of gene targets identified by our top-performing approach uncovers three genes with a previously unknown role in controlling flowering time in Arabidopsis, one of the most well-studied traits in the most well-studied plant genome. This work demonstrates the power of knockoff-based frameworks to uniquely identify novel genes underlying complex traits, a core task across applications in agriculture and human health.
]]></description>
<dc:creator><![CDATA[ Bellis, E. S., Robertson, M., Booker, W. W., Rudin, C. D. S., Alvarez, M. F. ]]></dc:creator>
<dc:date>2026-04-09</dc:date>
<dc:identifier>doi:10.64898/2026.04.08.717258</dc:identifier>
<dc:title><![CDATA[Conditional genome-wide associations reveal novel genes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.716976v1?rss=1">
<title>
<![CDATA[
LOCOM2: Robust Differential Abundance Analysis for Microbiome Data 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.716976v1?rss=1
</link>
<description><![CDATA[
Background: Numerous methods have been developed for differential abundance analysis of microbiome data; however, many fail to adequately control error rates, contributing to the reproducibility crisis in microbiome research. Moreover, new challenges have emerged, including large-scale studies, differential library size distributions, unbalanced case-control designs, and the increasing availability of only relative-abundance data rather than read counts. Methods: We propose LOCOM2 to address these challenges. The method refines the weighting scheme in LOCOM to eliminate confounding by library size while accommodating relative abundance data. It incorporates a series of adjustments to ensure stable and reliable estimation, even under extreme conditions such as very rare taxa and highly unbalanced case-control designs. In addition, LOCOM2 replaces the computationally intensive permutation procedure in LOCOM with a Wald-type test, substantially improving computational efficiency. To evaluate performance, we conducted extensive simulation studies using the MIDASim simulator and three data templates representing diverse body sites. We benchmarked LOCOM2 against state-of-the-art methods, including LOCOM, LinDA, ANCOM-BC2, MaAsLin2, and MaAsLin3. This benchmarking effort provides an essential foundation for the next stage of microbiome research. Results: LOCOM2 achieved accurate control of the false discovery rate across all simulation scenarios, whereas none of the other methods consistently did so. LOCOM2 also demonstrated the highest sensitivity for detecting true signals. Applications of these methods to three real microbiome datasets further corroborated these findings.
]]></description>
<dc:creator><![CDATA[ He, M., Satten, G. A., Hu, Y.-J. ]]></dc:creator>
<dc:date>2026-04-09</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.716976</dc:identifier>
<dc:title><![CDATA[LOCOM2: Robust Differential Abundance Analysis for Microbiome Data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.716912v1?rss=1">
<title>
<![CDATA[
ARACRA: Automated RNA-seq Analysis for Chemical Risk Assessment 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.716912v1?rss=1
</link>
<description><![CDATA[
Transcriptomic analysis is considered a powerful approach for biomarker discovery, however still exploring large scale omics dataset to extract meaningful biological insights remains a challenge for biologists. To address this gap, we present ARACRA a fully automated RNA-seq analysis pipeline including entire transcriptomics workflow from raw FASTQ files to the transcriptomics Point of Departure (tPoD) with human-in-the-loop review process. Overall, the analysis is performed in two phases: Phase 1 carries out the acquisition of raw reads, pre-alignment quality control, alignment to reference genome and quantification of gene expression. Whereas, Phase 2 performs statistical analysis including Differential Gene Expression analysis and Dose-Response modelling. Two phases are separated by an extensive quality control step which allows the user to visually inspect the quality of data processed and helps in filtering noise and outlier samples. ARACRA facilitates end-to-end analysis of RNA-Seq data through an interactive web-based application developed on nextflow and streamlit for minimizing computational complexities while ensuring correct downstream processing. Availability and implementation ARACRA is freely available online at the GitHub with MIT License and stream lit-based web application: ARACRA. Researchers can use the demo data or even upload their own data to do the analysis.
]]></description>
<dc:creator><![CDATA[ sharma, S., Kumar, S., Brull, J. B., Deepika, D., Kumar, V. ]]></dc:creator>
<dc:date>2026-04-09</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.716912</dc:identifier>
<dc:title><![CDATA[ARACRA: Automated RNA-seq Analysis for Chemical Risk Assessment]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.716565v1?rss=1">
<title>
<![CDATA[
ViralMap: Predicting Features in Viral Proteins from Primary Sequence 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.716565v1?rss=1
</link>
<description><![CDATA[
Modern viral vaccines are designed to elicit an immune response against viral proteins that mediate infection, making those proteins important targets for characterization and engineering. To improve vaccine efficacy, the proteins often require changes to specific residues or domains to improve immunogenicity and induce a protective response. These engineering strategies vary significantly across viruses, and comprehensive and accurate protein sequence annotation is crucial for guiding vaccine design. The growing risk of novel pathogen emergence and initiatives such as the CEPI 100 Days Mission to rapidly counter "Disease X" threats heighten the need for tools that can convert viral protein sequences from newly characterized genomes or emerging variants into the annotation profiles required for antigen engineering. To address this, we developed ViralMap, a multi-label annotation model tailored for eukaryotic viral proteins. By leveraging ESM-2 language model representations, ViralMap simultaneously predicts ten distinct annotation classes spanning domain topology and localization, post-translational modifications, and structural features directly from primary sequences. The model achieves a residue-level precision-recall area under the curve (PR-AUC) of 0.75 or greater for seven of the ten classes and realizes predictive performance competitive with established tools across the eight benchmarked classes. Evaluation on complex glycoproteins, including the SARS-CoV-2 spike and HIV-1 Env, supported cross-strain and novel-family generalization. By providing a unified, sequence-based framework for multi-label annotation, ViralMap offers a practical and scalable bridge from raw viral protein sequences to the annotation profiles required for antigen engineering.
]]></description>
<dc:creator><![CDATA[ Dwivedi, S., Kar, S., Horton, A. P., Gollihar, J. D. ]]></dc:creator>
<dc:date>2026-04-09</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.716565</dc:identifier>
<dc:title><![CDATA[ViralMap: Predicting Features in Viral Proteins from Primary Sequence]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.716958v1?rss=1">
<title>
<![CDATA[
An introgressed galectin-like protein is a candidate driver of the human tropism in the intestinal parasite Cryptosporidium 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.716958v1?rss=1
</link>
<description><![CDATA[
Cryptosporidium spp. are protozoan parasites responsible for diarrheal diseases. In humans, cryptosporidiosis is predominantly caused by the human-specific Cryptosporidium hominis and by Cryptosporidium parvum. This second species has been classically reported as zoonotic, with a host preference for ruminants. However, the recently described subspecies C. parvum anthroponosum has been found to be restricted to humans. Here, we generated novel whole genome sequences from West African samples of C. p. anthroponosum, and analyzed them together with all those already available, originating from East Africa, Europe, North America and Asia. Phylogenomics showed that all C. p. anthroponosum isolates are strongly clustered together, forming the sister clade of the zoonotic C. parvum representatives. The phylogenetic variations within C. p. anthroponosum did not present a clear geographic structure, consistent with C. hominis, primarily transmitted in humans. To elucidate the evolution of host species adaptation in C. p. anthroponosum, we then investigated genetic exchanges with C. hominis, detecting an ancestral introgression present in all C. p. anthroponosum isolates. This introgression involved a single gene, encoding for an extracellular galectin-like protein, which we predicted with high confidence to form a protein complex with the human insulin-degrading enzyme, a key metabolic regulator. Considering the role of host insulin metabolism in the proliferation of parasites as well as its known intrinsic differences between humans and ruminants, this molecular interaction could represent a plausible mechanism for an important role of the galectin-like protein in host-parasite interactions and in the host specificity of C. p. anthroponosum.
]]></description>
<dc:creator><![CDATA[ Bellinzona, G., Tichkule, S., Jex, A., van Oosterhout, C., Bandi, C., Sassera, D., Castelli, M., Caccio, S. M. ]]></dc:creator>
<dc:date>2026-04-09</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.716958</dc:identifier>
<dc:title><![CDATA[An introgressed galectin-like protein is a candidate driver of the human tropism in the intestinal parasite Cryptosporidium]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.716967v1?rss=1">
<title>
<![CDATA[
Benchmarking SNP-Calling Accuracy Against Known Citrus Pedigrees Reveals Pangenome Advantages Over Linear References 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.716967v1?rss=1
</link>
<description><![CDATA[
Background Pangenomes are a promising new approach to genomics that can reduce reference bias in genotyping, but the reliability of such a data model remains unclear in tracking variation across species. To test the utility of graph-based pangenomes for interspecific breeding, we developed a Minigraph-Cactus super-pangenome representing four Citrus species derived from the founder lines of a citrus breeding program. To benchmark SNP calling accuracy using graph and linear-based approaches, we performed whole genome short read sequencing for two sets of pedigreed progeny: 30 F1 hybrids and 244 advanced hybrids from an F1 crossed with a parent not included in the pangenome. Results The linear approach yielded more SNP calls than the graph-based approach, however, both methods exhibited similar Mendelian Inheritance Error Rates (MIER) in a tool-dependent manner. Reconstruction of parental haplotype blocks in the advanced hybrids revealed a striking improvement in performance in the pangenome graph-based calls, suggesting MIER is vulnerable to error when reference bias influences both parental and progeny genotype calls. Masking of regions diverged from the reference path improved MIER accuracy metrics and haplotype block reconstruction in both the linear and graph-based SNP calls. Conclusions In non-model systems, inheritance patterns observed from pedigreed hybrids provide a framework for benchmarking variant-calling accuracy using pangenomes. SNP miscalls originating from diverged regions can falsely satisfy MIER filters, thus we recommend haplotype blocks. The inherent structure of the pangenome graph has promising applications for removing regions of unreliable mapping quality, which cannot otherwise be reliably removed using traditional filtering metrics.
]]></description>
<dc:creator><![CDATA[ Kuster, R. D., Sisler, P., Sandhu, K., Yin, L., Niece, S., Krueger, R., Dardick, C., Keremane, M., Ramadugu, C., Staton, M. E. ]]></dc:creator>
<dc:date>2026-04-09</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.716967</dc:identifier>
<dc:title><![CDATA[Benchmarking SNP-Calling Accuracy Against Known Citrus Pedigrees Reveals Pangenome Advantages Over Linear References]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.716683v1?rss=1">
<title>
<![CDATA[
Over-representation of sperm-associated deleterious mutations across wild and ex situ cheetah (Acinonyx jubatus) populations 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.716683v1?rss=1
</link>
<description><![CDATA[
As purifying selection becomes less effective and inbreeding increases, small populations frequently develop an increased load of genome-wide deleterious mutations. Reflecting this pattern, deleterious mutations in genes associated with fertility and immunity have previously been identified in the cheetah (Acinonyx jubatus), which has had a low effective population size for at least the last 10,000 years. However, the distribution of deleterious mutations across cheetah populations is currently unknown. Here, we analysed novel whole genome resequencing data from 30 ex situ and 9 wild cheetahs. We investigated variation in genetic diversity, genomic measures of inbreeding, and the distribution of deleterious mutations across cheetah populations. South Sudanese and Tanzanian cheetahs showed higher inbreeding and realized load, while Namibian cheetahs had a higher proportion of population-specific deleterious mutations. Genes containing high- or moderate-impact deleterious mutations were significantly enriched for sperm-related functions, highlighting putative causative loci associated with poor sperm quality in cheetahs. Similar levels of genetic diversity and inbreeding were observed in ex situ cheetahs compared to their wild counterparts, providing empirical evidence of the efficacy of captive breeding programmes in maintaining genetic variation in ex situ populations.
]]></description>
<dc:creator><![CDATA[ Peers, J. A., Sibley, H. R., Armstrong, E. E., Crosier, A. E., Nash, W. J., Koepfli, K.-P., Haerty, W. ]]></dc:creator>
<dc:date>2026-04-09</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.716683</dc:identifier>
<dc:title><![CDATA[Over-representation of sperm-associated deleterious mutations across wild and ex situ cheetah (Acinonyx jubatus) populations]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.716715v1?rss=1">
<title>
<![CDATA[
Systemic mutagen exposures reported by normal kidney cell genomes 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.716715v1?rss=1
</link>
<description><![CDATA[
Lifestyle, environmental and other exposures to exogenous mutagens generate somatic mutations in normal human cells in vivo and increase cancer risk. However, the global repertoire of exogenous mutagen exposures is uncertain. The mutational signatures of mutagens in normal tissues offer opportunities to detect such exposures and survey them at population level. Using single-molecule duplex sequencing of normal kidney (n=319) and blood (n=272) samples from 10 countries, we show that normal kidney cell genomes report an extensive repertoire of somatic mutational signatures. Microdissection of kidney structures revealed that proximal tubules exhibit higher mutation rates than other components of the nephron and most normal cell types despite low cell division rates. This is explained by marked enrichment of mutational signatures due to known exogenous carcinogenic mutagens including the plant-derived aristolochic acids, as well as several signatures of unknown causes including an unknown agent prevalent in Japan (SBS12), and signatures of uncertain origins (SBS40b and SBS40c). The results suggest the existence of multiple, common, systemically circulating mutagens affecting human populations and indicate that the genomes of kidney proximal tubule cells report such exposures with high sensitivity.
]]></description>
<dc:creator><![CDATA[ Wang, Y., Knight, W., Ferreiro-Iglesias, A., Abedi-Ardekani, B., Pham, M. H., Moody, S., Hooks, Y., Abascal, F., Nunn, C., Fitzgerald, S., Cattiaux, T., Gaborieau, V., Fukagawa, A., Jinga, V., Rascu, S., Sima, C., Zaridze, D. G., Mukeria, A. F., Holcatova, I., Hornakova, A., Vasudev, N. S., Banks, R. E., Ognjanovic, S., Savic, S., Curado, M. P., Zequi, S. d. C., Reis, R. M., Magnabosco, W. J., Vianna, F., Silva Neto, B., Jarmalaite, S., Zalimas, A., Foretova, L., Navratilova, M., Phouthavongsy, L., Shire, C., Attawettayanon, W., Sangkhathat, S., Ding, C., Lawson, A. R. J., Latimer, C., Humphre ]]></dc:creator>
<dc:date>2026-04-09</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.716715</dc:identifier>
<dc:title><![CDATA[Systemic mutagen exposures reported by normal kidney cell genomes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.706161v1?rss=1">
<title>
<![CDATA[
A Grid-Search Framework for Dataset-Specific Calibration of Actigraphy Sleep Detection Algorithms 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.706161v1?rss=1
</link>
<description><![CDATA[
Actigraphy is widely used for long-term sleep monitoring, but established sleep-wake scoring algorithms often require parameter tuning, which is commonly performed manually and can reduce reproducibility. In this study, a grid-search-based calibration framework is presented for established actigraphy algorithms and evaluate whether it can serve as a practical alternative to manual tuning. The method was evaluated using two datasets: a multi-subject polysomnography-validated actigraphy dataset and a self-collected dual-device dataset. In the polysomnography-validated dataset, grid-search optimization produced performance patterns similar to manual parameter selection, while slightly improving detection of sleep onset and sleep offset and yielding modest gains in wake-sensitive metrics. In the dual-device dataset, consensus and majority voting were useful for reducing the influence of brief wake episodes occurring within the main sleep period, including micro-awakenings that can fragment sleep predictions across individual algorithms. Overall, these findings show that grid-search can replace manual parameter tuning with a more explicit and reproducible procedure while providing small improvements in sleep timing estimation and benefiting ensemble-based handling of within-sleep wakefulness.
]]></description>
<dc:creator><![CDATA[ Rahjouei, A. ]]></dc:creator>
<dc:date>2026-04-09</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.706161</dc:identifier>
<dc:title><![CDATA[A Grid-Search Framework for Dataset-Specific Calibration of Actigraphy Sleep Detection Algorithms]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.07.716863v1?rss=1">
<title>
<![CDATA[
gbdraw: a genome diagram generator for microbes and organelles 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.07.716863v1?rss=1
</link>
<description><![CDATA[
Motivation: Generating graphical diagrams of microbial and organellar genomes is a common and essential task in bioinformatics. Existing tools often present a trade-off; while powerful programming libraries that require coding skills, graphical applications require server processing or local installation with complex dependency. This highlights the need for a tool that offers both programmatic control for batch processing and graphical accessibility for ease of use. Results: To fill this gap, I developed gbdraw, a web application that generates circular and linear genome diagrams from self-contained GenBank or DDBJ files or combinations of GFF3 annotation and FASTA sequence files. Its core functions include visualizing annotated features, plotting GC content/skew tracks, and optionally generating pairwise sequence comparisons for comparative genomics. It is available as both a GUI web application and a command-line utility. Unlike existing web-based tools that require data upload to a remote server, gbdraw operates entirely within the user's web browser. This serverless architecture ensures that sensitive sequence data never leaves the local machine, providing a secure environment for visualizing unpublished genomic data. Availability and Implementation: gbdraw is implemented in Python 3 (version 3.10+) and is freely available under the MIT license. The web app is available at https://gbdraw.app/. Source code and documentation are available at https://github.com/satoshikawato/gbdraw. The local version can be installed from the Bioconda channel using a conda-compatible package manager.
]]></description>
<dc:creator><![CDATA[ Kawato, S. ]]></dc:creator>
<dc:date>2026-04-09</dc:date>
<dc:identifier>doi:10.64898/2026.04.07.716863</dc:identifier>
<dc:title><![CDATA[gbdraw: a genome diagram generator for microbes and organelles]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.06.716845v1?rss=1">
<title>
<![CDATA[
GMIP-PLSR: A Nextflow Pipeline for GWAS and Multi-Omics Integration in Gene Prioritization Using PLSR 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.06.716845v1?rss=1
</link>
<description><![CDATA[
Genome-wide association studies (GWAS) have significantly advanced our understanding of complex traits and diseases, but their interpretive power remains limited due to challenges in identifying causal genes and pathways. Integrating GWAS with multi-omics data - such as gene expression, protein-protein interactions, and gene-pathway networks have the potential to enhance biological insights and improve gene prioritization. To fulfill this potential and need, we developed the GWAS & Multi-omics Integration Pipeline (GMIP), a flexible and scalable framework that incorporates widely used tools such as PoPS, MAGMA, and benchmarker to enrich GWAS findings. However, PoPS suffers from multicollinearity in its features, which can impact performance. To overcome this, we introduce GMIP-PLSR, an extension of GMIP that uses Partial Least Squares Regression (PLSR) to manage multicollinearity effectively. We applied GMIP-PLSR across multiple GWAS datasets, demonstrating superior performance over PoPS in most cases. In a case study on NAFLD, GMIP-PLSR, using features derived from both disease-specific scRNA-seq and general PoPS features, identified gene sets with higher heritability and stronger enrichment in known NAFLD pathways, confirming its ability to enhance GWAS findings. Built on Nextflow, GMIP is computationally efficient, adaptable to diverse research environments, and provides a robust solution for gene reprioritization in post-GWAS analyses. GMIP-PLSR is available at https://github.com/mohammedmsk/GMIP.
]]></description>
<dc:creator><![CDATA[ Kanchwala, M. S., Xing, C., Xuan, Z. ]]></dc:creator>
<dc:date>2026-04-09</dc:date>
<dc:identifier>doi:10.64898/2026.04.06.716845</dc:identifier>
<dc:title><![CDATA[GMIP-PLSR: A Nextflow Pipeline for GWAS and Multi-Omics Integration in Gene Prioritization Using PLSR]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.06.716854v1?rss=1">
<title>
<![CDATA[
Spectral Graph Features for Reference-free RNA 3D Quality Assessment 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.06.716854v1?rss=1
</link>
<description><![CDATA[
Motivation: Existing RNA 3D structure quality assessment (QA) methods rely on local geometric descriptors or statistical potentials that evaluate atomic-level contacts but are blind to global topological coherence. This creates a critical failure mode---structures that are ''locally correct but globally wrong''---where well-formed local helices mask misplaced domains and incorrect overall packing. Results: We introduce SpecRNA-QA, a lightweight method that scores RNA 3D models using multi-scale spectral features derived from the graph Laplacian of inter-nucleotide contact networks. By computing eigenvalue distributions, heat-kernel traces, and spectral entropy across four distance scales with binary and Gaussian kernels, SpecRNA-QA captures global structural coherence inaccessible to conventional descriptors. In leave-one-out cross-validation on CASP16 (42 targets, 7368 models), spectral features achieve median per-target Spearman rho = 0.69 [95% CI: 0.64--0.73], significantly outperforming an internal geometry baseline (rho = 0.47, Delta_rho = +0.22, Wilcoxon p = 1.2 x s 10^{-10}). Compared against established unsupervised statistical potentials---which require no labeled data, unlike the supervised spectral model---rsRNASP outperforms on small-to-medium RNAs (rho = 0.67 vs. 0.57$ , [&le;]200~nt). However, rsRNASP times out on most large RNAs (>200~nt), where SpecRNA-QA provides the strongest available quality signal (rho = 0.72 vs. DFIRE 0.52), revealing clear complementarity between global-topological and local-energy scoring. A training-free heuristic using only three spectral statistics enables quality estimation without any labeled data.
]]></description>
<dc:creator><![CDATA[ Zhu, Y., Zhang, H., Calhoun, V. D., Bi, Y. ]]></dc:creator>
<dc:date>2026-04-09</dc:date>
<dc:identifier>doi:10.64898/2026.04.06.716854</dc:identifier>
<dc:title><![CDATA[Spectral Graph Features for Reference-free RNA 3D Quality Assessment]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.04.06.716861v1?rss=1">
<title>
<![CDATA[
Quantifying Scientific Consensus in Biomedical Hypotheses via LLM-Assisted Literature Screening 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.04.06.716861v1?rss=1
</link>
<description><![CDATA[
Systematic literature reviews are labor-intensive tasks in biomedical research. While Large Language Models (LLMs) using Retrieval-Augmented Generation (RAG) techniques have enhanced information accessibility, the inherent complexity of biological systems---characterized by high context dependency and conflicting data---remains a primary driver of LLM hallucinations. This imposes a structural constraint that limits the precision of evidence synthesis. To address these limitations, we propose an automated framework designed for the exhaustive identification of supporting and contradictory evidence within a target literature set. Rather than relying on a model's pre-trained knowledge, our system requires the LLM to review each paper individually to determine its alignment with a specific research hypothesis. By evaluating semantic context, the framework captures subtle contradictions that are often overgeneralized by conventional methods. The framework's performance was validated using the BioNLI task, where it demonstrated high classification accuracy in distinguishing whether evidence supports or contradicts a given hypothesis. Notably, the implementation of an ensemble approach provided superior stability and slightly higher precision compared to individual models. Furthermore, the framework exhibited robust performance across several well-established biological hypotheses, confirming its practical utility and reliability in real-world research. This approach provides a rigorous basis for biomedical discovery by enabling the precise, systematic analysis of biological literature and the robust collection of evidence.
]]></description>
<dc:creator><![CDATA[ Kim, U., Kwon, O., Lee, D. ]]></dc:creator>
<dc:date>2026-04-09</dc:date>
<dc:identifier>doi:10.64898/2026.04.06.716861</dc:identifier>
<dc:title><![CDATA[Quantifying Scientific Consensus in Biomedical Hypotheses via LLM-Assisted Literature Screening]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-04-09</prism:publicationDate>
<prism:section></prism:section>
</item>
</rdf:RDF>
