<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF xmlns:admin="http://webns.net/mvcb/" xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:prism="http://purl.org/rss/1.0/modules/prism/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
<channel rdf:about="https://biorxiv.org">
<admin:errorReportsTo rdf:resource="mailto:biorxiv@cshlpress.edu"/>
<title>bioRxiv Subject Collection: Bioinformatics</title>
<link>https://biorxiv.org</link>
<description>
This feed contains articles for bioRxiv Subject Collection "Bioinformatics"
</description>

<items>
<rdf:Seq>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.01.721633v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.06.723290v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.06.722973v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.06.723091v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.06.722876v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.06.723123v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.05.723059v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.05.723100v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.05.723092v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.06.721805v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.06.722404v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.06.723370v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.05.723027v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.05.723040v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.05.722940v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.05.722888v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.05.722901v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.05.722871v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.05.722992v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.04.722792v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.04.722810v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.06.723235v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.04.722812v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.05.722981v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.04.722765v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.04.722524v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.04.722193v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.04.721987v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.04.722647v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.04.722712v1?rss=1"/>
</rdf:Seq>
</items>
<prism:eIssn/>
<prism:publicationName>bioRxiv</prism:publicationName>
<prism:issn/>

<image rdf:resource=""/>
</channel>
<image rdf:about="">
<title>bioRxiv</title>
<url>https://www.biorxiv.org/sites/default/files/bioRxiv_article.jpg</url>
<link>https://www.biorxiv.org</link>
</image>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.01.721633v1?rss=1">
<title>
<![CDATA[
LIVIA: a browser-based tool for assessing and visualizing predicted protein interactions 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.01.721633v1?rss=1
</link>
<description><![CDATA[
As protein structure prediction tools become widely adopted across biology, there is a growing need for accessible methods to assess and visualize predicted protein-protein interactions (PPIs). Here we present LIVIA (Local Interaction Visualization and Analysis), a browser-based tool that computes local PPI confidence metrics across multiple prediction platforms, identifies predicted interface residues, embeds an interactive Mol-star 3D viewer, and generates visualization scripts for ChimeraX and PyMOL. The tool automatically detects prediction formats; all parsing and computation occur locally on the users machine. LIVIA is freely available at https://flyark.github.io/LIVIA.
]]></description>
<dc:creator><![CDATA[ Kim, A.-R., Perrimon, N. ]]></dc:creator>
<dc:date>2026-05-10</dc:date>
<dc:identifier>doi:10.64898/2026.05.01.721633</dc:identifier>
<dc:title><![CDATA[LIVIA: a browser-based tool for assessing and visualizing predicted protein interactions]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.06.723290v1?rss=1">
<title>
<![CDATA[
Deciphering the Molecular Structure of the Type III Secretion System in Chlamydia trachomatis for Structure-Based Therapeutic Targeting 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.06.723290v1?rss=1
</link>
<description><![CDATA[
Chlamydia trachomatis is an obligate intracellular Gram-negative pathogen responsible for sexually transmitted infections and trachoma in humans. Although antibiotics are generally effective against acute infections, persistent chlamydial forms often exhibit reduced susceptibility during chronic infection. Chlamydia relies on its type III secretion system (T3SS) to inject effector proteins into host cells, making T3SS proteins attractive targets for antivirulence therapeutics. In this study, we employed an integrated computational pipeline to model and assemble the C. trachomatis T3SS constituent proteins. Template-based modeling using crystallographic structures of homologs from other Gram-negative bacteria revealed a highly conserved structural architecture despite low sequence identity (18-46%). Stereochemical validation confirmed high model quality, with most T3SS proteins exhibiting favorable protein-protein interactions (PPIs). Since the activity of the T3SS complex relies on extensive PPIs, we targeted these PPIs as a promising approach to attenuate bacterial virulence. CdsN, which functions as an ATPase of the T3SS, is a hexamer of which we targeted the dimerization interface. Structure-based virtual screening of compounds from the e-Drug3D and IMPPAT libraries against predicted hotspot residues and the identified druggable pocket at the CdsN dimeric interface, followed by ADMET screening, yielded three promising candidates: M Roflumilast (Drug ID: 1537), Elacestrant (Drug ID: 2081), and Tecovirimat (Drug ID: 1889). All three ligands formed thermodynamically stable complexes with the CdsN dimer, with Elacestrant demonstrating the most favourable binding free energy. This was also confirmed by 100 ns molecular dynamics simulation. This study provides new insights into the molecular architecture of C. trachomatis T3SS and identifies M Roflumilast, Elacestrant, and Tecovirimat as potential drug candidates against chlamydial infection.
]]></description>
<dc:creator><![CDATA[ Panda, A., Kapoor, J., Rajagopal, R., Kumar, S., Bandyopadhyay, A. ]]></dc:creator>
<dc:date>2026-05-09</dc:date>
<dc:identifier>doi:10.64898/2026.05.06.723290</dc:identifier>
<dc:title><![CDATA[Deciphering the Molecular Structure of the Type III Secretion System in Chlamydia trachomatis for Structure-Based Therapeutic Targeting]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.06.722973v1?rss=1">
<title>
<![CDATA[
Know Your Alphabet: Conformational Noise, Latent-Space Encodings, and the Future of Structural Phylogenetics 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.06.722973v1?rss=1
</link>
<description><![CDATA[
Structural alphabets have transformed protein phylogenetics by enabling sequence-style alignment and maximum-likelihood inference to be applied directly to structural data. However, a coordinate-explicit alphabet, in which character states are derived from three-dimensional atomic positions, encodes not only evolutionary signal but also the conformational variability inherent to protein structure. This source of noise has not previously been quantified in a phylogenetic context, and no framework exists for comparing alphabets with respect to their conformational sensitivity. Here, we introduce the Normalised Noise Index (NNI), a Shannon entropy-based metric for quantifying conformational sensitivity in structural alphabet encodings, and apply it alongside ensemble-wide Robinson--Foulds (RF) variance as a framework for characterising the impact of conformational noise on phylogenetic inference. Across 3,749 single-chain NMR ensembles from the Protein Data Bank, we show that 3Di character variability is a pervasive feature of experimentally observed conformational spread, with NNI negatively correlated with within-ensemble structural stability. A 100 ns molecular dynamics simulation of myoglobin confirmed that thermal fluctuations alone are sufficient to generate comparable 3Di character variation and, in 2.9% of cases, to redirect maximum-likelihood tree search away from the expected topology in a 4-taxon globin benchmark with independently established relationships. Exhaustive enumeration of 4,800 conformational replicates across three NMR ensembles revealed that topological variance under 3Di encoding is approximately 1.7-fold greater than under structural distance, based on 11,517,600 pairwise RF comparisons, a source of uncertainty invisible to standard bootstrap analysis. By contrast, TEA, a sequence-derived structure-aware alphabet inferred from ESM-2 embeddings rather than directly from atomic coordinates, is insulated from conformational sampling by construction and yields zero topological variance across all conformational replicates, serving here as a noise-insulated reference rather than a proposed replacement for 3Di. Together, these results demonstrate that alphabet choice is a methodological variable in structural phylogenetics, and that the NNI metric and RF variance framework introduced here provide a practical basis for principled noise characterisation as new structural alphabets continue to emerge.
]]></description>
<dc:creator><![CDATA[ Schmid, M., Liu, Y., Malik, A. J., Ascher, D. ]]></dc:creator>
<dc:date>2026-05-09</dc:date>
<dc:identifier>doi:10.64898/2026.05.06.722973</dc:identifier>
<dc:title><![CDATA[Know Your Alphabet: Conformational Noise, Latent-Space Encodings, and the Future of Structural Phylogenetics]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.06.723091v1?rss=1">
<title>
<![CDATA[
A structural grammar of truncation across the human homodimer landscape 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.06.723091v1?rss=1
</link>
<description><![CDATA[
Alternative splicing and proteolytic truncation generate tens of thousands of protein isoforms in the human proteome, but the structural consequences for quaternary state, the level at which most signaling, enzymatic and regulatory function operates, have largely been examined one molecule at a time. Leveraging the recent expansion of the AlphaFold Database to predicted human homodimers, we systematically compared 5,168 canonical-versus-truncated homodimer pairs across the human proteome. In high-confidence canonical homodimers, truncation is associated with predicted structural conservation in 56.4% of pairs (mean 85 residues lost), complete interface ablation in 26.1% (mean 178 residues lost), and partial destabilization in 17.5% (mean 134 residues lost); a distinct fourth class (4.0% of the dataset, n = 208) shows truncation-associated emergence of a predicted high-confidence interface from a sub-threshold canonical baseline. Two reproducible rules govern these transitions: a topological asymmetry in which N-terminal losses are preferentially enriched ~1.6-fold in interface preservation while C-terminal losses are rare overall (~6% of pairs) and modestly under-represented in the conservation class, and a biophysical rule in which emergence-class proteins show substantially elevated intrinsic disorder content relative to ablation-class proteins, as measured by both AlphaFold pLDDT-defined disorder of the canonical structure (Cohen's d {approx} 1.39) and AIUPred peak binding propensity of the truncated isoform (Cohen's d {approx} 0.65). Formal pathway enrichment recovered only a small nucleotide-metabolism signal, indicating that these rules operate across diverse gene-functional categories. Truncation-associated remodeling of homodimer architecture thus constitutes a structural grammar of the human proteome rather than a specialty of any single regulatory family.
]]></description>
<dc:creator><![CDATA[ Karagöl, T., Karagöl, A. ]]></dc:creator>
<dc:date>2026-05-09</dc:date>
<dc:identifier>doi:10.64898/2026.05.06.723091</dc:identifier>
<dc:title><![CDATA[A structural grammar of truncation across the human homodimer landscape]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.06.722876v1?rss=1">
<title>
<![CDATA[
Building an open ecosystem for molecular neuroimaging: standards and tools from the OpenNeuroPET initiative 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.06.722876v1?rss=1
</link>
<description><![CDATA[
Molecular neuroimaging with positron emission tomography (PET) and single-photon emission computed tomography (SPECT) enables quantification of specific molecular targets in the living brain. Despite its scientific impact, molecular neuroimaging research has historically faced challenges due to high costs, small sample sizes, laboratory-specific analysis pipelines, and limited large-scale data sharing. These factors have hindered reproducibility and the broader reuse of valuable PET datasets. The OpenNeuroPET initiative was established to address these barriers by developing standards, infrastructure, and open-source tools for organizing, sharing, and analyzing molecular neuroimaging data. Through collaborations across Europe and North America, OpenNeuroPET has supported the PET extension of the Brain Imaging Data Structure (PET-BIDS), providing a standardized framework for PET datasets and metadata. Building on PET-BIDS, tools such as PET2BIDS, ezBIDS, and BIDSCoin facilitate data conversion and curation. In parallel, OpenNeuro now hosts PET-BIDS datasets for open sharing, while complementary platforms such as PublicnEUro enable GDPR-compliant controlled access. Emerging open-source workflows and BIDS applications further support automated, reproducible PET preprocessing and quantitative analysis, promoting harmonized processing across centers. Together, these developments mark an important step toward an open molecular neuroimaging ecosystem in which datasets, software, and workflows can be transparently shared, reused, and scaled for collaborative research.
]]></description>
<dc:creator><![CDATA[ Ganz, M., Norgaard, M., Pernet, C., Matheson, G. J., Galassi, A., Ceballos, E. G., Wighton, P., Bilgel, M., Eierud, C., Gonzalez-Escamilla, G., Buckholtz, J., Blair, R., Markiewicz, C. J., Hardcastle, N., Greve, D. N., Thomas, A. G., Poldrack, R. A., Calhoun, V. D., Innis, R. B., Knudsen, G. M. ]]></dc:creator>
<dc:date>2026-05-09</dc:date>
<dc:identifier>doi:10.64898/2026.05.06.722876</dc:identifier>
<dc:title><![CDATA[Building an open ecosystem for molecular neuroimaging: standards and tools from the OpenNeuroPET initiative]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.06.723123v1?rss=1">
<title>
<![CDATA[
A Fractal-Dimension Framework for Quantifying Self-Similarity in Chromatin Folding 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.06.723123v1?rss=1
</link>
<description><![CDATA[
The three-dimensional folding of DNA is essential for genome function, but its organization remains difficult to summarize quantitatively across genomic scales. Here, we study DNA folding from Hi-C contact data using a network-based notion of fractal dimension. In this representation, genomic loci are treated as nodes, and observed Hi-C contacts define weighted edges, so that frequently interacting loci are closer in the resulting network. We then estimate fractal dimension using two complementary graph-based methods: the correlation dimension and the sandbox dimension. Validation on synthetic networks shows that the proposed estimators detect clear scaling behavior in hierarchical fractal-like networks, while distinguishing them from networks with local clustering but no stable multiscale self-similarity. Applied to intrachromosomal Hi-C data from the IMR90 human cell line, the method reveals approximate linear scaling regimes on log-log plots, suggesting fractal-like organization in chromatin contact networks. At the chromosome level, estimated fractal dimension tends to increase with chromosome size: larger chromosomes often have dimensions closer to 3, consistent with more compact and space-filling organization, whereas shorter chromosomes tend to have lower dimensions, closer to 1, consistent with simpler and more open folding patterns. A sliding-window analysis at 5 kb resolution further shows that fractal organization varies substantially along chromosomes rather than remaining uniform across genomic position. These results suggest that graph-based fractal dimension provides an interpretable summary of DNA folding complexity at both global and local scales. More broadly, the proposed framework offers a quantitative way to study multiscale genome organization from Hi-C data using tools from network geometry.
]]></description>
<dc:creator><![CDATA[ El-Yaagoubi, A., Balubaid, A. O., Chung, M. K., tegner, j., Ombao, H. ]]></dc:creator>
<dc:date>2026-05-09</dc:date>
<dc:identifier>doi:10.64898/2026.05.06.723123</dc:identifier>
<dc:title><![CDATA[A Fractal-Dimension Framework for Quantifying Self-Similarity in Chromatin Folding]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.05.723059v1?rss=1">
<title>
<![CDATA[
Machine learning cross-platform proteomic imputation enables protein quality scoring and replication of epidemiological associations 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.05.723059v1?rss=1
</link>
<description><![CDATA[
High-throughput affinity-based proteomics has advanced biomedical research, yet fundamental, persistent discordance between mainstream platforms (SomaScan and Olink) routinely undermines the replication of findings. This platform-driven non-replication complicates downstream biological validation and biomarker prioritization. Here, we develop a machine learning-based framework for cross-platform protein value imputation to resolve this translational bottleneck. Using paired proteomic data measured by both SomaScan and Olink from 5,325 participants of the Multi-Ethnic Study of Atherosclerosis, we developed models to impute cross-platform measurements and applied them to two independent and demographically distinct cohorts (Cardiovascular Health Study [N=3,171] and UK Biobank [UKB; N=41,405]) for external validation. Our bi-directional model 1) established an imputation performance-based protein fidelity index, validated against gold-standard measurements from Atherosclerosis Risk in Communities study (N=101) and Nurses' Health Study (N=54), 2) enabled imputation of platform-exclusive protein measurements, and 3) facilitated calibration of overlapping proteins. We demonstrate the utility of this framework through three applications: 1) fidelity-informed analyses enhanced the replication of biomarker discovery, 2) recovery of SomaScan signals that were previously inaccessible in UKB's original Olink measurements, and 3) improved replication performance for overlapping proteins. Our study offers a translational roadmap that allows researchers to achieve reliable epidemiological replication, target specific assays for future optimization, and prioritize biological signal over platform noise.
]]></description>
<dc:creator><![CDATA[ Li, L., Alaa, A., Tan, Y., Demirel, I., Friedman, S., Zha, Q., Trac, R. P., Taylor, K. D., Yu, B., Ballantyne, C. M., Deo, R., Dubin, R., Tsai, M. Y., Peloso, G. M., Brody, J., Austin, T., Psaty, B. M., Nicholas, J., Raffield, L. M., Tahir, U., Coresh, J., Hornsby, W., Chan, A., Rich, S. S., Rotter, J. I., Ganz, P., Gerszten, R., Philippakis, A., Natarajan, P., Yu, Z. ]]></dc:creator>
<dc:date>2026-05-09</dc:date>
<dc:identifier>doi:10.64898/2026.05.05.723059</dc:identifier>
<dc:title><![CDATA[Machine learning cross-platform proteomic imputation enables protein quality scoring and replication of epidemiological associations]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.05.723100v1?rss=1">
<title>
<![CDATA[
Cross Dataset Transcriptomic Analysis Identifies Oxidative Stress Inflammation Gene Networks Modulated by Nutrigenomic Interventions in Parkinson Disease 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.05.723100v1?rss=1
</link>
<description><![CDATA[
Inflammation and oxidative stress (OS) are key to Parkinson's disease (PD). We performed a cross-dataset integrative transcriptomic analysis to identify OS and inflammation-related hub genes persistently dysregulated in PD and to evaluate their response to nutrigenomic interventions using publicly available datasets. Four GEO datasets (GSE7621, GSE20141, GSE20146, GSE49036) were analysed to identify differentially expressed genes (DEGs), which were intersected with GeneCards OS inflammation gene sets. Functional enrichment analyses, including gene ontology (GO), pathway over-representation analysis (ORA), and protein-protein interaction (PPI) analysis, were used to identify key pathways and hub genes. Gene food bioactive compound (FBC) association was explored by integrating PD signatures with nutrigenomic profiles from NutriGenomeDB. We identified 183 DEGs in PD, enriched in synaptic, dopaminergic, OS, and inflammatory pathways. Intersection analysis yielded 26 OS-inflammation-related genes and 10 central regulators, including TH, DDC, SNCA, LRRK2, HSPB1, and HSPA1B. revealed opposing transcriptional patterns, with several FBCs suppressing stress related genes and upregulating dopaminergic markers such as TH, GCH1, and DDC. Overall, this integrative analysis highlights OS inflammation gene networks in PD and identifies candidate diet gene interactions that warrant further experimental validation
]]></description>
<dc:creator><![CDATA[ Rafiee, M., Abaj, F., Mahdevar, M., Rashidian, A., Ghaedi, K., Ghiasvand, R. ]]></dc:creator>
<dc:date>2026-05-09</dc:date>
<dc:identifier>doi:10.64898/2026.05.05.723100</dc:identifier>
<dc:title><![CDATA[Cross Dataset Transcriptomic Analysis Identifies Oxidative Stress Inflammation Gene Networks Modulated by Nutrigenomic Interventions in Parkinson Disease]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.05.723092v1?rss=1">
<title>
<![CDATA[
PromptBio-Bench: Benchmarking LLM-based Bioinformatics Agents for End-to-End Data Analysis 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.05.723092v1?rss=1
</link>
<description><![CDATA[
Large language model (LLM)-based agents hold transformative potential for automating bioinformatics workflows; however, systematic evaluations of their capabilities remain limited, hindering a clear assessment of their readiness for real-world application. We introduce PromptBio-Bench, a comprehensive evaluation suite of 194 expert-curated tasks spanning bioinformatics and data science at varied difficulty levels, and an evaluation framework for structured file comparison and scoring against expert reference answers. Benchmarking three state-of-the-art agents revealed that Biomni and ToolsGenie achieved comparable performance, and accuracy declined markedly at higher difficulty levels across all agents. As foundation models and agent frameworks continue to evolve, PromptBio-Bench provides a valuable benchmark infrastructure for the community to systematically track the progress of agentic bioinformatics.
]]></description>
<dc:creator><![CDATA[ Guo, W., Zhang, M., Han, B., Ma, Y., Leng, Y., Hebbar, S., Zhou, X., Gu, W., Yang, X., Dhar, S. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.05.723092</dc:identifier>
<dc:title><![CDATA[PromptBio-Bench: Benchmarking LLM-based Bioinformatics Agents for End-to-End Data Analysis]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.06.721805v1?rss=1">
<title>
<![CDATA[
Structural bias in machine learning-guided peptide design 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.06.721805v1?rss=1
</link>
<description><![CDATA[
Machine learning continues to accelerate peptide and protein design through the rapid prediction and generation of sequences with desired characteristics. Many applications focus on predicting properties, functions, and structures, as well as generating point mutations and de novo designs. Nevertheless, many models prove less generalizable than initially claimed. Most predictors and generators are trained on sequential datasets, where imbalances can be addressed during preprocessing. In contrast, structural bias, a subtype of algorithmic bias arising from uneven representation of structural classes in training datasets, and the limitations of early protein structure predictors have frequently remained undetected and uncorrected. The recent surge in powerful protein structure prediction tools, such as the AlphaFold and RosettaFold series and their variants, now presents opportunities to mitigate this issue. We hypothesize that such structural sampling biases influence the downstream performance of ML models. Using antimicrobial peptides as a case study, we audited the structural biases in 16 state-of-the-art predictors for antimicrobial activity and tested whether structural information constrains their predictions. Our analysis revealed that models explicitly trained on sequential data still produce predictions biased by uneven fold representations and data leakage. These findings highlight the importance of integrating balanced structural data or implementing bias-mitigating strategies to develop agnostic models that maximize bioactive protein discovery and multi-objective optimization.
]]></description>
<dc:creator><![CDATA[ Aldas-Bulos, V. D., Plisson, F. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.06.721805</dc:identifier>
<dc:title><![CDATA[Structural bias in machine learning-guided peptide design]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.06.722404v1?rss=1">
<title>
<![CDATA[
Open-Rosalind: Tool-First Biomedical LLM Agents with Process-Aware Benchmarking 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.06.722404v1?rss=1
</link>
<description><![CDATA[
Large language models are increasingly used as scientific agents, yet the flexibility that benefits general-purpose agents can conflict with the accountability required in biomedical research. We study whether biomedical agents can be organized around auditable constraints rather than unconstrained autonomy. We present Open-Rosalind, a tool-first bio-agent system designed around four operational principles: evidence-grounded outputs, trace completeness, workflow-constrained execution, and explicit tool mediation for factual claims. To evaluate these principles, we introduce Open-Rosalind BioBench, a process-aware benchmark that measures not only task accuracy but also tool correctness, citation presence, trace completeness, and failure rate. On a strict in-house benchmark, the reference pipeline achieves 81.4% accuracy with complete execution traces. In multi-model ablations and paired replications, removing tools reduces accuracy by 19.3 to 26.4 percentage points, indicating that tool-first execution is the strongest and most stable contributor to performance. Constrained workflows also reduce lower-tail failures for models that are weak at free-form tool use. However, an author-independent 30-task hold-out initially revealed severe external-validity collapse on the deployment model. After diagnosing five routing and normalization failures and applying targeted fixes, hold-out accuracy improved from 17.8% to 53.3%, and the most concerning negative comparison against a no-tool baseline disappeared. Taken together, these results frame Open-Rosalind as an empirical study of auditable biomedical agents, rather than as a claim that protocol constraints alone guarantee superior performance.
]]></description>
<dc:creator><![CDATA[ Wang, L. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.06.722404</dc:identifier>
<dc:title><![CDATA[Open-Rosalind: Tool-First Biomedical LLM Agents with Process-Aware Benchmarking]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.06.723370v1?rss=1">
<title>
<![CDATA[
vartracker: an end-to-end tool for pathogen longitudinal variant analysis and visualisation 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.06.723370v1?rss=1
</link>
<description><![CDATA[
Longitudinal sequencing can reveal fine-grained pathogen evolution during acute and chronic infections and inform public health responses. However, integrating ordered pathogen genomic data into a coherent evolutionary and clinical framework can be tedious and error-prone. We present vartracker, an open-source tool for longitudinal pathogen variant analysis and visualisation. Given an ordered sample manifest, vartracker supports three entry points: raw sequence reads, reference-aligned BAM files, or user-supplied VCF and coverage inputs. Raw-read and BAM inputs are processed through an integrated Snakemake workflow, whereas VCF mode starts from precomputed files. Variants are normalised and annotated relative to a reference genome, tracked across timepoints, and classified as original or newly emerging and as transient or persistent. Inferred amino acid changes are reported, and for SARS-CoV-2 analyses, relevant published literature for key mutations can be automatically linked through a functional database. vartracker outputs a schema-documented results table, provenance metadata for reproducibility, publication-quality static figures, and an interactive heatmap for data exploration. Although packaged with SARS-CoV-2 reference assets and initially developed for SARS-CoV-2 datasets, vartracker is pathogen-agnostic when appropriate reference data are supplied. We demonstrate its utility using SARS-CoV-2 and respiratory syncytial virus A (RSV-A) datasets. vartracker is freely available through GitHub, PyPI and Bioconda.
]]></description>
<dc:creator><![CDATA[ Foster, C. S. P., Rawlinson, W. D. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.06.723370</dc:identifier>
<dc:title><![CDATA[vartracker: an end-to-end tool for pathogen longitudinal variant analysis and visualisation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.05.723027v1?rss=1">
<title>
<![CDATA[
BART-spatial unravels biologically significant transcriptional regulators from spatial omics data 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.05.723027v1?rss=1
</link>
<description><![CDATA[
Transcriptional regulators (TRs) are crucial regulators of cell fate decisions by activating or repressing lineage-specific genes and integrating environmental signals with intrinsic networks. Identifying functional TRs is essential for understanding development, tissue organization, and disease. Emerging spatial transcriptomics and epigenomics technologies now provide near-single-cell resolution mapping of genomic features while preserving information of each cell's physical location and microenvironment which influence TR activity. Despite these advances, identifying active TRs in spatial data remains challenging due to low TR expression and the fact that TR activity often does not correlate directly with mRNA levels. Moreover, existing tools mainly designed for non-spatial single-cell data overlook spatial heterogeneity. To bridge this gap, we developed BART-spatial (Binding Analysis for Regulation of Prediction for spatial omics), an innovative computational method to infer functional TRs from spatial omics data. BART-spatial integrates spatial variability and pseudo-temporal information with publicly available TR binding profiles. Applied to multiple spatial datasets from diverse platforms, including 10X Visium, Visium HD, Atera, and spatial RNA-ATAC-seq, BART-spatial consistently outperforms existing methods, identifying stage-specific TRs and revealing regulators undetectable by expression alone. Its compatibility with spatial epigenomics data further strengthens its utility and enables cross-validation. Overall, BART-spatial provides a powerful and robust tool for decoding spatially resolved gene regulatory programs.
]]></description>
<dc:creator><![CDATA[ Wang, J., Zhang, H., Wang, Z., Zang, C. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.05.723027</dc:identifier>
<dc:title><![CDATA[BART-spatial unravels biologically significant transcriptional regulators from spatial omics data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.05.723040v1?rss=1">
<title>
<![CDATA[
RAPID: an interactive R/Shiny platform for end-to-end 16S rRNA and ITS amplicon sequence analysis using DADA2 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.05.723040v1?rss=1
</link>
<description><![CDATA[
Abstract Motivation: Amplicon sequencing of 16S rRNA and internal transcribed spacer (ITS) gene regions is the most widely used approach for characterizing bacterial and fungal communities, respectively. The DADA2 pipeline has become a standard for inferring amplicon sequence variants (ASVs), offering single-nucleotide resolution over traditional OTU clustering. However, executing the full DADA2 workflow requires proficiency in R programming and manual coordination of multiple sequential steps, presenting a substantial barrier for researchers in clinical, environmental, and agricultural sciences who lack computational training. Results: We present RAPID (R-based Amplicon Pipeline for Interactive DADA2), a pair of R/Shiny applications providing complete graphical user interfaces for 16S rRNA and ITS amplicon sequence analysis. The 16S application implements a 10-step guided workflow from raw paired-end FASTQ files through quality filtering, error learning, dereplication, paired-read merging, chimera removal, taxonomy assignment (SILVA), phyloseq construction with data transformation (rarefaction, relative abundance, or CLR), interactive visualization (rarefaction curves, alpha diversity, NMDS, PCoA, taxonomic abundance), PERMANOVA, and ANCOM-BC2 differential abundance analysis. The ITS application extends this to an 11-step workflow, adding an automated primer removal step using cutadapt with support for multiple primers and length-variable amplicons, and uses the UNITE database for fungal taxonomy. Both applications feature asynchronous background processing, session persistence, real-time progress monitoring, publication-ready figure export, and comprehensive result downloads. Availability: RAPID is freely available at https://github.com/beantkapoor786/RAPID. Both applications can be installed locally on any system with R (version 4.0 or higher) and run as local web applications accessible through a standard browser. Keywords: 16S rRNA, ITS, amplicon sequencing, DADA2, microbiome, mycobiome, graphical user interface, Shiny, phyloseq, ASV, PERMANOVA, ANCOM-BC2
]]></description>
<dc:creator><![CDATA[ Kapoor, B., Cregger, M. A., Ranjan, P. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.05.723040</dc:identifier>
<dc:title><![CDATA[RAPID: an interactive R/Shiny platform for end-to-end 16S rRNA and ITS amplicon sequence analysis using DADA2]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.05.722940v1?rss=1">
<title>
<![CDATA[
TopoFuseNet: Hierarchical Graph Representation Learning with Multi-Scale Topological Features for Accurate Drug Synergy Prediction 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.05.722940v1?rss=1
</link>
<description><![CDATA[
Accurate prediction of drug synergy is paramount for developing effective combination therapies and advancing personalized medicine. Although methods based on graph neural networks (GNNs) have become a prevalent approach, they often treat molecules as flat graphs of connected atoms, thus overlooking their inherent hierarchical structure (i.e., atoms forming functional groups) and the critical topological information that governs molecular interactions. To address this limitation, we introduce TopoFuseNet, a novel hierarchical graph representation learning framework that integrates multi-scale topological features. The core innovations of TopoFuseNet include: 1) The first-ever application of "Group Centrality" from network science to cheminformatics, enabling the identification and quantification of functional groups crucial to drug activity; 2) A systematic, multi-path strategy to seamlessly integrate node-level (atom) and group-level (functional group) topological features into a Graph Attention Network (GAT) via feature augmentation, attention biasing, and hierarchical pooling; 3) A Differential Transformer module to deeply fuse multi-modal features learned from sequences, fingerprints, and our proposed hierarchical graph representations.
]]></description>
<dc:creator><![CDATA[ Wang, Q., Shi, x. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.05.722940</dc:identifier>
<dc:title><![CDATA[TopoFuseNet: Hierarchical Graph Representation Learning with Multi-Scale Topological Features for Accurate Drug Synergy Prediction]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.05.722888v1?rss=1">
<title>
<![CDATA[
A Differentiable dFBA Simulator for Scalable Bayesian Inference over Microbial Metabolic Models 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.05.722888v1?rss=1
</link>
<description><![CDATA[
Medium optimisation for bioprocess design remains challenging and costly: fermentation recipes typically contain ten or more components, the design space expands combinatorially as ingredients are added, and each batch experiment requires over 24 hours. High-throughput 96-well plate screening can reduce experimental cost, but extracting actionable predictions from growth curves requires a mechanistic model that links medium composition to cellular metabolism. In this paper, we present a differentiable simulator for dynamic flux balance analysis (dFBA) that enables scalable Bayesian inference over microbial metabolic models. A distinguishing feature is that inference is driven entirely by OD600 measurements, a simple optical proxy for biomass, without substrate or product assays; internal fluxes, substrate consumption, and secreted metabolite profiles are recovered as latent variables constrained by the metabolic network stoichiometry. We resolve the core differentiability barrier of classical dFBA by reformulating the per-step linear or quadratic programme (LP/QP) as a smooth continuous ODE (the Relaxed Interior-Point ODE, R-iODE), establishing the mathematical framework for end-to-end gradient propagation through long fermentation trajectories in JAX; full gradient validation is ongoing. The result is a framework for principled inference over thousands of batch fermentations, providing a path toward model-guided medium design, cross-strain parameter transfer, and scale-up prediction from plate data.
]]></description>
<dc:creator><![CDATA[ Diederen, T., Merzbacher, C., Patz, M. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.05.722888</dc:identifier>
<dc:title><![CDATA[A Differentiable dFBA Simulator for Scalable Bayesian Inference over Microbial Metabolic Models]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.05.722901v1?rss=1">
<title>
<![CDATA[
SaVanache: indexing and visualizing pangenome variation graphs 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.05.722901v1?rss=1
</link>
<description><![CDATA[
With the rapid increase in genome sequencing and the growing availability of genomic resources, genomics is shifting toward pangenome representations that capture intra- and inter-specific diversity by integrating multiple genomes into a single entity. These pangenomes are increasingly modeled as graphs, encoding complex genomic variations in structures such as de Bruijn or variation graphs. However, while genome browsers provide standard and effective solutions for visualizing single or limited numbers of genomes, equivalent interactive tools for graph-based pangenomes remain limited, particularly for variation graph models. We developed SaVanache, a multi-resolution visualization interface designed to explore pangenome variation graphs at various depths. SaVanache enables the exploration of both global diversity and structural variations (SVs) across genomes relative to a user-defined linear pivot genome. Unlike synteny viewers, SaVanache emphasizes variations by representing SV types through a dedicated set of glyphs, facilitating intuitive one-to-many comparisons. To support smooth exploration, SaVanache preprocesses a Graphical Fragment Assembly (GFA) pangenome file into optimized index and data structures, enabling fast, real-time queries on large pangenome graphs. By combining advanced visualization techniques with efficient data handling, SaVanache provides a robust tool for scientists to analyze and visualize genetic variation within genomes and pangenomes, facilitating the identification of genetic determinants associated with phenotypes of interest and fully exploiting current genomic resources.
]]></description>
<dc:creator><![CDATA[ Mohamed, M., Durant, E., Rouard, M., Muller, C., Monat, C., Conte, M., Sabot, F. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.05.722901</dc:identifier>
<dc:title><![CDATA[SaVanache: indexing and visualizing pangenome variation graphs]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.05.722871v1?rss=1">
<title>
<![CDATA[
Efficient Stochastic Trace Generation for Transcription 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.05.722871v1?rss=1
</link>
<description><![CDATA[
Bursty transcription in single cells typically produces over-dispersed, skewed, and sometimes heavy-tailed expression distributions that are explained by two-state Markov models of the promoters. While the gold standard for simulation is exact stochastic sampling with Gillespie's algorithm, obtaining thousands of timed traces is computationally costly. Surrogate models based on stochastic differential equations (SDEs) are widely used to speed up this simulation process. An example is the Chemical Langevin Equation based on Gaussian noise, which, however, does not capture heavy-tailed noise. In this work, we present a unified SDE framework that combines deterministic drift, Gaussian fluctuations, and additive sporadic jumps of arbitrary distributions, and provide an open-source Python implementation, bcrnnoise. The framework subsumes standard surrogate models and allows for vectorized generation of batches of transcription traces. We assess computational speed and accuracy of common surrogate models along with new models, showing that high accuracy can be obtained while reducing computational cost up to two orders of magnitude.
]]></description>
<dc:creator><![CDATA[ Ferdowsi, A., Fuegger, M., Nowak, T. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.05.722871</dc:identifier>
<dc:title><![CDATA[Efficient Stochastic Trace Generation for Transcription]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.05.722992v1?rss=1">
<title>
<![CDATA[
LongAllele: a joint inference framework for allele-specific analysis on long-read bulk and single-cell RNA sequencing 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.05.722992v1?rss=1
</link>
<description><![CDATA[
Allele-specific analysis from RNA-seq is a powerful approach to characterize cis-regulatory effects. However, existing methods remain limited in both haplotype inference and allelic testing. Their haplotype-inference workflows separate variant calling, haplotype phasing, and read-haplotype assignment into sequential steps, failing to fully exploit within-read SNV linkage information and propagating errors into downstream allelic analysis. At the testing stage, they ignore non-phasable reads lacking heterozygous SNVs, biasing calls and inflating false positives, and remain incomplete across gene-, isoform-, and local-event-level variant effects. Here, we present LongAllele, a statistical framework that employs an expectation-maximization algorithm to jointly infer heterozygous variants, haplotype structure, and read-haplotype assignments from long-read bulk and single-cell RNA sequencing. LongAllele further introduces phasability-aware testing that explicitly accounts for non-phasable reads, avoiding inflated false-positive calls when haplotype information is incomplete. It also enables comprehensive allelic testing across gene-level ASE, isoform-level allele-specific transcript usage (ASTU), and local-event-level haplotype-associated exon and junction usage (HAEU and HAJU), providing a multi-scale view of cis-regulation. We applied LongAllele to long-read RNA-seq datasets spanning GTEx (multi-tissue bulk), peripheral blood mononuclear cells (single-cell), and human hippocampus (single-nucleus). LongAllele consistently revealed greater tissue and cell-type variability in expression-level than isoform-level allelic regulation, pinpointed high-impact regulatory variants including rare splice-site mutations missed by standalone variant callers, and showed that purifying selection constrains allelic imbalance at both gene and isoform levels. LongAllele offers a unified framework for haplotype-resolved cis-regulatory analysis across diverse cellular contexts.
]]></description>
<dc:creator><![CDATA[ Xu, Z., Wang, K. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.05.722992</dc:identifier>
<dc:title><![CDATA[LongAllele: a joint inference framework for allele-specific analysis on long-read bulk and single-cell RNA sequencing]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.04.722792v1?rss=1">
<title>
<![CDATA[
Allosteric Protein Chemical Shift Perturbations are Ubiquitous 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.04.722792v1?rss=1
</link>
<description><![CDATA[
While allosteric protein function has been appreciated for decades, the ubiquity of conformational shifts, particularly those distant from the interaction interface, has not been broadly characterized. For example, ligand binding frequently triggers allosteric effects far from the interaction interface, yet the prevalence of these conformational shifts underpinning protein function remain poorly documented. We systematically assessed the generality of allosteric effects as monitored by NMR Chemical Shift Perturbations (CSPs) distant from the interaction interface. In a set of 139 protein-protein complexes, a striking 74% of all significant CSPs are non-local to the binding site. Notably, more than 35% of significant CSPs outside the binding site occur in residues for which the shortest receptor-ligand interatomic distance is more than 10 [A]. Every protein analyzed exhibits a significant fraction (> 8%) of CSPs distant from the binding site. This analysis across a large number of protein structures demonstrates and documents that structural plasticity is a ubiquitous and fundamental property of proteins.
]]></description>
<dc:creator><![CDATA[ Benavides, T. L., Ramelot, T. A., Montelione, G. T. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.04.722792</dc:identifier>
<dc:title><![CDATA[Allosteric Protein Chemical Shift Perturbations are Ubiquitous]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.04.722810v1?rss=1">
<title>
<![CDATA[
Denoised MDS-UPDRS Part-III Scores Yield New Patterns of Progression Heterogeneity in Early Stage Parkinson's Disease 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.04.722810v1?rss=1
</link>
<description><![CDATA[
Parkinson's Disease (PD) Motor Scores (MDS-UPDRS Part III) are quite noisy. This paper proposes a new methodology for processing these scores by first denoising the scores to enhance the underlying progression signal, and then conducting a high-dimensional analysis which does not sum the scores into a total movement score. The analysis gives novel insights into PD progression heterogeneity: it reveals that the heterogeneity is continuously variable rather than clustered into "subtypes" and that the variability is along two easily understood axes. This analysis also resolves some of the discrepancies in previously reported progression subtypes. Finally, the analysis reveals that patient-specific progression cannot be predicted from baseline using only MDS-UPDRS Part III scores.
]]></description>
<dc:creator><![CDATA[ Koss, J., Tinaz, S., Tagare, H. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.04.722810</dc:identifier>
<dc:title><![CDATA[Denoised MDS-UPDRS Part-III Scores Yield New Patterns of Progression Heterogeneity in Early Stage Parkinson's Disease]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.06.723235v1?rss=1">
<title>
<![CDATA[
QuadStack: Specialized convolutional blocks enable in vivo BG4-binding motif prediction and highlight discrepancies with in vitro G-quadruplexes. 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.06.723235v1?rss=1
</link>
<description><![CDATA[
G-quadruplex (G4) prediction has been largely guided by in vitro biophysical rules, yet these models show limited agreement with in vivo measurements. Here, we present QuadStack, a deep learning model trained on a multi study BG4-ChIP-seq compendium. QuadStack introduces two biologically grounded convolutional modules-G4Stack Convolution, which captures G/C stacking patterns, and Reverse Complement Convolution, which enforces strand invariant representations consistent with ChIP-seq signals. QuadStack achieves strong predictive performance (AUC up to 0.94) and substantially outperforms widely used in vitro-based predictors on genomic test data. Beyond performance, our analyses reveal that BG4-associated sequence grammar is not solely governed by canonical isolated G-rich tracts, but also by patterns where G and C nucleotides are mixed. This suggests that cytosines are not simply disruptive in vivo, and raises the possibility that cytosines may play a context-dependent role or that guanines on the opposite strand contribute to the structure, which could explain the difference between in vivo and in vitro observations. Together these findings demonstrate a fundamental discrepancy between in vitro folding propensity and in vivo G4 biology, and establish QuadStack as both a predictive model and a framework for interpreting G4 formation in its native genomic context.
]]></description>
<dc:creator><![CDATA[ Ulas, P. N., Doluca, O. ]]></dc:creator>
<dc:date>2026-05-08</dc:date>
<dc:identifier>doi:10.64898/2026.05.06.723235</dc:identifier>
<dc:title><![CDATA[QuadStack: Specialized convolutional blocks enable in vivo BG4-binding motif prediction and highlight discrepancies with in vitro G-quadruplexes.]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.04.722812v1?rss=1">
<title>
<![CDATA[
MUSE enables cross-species multi-omics integration that incorporates transcriptional regulatory modules 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.04.722812v1?rss=1
</link>
<description><![CDATA[
Recent advances in evolutionary biology and biomedical research have promoted comparative analyses of cellular states and developmental processes across species, leading to the development of numerous cross-species alignment methods based on scRNA-seq data. However, alignments relying solely on RNA expression are strongly driven by lineage signals and cell type-specific transcriptional programs. As a result, they are limited in their ability to identify conserved regulatory modules across species and to compare regulatory logic beyond developmental lineages, thereby constraining biological interpretability. Meanwhile, recent technological developments have enabled the acquisition of multi-omics data, including chromatin accessibility, making it increasingly feasible to analyze and interpret conservation at the level of regulatory modules across species. Nevertheless, computational methods that can integratively handle such heterogeneous omics data and enable cross-species comparative analysis in a unified framework remain insufficiently established. To address this challenge, we propose Multi-omics Unified embedding across Species (MUSE), a novel framework for integrating multi-omics data across species. MUSE constructs a graph that captures relationships among features both within and across species, and learns a shared latent space based on this graph structure. By leveraging this integrated graph-based representation, MUSE enables cross-species alignment that preserves species-specific characteristics while reflecting similarities not only at the level of gene expression and chromatin states but also at the level of regulatory modules.
]]></description>
<dc:creator><![CDATA[ Fuka, N., Shintaro, Y., Zhenan, L., Chikara, M., Shuto, H., Teppei, S., Hiroshi, Y. ]]></dc:creator>
<dc:date>2026-05-07</dc:date>
<dc:identifier>doi:10.64898/2026.05.04.722812</dc:identifier>
<dc:title><![CDATA[MUSE enables cross-species multi-omics integration that incorporates transcriptional regulatory modules]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.05.722981v1?rss=1">
<title>
<![CDATA[
PHYFUM: Phylogenetic Reconstruction of Normal and Pre-malignant Tissue Evolution Using Fluctuating Methylation 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.05.722981v1?rss=1
</link>
<description><![CDATA[
We present PHYFUM, a novel Bayesian phylogenetic method for methylation data that reconstructs the evolutionary history of stem cells and the glandular structures they reside within in normal tissue. Using simulations, we validated this phylogenetic method and confirmed its accuracy. A re-analysis of 22 patients unveiled early gland divergence in the human gut, in contrast to a much later common ancestor in the endometrium, and yielded strong evidence against gland division by segregation of individual stem cells.
]]></description>
<dc:creator><![CDATA[ Bousquets-Munoz, P., Grant, H. E., Shibata, D., Graham, T. A., Maley, C. C., Gabbutt, C., Mallo, D. ]]></dc:creator>
<dc:date>2026-05-07</dc:date>
<dc:identifier>doi:10.64898/2026.05.05.722981</dc:identifier>
<dc:title><![CDATA[PHYFUM: Phylogenetic Reconstruction of Normal and Pre-malignant Tissue Evolution Using Fluctuating Methylation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.04.722765v1?rss=1">
<title>
<![CDATA[
Beyond Pathway Boundaries: A Degree-Aware Network Clustering Test for Gene Sets 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.04.722765v1?rss=1
</link>
<description><![CDATA[
Over-representation analysis (ORA) is the most commonly used interpretation tool for gene lists despite well-documented limitations: pathway boundaries are fixed, genes are assumed independent, and results depend on the background set. Network-based methods address these using interaction-network modularity, but introduce hub bias: highly connected genes appear clustered under naive nulls because curated networks overrepresent well-studied genes. Existing corrections are imperfect: edge permutation destroys the topology the test should condition on, and propagation methods hide the confound in parameter tuning. We introduce MANGO (Moran's Autocorrelation for Network Gene Over-representation), which asks one conditional question: does a gene set's spatial autocorrelation on a fixed biological network exceed what its degree composition alone would predict? MANGO computes Global Moran's I under a null that conditions on both the network and the binned degree distribution of the gene set, then decomposes significant signals at the component and gene level. In benchmarks, uniform nulls produce a false positive rate of 1.0 on hub-enriched gene sets with no real clustering; ten-bin degree-stratified nulls bring that to 0.0 with no power loss (AUC [&ge;] 0.98; on degree-typical signals, |{Delta}AUC| [&le;] 0.004). Pathway-spiking simulations confirm detection of real biological clustering across diverse pathway sizes and degree profiles. Applied to the FIGI colorectal cancer GWAS (204 SNPs), the set is degree-typical (KS p = 0.83), yet Moran's I is highly significant (p < 0.001). Component-level jackknife localizes the entire signal to a single 24-gene module spanning TGF-{beta}, Wnt/cadherin, and related pathways, with four bottlenecks (SMAD3, MYC, CTNNB1, PTPN1) matching established CRC driver biology.
]]></description>
<dc:creator><![CDATA[ Queme, B., Marjoram, P., Mi, H. ]]></dc:creator>
<dc:date>2026-05-07</dc:date>
<dc:identifier>doi:10.64898/2026.05.04.722765</dc:identifier>
<dc:title><![CDATA[Beyond Pathway Boundaries: A Degree-Aware Network Clustering Test for Gene Sets]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.04.722524v1?rss=1">
<title>
<![CDATA[
Image-Conditioned Diffusion for Privacy-Preserving Synthetic Medical Images 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.04.722524v1?rss=1
</link>
<description><![CDATA[
Medical imaging models depend on large, shareable datasets, yet privacy constraints limit data dissemination. Current text-conditioned diffusion models fail to preserve subtle, distributed clinical signals, such as continuous physiological biomarkers, rendering synthetic data insufficient for robust downstream physiological modeling. Here, we evaluate image-to-image (I2I) diffusion as a tunable, privacy-preserving transformation that produces a synthetic counterpart of real images while preserving downstream-relevant information. We fine-tune Stable Diffusion with low-rank adapters on retinal fundus photographs and chest radiographs, assessing fidelity, clinical signal preservation, cross-site transfer, and empirical re-identification risk. I2I consistently outperforms text-to-image generation in image fidelity and in preserving biomarker information. In cross-cohort transfer to an external retinal dataset from the UK Biobank, pretraining on I2I synthetic data performs comparably to real-image pretraining and surpasses it in the smallest fine-tuning sets. Varying I2I strength reveals that the privacy-utility tradeoff is highly modality-dependent: while retinal images achieve practical de-identification, chest X-rays exhibit structural combinatorics that leave them substantially re-identifiable even at high noise strengths, exposing critical boundaries for diffusion-based anonymization. These results position image-conditioned diffusion as a practical approach for generating shareable medical images with tunable de-identification.
]]></description>
<dc:creator><![CDATA[ Yaya-Stupp, D., Lutsker, G., Spiegel-Yerushalmi, O., Segal, E. ]]></dc:creator>
<dc:date>2026-05-07</dc:date>
<dc:identifier>doi:10.64898/2026.05.04.722524</dc:identifier>
<dc:title><![CDATA[Image-Conditioned Diffusion for Privacy-Preserving Synthetic Medical Images]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.04.722193v1?rss=1">
<title>
<![CDATA[
ORBIT: Orthogonal Rotation for Biological Inter-species Transfer 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.04.722193v1?rss=1
</link>
<description><![CDATA[
Motivation. Cross-species gene embeddings are central to transferring functional annotations between species. A recent method demonstrated that species-specific STRING (PPI) network embeddings can be aligned across 1322 eukaryotes with autoencoders (FedCoder), but this approach is computationally expensive, depends on careful hyperparameter selection, leaves substantial room for improvement in cross-species retrieval quality, and has not been demonstrated on coexpression networks. Results. We introduce an alignment pipeline for cross-species coexpression network embeddings based on orthogonal Procrustes rotation. Species-specific Node2Vec embeddings of coexpression networks are aligned to a shared space using ortholog anchors from OrthoFinder, solved in closed form via Singular Value Decomposition (SVD). Applied to 153 plant species and 5.7 million genes, Procrustes alignment achieves four-fold higher cross-species Spearman correlation and consistently higher retrieval metrics than the SPACE autoencoder, while leaving within-species coexpression structure invariant (preservation ratio 1.000 against the unaligned baseline). The full alignment completes in under three minutes on a single CPU, and on downstream tasks, Procrustes embeddings improve within-species GO term prediction and outperform SPACE for cross-species GO transfer. Procrustes and sequence embeddings remain complementary for biological-process prediction, consistent with observations from SPACE. Availability. Code for producing the embeddings is made available at https://github.com/pwissenberg/orbit
]]></description>
<dc:creator><![CDATA[ Wissenberg, P., Lee, J. M., Mutwil, M. ]]></dc:creator>
<dc:date>2026-05-07</dc:date>
<dc:identifier>doi:10.64898/2026.05.04.722193</dc:identifier>
<dc:title><![CDATA[ORBIT: Orthogonal Rotation for Biological Inter-species Transfer]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.04.721987v1?rss=1">
<title>
<![CDATA[
Bridging genomes and peptidomes: hybrid sequencing reveals conserved bioactive peptides in crustaceans 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.04.721987v1?rss=1
</link>
<description><![CDATA[
Endogenous peptides are critical regulators of signaling and immunity but remain difficult to characterize in organisms with incomplete genomic annotation. We developed a hybrid discovery platform that integrates transformer-based de novo sequencing (Casanovo), neuropeptide-focused database searching (EndoGenius), and empirical false discovery rate estimation via NovoBoard. This pipeline enables confident identification of endogenous peptides while expanding coverage beyond conventional database-only or de novo-only approaches. Applied to neuroendocrine tissues from Callinectes sapidus and Cancer borealis, the workflow revealed numerous high-abundance novel peptides and provided structural and genomic support for their biological relevance. Notably, we report the first histone-2A-derived antimicrobial peptide in the C. sapidus and characterize naturally occurring sequence variants. We also identified unexpected peptide homologies between crustaceans and Rattus norvegicus, enabling annotation of conserved housekeeping proteins in sparsely annotated genomes. This hybrid platform establishes a scalable, open-source strategy for advancing neuropeptidomics and endogenous peptide discovery in emerging model organisms.
]]></description>
<dc:creator><![CDATA[ Fields, L., Qin, J., Ibarra, A. E., Selby, K. G., Gao, T., Dang, T. C., Lu, H., Li, L. ]]></dc:creator>
<dc:date>2026-05-07</dc:date>
<dc:identifier>doi:10.64898/2026.05.04.721987</dc:identifier>
<dc:title><![CDATA[Bridging genomes and peptidomes: hybrid sequencing reveals conserved bioactive peptides in crustaceans]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.04.722647v1?rss=1">
<title>
<![CDATA[
A lightweight codon-based DNA Transformer for Regulatory Region Identification in the Genome 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.04.722647v1?rss=1
</link>
<description><![CDATA[
We developed a lightweight codon-based DNA Transformer equipped with multi-head self-attention and an adaptive classifier head, which achieves exon intron classification with high accuracy and also has moderate accuracy in CDS classification and splice site recognition. We named this model as ExIT (Exon-Intron Transformer). We have implemented codon tokenization for this model. This has been validated on the human genome with external validation from the chimpanzee genome. Further benchmarking has implied that our model is better than the existing models in the above tasks.
]]></description>
<dc:creator><![CDATA[ Karthik, A. S. P., Das, A. B. ]]></dc:creator>
<dc:date>2026-05-07</dc:date>
<dc:identifier>doi:10.64898/2026.05.04.722647</dc:identifier>
<dc:title><![CDATA[A lightweight codon-based DNA Transformer for Regulatory Region Identification in the Genome]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.04.722712v1?rss=1">
<title>
<![CDATA[
scLASER: a robust framework for simulating and detecting time-dependent single-cell dynamics in longitudinal studies 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.04.722712v1?rss=1
</link>
<description><![CDATA[
Longitudinal single-cell clinical studies enable tracking within-individual cellular dynamics, but methods for modeling temporal phenotypic changes and estimating power remain limited. We present scLASER, a framework detecting time-dependent cellular neighborhood dynamics and simulating longitudinal single-cell datasets for power estimation. Across benchmark experiments, scLASER shows consistently higher sensitivity than traditional cluster--based approaches, with particularly pronounced gains in rare cell types and non-linear temporal patterns. Applications to inflammatory bowel disease (95,813 cells, 38 patients) reveal treatment-responsive NOTCH3+ stromal trajectories with high cell type discrimination (AUC > 0.92), while analysis of COVID-19 data (188,181 cells, 84 patients) identifies three distinct axes of T cell activity (cytotoxic effector, NK immunoreceptor signaling, and interferon-stimulated gene programs) over disease progression. scLASER enables robust longitudinal single-cell analysis and optimization of study design.
]]></description>
<dc:creator><![CDATA[ Vanderlinden, L. A., Vargas, J., Inamo, J., Young, J., Wang, C., Zhang, F. ]]></dc:creator>
<dc:date>2026-05-07</dc:date>
<dc:identifier>doi:10.64898/2026.05.04.722712</dc:identifier>
<dc:title><![CDATA[scLASER: a robust framework for simulating and detecting time-dependent single-cell dynamics in longitudinal studies]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-07</prism:publicationDate>
<prism:section></prism:section>
</item>
</rdf:RDF>
