<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF xmlns:admin="http://webns.net/mvcb/" xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:prism="http://purl.org/rss/1.0/modules/prism/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
<channel rdf:about="https://biorxiv.org">
<admin:errorReportsTo rdf:resource="mailto:biorxiv@cshlpress.edu"/>
<title>bioRxiv Subject Collection: Bioinformatics</title>
<link>https://biorxiv.org</link>
<description>
This feed contains articles for bioRxiv Subject Collection "Bioinformatics"
</description>

<items>
<rdf:Seq>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.20.733503v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.20.733495v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.21.733596v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.21.733614v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.21.732899v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.21.728965v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.21.733619v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.21.733574v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.24.733669v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.21.733646v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.22.733853v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.21.733121v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.23.733838v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.23.733339v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.21.733655v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.23.734130v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.23.734068v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.22.733900v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.19.733466v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.19.733337v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.19.733445v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.18.733286v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.19.733293v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.19.732250v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.19.732660v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.19.732679v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.18.733287v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.18.732083v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.18.733198v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.19.733349v1?rss=1"/>
</rdf:Seq>
</items>
<prism:eIssn/>
<prism:publicationName>bioRxiv</prism:publicationName>
<prism:issn/>

<image rdf:resource=""/>
</channel>
<image rdf:about="">
<title>bioRxiv</title>
<url>https://www.biorxiv.org/sites/default/files/bioRxiv_article.jpg</url>
<link>https://www.biorxiv.org</link>
</image>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.20.733503v1?rss=1">
<title>
<![CDATA[
GenoSim: A Forward-Time Genotype Simulator for Clinical and Population Genetics with Population Stratification 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.20.733503v1?rss=1
</link>
<description><![CDATA[
Motivation: Next-generation sequencing studies in clinical genetics are often limited by the scarcity of human genotype data, which stems from ethical, regulatory, and economic barriers. The shortfall is sharpest in consanguineous populations, which are common in South Asia and the Middle East, where family-based designs need large pedigrees that are rarely sequenced in full. Existing simulators do not combine pedigree-aware propagation, realistic population stratification, and clinical export formats in one tool. Results: We present GenoSim, an R package for forward-time simulation of diploid SNP genotypes. It runs in two modes: a population mode implementing inbreeding-adjusted Hardy-Weinberg sampling, Wright-Fisher drift, directional selection, recurrent mutation, and Haldane recombination across multiple generations; and a pedigree-constrained mode that ingests real family VCFs and a pedigree, reconstructs phase where the pedigree makes it identifiable, propagates genotypes through the observed family structure, and appends synthetic generations. Version 1.1.1 adds population stratification through the Balding-Nichols model parameterised by gnomAD v3.1 fixation indices (F_ST) for eight ancestry groups (AFR, AMR, EAS, EUR, FIN, MID, SAS, ASJ), empirical allele-frequency loading from external reference panels, and admixed-cohort simulation. Analysis functions cover Hardy-Weinberg testing, linkage disequilibrium, runs of homozygosity, principal component analysis, founder-referenced and between-generation F-statistics, and Nei gene diversity. Availability and implementation: GenoSim is available as an R package at https://github.com/malikbak/GenoSim under the MIT licence. It requires R [&ge;] 4.0.0 and depends only on base R packages (stats, utils, graphics, grDevices, tools).
]]></description>
<dc:creator><![CDATA[ Bakar, A., Gul, R., Haq, W. u., Afghani, T. ]]></dc:creator>
<dc:date>2026-06-25</dc:date>
<dc:identifier>doi:10.64898/2026.06.20.733503</dc:identifier>
<dc:title><![CDATA[GenoSim: A Forward-Time Genotype Simulator for Clinical and Population Genetics with Population Stratification]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.20.733495v1?rss=1">
<title>
<![CDATA[
Evidence for post-allopolyploidy genetic exchanges between duplicated regions in three ancient polyploidies 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.20.733495v1?rss=1
</link>
<description><![CDATA[
Many successful lineages, including flowering plants and vertebrates, owe some of their evolutionary prosperity to whole genome duplications (WGD). However, in the immediate aftermath of a WGD, the new polyploid species that is formed often experiences multivalent pairings during meiosis, which can produce inviable gametes. To mitigate the potential harm caused by such pairings, most lineages eventually undergo "diploidization" to restore typical bivalent pairing. A key component of this process is the loss of duplicated genes. While diploidization was once thought to be rapid, recent analyses of polyploidies suggest the process may be more drawn out, with multivalent pairing persisting long after the initial WGD event. Here, we assess evidence for "late" diploidization after three different polyploidies: the teleost genome duplication (TGD), nested polyploidies in Paramecium lineages, and the ancient WGD in bakers yeast. Using our tool POInT (the Polyploidy Orthology Inference Tool), we model the resolution of these events. By analyzing discordance between expected species trees and observed gene trees, we argue that late diploidization was a likely feature in the resolution of all three polyploidies.
]]></description>
<dc:creator><![CDATA[ Dhillon, A. K., Pasagadugula, H., Pitts, I., Rohilla, M., Conant, G. C. ]]></dc:creator>
<dc:date>2026-06-25</dc:date>
<dc:identifier>doi:10.64898/2026.06.20.733495</dc:identifier>
<dc:title><![CDATA[Evidence for post-allopolyploidy genetic exchanges between duplicated regions in three ancient polyploidies]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.21.733596v1?rss=1">
<title>
<![CDATA[
Ambiguity-Aware Multi-Stage Cell-Type Annotation for Spatial Transcriptomics 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.21.733596v1?rss=1
</link>
<description><![CDATA[
Spatial transcriptomics enables characterization of cellular organization in intact tissue, but robust cell type annotation remains challenging due to heterogeneous expression profiles, mixed populations, and transitional states. Existing methods often enforce a single label per cluster, obscuring biologically meaningful ambiguity and producing overconfident assignments. We propose an ambiguity-aware, multi-stage framework for spatial cell-type annotation. The method combines hybrid spatial feature clustering with constrained language-model inference over curated label sets, and assigns confidence scores based on marker coverage, candidate separation, and entropy. Low-confidence clusters are selectively refined via local reclustering of ambiguous regions, while unresolved clusters are preserved as mixed rather than forcibly labeled. Applied to 10x Genomics Xenium spatial transcriptomics data from cholangiocarcinoma, the proposed refinement reduces cluster-level ambiguity from 16.1% to 2.27% and cell-level ambiguity from 18.4% to 0.86%, while improving confidence calibration. Spatial ablation confirms that topological integration resolves structural ambiguity over feature-only baselines, while constrained inference via a lightweight language model ensures scalable and biologically coherent annotations. These results highlight the importance of explicit ambiguity handling for reliable spatial annotation in heterogeneous tumors.
]]></description>
<dc:creator><![CDATA[ Mahmud, M. I., Kochat, V., Anzum, H., Satpati, S., Dwarampudi, J. M. R., Rai, K., Banerjee, T. ]]></dc:creator>
<dc:date>2026-06-25</dc:date>
<dc:identifier>doi:10.64898/2026.06.21.733596</dc:identifier>
<dc:title><![CDATA[Ambiguity-Aware Multi-Stage Cell-Type Annotation for Spatial Transcriptomics]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.21.733614v1?rss=1">
<title>
<![CDATA[
A Visually Interpretable Histopathology-Based Immune Model Predicts T-effector Biology and Response to Immune checkpoint inhibition in Clear Cell Renal Cell Carcinoma Clinical Trial and Contemporary Real-World Datasets 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.21.733614v1?rss=1
</link>
<description><![CDATA[
Immune checkpoint inhibitors (ICI) are central to the treatment of metastatic clear cell renal cell carcinoma (ccRCC), yet only a subset of patients derive durable benefit, and clinically deployable predictive biomarkers remain an unmet need. RNA-based T-effector signatures capture cytotoxic immune biology and have been associated with ICI response in clinical trial cohorts; however, their clinical implementation is limited by the marked spatial heterogeneity of ccRCC, as well as cost, long turnaround time, sample quality requirements, and limited accessibility. Here, we developed a visually interpretable deep learning (DL) model that predicts a T-cell-enriched immune score directly from hematoxylin and eosin (H&E)-stained whole-slide images. To overcome the inability of H&E morphology alone to distinguish lymphocyte subsets, we trained the model using multimodal spatial supervision from CD8, PAX8, and ERG IHC, which respectively identified cytotoxic T-cell-rich regions, tumor cells, and endothelial cells, thereby constraining immune predictions to relevant tumor microenvironmental niches. The resulting H&E DL Immune score was validated by pathologist review, comparison with held-out CD8 IHC annotations, and independent datasets. The H&E DL Immune score correlated with T-effector RNA scores across independent institutional and IMmotion150 clinical trial cohorts (spearman correlations of 0.726; p=5.90x10-15 and 0.706; p=4.04x10-19). As a proof of principle, the score was used to characterize associations with key biological features across large cohorts, including sarcomatoid differentiation, BAP1 and PBRM1 mutation status, and additional transcriptomic signatures. In IMmotion150 clinical trial cohort, a median-dichotomized H&E DL Immune score, similar to RNA-based T-effector score, was significantly associated with clinical benefit from atezulumab therapy. In contemporary institutional cohorts of patients treated with frontline ipilimumab plus nivolumab or in initial 3 lines of nivolumab monotherapy, patients in the top quartile of H&E DL Immune score had significantly longer progression-free survival. Collectively, these findings support a scalable and interpretable H&E-based biomarker that captures T-effector biology and can help identify patients with ccRCC more likely to benefit from ICIs.
]]></description>
<dc:creator><![CDATA[ Perny, A., Jarmale, V., Jasti, J., Zhong, H., Christie, A. L., Miyata, J., Nielsen, A., Kontoyiannis, P., Rakheja, D., Modrusan, Z., Huseni, M., Kadel, W., Brugarolas, J., Kapur, P., Rajaram, S. ]]></dc:creator>
<dc:date>2026-06-25</dc:date>
<dc:identifier>doi:10.64898/2026.06.21.733614</dc:identifier>
<dc:title><![CDATA[A Visually Interpretable Histopathology-Based Immune Model Predicts T-effector Biology and Response to Immune checkpoint inhibition in Clear Cell Renal Cell Carcinoma Clinical Trial and Contemporary Real-World Datasets]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.21.732899v1?rss=1">
<title>
<![CDATA[
CREPAS: a reproducible nascent chromatin sequencing analysis pipeline for epigenome replication studies 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.21.732899v1?rss=1
</link>
<description><![CDATA[
Chromatin-based genomics data are essential for understanding genome regulation and the mechanisms underlying epigenetic memory. Recent methods such as ChOR-seq and SCAR-seq assess histone modifications and chromatin-associated proteins during and after replication, capturing chromatin states that contribute to memory across cell divisions. Current tools for chromatin data analysis lack scalability and reproducibility across computing infrastructures, offer limited parameters, and are applicable only to a few sequencing techniques, ignoring the information from nascent chromatin assays. To address these challenges, we developed CREPAS, a Nextflow pipeline for analyzing nascent and parental chromatin sequencing data, including ChIP-seq, ChOR-seq, SCAR-seq, OK-seq, ATAC-seq, CUT&RUN, and CUT&Tag, and derivative protocols. CREPAS provides an end-to-end solution, from quality control to advanced analyses, including downsampling, peak calling, annotation, and visualization. By harnessing quantitative assays such as qChIP-seq and qChOR-seq, the normalization methods in CREPAS allow to compare the restoration kinetics of individual marks or proteins across replication timepoints. Moreover, the pipeline includes calculations such as fork directionality and partitioning using OK-seq and SCAR-seq data, linking replication dynamics to epigenetic inheritance. CREPAS is a valuable resource that enhances the efficiency and reproducibility of nascent chromatin sequencing data analyses, enabling the study of chromatin replication and propagation of epigenetic states.
]]></description>
<dc:creator><![CDATA[ Ruiz-Perez, S., Du, Q., Biran, A., Groth, A., Alcaraz, N. ]]></dc:creator>
<dc:date>2026-06-25</dc:date>
<dc:identifier>doi:10.64898/2026.06.21.732899</dc:identifier>
<dc:title><![CDATA[CREPAS: a reproducible nascent chromatin sequencing analysis pipeline for epigenome replication studies]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.21.728965v1?rss=1">
<title>
<![CDATA[
NanoCellAnnotator: Formalizing Expert Cell Type Annotation with Large Language Models 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.21.728965v1?rss=1
</link>
<description><![CDATA[
Motivation: Cell-type annotation in spatial transcriptomics is challenging due to sparse gene panels, spatial heterogeneity, and limited availability of tissue-matched reference atlases. Recent approaches have explored large language models (LLMs) for integrating biological knowledge during annotation, but unconstrained inference can produce biologically unsupported predictions and hallucinated cell types. In addition, many LLM-based pipelines rely on large cloud-hosted models that limit reproducibility and deployment in privacy-sensitive environments. Results: We introduce NanoCellAnnotator, a biologically constrained and confidence-aware framework for automated cell-type annotation in spatial transcriptomics. The framework de-couples spatial structure discovery, deterministic biological evidence construction, and language model-based semantic inference. Spatial clusters are identified using hybrid spatially regularized non-negative matrix factorization (hSNMF), after which cluster-level marker genes are abstracted into ontology-derived functional programs using Gene Ontology enrichment and GO-slim projection. A lightweight locally executable language model performs constrained label selection within a curated admissible label space derived from PanglaoDB and CellMarker. Annotation confidence is estimated independently using marker support strength and lineage separation, enabling ambiguous or heterogeneous clusters to be explicitly flagged. We evaluate NanoCellAnnotator on Xenium spatial transcriptomics data from intrahepatic cholangiocarci-noma and an independent breast cancer spatial transcriptomics dataset. The framework recovers canonical cell populations with high confidence while identifying heterogeneous or transitional spatial domains as ambiguous. Agreement with manual annotations was evaluated using accuracy and adjusted Rand index. Availability: Code available at https://github.com/ishtyaqmahmud/NanoCellAnnotator.
]]></description>
<dc:creator><![CDATA[ Mahmud, M. I., Kochat, V., Anzum, H., Satpati, S., Dwarampudi, J. M. R., Rai, K., Banerjee, T. ]]></dc:creator>
<dc:date>2026-06-25</dc:date>
<dc:identifier>doi:10.64898/2026.06.21.728965</dc:identifier>
<dc:title><![CDATA[NanoCellAnnotator: Formalizing Expert Cell Type Annotation with Large Language Models]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.21.733619v1?rss=1">
<title>
<![CDATA[
Tissue-aware elastic net decomposition reveals shared and lineage-specific drug response biomarkers 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.21.733619v1?rss=1
</link>
<description><![CDATA[
Motivation: Computational models that predict cancer drug response from genomic features are central to biomarker discovery, yet a recent audit found data leakage in 72% of 32 published methods, and complex models offer little interpretability while only modestly exceeding simple baselines under honest evaluation. Tissue lineage is a largely untapped source of legitimate inductive bias, but existing tissue-aware methods neither separate pan-cancer from lineage-specific signal nor report leakage-free performance. Results: We introduce the Data Shared Elastic Net (DSEN), a tissue-aware regression that decomposes each drug's model into a shared coefficient block common to all lineages and tissue-specific deviation blocks. Under leakage-free cross-validation across 265 drugs, 1,462 cell lines and 31 tissue lineages, DSEN improved mean squared error over a standard elastic net for 92.5% of drugs (mean 4.95%) while selecting 58% fewer stable shared features. Shared coefficients generalized to held-out tissues (59% tissue-level win rate) and recurrently recovered transferable pathway modules (p53, MAPK), whereas tissue blocks captured lineage markers such as the skin MITF/S100B program. The closest tissue-aware comparator, TG-LASSO, performed worse than the tissue-agnostic baseline (-13.8% mean MSE). Ablation shows tissue-aware modeling helps most when features are scarce, with no single modality dominating.
]]></description>
<dc:creator><![CDATA[ Strauch, J., Azinfar, L., Pua, H. H., Long, J. P., Coombes, K. R., Asiaee, A. ]]></dc:creator>
<dc:date>2026-06-25</dc:date>
<dc:identifier>doi:10.64898/2026.06.21.733619</dc:identifier>
<dc:title><![CDATA[Tissue-aware elastic net decomposition reveals shared and lineage-specific drug response biomarkers]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.21.733574v1?rss=1">
<title>
<![CDATA[
BGC-QDR: A Quantum-Assisted Pipeline for Biosynthetic Gene Cluster Discovery and Ranking from Environmental DNA 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.21.733574v1?rss=1
</link>
<description><![CDATA[
Biosynthetic gene clusters (BGCs) encode enzymatic pathways for natural products with pharmaceutical potential, yet prioritizing candidates from fragmented environmental DNA (eDNA) assemblies remains computationally challenging. We present BGC-QDR (Biosynthetic Gene Cluster Quantum Discovery and Ranking), an open-source pipeline that integrates input quality control, Prodigal ORF prediction, Pfam HMM domain annotation, rule-based BGC classification, MiBIG 4.0 novelty assessment, and variational quantum classifier (VQC) ranking via PennyLane. BGC-QDR is designed as a quantum-assisted ranking framework for biologically informed BGC prioritization, not as a claim of quantum computational advantage over classical machine learning. We evaluate the pipeline on MiBIG 4.0 (2,636 annotated BGCs) using a 20-dimensional biosynthetic feature vector and stratified 10-fold cross-validation. The integrated VQC (6 qubits x 3 layers, 54 parameters) achieves an accuracy of 0.789 +/- 0.076 and ROC-AUC of 0.835 +/- 0.057. Random Forest achieves the highest ROC-AUC (0.898 +/- 0.032), followed by Logistic Regression (0.874 +/- 0.020) and MLP (0.872 +/- 0.024). Wilcoxon signed-rank tests on per-fold AUC scores show that VQC ROC-AUC is significantly lower than Random Forest (p = 0.0098) and Logistic Regression (p = 0.037) at alpha = 0.05, with no significant difference versus MLP (p = 0.064). Architecture ablation identifies 4 qubits x 3 layers as the best VQC configuration on hold-out validation (AUC = 0.737). Feature importance analysis highlights peptidyl carrier protein domains, cluster length, and module count as dominant predictors. BGC-QDR provides a reproducible, end-to-end workflow for eDNA-derived BGC discovery with integrated novelty scoring and quantum-assisted candidate ranking.
]]></description>
<dc:creator><![CDATA[ Mishra, A., Rai, A. ]]></dc:creator>
<dc:date>2026-06-25</dc:date>
<dc:identifier>doi:10.64898/2026.06.21.733574</dc:identifier>
<dc:title><![CDATA[BGC-QDR: A Quantum-Assisted Pipeline for Biosynthetic Gene Cluster Discovery and Ranking from Environmental DNA]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.24.733669v1?rss=1">
<title>
<![CDATA[
Multi-Omics Study of Ancestry in Adults with Intracranial Cancers Glioma (MOSAIC) 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.24.733669v1?rss=1
</link>
<description><![CDATA[
Background: Most genomic studies of adult type diffuse gliomas have focused on predominantly European ancestry populations, limiting the generalizability of molecular classifications and precision medicine approaches. We assembled a multi institutional glioma cohort of diverse patients to investigate how germline ancestry, molecular subtypes, and mutational processes shape tumor biology and clinical outcomes. Methods: We analyzed 1,102 adults with WHO 2021 classified diffuse gliomas (IDH mutant, 1p/19q-codeleted oligodendroglioma; IDH-mutant astrocytoma; IDH wildtype glioma) from seven U.S. institutions. Whole-exome sequencing (WES) of FFPE tumors identified somatic alterations and COSMIC SBS v3.2 mutational signatures. Genetic ancestry was estimated from WES using 1000 Genomes reference populations. Overall survival was assessed using Kaplan Meier and multivariable models. Results:The cohort included 66.9% European (EUR), 21.1% Admixed American/Hispanic (AMR), 10.3% Admixed African (AFR), and 1.6% Asian (AS) ancestry. Survival followed expected molecular hierarchy (median overall survival: oligodendroglioma 15.7 years, astrocytoma 10.6 years, IDH-wildtype glioma 1.9 years). Within oligodendroglioma, AMR patients showed improved survival versus EUR (HR 0.67, 95% CI 0.48 to 0.94; p=0.011), with similar trends across subtypes. Somatic profiling confirmed canonical subtype-defining alterations and revealed higher ATRX alterations in AFR and AMR IDH wildtype tumors compared with EUR. ATRX alterations were associated with improved survival only in AFR (p=0.003). Mutational signature analysis identified subtype-specific signatures, including therapy-associated signatures. Chemotherapy-related signatures were more frequent in EUR and AMR than in AFR. Conclusions: This ancestrally diverse glioma cohort confirms established molecular classifications and identifies ancestry-associated differences in survival, somatic alterations, and mutational processes, indicating the critical need for broad representation to inform precision neuro-oncology.
]]></description>
<dc:creator><![CDATA[ Bondy, M. L., Noor, H., Tsavachidis, S., Fukumura, K., Ostrom, Q. T., Walsh, K. M., Peng, B., Muzny, D. M., Korchina, V., Nabors, B., Norberg, L., Desjardins, A., Ritchie, J., Horbinski, C., Perez, A., Tadimeti, V., Mandel, J., Wrensch, M., Bale, T. A., Orlow, I., Hu, J., Doddapaneni, H., Liu, X., Momin, Z., Motewar, P., Armstrong, G., Woods, M., Bernstein, J. L., Amos, C. I., Huse, J. T. ]]></dc:creator>
<dc:date>2026-06-25</dc:date>
<dc:identifier>doi:10.64898/2026.06.24.733669</dc:identifier>
<dc:title><![CDATA[Multi-Omics Study of Ancestry in Adults with Intracranial Cancers Glioma (MOSAIC)]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.21.733646v1?rss=1">
<title>
<![CDATA[
CNSigs: An R Package for the Identification of Copy Number Mutational Signatures 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.21.733646v1?rss=1
</link>
<description><![CDATA[
Background: Copy number aberrations (CNAs) are gains and losses of large genomic segments present across most cancer types and are a hallmark of cancer genomic alterations. However, the processes underlying CNAs and characteristic patterns of CNAs are poorly understood. Bioinformatic advances have identified underlying single nucleotide variant (SNV) mutational signatures resulting from distinct mutational processes, yet development of algorithms able to uncover similar signatures for CNAs remains less advanced. Methods: Using segmented data files from DNA sequencing, six copy number features are extracted for signature determination: segment size, breakpoints per 10 megabases, copy number oscillation events, average changepoint size, average copy number, and breakpoints per chromosome arm, along with ploidy. Mixed model approaches and non-negative matrix factorization (NMF) are utilized to derive CNA signatures across cancer types. The full methodology was packaged in a robust R package, termed 'CNSigs' that is publicly available. Results: To verify the reproducibility of the signatures, we derived five signatures from two independent breast cancer datasets (total n>3000), demonstrating high accuracy (average cosine similarity = 0.89). Pan-cancer application of CNSigs in the TCGA dataset resulted in derivation of 13 pan-cancer signatures which were significantly associated with disease-specific survival. Benchmarking CNSigs to two other CNA signature approaches within TCGA demonstrated non-overlapping signatures and favorable compute speed for CNSigs. We evaluated n=24 pairs of tumor and circulating tumor DNA (ctDNA) acquired at the same time and demonstrated that CNSigs are detectable and reproducible via ctDNA, with significant association of CNSig11 with metastatic triple-negative breast cancer progression-free survival for taxane but not platinum or capecitabine chemotherapy. CNSigs association with immunophenotype was evaluated in low-grade glioma (LGG) and CNSig 3 was found to be highly prognostic for LGG yet complementary to immune features. Conclusions: The CNSigs R package allows researchers to easily analyze their own samples to derive copy number signatures and evaluate clinical associations. We demonstrate potential application in ctDNA and association with treatment response. The development of this package allows further investigation of underlying processes that may be responsible for these CNA fingerprints.
]]></description>
<dc:creator><![CDATA[ Tallman, D., Striker, S., Byappanahalli, A. M., Stockard, S., Jenison, J., Collier, K. A., Blige, E., Vater, M., Stover, D. G. ]]></dc:creator>
<dc:date>2026-06-25</dc:date>
<dc:identifier>doi:10.64898/2026.06.21.733646</dc:identifier>
<dc:title><![CDATA[CNSigs: An R Package for the Identification of Copy Number Mutational Signatures]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.22.733853v1?rss=1">
<title>
<![CDATA[
The Human Pancreas Cell Atlas defines a healthy reference framework for disease contextualization and translational benchmarking 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.22.733853v1?rss=1
</link>
<description><![CDATA[
A central challenge in single-cell biology is distinguishing disease-associated remodeling from normal cellular heterogeneity. Addressing this challenge requires healthy reference frameworks that capture cellular diversity across individuals, technologies, and biological contexts. Here we present the Human Pancreas Cell Atlas (HPCA), a reference atlas of the healthy human pancreas integrating 815,126 single-cell and single-nucleus transcriptomes from 109 donors across 12 studies, diverse technologies, and demographics. Using benchmarked integration and community-driven annotations, HPCA defines 94 cell types and transcriptional states spanning endocrine, exocrine, immune, and stromal compartments. The atlas identifies rare endocrine populations, including a putative, spatially supported polyhormonal alpha-beta-delta state, and provides a unified framework for interpreting pancreatic cellular variation across diverse biological and demographic covariates. Projection of disease and model-system datasets onto HPCA contextualized endocrine and epithelial remodeling relative to healthy pancreatic states. Diabetes-associated endocrine cells remained embedded within the healthy endocrine state space while exhibiting disease-specific changes, as supported by spatial and eQTL concordance analyses. Integration with a pancreatic ductal adenocarcinoma atlas resolved injury-associated and malignant epithelial ecosystem regions across donors. Finally, the HPCA enables quantitative benchmarking of murine diabetes models and stem-cell-derived islets against human pancreatic reference states. Together, the HPCA establishes a healthy transcriptional coordinate system for interpreting disease-associated pathophysiology , experimental perturbation, and regenerative fidelity, illustrating how reference atlases can function as analytical frameworks rather than static cell catalogs.
]]></description>
<dc:creator><![CDATA[ Parikh, S., Strobl, D. C., Jimenez, S., Beckmann, J. L., Arnoldt, L., Roellin, E., Vandenbempt, V., Sterr, M., Aije, M., Vu, H. T. H., Melton, R., Liu, J., Feng, F., Cartailler, J., Gaulton, K. J., Parker, S. C. J., Ruland, J., Conrad, C., Brissova, M., Carlotti, F., Lickert, H., Eils, R., Balboa, D., Luecken, M. D., Theis, F. J. ]]></dc:creator>
<dc:date>2026-06-25</dc:date>
<dc:identifier>doi:10.64898/2026.06.22.733853</dc:identifier>
<dc:title><![CDATA[The Human Pancreas Cell Atlas defines a healthy reference framework for disease contextualization and translational benchmarking]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.21.733121v1?rss=1">
<title>
<![CDATA[
Bamsnap-LRS: an automated batch visualization tool for long-read sequencing alignments 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.21.733121v1?rss=1
</link>
<description><![CDATA[
Summary: Long-read sequencing (LRS) has become essential for genome assembly, structural variations (SVs) detection, haplotype phasing and transcript isoform characterization. However, these applications often require manual inspection of read alignment for validation. Existing visualization tools are either interactive genome browsers that are difficult to scale to large datasets or batch-oriented tools that are not optimized for the unique alignment patterns of long-read data. We developed Bamsnap-LRS, an automated command-line tool for high-throughput LRS alignment visualization. It supports long-read-specific features, phased SNP inspection, and publication-ready batch figure generation within a unified framework for genomic, transcriptomic, and haplotype-aware analyses. Availability and Implementation: All codes and examples are freely available at https://github.com/comery/Bamsnap-LRS.
]]></description>
<dc:creator><![CDATA[ Chen, W., Yang, C., Qiu, L., Hu, J., Zhou, Y. ]]></dc:creator>
<dc:date>2026-06-25</dc:date>
<dc:identifier>doi:10.64898/2026.06.21.733121</dc:identifier>
<dc:title><![CDATA[Bamsnap-LRS: an automated batch visualization tool for long-read sequencing alignments]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.23.733838v1?rss=1">
<title>
<![CDATA[
Agnostic material classification using differential de Bruijn graphs of DNA imprints 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.23.733838v1?rss=1
</link>
<description><![CDATA[
The wide variety of physical and chemical properties in materials makes the study of unknown substances challenging. We have previously proposed a theoretical framework for agnostic material characterization based on using nucleic acid imprints of the materials and then analyzing material-specific patterns of derived sequences. Here we demonstrate an experimental and computational pipeline that can agnostically identify and distinguish varied materials based on DNA k-mer imprints and validate the ability of these imprints to distinguish closely related materials. This work lays the foundation for expansion of purely agnostic sensing technologies for the unbiased characterization and categorization of a much wider variety of biotic and abiotic materials.
]]></description>
<dc:creator><![CDATA[ Cox, R. M., Ansari, Z. T., Johnson, C. D., Marcotte, E. M., Ellington, A., Bhadra, S. ]]></dc:creator>
<dc:date>2026-06-25</dc:date>
<dc:identifier>doi:10.64898/2026.06.23.733838</dc:identifier>
<dc:title><![CDATA[Agnostic material classification using differential de Bruijn graphs of DNA imprints]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.23.733339v1?rss=1">
<title>
<![CDATA[
DextraDemixer enables accurate identification of antigen-specific T cells from pMHC multimer experiments 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.23.733339v1?rss=1
</link>
<description><![CDATA[
Antigen specificity of T cells defines the adaptive immune response, yet the vast majority of known T cell receptors (TCRs) lack annotated antigen targets. Single-cell peptide-MHC (pMHC) multimer assays offer a scalable approach to map TCR-antigen interactions. Still, their utility is limited by pervasive non-specific binding and severe overlap between signal and noise, which confound the accurate identification of antigen-specific cells. To address these limitations, we present DextraDemixer, a Bayesian hierarchical mixture model that disentangles antigen-specific T cells from background noise in pMHC multimer data. The model integrates information from negative controls and clonotype structure while providing calibrated uncertainty estimates for classification. We further introduce a dynamic thresholding scheme that enables credible interval-bounded control of the false discovery rate. Extensive benchmarking on simulated datasets and antigen-specific spike-in experiments demonstrated the model's robustness and improved accuracy over established methods. In a longitudinal SARS-CoV-2 vaccine study, DextraDemixer identified antigen-specific TCRs characterized by high sequence similarity, elevated antigen-specificity prediction scores, and strong clonal purity. Annotations showed high concordance with external validation data and supported the identification of antigen-specific motifs. Overall, DextraDemixer provides a principled probabilistic framework for reliable identification of antigen-specific TCRs from single-cell pMHC-multimer assays.
]]></description>
<dc:creator><![CDATA[ An, Y., Drost, F., Bonafonte-Pardas, I., Grotz, M., Schober, K., Schubert, B. ]]></dc:creator>
<dc:date>2026-06-25</dc:date>
<dc:identifier>doi:10.64898/2026.06.23.733339</dc:identifier>
<dc:title><![CDATA[DextraDemixer enables accurate identification of antigen-specific T cells from pMHC multimer experiments]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.21.733655v1?rss=1">
<title>
<![CDATA[
ComplexDesign: sequence-hallucination design of protein binders bridging multiple proteins 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.21.733655v1?rss=1
</link>
<description><![CDATA[
Motivation: Designing multichain protein complexes requires coordinating the folding of component proteins with the formation of their interfaces. The existing methods, however, remain limited in their ability to satisfy these requirements simultaneously, especially for trimeric and tetrameric complexes. As an important practical scenario, designing a binder that bridges two target proteins into a ternary complex requires flexibility in the relative arrangement of the two targets, adding an additional challenge to existing design methods. Results: We present ComplexDesign, a hallucination-based approach for multichain protein design. ComplexDesign performs structure-prediction-guided sequence optimization to simultaneously fold each protein chain and form inter-chain interactions that bind them together. To provide the flexibility required to appropriately arrange these target proteins, ComplexDesign introduces a specialized masking mechanism that enables exploration of possible relative arrangements rather than being limited to the predefined ones. Across a comprehensive set of benchmarks with various chain lengths, ComplexDesign outperformed existing methods in the unconditional design of dimers, trimers, and tetramers, achieving a high design success rate exceeding 50%, supporting its capability for multichain complex design. Furthermore, in the case of multi-target binder design, ComplexDesign produced high-confidence, self-consistent ternary complexes for 8 out of 10 target pairs. These results establish ComplexDesign as an effective tool for multichain protein design, with particular utility for designing binders that bridge two target proteins. Availability and implementation: The source code of ComplexDesign will be made publicly available upon publication.
]]></description>
<dc:creator><![CDATA[ Xu, J., Ren, M., Qi, N., Zhang, X., He, Z., Yu, C., Bu, D. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.21.733655</dc:identifier>
<dc:title><![CDATA[ComplexDesign: sequence-hallucination design of protein binders bridging multiple proteins]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.23.734130v1?rss=1">
<title>
<![CDATA[
V3Cell: A Vision-Guided Virtual 3D Cell Framework for Phenotypic Modeling and Perturbation Prediction 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.23.734130v1?rss=1
</link>
<description><![CDATA[
Predicting how organoids respond to chemical perturbations is central to disease modeling and drug discovery. Existing virtual cell models operate at the single-cell level, producing static endpoint predictions from destructive assays. This leaves a critical gap at the organoid scale, where biological identity is defined by tissue-level architecture and continuous developmental dynamics rather than single-cell features. Here we introduce V3Cell, a vision-guided framework that constructs in silico surrogates of organoids directly from non-invasive brightfield microscopy. A foreground-aware model constructs static virtual 3D cells across colon, stomach, and lung organoid lineages. These virtual 3D cells closely match real samples across distributional metrics, micro-texture, and lineage-specific morphometrics, with small effect sizes for most descriptors. A temporal module further predicts developmental fate from as few as six early-frame observations and models fate-conditioned spatiotemporal trajectories that closely recapitulate real perturbation responses. V3Cell requires no omics profiling or fluorescent labeling, establishing a non-invasive brightfield-based paradigm for organoid-scale perturbation prediction. Our code and data are publicly available at https://github.com/Laineyoulu/V3Cell.
]]></description>
<dc:creator><![CDATA[ Lu, Y., Xun, D., chenke, X., Xiaobo, Z., Zhigang, Z., Pengyu, C., Xiwen, Y., Zhengzheng, Y., Jiahua, R., Huili, H., Jianying, H., Pengwei, H. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.23.734130</dc:identifier>
<dc:title><![CDATA[V3Cell: A Vision-Guided Virtual 3D Cell Framework for Phenotypic Modeling and Perturbation Prediction]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.23.734068v1?rss=1">
<title>
<![CDATA[
fastQpick: scalable bootstrap and subsampling of FASTQ reads 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.23.734068v1?rss=1
</link>
<description><![CDATA[
fastQpick is a command-line tool and Python library for sampling FASTQ reads with replacement. Sampling with replacement turns a single FASTQ file into an arbitrary number of bootstrap replicates, which enables uncertainty quantification and statistical analysis at the level of raw reads. This process answers questions such as how much an abundance estimate would change if the library were resequenced, or whether a low-abundance call is robust to the particular reads that were sequenced. fastQpick works efficiently on large libraries by streaming files in two passes by default: first to count reads and create a hash-based counter, and then to write the sample. It generates a full-size bootstrap replicate of a 500-million-read library in under 30 minutes with 9.4 GB of peak memory, with a low-memory mode that reduces the peak to 1.4 GB. A single-pass mode draws samples in a single read through the file, using O(1) working memory and producing an output size that is exact in expectation but not fixed. In a real yeast RNA-seq experiment, bootstrap replicates generated by fastQpick recover the sampling uncertainty of transcript abundance estimates, matching the analytic multinomial standard errors to within a few percent. fastQpick is open source and freely available under the MIT license on GitHub at https://github.com/pachterlab/fastQpick and on PyPI (pip install fastQpick).
]]></description>
<dc:creator><![CDATA[ Rich, J., Pachter, L. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.23.734068</dc:identifier>
<dc:title><![CDATA[fastQpick: scalable bootstrap and subsampling of FASTQ reads]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.22.733900v1?rss=1">
<title>
<![CDATA[
RNabel-A Standalone Software Tool for Annotating Tandem Mass Spectra of Modified Ribonucleic Acids 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.22.733900v1?rss=1
</link>
<description><![CDATA[
Ribonucleic acid (RNA) modifications, with over 170 identified types, play diverse roles in cellular processes. The past decade has witnessed surging demand for accurate identification and localization of RNA modifications in both endogenous and synthetic therapeutic RNAs. With accurate spectral annotation for RNA, tandem mass spectrometry (MS/MS) can meet this demand. Here we present RNabel, a user-friendly software tool for in-depth annotation of MS/MS spectra of RNA oligonucleotides. RNabel considers a full set of backbone-cleavage ions (a, b, c, d, a-B, w, x, y, z) in which the ribonucleotide unit could be A, U, C, G, Y (pseudouridine), or I (Inosine). Additionally, RNabel considers 196 modifications on the base, the phosphoribose linkage, the 5' or the 3' terminus, or detachment of a sub-nucleotide fragment as a neutral or charged group. Users can create new components if needed, including ribonucleotides, modifications, neutral or charged groups that could detach from a ribonucleotide. RNabel efficiently processes large datasets in four acceptable formats including .mgf, .raw, .txt from msConvert, and RNabel batch files. Multiple statistical metrics are provided for quality assessment of spectral annotation. To accelerate RNA modification analysis, RNabel is made freely available for Mac and Windows users at https://github.com/songge1111/RNabel/releases.
]]></description>
<dc:creator><![CDATA[ Song, G., Du, Y.-J. N., Sun, R., Dong, M.-Q. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.22.733900</dc:identifier>
<dc:title><![CDATA[RNabel-A Standalone Software Tool for Annotating Tandem Mass Spectra of Modified Ribonucleic Acids]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.19.733466v1?rss=1">
<title>
<![CDATA[
Development of Deep-Learning Models that Predict Quantitative Protein-Ligand Interac-tions in Glycobiology as a part of a Capstone Course 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.19.733466v1?rss=1
</link>
<description><![CDATA[
Glycans coat the surface of all cells, and every glycan is recognised by specific glycan-binding pro-teins (GBPs). There are no general tools that can accurately estimate the binding strength between glycan and GBP from the amino acid sequence of the GBP and the molecular structure of the glycan, represented as SMILES string. We describe models for predicting such binding strengths developed as a part of a Capstone Course at the University of Alberta. The models are trained on a dataset that combines BindingDB, a published database of small-molecule protein interactions, and data from glycan arrays measured by Consortium of Functional Glycomics (CFG). In this hybrid dataset of protein-ligand interactions the ligands are both glycans from CFG and small molecules from BindingDB; similarly, proteins include GBP and proteins from BindingDB. Three models are presented (i) ProMax which fuses ESM-2, MolFormer, and MolCLR features; (ii) APEX which constrains learning to a predetermined form, a physical model of binding; (iii) UltraMax adds inter-atomic distances for the ligands. To address the dataset's severe long-tail distribution, the models employ tail-aware losses for rare high-binding instances. Trained and evaluated on approximately one million protein--ligand pairs using hold-out splits for unseen molecules, the three models provide a unified framework for quantitative glycan-protein binding prediction. We observed that learning glycan-protein binding is harder than the similar task of learning small-molecule-protein interactions. Simple mirror-inversion tests led us to postulate that insufficient use of chiral features is an important source of difficulty in learning these interactions.
]]></description>
<dc:creator><![CDATA[ Yin, H., Liu, W., Zhou, W., Chang, Z., Carpenter, E. J., Satyajith, A., Haregu, S., Greiner, R., Derda, R. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.19.733466</dc:identifier>
<dc:title><![CDATA[Development of Deep-Learning Models that Predict Quantitative Protein-Ligand Interac-tions in Glycobiology as a part of a Capstone Course]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.19.733337v1?rss=1">
<title>
<![CDATA[
A comprehensive analysis of calreticulin mutants reveals distinct biophysicochemical proprieties with a potential for refined targeted therapies 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.19.733337v1?rss=1
</link>
<description><![CDATA[
Calreticulin mutations in myeloproliferative neoplasms result in the replacement of the C-terminus acidic sequence with a positively charged tail that causes pathological activation of the thrombopoietin. The two canonical variants are Type-1 and Type-2. The remaining are mainly classified as Type-1 or Type-2 like based on the wild type sequence retained. Here, we performed in silico biophysicochemical analyses of 76 CALR exon 9 frameshift variants by their sequence and predicted biophysical properties, complemented by structural modeling of the mutant homodimers. Beyond confirming the Type-1 versus Type-2 distinction, we found that the Type 1-like variants form a continuum of charge architecture along which two reproducible subgroups can be identified, rather than sharply separated classes. This work refines the conventional mechanism-based classification into a charge-resolved framework and provides testable hypotheses linking novel-tail chemistry to receptor activation in CALR-mutant neoplasms and paves the way for improved targeted therapies based on individual mutants characteristics
]]></description>
<dc:creator><![CDATA[ Kurt, O. N., Civelek, E., Ozturk, B., Chachoua, I. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.19.733337</dc:identifier>
<dc:title><![CDATA[A comprehensive analysis of calreticulin mutants reveals distinct biophysicochemical proprieties with a potential for refined targeted therapies]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.19.733445v1?rss=1">
<title>
<![CDATA[
InVitroGap: an open-source tool for automated quantification of wound closure in the in vitro scratch assay 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.19.733445v1?rss=1
</link>
<description><![CDATA[
Abstract Background and Objective: Scratch assays are widely used to study wound closure in vitro, but quantitative image analysis remains constrained by manual variability, proprietary workflows, and tools requiring programming expertise. We developed InVitroGap, a Python-based application with a browser-accessible interface for automated quantification of scratch assay closure from sequential microscopy images. Methods: RCC-ER and Renca cells were seeded in 96-well ImageLock plates and scratched using a WoundMaker device for uniform linear wounds or a 200 uL pipette tip for crisscross wounds. Phase-contrast time-lapse images acquired at 0, 24, and 48 h with an IncuCyte SX5 system were independently analyzed using IncuCyte 2023A Rev2 and InVitroGap. The InVitroGap pipeline combines Gaussian smoothing, gradient-based texture mapping, adaptive percentile thresholding, and morphological post-processing to quantify wound confluence and relative wound density (RWD). Agreement was evaluated using paired comparisons, Pearson and Spearman correlations, Bland-Altman analysis, and mean absolute error (MAE). Results: InVitroGap measurements closely tracked IncuCyte outputs across both cell lines, with no significant between-method differences (p > 0.05), strong pooled correlations (R square = 0.964 for RWD; R square = 0.983 for wound confluence), and small mean biases (absolute bias [&le;] 1.64%). The tool successfully processed crisscross wounds from brightfield image series, and a complete four-timepoint series was analyzed in approximately 10 seconds, with robust performance across distinct cell morphologies and wound geometries. Conclusions: InVitroGap provides a transparent, computationally efficient, and platform-independent alternative for scratch assay analysis, delivering performance comparable to commercial systems while remaining freely accessible at https://invitrogap.vercel.app/.
]]></description>
<dc:creator><![CDATA[ ARYA, R. K., Sindhani, M., Dewala, S. R., Weight, C. J., Bukavina, L. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.19.733445</dc:identifier>
<dc:title><![CDATA[InVitroGap: an open-source tool for automated quantification of wound closure in the in vitro scratch assay]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.18.733286v1?rss=1">
<title>
<![CDATA[
Generative Modeling of Mouse Embryogenesis for Fate and Disease Prediction 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.18.733286v1?rss=1
</link>
<description><![CDATA[
Embryonic development is orchestrated by complex gene regulatory networks, and learning regulatory dynamics from developmental data could allow us to understand, predict, and ultimately engineer cell fates. Here we introduce Navigo (https://github.com/aristoteleo/Navigo-release), a biologically grounded generative modeling framework that learns a developmental vector field by integrating flow matching at the population level with RNA kinetics modeling at the molecular level. Navigo accurately maps developmental trajectories across lineages on a mouse embryogenesis scRNA-seq atlas spanning 43 time points and comprising 12.4 million cells. Applied to cardiac development, Navigo enables disease modeling by mechanistically resolving regulatory networks that distinguish congenital heart disease subtypes. Navigo also predicts perturbation effects in a zero-shot manner, as validated on independent in vivo data from six knockout genotypes without perturbation-specific training, uncovering lineage-specific gene-compensation mechanisms. Moreover, Navigo guides rational cell-fate engineering, exemplified by fibroblast reprogramming analyses, including identifying pro-fibrotic barriers to cardiac fates and evaluating hundreds of pairwise transcription factor combinations for neuronal fate, each consisting of one bHLH factor and one POU factor. Overall, Navigo provides a generalizable AI platform for perturbation-effect prediction, disease modeling, and rational cell-fate engineering, advancing toward AI-based virtual embryos for developmental biology and regenerative medicine.
]]></description>
<dc:creator><![CDATA[ Fan, Y., Liu, X., Wang, Y., Zeng, Z., Li, L., Qiu, X., Li, Y. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.18.733286</dc:identifier>
<dc:title><![CDATA[Generative Modeling of Mouse Embryogenesis for Fate and Disease Prediction]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.19.733293v1?rss=1">
<title>
<![CDATA[
Systematic benchmarking of multi-modal approaches for tumor-naive ctDNA detection and quantification 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.19.733293v1?rss=1
</link>
<description><![CDATA[
Longitudinal monitoring of circulating tumor DNA (ctDNA) has emerged as a promising framework for characterizing treatment response dynamics in cancer. Scalable tumor-naive approaches for quantifying ctDNA often involve whole-genome sequencing (WGS) or DNA methylation profiling, but their comparative performance and capacity for complementary integration remain poorly understood. Here we systematically benchmarked tumor-naive WGS- and methylation-based ctDNA quantification methods using plasma from 150 patients with colorectal, lung and breast cancer. Using paired high-depth WGS and EM-seq data, we generated 40,000 in silico samples and evaluated detection accuracy, limits of detection (LoD) and quantification (LoQ) across cancer types and sequencing depths (0.1x-30x). We further assessed single- and multimodal method combinations, identifying conditions under which integrated approaches enhance analytical performance for detection and quantification relative to single modalities. This benchmark delineates key performance trade-offs and provides a practical framework to support method development and guide future research applications in ctDNA-based biomarker studies.
]]></description>
<dc:creator><![CDATA[ Qi, T., Odinokov, D., Lakshmanan, L. N., Grachet, N. G., Lou, M., Saelee, S., Garcia-Montoya, G., Mun, W. P., Rahman, R. C., Asgharian, H., Yi, A. T. X., Pyone, N. H. Y., Wang, L. Y., Tan, G. T., Carrie, H., Lim, A., Ting, L. Y., Hsia, A. G. H., Yean, P. P. S., Ngo, S., Snyder, J., Kaur, H., Tan, A., Yap, Y. S., Tan, D. S., Tan, I. B. H., Penkler, J.-A., Utiramerur, S., Kumar, D., Skanderup, A. J. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.19.733293</dc:identifier>
<dc:title><![CDATA[Systematic benchmarking of multi-modal approaches for tumor-naive ctDNA detection and quantification]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.19.732250v1?rss=1">
<title>
<![CDATA[
Statistical tests for bivariate spatial association across multi-omics data with disjoint coordinates 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.19.732250v1?rss=1
</link>
<description><![CDATA[
Spatial biology has entered a new era of multimodal profiling, with multiple, high-dimensional spatial omics types being measured on consecutive tissue slices, or co-assayed on the same slice. Interest then lies in statistical testing for spatial association between the features of the different modalities, to gain insight in biological processes. One major challenge is the multitude of bivariate combinations, leading to high computational demands. Another difficulty is the difference in spatial resolution between technologies, implying no one-to-one matching between the measurement spots of the two modalities, even after alignment. As a result, common statistical measures such as joint distributions and correlations are not defined, and tests need to rely on spatial vicinity only. Moreover, we argue that many existing bivariate association tests address an inappropriate null hypothesis, or make inappropriate assumptions, both implying absence of spatial autocorrelation in any of the features and leading to misleading conclusions. As a remedy, we modify tests for the detection of spatially variable genes (Moran's I, Gaussian processes and generalized additive models (splines)) to derive bivariate tests across modalities with non-overlapping coordinate sets and provide variance estimators that do account for spatial autocorrelation. We develop inference methods for single sections as well as for replicated experiments with multiple sections, and compare their performance in nonparametric and parametric simulations. Finally, we apply the newly developed methods to two co-assayed spatial transcriptomics and metabolomics datasets from mouse and human. The full suite of tests is available from github.com/sthawinke/sbivar as the R-package sbivar.
]]></description>
<dc:creator><![CDATA[ Hawinkel, S., Hu, W., Velten, B., Maere, S. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.19.732250</dc:identifier>
<dc:title><![CDATA[Statistical tests for bivariate spatial association across multi-omics data with disjoint coordinates]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.19.732660v1?rss=1">
<title>
<![CDATA[
trAIt: Species-by-Trait Data Retrieval using Large Language Models 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.19.732660v1?rss=1
</link>
<description><![CDATA[
Biological research often requires information about species' traits. Manual literature collation can be time-consuming and miss parts of the literature. To address this gap, we developed trAIt, a publicly available software for the retrieval of characteristics of species from scientific literature catalogued in the Europe PubMed Central (PubMed) database. trAIt provides a graphical user interface in which users specify species and characteristics of interest. Leveraging a large language model (LLM), trAIt retrieves relevant papers, combines their content through a consensus-based summarization model, and outputs a species-by-characteristic table. For a case study involving frog species, trAIt recovered 47.1% of trait-species combinations in 2.75 hours, while an expert curator independently recovered 62.4% over months. The consensus-based summarization substantially aids accuracy compared to single-source extraction. Across three case studies of vertebrate taxa, an expert confirmed the accuracy of 70.9% of trait-species entries recovered by trAIt. We observed considerable variation across taxa in trAIt's accuracy, which is possibly due to heterogeneity in open-access literature availability and inconsistencies in species and trait terminology. In sum, our analysis suggests that LLM-based tools can accelerate biological data synthesis but should be used to support domain experts' research, rather than replace their judgment.
]]></description>
<dc:creator><![CDATA[ Balaji, S., Martinson, K. A., Schellenberger, J. S., Koley, J., Inman, C. M., Hofmann, H. A., Young, R. L., Harpak, A. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.19.732660</dc:identifier>
<dc:title><![CDATA[trAIt: Species-by-Trait Data Retrieval using Large Language Models]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.19.732679v1?rss=1">
<title>
<![CDATA[
Beyond statistical significance: ranking transcription factor binding motifs by effect size 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.19.732679v1?rss=1
</link>
<description><![CDATA[
Chromatin immunoprecipitation-sequencing (ChIP-seq) has wide use in identifying transcription factor binding sites. DNA sequence motifs specific to a targeted transcription factor occur more frequently near ChIP-seq peak centres. The most common approach to quantifying relative motif enrichment ranks motifs by p-value . Because sample sizes can vary substantially across examined motifs, p-value magnitudes may reflect this heterogeneity rather than the biological effect of interest. As alternatives, we considered four ranking methods based on effect sizes: (a) a modified Cliffs delta, (b) the lower bound of a frequentist asymptotic confidence interval, (c) the lower bound of a frequentist finite-sample confidence interval, and (d) the lower bound of a Bayesian credible region. Through extensive simulations, the four alternatives better recovered the simulated central- enrichment ordering under heterogeneous sample sizes. Using published ChIP-seq data for GATA3, the effect size methods ranked the known targeted motif highest, even compared to highly similar motifs for other GATA family members, while p-value ranking did not. In a separate SRF application, all four alternative methods also consistently ranked the known motif highest. We recommend the asymptotic confidence interval lower bound for its simplicity, ease of implementation, and intuitive interpretation. The software is freely available (https://github.com/ScottMastro/motif-ranking).
]]></description>
<dc:creator><![CDATA[ Viner, C., Mastromatteo, S., Denisko, D., Negrea, J., Tang, Y., Zhang, L., Hoffman, M. M., Sun, L. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.19.732679</dc:identifier>
<dc:title><![CDATA[Beyond statistical significance: ranking transcription factor binding motifs by effect size]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.18.733287v1?rss=1">
<title>
<![CDATA[
SEMFA: A General Framework for Inferring Statistical Significance of Mahalanobis Similarity between Multi-Omics Profiled Samples Built on Multiple Factor Analysis 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.18.733287v1?rss=1
</link>
<description><![CDATA[
Motivation: With rapid advances in sequencing technologies, many heterogeneous omics datasets have been generated, as seen in the Encyclopedia of DNA Elements (ENCODE) and many single-cell multi-omics sequencing projects, bringing substantial challenges to existing integrative methods. In this article, we report a novel multi-omics fusion and analysis software SEMFA which performs general parametric tests for the Mahalanobis Similarity of samples based on the factor scores generated by an Extended version of conventional Multiple Factor Analysis. Results: Our developed method is effective and robust under both Gaussian and non-Gaussian assumptions. The mean F1 scores are over 0.8 when the column similarity level is 0.9 and the noise level ranges between 0.1 and 0.2, using simulation studies based on ENCODE count data. It was also efficient and effective at handling large-scale single-cell multi-omics data, as demonstrated in colon cancer cases as it unveiled signature network organization patterns of cells for stages III and IV.
]]></description>
<dc:creator><![CDATA[ Han, J., Luo, W., Baldwin, E., Zhang, H. H., An, L., Liu, J., Li, H. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.18.733287</dc:identifier>
<dc:title><![CDATA[SEMFA: A General Framework for Inferring Statistical Significance of Mahalanobis Similarity between Multi-Omics Profiled Samples Built on Multiple Factor Analysis]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.18.732083v1?rss=1">
<title>
<![CDATA[
Pharmacological Stratification of Public Bioactivity Databases: A Reusable, OECD-Anchored Curation and Benchmarking Framework Demonstrated for Opioid Receptors 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.18.732083v1?rss=1
</link>
<description><![CDATA[
Public bioactivity databases are heterogeneous not only in measurement type, where binding affinities and functional potencies are reported on different scales, but in pharmacology: the same compound and target can carry agonist, antagonist, or inhibitor records measured through binding displacement, cAMP, {beta}-arrestin, or [35S]GTP{gamma}S readouts that quantify different biological events. Pooling these records produces models whose output is detached from any coherent pharmacological claim. Prior work has standardized bioactivity at scale and quantified the noise from mixing measurement types, but pharmacological mechanism and assay-readout class have not been treated as a primary axis of large-scale curation. This study presents an auditable, OECD-anchored framework that stratifies public records by action type and assay readout before modeling, converting heterogeneous data into externally validated, interpretable QSAR tasks that compose with existing standardization resources rather than replacing them. The framework is demonstrated on the four opioid receptors (MOR, DOR, KOR, and nociceptin/orphanin FQ, NOP). Four public sources were reconciled into 72,148 merged records and 50,977 curated measurements spanning 19,585 compounds, each carrying auditable attributes for source agreement, endpoint meaning, pharmacology class, assay readout, and trust tier. Receptor-level binding tasks formed a compact benchmark with strong locked external performance, including KOR pK (R2 = 0.79, n = 798) and DOR pK (R2 = 0.77, n = 736). Pharmacology- and readout-resolved functional endpoints yielded externally validated strata that pooled labels would obscure, including a MOR antagonist functional-inhibition endpoint (R2 = 0.86, n = 110) and agonist potency endpoints for DOR, KOR, and MOR (R2 up to 0.81). Comparison against a fully pooled baseline shows that pooled models either match stratified models on coherent endpoints or reach a deceptively high R2 on functional-IC endpoints by training predominantly on binding-displacement records, so the pooled number predicts affinity rather than functional activity. SHAP attribution indicates that binding and functional potency encode partially distinct structure-activity signals. The dataset contract, not model performance alone, defines the validity and scope of a QSAR claim, and stratification is a precondition for a functional model to support a defensible claim. Curation logic, derived tables, frozen data, and reproducibility artifacts are released.
]]></description>
<dc:creator><![CDATA[ Nael, M., Alakonda, L., Ghosh, A., Ward, S. J., Liu-Chen, L.-Y., Rajadhyaksha, A. M., Abou-Gharbia, M., Elokely, K. M. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.18.732083</dc:identifier>
<dc:title><![CDATA[Pharmacological Stratification of Public Bioactivity Databases: A Reusable, OECD-Anchored Curation and Benchmarking Framework Demonstrated for Opioid Receptors]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.18.733198v1?rss=1">
<title>
<![CDATA[
An atlas-scale generative model for unified representation learning of bulk RNA-seq data 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.18.733198v1?rss=1
</link>
<description><![CDATA[
Public bulk RNA-seq repositories contain hundreds of thousands of samples, creating opportunities for large-scale representation learning, but integration across studies remains challenging because of heterogeneous annotations, experimental protocols, and technical variation. While pre-trained foundation models are now widely available for single-cell RNA-seq, comparable resources for bulk RNA-seq remain scarce, motivating a model that learns a unified, tissue-aware representation directly from bulk data. We trained a supervised variational autoencoder (VAE) on a compendium of 118,263 bulk RNA-seq samples that we assembled from TCGA, GTEx, and ARCHS4 and mapped to 42 tissue categories. The model classifies tissue of origin at 94.9% balanced accuracy (weighted F1 96.2%) and compresses 16,115 genes into a 121-dimensional latent space. Tissue identity is the primary organizing axis of the latent space, while source effects remain secondary. To assess the impact of data volume, we constructed training sets at three different scales (38K, 75K, and 118K samples). Our results demonstrated that reconstruction fidelity improved incrementally with each expansion of the dataset, but with diminishing returns. We validated the model on an independent cohort of 734 paediatric tumour samples from TARGET, achieving 84.6% agreement with the expected tissue of origin. The trained model and code are available at GitHub (https://github.com/BIMSBbioinfo/flexynesis_tissue_vae_manuscript) with an interactive web application.
]]></description>
<dc:creator><![CDATA[ Pande, A., Uyar, B., Akalin, A. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.18.733198</dc:identifier>
<dc:title><![CDATA[An atlas-scale generative model for unified representation learning of bulk RNA-seq data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.19.733349v1?rss=1">
<title>
<![CDATA[
BATTLE-AMP: Benchmarking Antimicrobial Peptide Predictors 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.19.733349v1?rss=1
</link>
<description><![CDATA[
As antimicrobial resistance outpaces antibiotic development, antimicrobial peptides (AMPs) have emerged as a promising class of alternative antibacterials, and computational predictors are increasingly used to prioritize AMP candidates. Such predictors are typically evaluated on binary AMP/non-AMP classification, which does not test whether they can identify peptides with clinically relevant potency against specific pathogens. We present BATTLE-AMP, a benchmarking framework that evaluates AMP predictors against experimentally measured minimum inhibitory concentrations (MICs) across clinically relevant bacterial species and strains. We surveyed 48 published methods, finding fewer than 25% reproducible, and benchmarked 10 model families (21 variants) using experimental MIC data, synthetic sequence perturbations, activity cliff analyses, and all-atom molecular dynamics (MD) simulations. Four findings emerge: (i) models trained on MIC data outperform binary classifiers regardless of architecture; (ii) the best model depends on the target pathogen, so model selection must be guided by the biological question; (iii) most models cannot distinguish active peptides from inactive sequences with identical amino acid composition; and (iv) activity cliffs remain unresolved by both machine learning and MD, marking a limit of current computational methods. BATTLE-AMP is released as an open Snakemake framework at https://github.com/szczurek-lab/battleamp-snakemake for benchmarking new models and scoring novel candidate libraries.
]]></description>
<dc:creator><![CDATA[ Szymczak, P., Bukała, A., Zarzecki, W., Sala, M., Borisek, J., Fadavi, S., Olayo-Alarcon, R., Sroka, J., Colome-Tatche, M., Gambin, A., L. Müller, C., Setny, P., Szczurek, E. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.19.733349</dc:identifier>
<dc:title><![CDATA[BATTLE-AMP: Benchmarking Antimicrobial Peptide Predictors]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
</rdf:RDF>
