<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF xmlns:admin="http://webns.net/mvcb/" xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:prism="http://purl.org/rss/1.0/modules/prism/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
<channel rdf:about="https://biorxiv.org">
<admin:errorReportsTo rdf:resource="mailto:biorxiv@cshlpress.edu"/>
<title>bioRxiv Subject Collection: Bioinformatics</title>
<link>https://biorxiv.org</link>
<description>
This feed contains articles for bioRxiv Subject Collection "Bioinformatics"
</description>

<items>
<rdf:Seq>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.23.733339v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.21.733655v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.23.734130v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.23.734068v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.22.733900v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.19.733466v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.19.733337v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.19.733445v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.18.733286v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.19.733293v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.19.732250v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.19.732660v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.19.732679v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.18.733287v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.18.732083v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.18.733198v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.19.733349v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.19.733398v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.22.733672v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.09.730151v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.18.733068v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.17.733050v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.18.733146v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.18.733075v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.18.733061v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.18.733122v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.17.732493v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.18.733163v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.18.733285v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.10.731380v1?rss=1"/>
</rdf:Seq>
</items>
<prism:eIssn/>
<prism:publicationName>bioRxiv</prism:publicationName>
<prism:issn/>

<image rdf:resource=""/>
</channel>
<image rdf:about="">
<title>bioRxiv</title>
<url>https://www.biorxiv.org/sites/default/files/bioRxiv_article.jpg</url>
<link>https://www.biorxiv.org</link>
</image>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.23.733339v1?rss=1">
<title>
<![CDATA[
DextraDemixer enables accurate identification of antigen-specific T cells from pMHC multimer experiments 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.23.733339v1?rss=1
</link>
<description><![CDATA[
Antigen specificity of T cells defines the adaptive immune response, yet the vast majority of known T cell receptors (TCRs) lack annotated antigen targets. Single-cell peptide-MHC (pMHC) multimer assays offer a scalable approach to map TCR-antigen interactions. Still, their utility is limited by pervasive non-specific binding and severe overlap between signal and noise, which confound the accurate identification of antigen-specific cells. To address these limitations, we present DextraDemixer, a Bayesian hierarchical mixture model that disentangles antigen-specific T cells from background noise in pMHC multimer data. The model integrates information from negative controls and clonotype structure while providing calibrated uncertainty estimates for classification. We further introduce a dynamic thresholding scheme that enables credible interval-bounded control of the false discovery rate. Extensive benchmarking on simulated datasets and antigen-specific spike-in experiments demonstrated the model's robustness and improved accuracy over established methods. In a longitudinal SARS-CoV-2 vaccine study, DextraDemixer identified antigen-specific TCRs characterized by high sequence similarity, elevated antigen-specificity prediction scores, and strong clonal purity. Annotations showed high concordance with external validation data and supported the identification of antigen-specific motifs. Overall, DextraDemixer provides a principled probabilistic framework for reliable identification of antigen-specific TCRs from single-cell pMHC-multimer assays.
]]></description>
<dc:creator><![CDATA[ An, Y., Drost, F., Bonafonte-Pardas, I., Grotz, M., Schober, K., Schubert, B. ]]></dc:creator>
<dc:date>2026-06-25</dc:date>
<dc:identifier>doi:10.64898/2026.06.23.733339</dc:identifier>
<dc:title><![CDATA[DextraDemixer enables accurate identification of antigen-specific T cells from pMHC multimer experiments]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.21.733655v1?rss=1">
<title>
<![CDATA[
ComplexDesign: sequence-hallucination design of protein binders bridging multiple proteins 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.21.733655v1?rss=1
</link>
<description><![CDATA[
Motivation: Designing multichain protein complexes requires coordinating the folding of component proteins with the formation of their interfaces. The existing methods, however, remain limited in their ability to satisfy these requirements simultaneously, especially for trimeric and tetrameric complexes. As an important practical scenario, designing a binder that bridges two target proteins into a ternary complex requires flexibility in the relative arrangement of the two targets, adding an additional challenge to existing design methods. Results: We present ComplexDesign, a hallucination-based approach for multichain protein design. ComplexDesign performs structure-prediction-guided sequence optimization to simultaneously fold each protein chain and form inter-chain interactions that bind them together. To provide the flexibility required to appropriately arrange these target proteins, ComplexDesign introduces a specialized masking mechanism that enables exploration of possible relative arrangements rather than being limited to the predefined ones. Across a comprehensive set of benchmarks with various chain lengths, ComplexDesign outperformed existing methods in the unconditional design of dimers, trimers, and tetramers, achieving a high design success rate exceeding 50%, supporting its capability for multichain complex design. Furthermore, in the case of multi-target binder design, ComplexDesign produced high-confidence, self-consistent ternary complexes for 8 out of 10 target pairs. These results establish ComplexDesign as an effective tool for multichain protein design, with particular utility for designing binders that bridge two target proteins. Availability and implementation: The source code of ComplexDesign will be made publicly available upon publication.
]]></description>
<dc:creator><![CDATA[ Xu, J., Ren, M., Qi, N., Zhang, X., He, Z., Yu, C., Bu, D. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.21.733655</dc:identifier>
<dc:title><![CDATA[ComplexDesign: sequence-hallucination design of protein binders bridging multiple proteins]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.23.734130v1?rss=1">
<title>
<![CDATA[
V3Cell: A Vision-Guided Virtual 3D Cell Framework for Phenotypic Modeling and Perturbation Prediction 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.23.734130v1?rss=1
</link>
<description><![CDATA[
Predicting how organoids respond to chemical perturbations is central to disease modeling and drug discovery. Existing virtual cell models operate at the single-cell level, producing static endpoint predictions from destructive assays. This leaves a critical gap at the organoid scale, where biological identity is defined by tissue-level architecture and continuous developmental dynamics rather than single-cell features. Here we introduce V3Cell, a vision-guided framework that constructs in silico surrogates of organoids directly from non-invasive brightfield microscopy. A foreground-aware model constructs static virtual 3D cells across colon, stomach, and lung organoid lineages. These virtual 3D cells closely match real samples across distributional metrics, micro-texture, and lineage-specific morphometrics, with small effect sizes for most descriptors. A temporal module further predicts developmental fate from as few as six early-frame observations and models fate-conditioned spatiotemporal trajectories that closely recapitulate real perturbation responses. V3Cell requires no omics profiling or fluorescent labeling, establishing a non-invasive brightfield-based paradigm for organoid-scale perturbation prediction. Our code and data are publicly available at https://github.com/Laineyoulu/V3Cell.
]]></description>
<dc:creator><![CDATA[ Lu, Y., Xun, D., chenke, X., Xiaobo, Z., Zhigang, Z., Pengyu, C., Xiwen, Y., Zhengzheng, Y., Jiahua, R., Huili, H., Jianying, H., Pengwei, H. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.23.734130</dc:identifier>
<dc:title><![CDATA[V3Cell: A Vision-Guided Virtual 3D Cell Framework for Phenotypic Modeling and Perturbation Prediction]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.23.734068v1?rss=1">
<title>
<![CDATA[
fastQpick: scalable bootstrap and subsampling of FASTQ reads 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.23.734068v1?rss=1
</link>
<description><![CDATA[
fastQpick is a command-line tool and Python library for sampling FASTQ reads with replacement. Sampling with replacement turns a single FASTQ file into an arbitrary number of bootstrap replicates, which enables uncertainty quantification and statistical analysis at the level of raw reads. This process answers questions such as how much an abundance estimate would change if the library were resequenced, or whether a low-abundance call is robust to the particular reads that were sequenced. fastQpick works efficiently on large libraries by streaming files in two passes by default: first to count reads and create a hash-based counter, and then to write the sample. It generates a full-size bootstrap replicate of a 500-million-read library in under 30 minutes with 9.4 GB of peak memory, with a low-memory mode that reduces the peak to 1.4 GB. A single-pass mode draws samples in a single read through the file, using O(1) working memory and producing an output size that is exact in expectation but not fixed. In a real yeast RNA-seq experiment, bootstrap replicates generated by fastQpick recover the sampling uncertainty of transcript abundance estimates, matching the analytic multinomial standard errors to within a few percent. fastQpick is open source and freely available under the MIT license on GitHub at https://github.com/pachterlab/fastQpick and on PyPI (pip install fastQpick).
]]></description>
<dc:creator><![CDATA[ Rich, J., Pachter, L. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.23.734068</dc:identifier>
<dc:title><![CDATA[fastQpick: scalable bootstrap and subsampling of FASTQ reads]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.22.733900v1?rss=1">
<title>
<![CDATA[
RNabel-A Standalone Software Tool for Annotating Tandem Mass Spectra of Modified Ribonucleic Acids 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.22.733900v1?rss=1
</link>
<description><![CDATA[
Ribonucleic acid (RNA) modifications, with over 170 identified types, play diverse roles in cellular processes. The past decade has witnessed surging demand for accurate identification and localization of RNA modifications in both endogenous and synthetic therapeutic RNAs. With accurate spectral annotation for RNA, tandem mass spectrometry (MS/MS) can meet this demand. Here we present RNabel, a user-friendly software tool for in-depth annotation of MS/MS spectra of RNA oligonucleotides. RNabel considers a full set of backbone-cleavage ions (a, b, c, d, a-B, w, x, y, z) in which the ribonucleotide unit could be A, U, C, G, Y (pseudouridine), or I (Inosine). Additionally, RNabel considers 196 modifications on the base, the phosphoribose linkage, the 5' or the 3' terminus, or detachment of a sub-nucleotide fragment as a neutral or charged group. Users can create new components if needed, including ribonucleotides, modifications, neutral or charged groups that could detach from a ribonucleotide. RNabel efficiently processes large datasets in four acceptable formats including .mgf, .raw, .txt from msConvert, and RNabel batch files. Multiple statistical metrics are provided for quality assessment of spectral annotation. To accelerate RNA modification analysis, RNabel is made freely available for Mac and Windows users at https://github.com/songge1111/RNabel/releases.
]]></description>
<dc:creator><![CDATA[ Song, G., Du, Y.-J. N., Sun, R., Dong, M.-Q. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.22.733900</dc:identifier>
<dc:title><![CDATA[RNabel-A Standalone Software Tool for Annotating Tandem Mass Spectra of Modified Ribonucleic Acids]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.19.733466v1?rss=1">
<title>
<![CDATA[
Development of Deep-Learning Models that Predict Quantitative Protein-Ligand Interac-tions in Glycobiology as a part of a Capstone Course 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.19.733466v1?rss=1
</link>
<description><![CDATA[
Glycans coat the surface of all cells, and every glycan is recognised by specific glycan-binding pro-teins (GBPs). There are no general tools that can accurately estimate the binding strength between glycan and GBP from the amino acid sequence of the GBP and the molecular structure of the glycan, represented as SMILES string. We describe models for predicting such binding strengths developed as a part of a Capstone Course at the University of Alberta. The models are trained on a dataset that combines BindingDB, a published database of small-molecule protein interactions, and data from glycan arrays measured by Consortium of Functional Glycomics (CFG). In this hybrid dataset of protein-ligand interactions the ligands are both glycans from CFG and small molecules from BindingDB; similarly, proteins include GBP and proteins from BindingDB. Three models are presented (i) ProMax which fuses ESM-2, MolFormer, and MolCLR features; (ii) APEX which constrains learning to a predetermined form, a physical model of binding; (iii) UltraMax adds inter-atomic distances for the ligands. To address the dataset's severe long-tail distribution, the models employ tail-aware losses for rare high-binding instances. Trained and evaluated on approximately one million protein--ligand pairs using hold-out splits for unseen molecules, the three models provide a unified framework for quantitative glycan-protein binding prediction. We observed that learning glycan-protein binding is harder than the similar task of learning small-molecule-protein interactions. Simple mirror-inversion tests led us to postulate that insufficient use of chiral features is an important source of difficulty in learning these interactions.
]]></description>
<dc:creator><![CDATA[ Yin, H., Liu, W., Zhou, W., Chang, Z., Carpenter, E. J., Satyajith, A., Haregu, S., Greiner, R., Derda, R. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.19.733466</dc:identifier>
<dc:title><![CDATA[Development of Deep-Learning Models that Predict Quantitative Protein-Ligand Interac-tions in Glycobiology as a part of a Capstone Course]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.19.733337v1?rss=1">
<title>
<![CDATA[
A comprehensive analysis of calreticulin mutants reveals distinct biophysicochemical proprieties with a potential for refined targeted therapies 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.19.733337v1?rss=1
</link>
<description><![CDATA[
Calreticulin mutations in myeloproliferative neoplasms result in the replacement of the C-terminus acidic sequence with a positively charged tail that causes pathological activation of the thrombopoietin. The two canonical variants are Type-1 and Type-2. The remaining are mainly classified as Type-1 or Type-2 like based on the wild type sequence retained. Here, we performed in silico biophysicochemical analyses of 76 CALR exon 9 frameshift variants by their sequence and predicted biophysical properties, complemented by structural modeling of the mutant homodimers. Beyond confirming the Type-1 versus Type-2 distinction, we found that the Type 1-like variants form a continuum of charge architecture along which two reproducible subgroups can be identified, rather than sharply separated classes. This work refines the conventional mechanism-based classification into a charge-resolved framework and provides testable hypotheses linking novel-tail chemistry to receptor activation in CALR-mutant neoplasms and paves the way for improved targeted therapies based on individual mutants characteristics
]]></description>
<dc:creator><![CDATA[ Kurt, O. N., Civelek, E., Ozturk, B., Chachoua, I. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.19.733337</dc:identifier>
<dc:title><![CDATA[A comprehensive analysis of calreticulin mutants reveals distinct biophysicochemical proprieties with a potential for refined targeted therapies]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.19.733445v1?rss=1">
<title>
<![CDATA[
InVitroGap: an open-source tool for automated quantification of wound closure in the in vitro scratch assay 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.19.733445v1?rss=1
</link>
<description><![CDATA[
Abstract Background and Objective: Scratch assays are widely used to study wound closure in vitro, but quantitative image analysis remains constrained by manual variability, proprietary workflows, and tools requiring programming expertise. We developed InVitroGap, a Python-based application with a browser-accessible interface for automated quantification of scratch assay closure from sequential microscopy images. Methods: RCC-ER and Renca cells were seeded in 96-well ImageLock plates and scratched using a WoundMaker device for uniform linear wounds or a 200 uL pipette tip for crisscross wounds. Phase-contrast time-lapse images acquired at 0, 24, and 48 h with an IncuCyte SX5 system were independently analyzed using IncuCyte 2023A Rev2 and InVitroGap. The InVitroGap pipeline combines Gaussian smoothing, gradient-based texture mapping, adaptive percentile thresholding, and morphological post-processing to quantify wound confluence and relative wound density (RWD). Agreement was evaluated using paired comparisons, Pearson and Spearman correlations, Bland-Altman analysis, and mean absolute error (MAE). Results: InVitroGap measurements closely tracked IncuCyte outputs across both cell lines, with no significant between-method differences (p > 0.05), strong pooled correlations (R square = 0.964 for RWD; R square = 0.983 for wound confluence), and small mean biases (absolute bias [&le;] 1.64%). The tool successfully processed crisscross wounds from brightfield image series, and a complete four-timepoint series was analyzed in approximately 10 seconds, with robust performance across distinct cell morphologies and wound geometries. Conclusions: InVitroGap provides a transparent, computationally efficient, and platform-independent alternative for scratch assay analysis, delivering performance comparable to commercial systems while remaining freely accessible at https://invitrogap.vercel.app/.
]]></description>
<dc:creator><![CDATA[ ARYA, R. K., Sindhani, M., Dewala, S. R., Weight, C. J., Bukavina, L. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.19.733445</dc:identifier>
<dc:title><![CDATA[InVitroGap: an open-source tool for automated quantification of wound closure in the in vitro scratch assay]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.18.733286v1?rss=1">
<title>
<![CDATA[
Generative Modeling of Mouse Embryogenesis for Fate and Disease Prediction 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.18.733286v1?rss=1
</link>
<description><![CDATA[
Embryonic development is orchestrated by complex gene regulatory networks, and learning regulatory dynamics from developmental data could allow us to understand, predict, and ultimately engineer cell fates. Here we introduce Navigo (https://github.com/aristoteleo/Navigo-release), a biologically grounded generative modeling framework that learns a developmental vector field by integrating flow matching at the population level with RNA kinetics modeling at the molecular level. Navigo accurately maps developmental trajectories across lineages on a mouse embryogenesis scRNA-seq atlas spanning 43 time points and comprising 12.4 million cells. Applied to cardiac development, Navigo enables disease modeling by mechanistically resolving regulatory networks that distinguish congenital heart disease subtypes. Navigo also predicts perturbation effects in a zero-shot manner, as validated on independent in vivo data from six knockout genotypes without perturbation-specific training, uncovering lineage-specific gene-compensation mechanisms. Moreover, Navigo guides rational cell-fate engineering, exemplified by fibroblast reprogramming analyses, including identifying pro-fibrotic barriers to cardiac fates and evaluating hundreds of pairwise transcription factor combinations for neuronal fate, each consisting of one bHLH factor and one POU factor. Overall, Navigo provides a generalizable AI platform for perturbation-effect prediction, disease modeling, and rational cell-fate engineering, advancing toward AI-based virtual embryos for developmental biology and regenerative medicine.
]]></description>
<dc:creator><![CDATA[ Fan, Y., Liu, X., Wang, Y., Zeng, Z., Li, L., Qiu, X., Li, Y. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.18.733286</dc:identifier>
<dc:title><![CDATA[Generative Modeling of Mouse Embryogenesis for Fate and Disease Prediction]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.19.733293v1?rss=1">
<title>
<![CDATA[
Systematic benchmarking of multi-modal approaches for tumor-naive ctDNA detection and quantification 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.19.733293v1?rss=1
</link>
<description><![CDATA[
Longitudinal monitoring of circulating tumor DNA (ctDNA) has emerged as a promising framework for characterizing treatment response dynamics in cancer. Scalable tumor-naive approaches for quantifying ctDNA often involve whole-genome sequencing (WGS) or DNA methylation profiling, but their comparative performance and capacity for complementary integration remain poorly understood. Here we systematically benchmarked tumor-naive WGS- and methylation-based ctDNA quantification methods using plasma from 150 patients with colorectal, lung and breast cancer. Using paired high-depth WGS and EM-seq data, we generated 40,000 in silico samples and evaluated detection accuracy, limits of detection (LoD) and quantification (LoQ) across cancer types and sequencing depths (0.1x-30x). We further assessed single- and multimodal method combinations, identifying conditions under which integrated approaches enhance analytical performance for detection and quantification relative to single modalities. This benchmark delineates key performance trade-offs and provides a practical framework to support method development and guide future research applications in ctDNA-based biomarker studies.
]]></description>
<dc:creator><![CDATA[ Qi, T., Odinokov, D., Lakshmanan, L. N., Grachet, N. G., Lou, M., Saelee, S., Garcia-Montoya, G., Mun, W. P., Rahman, R. C., Asgharian, H., Yi, A. T. X., Pyone, N. H. Y., Wang, L. Y., Tan, G. T., Carrie, H., Lim, A., Ting, L. Y., Hsia, A. G. H., Yean, P. P. S., Ngo, S., Snyder, J., Kaur, H., Tan, A., Yap, Y. S., Tan, D. S., Tan, I. B. H., Penkler, J.-A., Utiramerur, S., Kumar, D., Skanderup, A. J. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.19.733293</dc:identifier>
<dc:title><![CDATA[Systematic benchmarking of multi-modal approaches for tumor-naive ctDNA detection and quantification]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.19.732250v1?rss=1">
<title>
<![CDATA[
Statistical tests for bivariate spatial association across multi-omics data with disjoint coordinates 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.19.732250v1?rss=1
</link>
<description><![CDATA[
Spatial biology has entered a new era of multimodal profiling, with multiple, high-dimensional spatial omics types being measured on consecutive tissue slices, or co-assayed on the same slice. Interest then lies in statistical testing for spatial association between the features of the different modalities, to gain insight in biological processes. One major challenge is the multitude of bivariate combinations, leading to high computational demands. Another difficulty is the difference in spatial resolution between technologies, implying no one-to-one matching between the measurement spots of the two modalities, even after alignment. As a result, common statistical measures such as joint distributions and correlations are not defined, and tests need to rely on spatial vicinity only. Moreover, we argue that many existing bivariate association tests address an inappropriate null hypothesis, or make inappropriate assumptions, both implying absence of spatial autocorrelation in any of the features and leading to misleading conclusions. As a remedy, we modify tests for the detection of spatially variable genes (Moran's I, Gaussian processes and generalized additive models (splines)) to derive bivariate tests across modalities with non-overlapping coordinate sets and provide variance estimators that do account for spatial autocorrelation. We develop inference methods for single sections as well as for replicated experiments with multiple sections, and compare their performance in nonparametric and parametric simulations. Finally, we apply the newly developed methods to two co-assayed spatial transcriptomics and metabolomics datasets from mouse and human. The full suite of tests is available from github.com/sthawinke/sbivar as the R-package sbivar.
]]></description>
<dc:creator><![CDATA[ Hawinkel, S., Hu, W., Velten, B., Maere, S. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.19.732250</dc:identifier>
<dc:title><![CDATA[Statistical tests for bivariate spatial association across multi-omics data with disjoint coordinates]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.19.732660v1?rss=1">
<title>
<![CDATA[
trAIt: Species-by-Trait Data Retrieval using Large Language Models 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.19.732660v1?rss=1
</link>
<description><![CDATA[
Biological research often requires information about species' traits. Manual literature collation can be time-consuming and miss parts of the literature. To address this gap, we developed trAIt, a publicly available software for the retrieval of characteristics of species from scientific literature catalogued in the Europe PubMed Central (PubMed) database. trAIt provides a graphical user interface in which users specify species and characteristics of interest. Leveraging a large language model (LLM), trAIt retrieves relevant papers, combines their content through a consensus-based summarization model, and outputs a species-by-characteristic table. For a case study involving frog species, trAIt recovered 47.1% of trait-species combinations in 2.75 hours, while an expert curator independently recovered 62.4% over months. The consensus-based summarization substantially aids accuracy compared to single-source extraction. Across three case studies of vertebrate taxa, an expert confirmed the accuracy of 70.9% of trait-species entries recovered by trAIt. We observed considerable variation across taxa in trAIt's accuracy, which is possibly due to heterogeneity in open-access literature availability and inconsistencies in species and trait terminology. In sum, our analysis suggests that LLM-based tools can accelerate biological data synthesis but should be used to support domain experts' research, rather than replace their judgment.
]]></description>
<dc:creator><![CDATA[ Balaji, S., Martinson, K. A., Schellenberger, J. S., Koley, J., Inman, C. M., Hofmann, H. A., Young, R. L., Harpak, A. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.19.732660</dc:identifier>
<dc:title><![CDATA[trAIt: Species-by-Trait Data Retrieval using Large Language Models]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.19.732679v1?rss=1">
<title>
<![CDATA[
Beyond statistical significance: ranking transcription factor binding motifs by effect size 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.19.732679v1?rss=1
</link>
<description><![CDATA[
Chromatin immunoprecipitation-sequencing (ChIP-seq) has wide use in identifying transcription factor binding sites. DNA sequence motifs specific to a targeted transcription factor occur more frequently near ChIP-seq peak centres. The most common approach to quantifying relative motif enrichment ranks motifs by p-value . Because sample sizes can vary substantially across examined motifs, p-value magnitudes may reflect this heterogeneity rather than the biological effect of interest. As alternatives, we considered four ranking methods based on effect sizes: (a) a modified Cliffs delta, (b) the lower bound of a frequentist asymptotic confidence interval, (c) the lower bound of a frequentist finite-sample confidence interval, and (d) the lower bound of a Bayesian credible region. Through extensive simulations, the four alternatives better recovered the simulated central- enrichment ordering under heterogeneous sample sizes. Using published ChIP-seq data for GATA3, the effect size methods ranked the known targeted motif highest, even compared to highly similar motifs for other GATA family members, while p-value ranking did not. In a separate SRF application, all four alternative methods also consistently ranked the known motif highest. We recommend the asymptotic confidence interval lower bound for its simplicity, ease of implementation, and intuitive interpretation. The software is freely available (https://github.com/ScottMastro/motif-ranking).
]]></description>
<dc:creator><![CDATA[ Viner, C., Mastromatteo, S., Denisko, D., Negrea, J., Tang, Y., Zhang, L., Hoffman, M. M., Sun, L. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.19.732679</dc:identifier>
<dc:title><![CDATA[Beyond statistical significance: ranking transcription factor binding motifs by effect size]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.18.733287v1?rss=1">
<title>
<![CDATA[
SEMFA: A General Framework for Inferring Statistical Significance of Mahalanobis Similarity between Multi-Omics Profiled Samples Built on Multiple Factor Analysis 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.18.733287v1?rss=1
</link>
<description><![CDATA[
Motivation: With rapid advances in sequencing technologies, many heterogeneous omics datasets have been generated, as seen in the Encyclopedia of DNA Elements (ENCODE) and many single-cell multi-omics sequencing projects, bringing substantial challenges to existing integrative methods. In this article, we report a novel multi-omics fusion and analysis software SEMFA which performs general parametric tests for the Mahalanobis Similarity of samples based on the factor scores generated by an Extended version of conventional Multiple Factor Analysis. Results: Our developed method is effective and robust under both Gaussian and non-Gaussian assumptions. The mean F1 scores are over 0.8 when the column similarity level is 0.9 and the noise level ranges between 0.1 and 0.2, using simulation studies based on ENCODE count data. It was also efficient and effective at handling large-scale single-cell multi-omics data, as demonstrated in colon cancer cases as it unveiled signature network organization patterns of cells for stages III and IV.
]]></description>
<dc:creator><![CDATA[ Han, J., Luo, W., Baldwin, E., Zhang, H. H., An, L., Liu, J., Li, H. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.18.733287</dc:identifier>
<dc:title><![CDATA[SEMFA: A General Framework for Inferring Statistical Significance of Mahalanobis Similarity between Multi-Omics Profiled Samples Built on Multiple Factor Analysis]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.18.732083v1?rss=1">
<title>
<![CDATA[
Pharmacological Stratification of Public Bioactivity Databases: A Reusable, OECD-Anchored Curation and Benchmarking Framework Demonstrated for Opioid Receptors 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.18.732083v1?rss=1
</link>
<description><![CDATA[
Public bioactivity databases are heterogeneous not only in measurement type, where binding affinities and functional potencies are reported on different scales, but in pharmacology: the same compound and target can carry agonist, antagonist, or inhibitor records measured through binding displacement, cAMP, {beta}-arrestin, or [35S]GTP{gamma}S readouts that quantify different biological events. Pooling these records produces models whose output is detached from any coherent pharmacological claim. Prior work has standardized bioactivity at scale and quantified the noise from mixing measurement types, but pharmacological mechanism and assay-readout class have not been treated as a primary axis of large-scale curation. This study presents an auditable, OECD-anchored framework that stratifies public records by action type and assay readout before modeling, converting heterogeneous data into externally validated, interpretable QSAR tasks that compose with existing standardization resources rather than replacing them. The framework is demonstrated on the four opioid receptors (MOR, DOR, KOR, and nociceptin/orphanin FQ, NOP). Four public sources were reconciled into 72,148 merged records and 50,977 curated measurements spanning 19,585 compounds, each carrying auditable attributes for source agreement, endpoint meaning, pharmacology class, assay readout, and trust tier. Receptor-level binding tasks formed a compact benchmark with strong locked external performance, including KOR pK (R2 = 0.79, n = 798) and DOR pK (R2 = 0.77, n = 736). Pharmacology- and readout-resolved functional endpoints yielded externally validated strata that pooled labels would obscure, including a MOR antagonist functional-inhibition endpoint (R2 = 0.86, n = 110) and agonist potency endpoints for DOR, KOR, and MOR (R2 up to 0.81). Comparison against a fully pooled baseline shows that pooled models either match stratified models on coherent endpoints or reach a deceptively high R2 on functional-IC endpoints by training predominantly on binding-displacement records, so the pooled number predicts affinity rather than functional activity. SHAP attribution indicates that binding and functional potency encode partially distinct structure-activity signals. The dataset contract, not model performance alone, defines the validity and scope of a QSAR claim, and stratification is a precondition for a functional model to support a defensible claim. Curation logic, derived tables, frozen data, and reproducibility artifacts are released.
]]></description>
<dc:creator><![CDATA[ Nael, M., Alakonda, L., Ghosh, A., Ward, S. J., Liu-Chen, L.-Y., Rajadhyaksha, A. M., Abou-Gharbia, M., Elokely, K. M. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.18.732083</dc:identifier>
<dc:title><![CDATA[Pharmacological Stratification of Public Bioactivity Databases: A Reusable, OECD-Anchored Curation and Benchmarking Framework Demonstrated for Opioid Receptors]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.18.733198v1?rss=1">
<title>
<![CDATA[
An atlas-scale generative model for unified representation learning of bulk RNA-seq data 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.18.733198v1?rss=1
</link>
<description><![CDATA[
Public bulk RNA-seq repositories contain hundreds of thousands of samples, creating opportunities for large-scale representation learning, but integration across studies remains challenging because of heterogeneous annotations, experimental protocols, and technical variation. While pre-trained foundation models are now widely available for single-cell RNA-seq, comparable resources for bulk RNA-seq remain scarce, motivating a model that learns a unified, tissue-aware representation directly from bulk data. We trained a supervised variational autoencoder (VAE) on a compendium of 118,263 bulk RNA-seq samples that we assembled from TCGA, GTEx, and ARCHS4 and mapped to 42 tissue categories. The model classifies tissue of origin at 94.9% balanced accuracy (weighted F1 96.2%) and compresses 16,115 genes into a 121-dimensional latent space. Tissue identity is the primary organizing axis of the latent space, while source effects remain secondary. To assess the impact of data volume, we constructed training sets at three different scales (38K, 75K, and 118K samples). Our results demonstrated that reconstruction fidelity improved incrementally with each expansion of the dataset, but with diminishing returns. We validated the model on an independent cohort of 734 paediatric tumour samples from TARGET, achieving 84.6% agreement with the expected tissue of origin. The trained model and code are available at GitHub (https://github.com/BIMSBbioinfo/flexynesis_tissue_vae_manuscript) with an interactive web application.
]]></description>
<dc:creator><![CDATA[ Pande, A., Uyar, B., Akalin, A. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.18.733198</dc:identifier>
<dc:title><![CDATA[An atlas-scale generative model for unified representation learning of bulk RNA-seq data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.19.733349v1?rss=1">
<title>
<![CDATA[
BATTLE-AMP: Benchmarking Antimicrobial Peptide Predictors 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.19.733349v1?rss=1
</link>
<description><![CDATA[
As antimicrobial resistance outpaces antibiotic development, antimicrobial peptides (AMPs) have emerged as a promising class of alternative antibacterials, and computational predictors are increasingly used to prioritize AMP candidates. Such predictors are typically evaluated on binary AMP/non-AMP classification, which does not test whether they can identify peptides with clinically relevant potency against specific pathogens. We present BATTLE-AMP, a benchmarking framework that evaluates AMP predictors against experimentally measured minimum inhibitory concentrations (MICs) across clinically relevant bacterial species and strains. We surveyed 48 published methods, finding fewer than 25% reproducible, and benchmarked 10 model families (21 variants) using experimental MIC data, synthetic sequence perturbations, activity cliff analyses, and all-atom molecular dynamics (MD) simulations. Four findings emerge: (i) models trained on MIC data outperform binary classifiers regardless of architecture; (ii) the best model depends on the target pathogen, so model selection must be guided by the biological question; (iii) most models cannot distinguish active peptides from inactive sequences with identical amino acid composition; and (iv) activity cliffs remain unresolved by both machine learning and MD, marking a limit of current computational methods. BATTLE-AMP is released as an open Snakemake framework at https://github.com/szczurek-lab/battleamp-snakemake for benchmarking new models and scoring novel candidate libraries.
]]></description>
<dc:creator><![CDATA[ Szymczak, P., Bukała, A., Zarzecki, W., Sala, M., Borisek, J., Fadavi, S., Olayo-Alarcon, R., Sroka, J., Colome-Tatche, M., Gambin, A., L. Müller, C., Setny, P., Szczurek, E. ]]></dc:creator>
<dc:date>2026-06-24</dc:date>
<dc:identifier>doi:10.64898/2026.06.19.733349</dc:identifier>
<dc:title><![CDATA[BATTLE-AMP: Benchmarking Antimicrobial Peptide Predictors]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-24</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.19.733398v1?rss=1">
<title>
<![CDATA[
EnrichViz: An Interactive R Shiny Application for Visualization of Pathway Enrichment Results from Omics Data 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.19.733398v1?rss=1
</link>
<description><![CDATA[
Pathway and functional enrichment analysis is a cornerstone of omics data interpretation, enabling researchers to map differentially expressed proteins or genes onto curated biological processes, signaling cascades, and molecular functions. While tools such as Ingenuity Pathway Analysis (IPA), g:Profiler, and Enrichr are widely used to generate ranked enrichment results, translating these tabular outputs into clear, publication-ready figures remains a time-consuming step that typically requires custom scripting and familiarity with visualization libraries, a significant barrier for researchers without a computational background. Here we present EnrichViz, a self-contained, browser-based R Shiny application that enables interactive, code-free visualization of pathway and functional enrichment results from quantitative proteomics, transcriptomics, and metabolomics experiments. EnrichViz accepts three standard CSV files as input, a normalized abundance matrix, a sample annotation or metadata file, and enrichment results from any platform that exports tabular output, and produces six complementary, publication-ready visualizations: bar and bubble plots for ranking enriched terms by significance, chord diagrams for exploring pathway-molecule connectivity, clustered heatmaps for displaying Z-score normalized expression patterns across experimental groups, and boxplots or violin plots for examining the abundance distribution of individual proteins, genes, or metabolites. The application supports both raw p-values and pre-transformed -log10(p) values through automatic detection, and all plot parameters are adjustable in real time through a graphical sidebar. Every figure can be exported as a high-resolution PNG file at 300 dpi. EnrichViz is implemented in R using the Shiny, ggplot2, pheatmap, and circlize packages, and is freely available at https://rgmilian.shinyapps.io/EnrichViz/
]]></description>
<dc:creator><![CDATA[ Garcia-Milian, R. ]]></dc:creator>
<dc:date>2026-06-23</dc:date>
<dc:identifier>doi:10.64898/2026.06.19.733398</dc:identifier>
<dc:title><![CDATA[EnrichViz: An Interactive R Shiny Application for Visualization of Pathway Enrichment Results from Omics Data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-23</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.22.733672v1?rss=1">
<title>
<![CDATA[
FateLimit quantifies the prediction horizon of cell fate 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.22.733672v1?rss=1
</link>
<description><![CDATA[
Single-cell technologies have enabled increasingly detailed reconstruction of developmental trajectories, yet a fundamental question remains unresolved: when does future cellular identity become predictable from cells current molecular state? Existing approaches infer lineage relationships, transition probabilities or future transcriptional dynamics, but do not directly quantify the emergence of fate predictability during cellular state transitions. Here we present FateLimit, an information-theoretic framework for measuring the temporal dynamics of cell-fate predictability from single-cell omics data. FateLimit combines probabilistic fate assignment, fate entropy and mutual information to quantify how information about future cellular outcomes is encoded in present molecular states. We introduce two quantitative descriptors: the Fate Information Half-Life (FIHL), which measures the characteristic timescale of fate-information dynamics, and the Prediction Horizon (PH), defined as the earliest developmental stage at which observed fate predictability exceeds the 95th percentile of a permutation-derived null distribution. We applied FateLimit across developmental, lineage-tracing and reprogramming systems, including pancreatic endocrinogenesis, CellTag reprogramming, human hematopoiesis and zebrafish embryogenesis. Across all datasets, FateLimit identified significant fate information and reproducible prediction horizons that were robust to cell-state representation, lineage structure and biological context. Comparative analysis revealed that prediction horizons differ substantially among cellular lineages, indicating that distinct developmental programs acquire predictive information at different rates. FateLimit establishes a general framework for quantifying the predictability of future cellular identity from present molecular states. By transforming developmental trajectories into predictability landscapes, FateLimit enables systematic comparison of commitment dynamics across biological systems and establishes prediction horizons as a quantitative measure of cell-fate determination.
]]></description>
<dc:creator><![CDATA[ Sung, J.-Y., Cheong, J.-H. ]]></dc:creator>
<dc:date>2026-06-23</dc:date>
<dc:identifier>doi:10.64898/2026.06.22.733672</dc:identifier>
<dc:title><![CDATA[FateLimit quantifies the prediction horizon of cell fate]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-23</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.09.730151v1?rss=1">
<title>
<![CDATA[
Multi-Scale Machine Learning for Antibody-Antigen Binding Affinity Prediction Using Deep Mutational Scanning and Structural Features 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.09.730151v1?rss=1
</link>
<description><![CDATA[
Predicting how mutations alter antibody-antigen binding affinity is essential for antibody engineering and vaccine design, yet current methods generalize poorly to unseen complexes. We present a multi-scale machine learning framework integrating 93 descriptors across four modalities: physicochemical, structural, ESM-2 protein language model, and solvent-accessible surface area (SASA)/{Delta}{Delta}G_fold features. Under leave-one-complex-out deep mutational scanning (LOCO-DMS) cross-validation on AbAgym (36,541 mutations, 68 experiments, 13 pathogens), gradient boosting achieved MCC = 0.206; a confidence-stratified ensemble reached MCC = 0.374 (83.5% accuracy, 25.5% coverage). No single modality exceeds the majority baseline alone; only multi-scale fusion succeeds. Boltzmann ceiling analysis shows 45.9% of mutations are near-neutral (|{Delta}{Delta}G| < k_BT), bounding theoretical maximum MCC at 0.473; our method achieves 79.1% of this limit. Five deep learning architectures benchmarked under LOCO-DMS showed self-attention matching gradient boosting (MCC = 0.200). Cross-pathogen transfer failed systematically (mean 46.7%), confirming universal binding predictors remain an open challenge.
]]></description>
<dc:creator><![CDATA[ Sivasubramani, S. ]]></dc:creator>
<dc:date>2026-06-23</dc:date>
<dc:identifier>doi:10.64898/2026.06.09.730151</dc:identifier>
<dc:title><![CDATA[Multi-Scale Machine Learning for Antibody-Antigen Binding Affinity Prediction Using Deep Mutational Scanning and Structural Features]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-23</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.18.733068v1?rss=1">
<title>
<![CDATA[
Comorbidity structure as an inductive bias: Comparing output-head designs for multi-label prediction of diabetes and myocardial infarction complications 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.18.733068v1?rss=1
</link>
<description><![CDATA[
BackgroundClinical complications are often predicted with separate sigmoid outputs, even when the target labels arise from related pathophysiological processes. This paper asks whether output-layer choice should reflect both predictive convenience and the biological structure assumed among complications. The central premise is that label-dependence mechanisms are explicit hypotheses about comorbidity, not generic modelling additions.

MethodsOutput-head assumptions were compared across two clinically distinct multi-label prediction tasks. In Type 2 diabetes (T2D), six heads were evaluated for nephropathy, neuropathy, and retinopathy: independent baseline, linear additive, multiplicative, symmetric conditional random field (CRF), residual multilayer perceptron (MLP), and combined additive-multiplicative. In myocardial infarction (MI), four heads were evaluated for ventricular tachycardia, ventricular fibrillation, and atrioventricular block: independent baseline, linear additive, multiplicative, and symmetric CRF. All experiments used five training data fractions and seven independent seeds, with the same shared-backbone protocol within each disease setting.

ResultsIn T2D, the symmetric CRF gave the most consistent improvement pattern, ranking highest at full data and at the two lowest data fractions while adding only three interaction parameters. At 20% training data, it was the only interaction head whose aggregate mean exceeded the independent baseline. The residual MLP, despite 123 interaction parameters, remained below the baseline across all T2D fractions. In MI, rankings changed across fractions: the multiplicative head led at 80% and 60%, the CRF led at 100% and 20%, and the baseline led at 40%. The combined additive-multiplicative head did not improve robustness in T2D and showed the largest negative baseline-relative deviations at lower fractions.

ConclusionThe findings support a biology-guided view of output-layer design. A small constrained mechanism was most useful when its symmetry matched the shared microvascular structure of T2D, whereas the heterogeneous electrophysiology of MI produced no stable winner. Output-layer choice should therefore be reported and defended as an assumption about disease structure instead of a routine hyperparameter decision.

Author summaryMany clinical prediction models treat complications as separate outcomes, even when clinicians know they often arise together. We studied whether the last layer of a model should reflect that biological knowledge. We compared several output heads across two disease settings: Type 2 diabetes, where nephropathy, neuropathy, and retinopathy share a common microvascular origin, and myocardial infarction, where electrical complications arise from a mixture of shared and location-specific mechanisms. We found that a small symmetric CRF head was most useful in the diabetes task, especially when training data were limited, while no single interaction head dominated in myocardial infarction. This suggests that modelling comorbidity is not only a technical choice; it is a statement about how disease processes relate to one another. Our results encourage researchers to report and justify output-layer design as part of the clinical modelling argument, rather than treating it as a routine hyperparameter.
]]></description>
<dc:creator><![CDATA[ Asumboya, W. A., Agbenorhevi, P. K., Adams, C. F., Ayariga, D. A., Adjadeh, T., Adams Ziblim, S., Kwofie, S. K. ]]></dc:creator>
<dc:date>2026-06-23</dc:date>
<dc:identifier>doi:10.64898/2026.06.18.733068</dc:identifier>
<dc:title><![CDATA[Comorbidity structure as an inductive bias: Comparing output-head designs for multi-label prediction of diabetes and myocardial infarction complications]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-23</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.17.733050v1?rss=1">
<title>
<![CDATA[
Learning interpretable structural similarity from tandem mass spectra for small molecule analog discovery 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.17.733050v1?rss=1
</link>
<description><![CDATA[
Analog discovery remains a central bottleneck in mass spectrometry-based untargeted metabolomics, as conventional spectral similarity scores poorly reflect molecular structure. We introduce SIMBA, a transformer-based model that infers two interpretable graph-based distances, maximum common edge subgraph and substructure edit distance, directly from tandem mass spectra. SIMBA consistently retrieves structurally closer analogs than existing methods, enabling structure-aware small molecule identification beyond exact spectral matching.
]]></description>
<dc:creator><![CDATA[ Piedrahita Giraldo, J. S., Da Silva, K. M., Zare Shahneh, M. R., Wang, M., Laukens, K., De Vijlder, T., Bittremieux, W. ]]></dc:creator>
<dc:date>2026-06-23</dc:date>
<dc:identifier>doi:10.64898/2026.06.17.733050</dc:identifier>
<dc:title><![CDATA[Learning interpretable structural similarity from tandem mass spectra for small molecule analog discovery]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-23</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.18.733146v1?rss=1">
<title>
<![CDATA[
VCBench: A Multi-Dimensional Benchmark for Single-Cell Foundation Models 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.18.733146v1?rss=1
</link>
<description><![CDATA[
Single-cell foundation models are increasingly positioned as virtual cells, yet their capabilities are assessed by fragmented, largely single-task benchmarks that obscure where these models improve on simple baselines. VCBench addresses this by synthesizing four independent virtual-cell frameworks into seven capability dimensions: perturbation response prediction, cross-species universality, gene regulatory network (GRN) inference, modality integration, temporal dynamics, multi-scale integration, and in silico experimentation. Each dimension is assessed for operational testability under current architectures and datasets: five admit direct or proxy evaluation, while multi-scale integration and in silico experimentation are structurally untestable as end-to-end tasks. We evaluate five foundation models (Geneformer, scGPT, UCE, TranscriptFormer, Arc State) against pre-registered linear and nearest-neighbor baselines across the five testable dimensions, and report three findings. First, the baselines match or exceed every foundation model on four of the five scored dimensions, replicating the reported competitiveness of linear baselines on perturbation prediction and extending it to cross-species transfer, GRN inference, and temporal ordering. Second, TranscriptFormer alone exceeds the strongest baseline on cross-modal RNA-to-protein prediction (53% Pearson improvement, with a documented contamination caveat) and is the only model to reach Level 2 in the pre-registered Virtual Cell (VC) Level rubric; the architectural choice behind this advantage simultaneously causes a spectral collapse that destroys its temporal-ordering performance, a tradeoff invisible to single-task benchmarks. Third, no foundation model publishes a complete cell-level training manifest, leaving data contamination undetectable to users. Alongside the benchmark, VCBench releases a Contamination Reporting Schema and contributes two further methodological tools: a common-label-set protocol that controls for class-count confounds in cross-species transfer, and a spread-error correlation probe for epistemic calibration.
]]></description>
<dc:creator><![CDATA[ Weidener, L. S., Brkic, M., Jovanovic, M., Ulgac, E., Meduri, A. ]]></dc:creator>
<dc:date>2026-06-23</dc:date>
<dc:identifier>doi:10.64898/2026.06.18.733146</dc:identifier>
<dc:title><![CDATA[VCBench: A Multi-Dimensional Benchmark for Single-Cell Foundation Models]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-23</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.18.733075v1?rss=1">
<title>
<![CDATA[
Measuring peptide-MHC generalization to unseen alleles across both HLA classes 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.18.733075v1?rss=1
</link>
<description><![CDATA[
Reported peptide-MHC (pMHC) AUROCs of 0.85-0.95 overstate generalization to unseen alleles: because immunopeptidome data are dense on a few well-studied alleles and sparse on the rest, training and test sets come to share near-identical alleles, so the numbers partly reflect interpolation rather than extrapolation to new MHC grooves. This is a property of the data, not of any one method. We assembled an open, harmonized corpus of 5.8 million experimental measurements across both HLA classes and use it to control the leakage explicitly: alleles held out at the sequence and cluster level, peptide-disjoint splits, and provenance-matched negatives. On strictly novel alleles, generalization is in the high 0.7s rather than the 0.9s a conventional split returns. Against this benchmark we trained a predictor that spans both classes in one model and factors presentation into a peptide-only ligand-likeness term and an allele-specific term; it exceeds eight published predictors by per-allele {Delta}AUROC = +0.22 to +0.37 (p < 10-9), most on the least-studied genes. Corpus, benchmark, and model are released.

Author summaryOur immune cells display protein fragments on the cell surface, held by molecules (the human leukocyte antigens, or HLAs) that vary from person to person. Predicting which fragments a given HLA displays matters for cancer vaccines, transplant matching, and the safety of engineered therapies, and many computational tools now do it well. Most available data come from a few common HLAs, so test cases tend to resemble training cases, and the published accuracy looks better than it really is for the rare HLAs that matter most in the clinic. We assembled a large, openly shared collection of experimental measurements across both major HLA classes and used it to test prediction more directly, holding out HLAs that are sequence-distant from those in training. Accuracy on these is measurable but lower than the usual figures suggest. We also built a predictor that handles both HLA classes in one model and gains most relative to existing tools on the rare HLAs where they are weakest. The data, benchmark, and model are available for the same test.
]]></description>
<dc:creator><![CDATA[ Mysore, V. ]]></dc:creator>
<dc:date>2026-06-23</dc:date>
<dc:identifier>doi:10.64898/2026.06.18.733075</dc:identifier>
<dc:title><![CDATA[Measuring peptide-MHC generalization to unseen alleles across both HLA classes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-23</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.18.733061v1?rss=1">
<title>
<![CDATA[
Automated Segmentation of Prostatic Gold Fiducial Markers for MR-Only Radiotherapy Planning Using Multi-Modal Consensus Deep Learning 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.18.733061v1?rss=1
</link>
<description><![CDATA[
Purpose: To develop and evaluate a multi-model consensus deep learning approach for automated gold fiducial marker (FM) segmentation in T1-weighted prostate MRI. Materials and Methods: In this retrospective study, T1-weighted MRI and CT-derived reference standard segmentations were collected from 127 prostate cancer patients (all male; mean age, 70 years +/- 7 [standard deviation]; age range, 50-88 years; collected between October 2020 and January 2026) who each had three implanted gold FMs. A 3D U-Net was trained on 93 subjects using four random seeds to produce an ensemble. At inference, marker-class probability maps were averaged across models and the top three connected components selected. Performance was evaluated on 34 temporally held-out subjects (9 tuning, 25 test) using marker-level sensitivity and precision with exact (Clopper-Pearson) 95% confidence intervals (CIs). A model count ablation study was performed. The pipeline was deployed for on-scanner processing on Siemens MRI systems via the OpenRecon framework and as a browser-based application using WebAssembly, executing entirely client-side. Results: The four-model consensus achieved 96% (70 of 73) sensitivity and 95% (70 of 74) precision on 25 test subjects, with 29 of 34 (85%) subjects achieving perfect marker detection. Single models had a mean sensitivity of 84% (SD, 9%), improving to 96% with four-model consensus (SD, <1%). Conclusion: Multi-model consensus deep learning substantially improved FM segmentation reliability over individual models, achieving high sensitivity and precision using only routinely acquired T1-weighted MRI.
]]></description>
<dc:creator><![CDATA[ Stewart, A. W., Goodwin, J., Richardson, M., Robinson, S. D., O'Brien, K., Jin, J., Barth, M. ]]></dc:creator>
<dc:date>2026-06-23</dc:date>
<dc:identifier>doi:10.64898/2026.06.18.733061</dc:identifier>
<dc:title><![CDATA[Automated Segmentation of Prostatic Gold Fiducial Markers for MR-Only Radiotherapy Planning Using Multi-Modal Consensus Deep Learning]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-23</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.18.733122v1?rss=1">
<title>
<![CDATA[
Model-based inference of gene expression noise from single-cell RNA-sequencing data 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.18.733122v1?rss=1
</link>
<description><![CDATA[
The heterogeneity of expression levels among genetically identical cells, termed gene expression noise, is a property of the gene expression process whose importance in the biology of organisms and their evolution is increasingly recognized. Measuring gene expression noise requires single-cell expression data, as obtained from single-cell RNA sequencing (scRNASeq). Its estimation, however, is challenging owing to (i) the presence of technical noise in addition to biological noise, and (ii) the heterogeneity of cell types in the sampled population. We propose a maximum-likelihood framework to infer biological noise from scRNASeq data, while accounting for technical noise, dropout probabilities, and distinct cell sequencing depths. We demonstrate the parameter identifiability using simulations and that the resulting noise estimates are uncorrelated from the mean gene expression, and therefore do not need extra correction in downstream analyses, easing intra- and inter- genome comparisons. Using two technical replicates of scRNASeq data from the wild yeast *Saccharomyces paradoxus*, we show that expression noise can be inferred in a reproducible manner.
]]></description>
<dc:creator><![CDATA[ Giersdorf, F., Rogers, D. W., Christensen, S., Dutheil, J. Y. ]]></dc:creator>
<dc:date>2026-06-23</dc:date>
<dc:identifier>doi:10.64898/2026.06.18.733122</dc:identifier>
<dc:title><![CDATA[Model-based inference of gene expression noise from single-cell RNA-sequencing data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-23</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.17.732493v1?rss=1">
<title>
<![CDATA[
Early Tracheal and Salivary miRNAs in Extremely Preterm Infants Predict BPD-related Pulmonary Hypertension 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.17.732493v1?rss=1
</link>
<description><![CDATA[
Pulmonary hypertension (BPD-PH) associated with bronchopulmonary dysplasia (BPD) in preterm infants associates with high morbidity and mortality within the first two years of life. In a previous unbiased study, we identified a panel miRNAs in tracheal aspirates (TA) that were differentially expressed in extremely low gestational age newborns (ELGANs) with BPD-PH compared to those with BPD but no PH. To explore the predictive potential of these miRNAs, we studied TA exosomes from 7 days old ELGANs and analysed a curated panel of 16 miRNAs through logistic regression and calculated the predictive AUROC to diagnose BPD-PH at 36 weeks PMA. AUROC of TA miRNAs was 0.76 with sensitivity and specificity of 53% and 93%, respectively. Adding sex and gestational age to the variables improved the AUROC to 0.78 with sensitivity and specificity of 61 and 87% respectively. Due to challenges of obtaining TA in non-invasively ventilated infants, we collected saliva samples from ELGANs at 7 days of age and compared the log expression of these 16 miRNAs in both biofluids and found significant correlation in their expression (pearson r=0.92, p<0.001). We calculated the predictive AUROC of the same miRNAs to diagnose BPD-PH at 36 weeks PMA. AUROC of these miRNAs in saliva was = 0.85 with sensitivity and specificity of 82% and 72%, respectively; addition of biological sex and gestational age improved AUROC to 0.86 with sensitivity and specificity of 79% and 76% respectively. Leave-one-sample-out sensitivity analysis demonstrated stable training performance with reduced performance in testing samples, supporting the need for validation in larger independent cohorts. In conclusion, early salivary miRNAs have great potential for risk stratification of ELGANs to develop BPD-PH, while also providing the opportunity to identify target molecules and mechanisms that modulate molecular function.
]]></description>
<dc:creator><![CDATA[ Li, T., Zhang, S., Aluquin, V., Donnelly, A., Stephens, H., Sharma, S., Hicks, S. D., Liu, D., Austin, E., Siddaiah, R. ]]></dc:creator>
<dc:date>2026-06-23</dc:date>
<dc:identifier>doi:10.64898/2026.06.17.732493</dc:identifier>
<dc:title><![CDATA[Early Tracheal and Salivary miRNAs in Extremely Preterm Infants Predict BPD-related Pulmonary Hypertension]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-23</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.18.733163v1?rss=1">
<title>
<![CDATA[
CellOS: Learning a World Model of Cellular State through Joint Embedding Prediction 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.18.733163v1?rss=1
</link>
<description><![CDATA[
Foundation models learned from single-cell transcriptomes are central to the prospect of AI virtual cell that can represent, query and predict cellular state. However, most current single-cell foundation models learn from a single view of gene expression and are optimized primarily through reconstruction or next-token prediction. As a result, they capture expression abundance but can-not explicitly reconcile complementary views of cellular state. Here we present CellOS, a multi-view foundation model that learns cellular representations from paired expression and perception views. CellOS integrates complementary views through a scalable three-stage training strategy that combines causal cell-sentence language modelling, function-preserving dense-to-mixture-of-experts expansion and latent-space alignment via an LLM-JEPA objective. Using this framework, we trained a 12-billion-parameter model on 390.5 million single-cell transcriptomes. Across diverse benchmarks spanning cell-state annotation, batch integration and perturbation-response prediction, CellOS consistently outperformed state-of-the-art single-cell foundation models in cell-state annotation and perturbation-response prediction while preserving robust batch integration. Together, these results suggest that predictive alignment between complementary cellular views provides a scalable path toward representation-centric cellular world models and transferable AI virtual cells.
]]></description>
<dc:creator><![CDATA[ Zhou, Q., Le, Y., Qi, X., Chang, S., Lu, H., Wu, Y., Wang, H., Ran, R., li, x. ]]></dc:creator>
<dc:date>2026-06-23</dc:date>
<dc:identifier>doi:10.64898/2026.06.18.733163</dc:identifier>
<dc:title><![CDATA[CellOS: Learning a World Model of Cellular State through Joint Embedding Prediction]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-23</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.18.733285v1?rss=1">
<title>
<![CDATA[
Systematic benchmarking of zero-shot utility and robustness in single-cell transcriptomic foundation models 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.18.733285v1?rss=1
</link>
<description><![CDATA[
Single-cell foundation models (scFMs) have been proposed as reusable representations for transcriptomic analysis, yet their practical utility and robustness when applied without task-specific fine-tuning remain incompletely characterized. Here, we systematically evaluated single-cell transcriptomic representations in zero-shot settings across 20 methods, 6 downstream tasks and 1,607 datasets comprising nearly 21.8 million cells. We characterized model behavior along three complementary dimensions: baseline utility, structural robustness, and dataset-level drivers of performance variability. Our large-scale analysis reveals a decoupling between utility and robustness: methods ranking highly on standard benchmarks often show marked instability under shifts in dataset structure. Furthermore, no single model performs uniformly well across tasks. In several tasks, classical statistical representations based on highly variable genes remain competitive under zero-shot conditions. Together, these results define the practical boundaries of zero-shot use in scFMs and provide a large-scale benchmark and decision framework for representation selection in single-cell genomics.
]]></description>
<dc:creator><![CDATA[ Liu, T., Feng, T., Pan, X., Chen, Y., Ren, L., Ye, X., Sakurai, T., Lin, H., Zhang, Y. ]]></dc:creator>
<dc:date>2026-06-23</dc:date>
<dc:identifier>doi:10.64898/2026.06.18.733285</dc:identifier>
<dc:title><![CDATA[Systematic benchmarking of zero-shot utility and robustness in single-cell transcriptomic foundation models]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-23</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.10.731380v1?rss=1">
<title>
<![CDATA[
biomeStat: Using Agentic AI for Scalable Genomic Epidemiology Demonstrated Through End-to-End Analysis of 1,000 Asian Dengue Virus Genomes 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.10.731380v1?rss=1
</link>
<description><![CDATA[
Genomic epidemiology workflows typically require expert curation of multiple specialized tools, extensive manual parameter tuning, and access to heterogeneous compute infrastructure. While standard generative AI models often hallucinate in complex biological domains, we introduce biomeStat: an autonomous AI agent that functions as a strict deterministic orchestrator. By automatically writing code to execute established bioinformatics tools in sandboxed environments, biomeStat dynamically provisions compute resources (CPU and GPU) and guarantees reproducibility, making it immediately useful for scientists without requiring command-line expertise.

To demonstrate the platform, we performed a fully autonomous genomic epidemiology and structural analysis of 1,000 Dengue virus (DENV) genomes sampled from 16 Asian countries between 2000 and 2025. The agent seamlessly orchestrated phylogenetic reconstruction (IQ-TREE, TreeTime), Bayesian phylodynamics (BEAST2 via NVIDIA H200 GPU), selection pressure analysis (HyPhy), and structural mapping (PyMOL). The analysis was completed in under 24 hours of wall-clock time, revealing endemic stability (R_e [~]1.0) and identifying 1,869 candidate immune escape sites structurally colocalized with B-cell and T-cell epitopes. Furthermore, the agent validated 176 highly conserved drug target residues across the viral replication complex, confirming that resistance-associated positions for emerging antivirals JNJ-1802 and NITD-688 remain absolutely conserved across all four serotypes. By bridging the gap between natural language intent and deterministic computational execution, biomeStat reduces weeks of expert effort into a single-session analysis with full methodological transparency.
]]></description>
<dc:creator><![CDATA[ Ariyaratne, D., Somaratna, N., Malavige, G. N. ]]></dc:creator>
<dc:date>2026-06-23</dc:date>
<dc:identifier>doi:10.64898/2026.06.10.731380</dc:identifier>
<dc:title><![CDATA[biomeStat: Using Agentic AI for Scalable Genomic Epidemiology Demonstrated Through End-to-End Analysis of 1,000 Asian Dengue Virus Genomes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-23</prism:publicationDate>
<prism:section></prism:section>
</item>
</rdf:RDF>
