<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF xmlns:admin="http://webns.net/mvcb/" xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:prism="http://purl.org/rss/1.0/modules/prism/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
<channel rdf:about="https://biorxiv.org">
<admin:errorReportsTo rdf:resource="mailto:biorxiv@cshlpress.edu"/>
<title>bioRxiv Subject Collection: Genomics Bioinformatics</title>
<link>https://biorxiv.org</link>
<description>
This feed contains articles for bioRxiv Subject Collection "Genomics Bioinformatics"
</description>

<items>
<rdf:Seq>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.07.730716v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.07.730684v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.10.731457v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.07.730473v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.07.730754v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.08.730896v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.09.728211v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.08.730685v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.09.731232v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.08.731013v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.10.731298v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.06.730646v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.07.729267v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.06.729466v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.07.730662v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.06.730554v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.06.730612v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.05.730273v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.06.730578v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.06.730607v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.06.730569v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.06.730616v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.04.730260v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.08.730821v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.05.730453v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.05.730479v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.05.730086v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.04.730246v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.03.729980v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.06.02.729719v1?rss=1"/>
</rdf:Seq>
</items>
<prism:eIssn/>
<prism:publicationName>bioRxiv</prism:publicationName>
<prism:issn/>

<image rdf:resource=""/>
</channel>
<image rdf:about="">
<title>bioRxiv</title>
<url>https://www.biorxiv.org/sites/default/files/bioRxiv_article.jpg</url>
<link>https://www.biorxiv.org</link>
</image>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.07.730716v1?rss=1">
<title>
<![CDATA[
Combinatorial docking and molecular generation to navigate over 100-billion molecules for prospective ligand discovery 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.07.730716v1?rss=1
</link>
<description><![CDATA[
Commercially available make-on-demand libraries now exceed 100 billion compounds, requiring over 50 years to screen on 2,000 CPU cores using conventional docking. We present two complementary approaches to address this challenge. CombiDOCK, a combinatorial docking framework, enables exhaustive screening at the 100-billion scale within 40 days. MINT-Dock, a generative framework, accelerates navigation of this space by integrating CombiDOCK with Monte Carlo Tree Search. Benchmarked on 46 diverse targets, CombiDOCK matched full-molecule docking accuracy, and MINT-Dock achieved a 4,800-fold enrichment over random selection. Compared with prior billion-scale brute-force campaigns against {sigma}2, VMAT2, and VAChT, prospective CombiDOCK screens of the 100-billion-molecule library yielded higher hit rates and more potent ligands, while MINT-Dock achieved comparable outcomes across single- and multi-target objectives with >20-fold computational cost reductions. Docking-predicted poses of the best VAChT-binding compounds were confirmed by cryo-EM structures. These methods provide exhaustive and generative paths for navigating the trillion-molecule frontier of drug discovery.
]]></description>
<dc:creator><![CDATA[ Zhang, J., Yang, C., Zhang, Y., Chen, X., Lam, B., Bryant, C., Pidathala, S., Wang, Y., Moroz, Y., Radchenko, D., Alon, A., Lee, C.-H., Zhang, Z., Lyu, J. ]]></dc:creator>
<dc:date>2026-06-11</dc:date>
<dc:identifier>doi:10.64898/2026.06.07.730716</dc:identifier>
<dc:title><![CDATA[Combinatorial docking and molecular generation to navigate over 100-billion molecules for prospective ligand discovery]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-11</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.07.730684v1?rss=1">
<title>
<![CDATA[
HoloCell: A Generative Foundation Model for Holistic Cellular Modeling 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.07.730684v1?rss=1
</link>
<description><![CDATA[
Single-cell multi-omics technologies have recently advanced to enable the profiling of epigenomic, transcriptomic, and proteomic layers within individual cells, offering new opportunities to characterize cellular states as integrated biological systems. However, developing a unified framework that can seamlessly integrate diverse omics modalities and remain robust to heterogeneous modality missingness remains challenging. Here we present HoloCell, to our knowledge the first generative foundation model for joint representation learning and generative modeling across all three major single-cell omics modalities, i.e., epigenomics, transcriptomics, and proteomics. HoloCell contains over 860 million parameters and is pretrained on the Human-Multi-Omics-Corpus, which comprises approximately 468 million single-cell profiles across these three omics layers, corresponding to over 425 billion tokens. HoloCell introduces a simple yet biologically grounded hierarchical tokenization strategy that encodes cis-regulatory elements, genes, and proteins as structured tokens within a shared modeling framework. We evaluated HoloCell across single-omics representation learning, paired multi-omics integration, unpaired multi-omics alignment, and cross-modal generation via iterative diffusion and remasking, demonstrating its superior performance and flexibility across diverse omics tasks. From a representation perspective, HoloCell provides a unified digital mapping of cellular states across multiple omics layers, capturing cell heterogeneity as an integrated system. From a generation perspective, its iterative diffusion and remasking framework accounts for the inherently unordered nature of biological features, enabling in silico simulation of multi-omics information flow. Together, these capabilities position HoloCell as a versatile foundation model toward the emerging concept of a virtual cell, offering both systematic characterization and generative simulation of cellular systems within a unified framework.
]]></description>
<dc:creator><![CDATA[ Jiang, Q., Li, Z., Hu, B., Bie, Y., Li, K., Li, Q., Jin, P., He, Y., Deng, P., Wang, Z., Chen, X., Qin, T., Liu, H., Jiang, R., Yin, Q. ]]></dc:creator>
<dc:date>2026-06-11</dc:date>
<dc:identifier>doi:10.64898/2026.06.07.730684</dc:identifier>
<dc:title><![CDATA[HoloCell: A Generative Foundation Model for Holistic Cellular Modeling]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-11</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.10.731457v1?rss=1">
<title>
<![CDATA[
Liquidity of gene co-expression trajectories across the lifespan highlights delayed maturation and the perinatal GABA switch in schizophrenia risk 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.10.731457v1?rss=1
</link>
<description><![CDATA[
Schizophrenia genetic and environmental risk factors play out largely in early life, biasing development toward a pathogenic trajectory that becomes clinically apparent in early adulthood, when the disorder typically onsets. Here, we convert snapshots of gene expression in postmortem brain at a moment in time into a dynamic lifetime series employing the "liquidity" metric, a novel tool to track the evolution of networks across time from a multi-systemic perspective. The landscape of normal prefrontal cortical development becomes increasingly "liquid" during the first two decades of post-natal life, with sharp discontinuities in known critical periods such as birth and adolescence. Neurotypical individuals free of apparent neuropathology with relatively elevated polygenic risk scores for schizophrenia exhibit a generalized delay in the dynamics of liquidity across these trajectories compared to below-average-risk individuals. Impacted biological processes strongly converge on delayed GABA-A receptor functional maturation, involved in establishing Excitatory/Inhibitory balance in brain during early development. Similar to patients with schizophrenia, neurotypical high-risk individuals show an increased expression ratio between the genes SLC12A2 (protein NKCC1) and SLC12A5 (protein KCC2) relative to low-risk, involved in the control of the equilibrium potential of chloride ions that regulates GABA-A function. These results provide evidence that genetic risk for schizophrenia is associated with a delayed maturational profile and delayed maturation of GABAergic signaling without detectable neuropathology and well before the age of clinical onset. Interestingly, the same effect is not observed in the hippocampus and is not observed with genetic risk for other neuropsychiatric and immune conditions. The dynamics of maturation of GABA-A signaling in the dorsolateral prefrontal cortex emerge as an early contributor to translating genetic risk into an altered developmental trajectory associated with schizophrenia.
]]></description>
<dc:creator><![CDATA[ Bellantuono, L., Di Camillo, F., Borcuk, C., Kikidis, G. C., Kleinman, J. E., Parihar, M., Hyde, T. M., Weinberger, D. R., Pergola, G. ]]></dc:creator>
<dc:date>2026-06-11</dc:date>
<dc:identifier>doi:10.64898/2026.06.10.731457</dc:identifier>
<dc:title><![CDATA[Liquidity of gene co-expression trajectories across the lifespan highlights delayed maturation and the perinatal GABA switch in schizophrenia risk]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-11</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.07.730473v1?rss=1">
<title>
<![CDATA[
Sequence-Based Therapeutic Peptide Classification with Augmented Negative Sampling 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.07.730473v1?rss=1
</link>
<description><![CDATA[
Therapeutic peptides offer high target specificity, low toxicity, and the ability to modulate protein-protein interactions, yet experimental functional characterization remains costly and slow. Computational prediction of therapeutic function directly from sequence could accelerate peptide screening and enable generative design pipelines, but requires reliable discrimination between therapeutic and non-therapeutic peptides. Existing multi-label predictors cover few functions, rely on limited datasets, and exhibit high glspl{fpr}, limiting their practical utility. We present a lightweight CNN classifier trained on the most comprehensive therapeutic peptide database to date (54,655 peptides, 48 functional categories). A key contribution is a statistically motivated negative sampling strategy using Markov models to generate diverse synthetic decoys at multiple difficulty levels. When evaluated on this controlled decoy benchmark, the FRP is reduced from over 60% for previous models to 2.1% for our approach. Our fine-tuned five-model ensemble achieves 78.9% Micro F1 and 54.6% Macro F1 while requiring only amino acid sequences as inputs. Analysis using a sparse L1-constrained variant of our model shows that convolutional filters capture conserved functional motifs and statistically improbable non-therapeutic patterns, with downstream layers combining these signals, providing mechanistic evidence that the network learns biologically meaningful structure. In a generalization task on the TPpred-LE benchmark, our model achieves 55.3% Micro F1 and 38.6% Macro F1, comparable to TPpred-LE trained on its native dataset (57.9%/38.1%) while predicting four times more therapeutic functions with four times fewer parameters. Code and models will be made available at https://github.com/terra-quantum-public/tq-therapep-ai.
]]></description>
<dc:creator><![CDATA[ Ellerbrock, R., Valentini, A., Paul, A. C., Mukhopadhyay, S., Perelshtein, M. R. ]]></dc:creator>
<dc:date>2026-06-11</dc:date>
<dc:identifier>doi:10.64898/2026.06.07.730473</dc:identifier>
<dc:title><![CDATA[Sequence-Based Therapeutic Peptide Classification with Augmented Negative Sampling]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-11</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.07.730754v1?rss=1">
<title>
<![CDATA[
Robust semi-supervised scRNA-seq integration from virtual adversarial learning 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.07.730754v1?rss=1
</link>
<description><![CDATA[
Single-cell RNA sequencing integration methods that rely solely on transcriptomic data often struggle to preserve fine-grained distinctions between closely related cell subtypes. As a result, cell populations that are separable in the raw data may become over-mixed after integration, reducing biological resolution and interpretability. Incorporating marker gene information can potentially address these issues; however, the variability and complexity of available marker sets limit their effective application. To address this, we introduce scCRAFT+, a semi-supervised integration model that innovatively incorporates marker gene information through Virtual Adversarial Training (VAT). By jointly optimizing marker-derived supervision and transcriptome-wide representations, VAT enforces local prediction smoothness among transcriptionally similar cells, improving robustness to noisy marker annotations while enhancing both integration quality and cell type auto-annotation. This targeted approach significantly enhances annotation accuracy and robustness, particularly when faced with incomplete or incorrect marker gene sets. Benchmarking shows that scCRAFT+ achieves consistently stronger performance than current unsupervised and supervised integration approaches, resulting in improved integration quality and biologically meaningful sub-cell type auto-annotations.
]]></description>
<dc:creator><![CDATA[ He, C., Filippidis, P., Xing, J., Kleinstein, S., Guan, L. ]]></dc:creator>
<dc:date>2026-06-11</dc:date>
<dc:identifier>doi:10.64898/2026.06.07.730754</dc:identifier>
<dc:title><![CDATA[Robust semi-supervised scRNA-seq integration from virtual adversarial learning]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-11</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.08.730896v1?rss=1">
<title>
<![CDATA[
HOMED enables hierarchical and multimodal optimization of DNA methylation deconvolution across tissues 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.08.730896v1?rss=1
</link>
<description><![CDATA[
Cellular heterogeneity is a major confounder in bulk DNA methylation data for epigenome-wide association studies. Existing reference-based DNAm deconvolution methods often ignore hierarchies among related cell types and may generalize poorly across datasets due to limited variability in reference profiles. We developed HOMED (Hierarchically Optimized Methylation Deconvolution), a framework that integrates cell-lineage hierarchies, single-cell RNA sequencing-guided deconvolution, and paired bulk RNA-seq/DNAm data for CpG signature optimization. Across simulated and real peripheral blood mononuclear cell, lung, and placental datasets, HOMED consistently yielded the highest PCCs and lowest RMSEs, outperforming existing scRNA-seq-guided DNAm deconvolution methods, improving accuracy, resolution, and cross-tissue generalizability.
]]></description>
<dc:creator><![CDATA[ Liu, Y., Chen, Y., Du, Y., Garmire, L. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.08.730896</dc:identifier>
<dc:title><![CDATA[HOMED enables hierarchical and multimodal optimization of DNA methylation deconvolution across tissues]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.09.728211v1?rss=1">
<title>
<![CDATA[
Folding the unfoldable 2: using AlphaFold and ESMFold to explore spurious proteins 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.09.728211v1?rss=1
</link>
<description><![CDATA[
Motivation: Spurious protein sequences, resulting from gene prediction errors, theoretically should not yield folded structures. AlphaFold2 was previously shown to predict short spurious sequences with high pLDDT scores and was therefore unlikely to distinguish between real proteins and spurious proteins which are usually short. We evaluate whether newer structure prediction methods (ESMFold and AlphaFold3) similarly predict short sequences with high pLDDT or if they better discriminate between spurious and real proteins. Results: All three structure prediction methods (ESMFold, AlphaFold2, and AlphaFold3) predict short spurious sequences from AntiFam with unexpectedly high pLDDT scores, however the discrimination between spurious and real proteins improves beyond 100 amino acids. By analysing sequences with disparate pTM and pLDDT scores, we identified two likely spurious shadow ORFs in Swiss-Prot and one potentially non-spurious AntiFam entry. Using the structure prediction scores, we developed a Gaussian Process Model and evaluated its performance on AlphaFold DB, identifying potential spurious proteins at scale. While limited on its own, this model can increase confidence in spurious protein identification when combined with other methods.
]]></description>
<dc:creator><![CDATA[ Orr, A. K., Bateman, A. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.09.728211</dc:identifier>
<dc:title><![CDATA[Folding the unfoldable 2: using AlphaFold and ESMFold to explore spurious proteins]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.08.730685v1?rss=1">
<title>
<![CDATA[
Pseudoperplexity Probes Memorization in Protein Language Models 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.08.730685v1?rss=1
</link>
<description><![CDATA[
Protein Language Models (pLMs) have significantly advanced computational biology. Yet their scale and reliance on redundant training data raise a fundamental question: do pLMs generalize the statistical grammar of proteins, or do they simply memorize their training data? To investigate this, we used pseudoperplexity as a probe for sequence-level memorization, comparing ProtT5's pseudoperplexity on a pre-training proxy dataset against a post-training holdout of genuinely novel sequences. To ensure a valid comparison, we matched the datasets by sequence length, cluster size, and taxonomic family. As a statistical baseline, we trained n-gram language models; analysis of higher-order n-gram composition and a statistically significant divergence in perplexity confirmed that the post-training sequences were genuinely novel at the local sequence level. ProtT5 showed a statistically significant difference in pseudoperplexity between seen and unseen sequences, though further analysis revealed this memorization signal to be modest. These findings suggest that ProtT5 exhibits detectable but limited memorization of its training data as measured by a pseudoperplexity-based probe.
]]></description>
<dc:creator><![CDATA[ Plaikner, A., Ploner, M., Sewald, Z., Senoner, T., Franz, S., Brenner, M., Heinzinger, M., Rost, B. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.08.730685</dc:identifier>
<dc:title><![CDATA[Pseudoperplexity Probes Memorization in Protein Language Models]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.09.731232v1?rss=1">
<title>
<![CDATA[
Interspecies variation of 45S ribosomal DNA in vertebrates 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.09.731232v1?rss=1
</link>
<description><![CDATA[
Ribosomal DNA (rDNA) remains one of the least resolved components of complex genomes despite its central role in cellular function and evolutionary inference. Using chromosome-scale assemblies from the Vertebrate Genomes Project, we present a comparative analysis of the 45S rDNA locus across vertebrates. rDNA organization varies widely among clades, with lineage-specific differences in chromosomal distribution, copy number, and positional bias. Mammals and birds typically restrict rDNA to few, often terminal loci, whereas amphibians and several fish lineages exhibit highly dispersed architectures, including extreme cases spanning dozens of chromosomes. Copy number varies by more than an order of magnitude and does not scale simply with repeat-unit size or predicted functional demand, consistent with rDNA functioning as a flexible genomic reservoir rather than a direct proxy for ribosome production. At the sequence level, the 45S unit resolves into distinct evolutionary regimes. Genes encoding rRNAs (18S, 5.8S, 28S) remain highly conserved and retain strong phylogenetic signal, whereas internal transcribed spacers show rapid divergence, lineage-specific expansion, and pronounced asymmetry between intergenic spacers ITS1 and ITS2. Sequence-structure analyses reveal a gradient of constraint across the locus, from tightly coupled evolution in rRNA-encoding regions to increasingly permissive regimes in spacers. Structure-informed phylogenetic inference yields modest but consistent improvements under high divergence, highlighting its value when sequence signal degrades. Together, these results establish the 45S rDNA locus as a multi-layered evolutionary system integrating deep conservation with structural plasticity across vertebrates.
]]></description>
<dc:creator><![CDATA[ Kang, L., Formenti, G., O'Connor, T., Michalak, P. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.09.731232</dc:identifier>
<dc:title><![CDATA[Interspecies variation of 45S ribosomal DNA in vertebrates]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.08.731013v1?rss=1">
<title>
<![CDATA[
A scalable MNase-seq framework for reproducible nucleosome profiling across pluripotent stem cell and cardiomyocyte models 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.08.731013v1?rss=1
</link>
<description><![CDATA[
Micrococcal nuclease (MNase) digestion is widely used to profile chromatin accessibility and nucleosome footprinting. However, its application is often limited by sensitivity to reaction conditions, high cell input requirements, and the lack of standardized protocols across cell types. Here we developed a robust MNase workflow encompassing buffer composition, DNA purification chemistry, fixation and decrosslinking parameters, cell input scalability, and an in-house yeast spike-in for quantitative normalization. We validated this unified framework across human induced pluripotent stem cells (hiPSCs), hiPSC-derived cardiomyocytes at multiple differentiation stages, primary murine embryonic cardiac cells, and adult mouse cardiomyocytes, and demonstrated comparable digestion efficiencies and kinetics despite marked differences in cellular architecture and chromatin organization. Genome-wide MNase-seq in hiPSCs, combined with the nucMACC bioinformatic pipeline, resolved concentration-dependent nucleosomal occupancy and precise nucleosome positioning at pluripotency-related regulatory elements. This modular, end-to-end, and scalable workflow provides a standardized platform for reproducible MNase-based chromatin profiling across diverse in vitro and in vivo models.
]]></description>
<dc:creator><![CDATA[ Thekkedam, C., Humphreys, D. T., Naval-Sanchez, M., Nicks, A. M., Harvey, R. P., Contreras, O. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.08.731013</dc:identifier>
<dc:title><![CDATA[A scalable MNase-seq framework for reproducible nucleosome profiling across pluripotent stem cell and cardiomyocyte models]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.10.731298v1?rss=1">
<title>
<![CDATA[
Metagenomic prediction of methane emissions in sheep using single- and multi-matrix BLUP models with taxonomic and functional microbial features 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.10.731298v1?rss=1
</link>
<description><![CDATA[
Abstract Background Enteric methane emissions from ruminant livestock represent a major greenhouse gas contributor, yet identification of high- and low-emitting ruminants remains expensive and logistically challenging for agricultural methane mitigation strategies. Ruminal microbial profiles derived from long-read sequencing technology provide a potential proxy to predict methane production. The optimal bioinformatic pipelines for processing long-read metagenomic data to perform methane predictions have yet to be determined. Here we evaluated how different metagenomic analysis pipelines affect methane predictive model accuracy in grazing sheep. Results We applied three bioinformatic pipelines to characterize the taxonomic and functional features of rumen microbiomes from 396 sheep. Functional abundance features were annotated from Clusters of Orthologous Genes (COG) or Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. The single-matrix model using COG features achieved the highest microbiability (m^2 = 0.942: proportion of variance component explained by microbial features) and predictive accuracy (5-fold cross validation r= 0.609: Pearson`s correlation between predicted and observed values). Both functional features outperformed all taxonomic features across all three pipelines in predictive accuracy. The multi-matrix models combined functional and taxonomic features slightly improved methane predictive accuracy across both 5-fold cross-validation and leave-one-day-out validation compared to the models using functional features alone. Conclusions These findings demonstrate the potential advantages of using long-read metagenomic data to predict enteric methane emissions in ruminants. COG-based functional features achieved the highest predictive accuracy among all feature types, suggesting that functional annotation of existing long-read sequences is sufficient for accurate methane prediction without requiring complementary taxonomic data.
]]></description>
<dc:creator><![CDATA[ Li, Y., Ong, C. T., Yadav, S., Aldridge, M., Fitzgerald, P., van der Werf, J., Nguyen, L. T., Ross, E. M. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.10.731298</dc:identifier>
<dc:title><![CDATA[Metagenomic prediction of methane emissions in sheep using single- and multi-matrix BLUP models with taxonomic and functional microbial features]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.06.730646v1?rss=1">
<title>
<![CDATA[
GEOAgent: An AI-driven Autonomous Framework for Intelligent GEO Data Retrieval and Standardized Preprocessing 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.06.730646v1?rss=1
</link>
<description><![CDATA[
Datasets in the Gene Expression Omnibus (GEO) remain difficult to reuse at scale because sample annotations are heterogeneous and raw sequencing data require assay-specific preprocessing. We present GEOAgent, an AI-driven autonomous framework designed for intelligent dataset retrieval and standardized preprocessing by coupling autonomous semantic governance with an automated Nextflow pipeline named bioStream. Metadata from 181,760 sequencing series and 84,756 associated PubMed records were organized in a relational database and semantic index to support natural-language dataset retrieval. The framework automatically determines assay modalities, resolves experimental design pairings, and standardizes sample naming to minimize manual curation overhead. Based on these parsed attributes, the framework generates deployment-ready manifests to automatically execute containerized workflows across bulk and single-cell omics modalities. In expert-curated benchmarks, the workflow achieved 96% retrieval precision alongside 100% accuracy in assay classification and sample relationship resolution. The web platform is publicly accessible, while the source code and associated databases are openly available via GitHub and Zenodo.
]]></description>
<dc:creator><![CDATA[ Zhao, Y., Cai, Q., Chen, D., Chen, J. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.06.730646</dc:identifier>
<dc:title><![CDATA[GEOAgent: An AI-driven Autonomous Framework for Intelligent GEO Data Retrieval and Standardized Preprocessing]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.07.729267v1?rss=1">
<title>
<![CDATA[
Promera: a unified model for biomolecular structure prediction, filtering, and design 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.07.729267v1?rss=1
</link>
<description><![CDATA[
Generative models have become staple tools for modeling and designing biomolecular structures. However, although these tools have improved in structural prediction accuracy, their ability to filter designed binders---an essential use case---remains insufficient; whereas design methods have focused more on unconstrained binder generation rather than capabilities enabled by controllable design. We introduce Promera, a unified generative model that combines all-atom structure prediction with improved filtering and controllable design. We find that Promera's confidence metrics are more accurate for filtering binders from non-binders for both miniproteins and nanobodies, while its co-folding performance surpasses popular open-source models (OpenFold3-p2, Boltz-2) on therapeutically relevant categories. As a design model, Promera generates binders by predicting masked protein sequences with optional epitope, paratope, and template constraints. Remarkably, our nanobody designs match the in silico success rates from backprop-based techniques (mBER) when evaluated under co-folding confidence filters. We further provide two in silico demonstrations of the the versatile capabilities of our design method: epitope targeting of the Andes hantavirus glycoprotein with VHHs and active state stabilization of the beta-2 andrenergic GPCR. We conclude by proposing a scaling law for co-folding models, suggesting a path for further performance improvement.
]]></description>
<dc:creator><![CDATA[ Jing, B., Bafna, M., Diaz, D. J., Klivans, A. R., Berger, B. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.07.729267</dc:identifier>
<dc:title><![CDATA[Promera: a unified model for biomolecular structure prediction, filtering, and design]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.06.729466v1?rss=1">
<title>
<![CDATA[
When batch correction corrupts gene expression: uncovering distortions in correlation structures 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.06.729466v1?rss=1
</link>
<description><![CDATA[
Batch correction is essential for integrating datasets and enabling population-level insights into health and disease. Embedding-based approaches are among the most widely used solutions, but here we highlight a critical, overlooked limitation: these methods can distort feature-to-feature (e.g., gene gene) relationships, potentially undermining downstream analyses. We investigate this issue and introduce a novel metric to quantify it.
]]></description>
<dc:creator><![CDATA[ Nourisa, J., Passemiers, A., Moreau, Y., Raimondi, D. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.06.729466</dc:identifier>
<dc:title><![CDATA[When batch correction corrupts gene expression: uncovering distortions in correlation structures]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.07.730662v1?rss=1">
<title>
<![CDATA[
A Unified Spatial AI Framework for Cross-Domain Tissue-State Analysis in Trauma, Oral, and Cardiovascular Pathology 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.07.730662v1?rss=1
</link>
<description><![CDATA[
Objective: To develop a cross-domain spatial AI framework for identifying conserved tissue-state organisation across trauma, oral disease, and cardiovascular tissue using spatial transcriptomic data. Methods: Four public spatial transcriptomic datasets spanning wound healing, periodontitis, oral squamous cell carcinoma, and cardiac tissue were integrated using recurrence modelling, graph-based spatial learning, fuzzy tissue-state analysis, and tensor decomposition. Cross-domain coupling, spatial fragmentation, recurrence structure, and permutation-based topological validation were evaluated. Results: Six conserved fuzzy tissue states were identified, dominated by extracellular matrix remodelling, fibroblast/stromal activation, endothelial signalling, and inflammatory pathways. Latent embedding analysis demonstrated strong overlap between trauma and oral domains, while cardiovascular tissue exhibited more compact spatial organisation. Oral inflammatory tissue showed the highest fragmentation, whereas cardiovascular tissue demonstrated greater recurrence coherence. Tensor decomposition identified conserved stromal-remodelling programmes across domains. Permutation testing confirmed significantly elevated graph modularity and reduced spatial entropy relative to null distributions. Conclusion: The proposed framework identified conserved spatial tissue-state architecture linking wound healing, oral pathology, and cardiovascular tissue despite differences in tissue origin, pathology, and acquisition technology. Significance: These findings demonstrate the potential of spatial AI for investigating conserved stromal and inflammatory microenvironmental organisation across clinically related disease systems and may support spatial biology research in trauma--oral--systemic health.
]]></description>
<dc:creator><![CDATA[ Pham, T. D. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.07.730662</dc:identifier>
<dc:title><![CDATA[A Unified Spatial AI Framework for Cross-Domain Tissue-State Analysis in Trauma, Oral, and Cardiovascular Pathology]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.06.730554v1?rss=1">
<title>
<![CDATA[
APOSM: Pairwise preference learning improves generative small-molecule design 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.06.730554v1?rss=1
</link>
<description><![CDATA[
Small-molecule lead refinement is constrained by the cost of synthesizing and assaying candidates, making the surrogate models that prioritize compounds for experimental testing central to the design process. The reliability of such surrogates is limited by the noise and sparsity of screening measurements. We show that training the surrogate on pairwise comparisons between candidate molecules, rather than on absolute predicted scores, yields a substantially more reliable signal for active candidate selection in this regime. We develop APOSM, an active-learning algorithm that combines a fragment-based generator, a pairwise message-passing graph neural network surrogate, and probabilistic ranking inside a batched acquisition loop. On the Practical Molecular Optimization benchmark and a GPCR ligand rediscovery task, APOSM improves target attainment and sampling efficiency over unguided fragment-based optimization, the Graph-GA genetic algorithm, and a pointwise-regression ablation, with the largest gains on tasks where absolute scores are hardest to calibrate.
]]></description>
<dc:creator><![CDATA[ Dreisler, M. W., Michael, R., Hatzakis, N. S., Boomsma, W. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.06.730554</dc:identifier>
<dc:title><![CDATA[APOSM: Pairwise preference learning improves generative small-molecule design]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.06.730612v1?rss=1">
<title>
<![CDATA[
RERconverge Update: Runtime Reduction and Analysis Function Overhaul 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.06.730612v1?rss=1
</link>
<description><![CDATA[
Motivation: Convergent evolution, or the independent acquisition of similar phenotypes in distinct lineages, provides a powerful framework for investigating genomic changes associated with a phenotype. This paper details an update to RERconverge, a powerful R package that tests for associations between gene relative evolutionary rates (RERs) and convergent phenotypes to infer genomic regions associated with traits or selective pressures. We introduce new customizable analysis choices and scalable and efficient algorithms that can process larger genomic datasets, a critical improvement as genomic data become available for more species. Results: Modifications to core functions in the RERconverge pipeline resulted in an immense speedup (by a factor of up to 28.6). The function that tests for associations between phenotypes and RERs has been expanded to include two new analytical methods for outlier control; we also provide here a summary of the statistical tests users can perform, along with their use cases. Availability and implementation: The code and walkthrough vignettes for the package are available at https://github.com/nclark-lab/RERconverge.
]]></description>
<dc:creator><![CDATA[ Hoffmann, G. L., Kopania, E. E. K., Tene, M., Kowalczyk, A., Redlich, R., Pfenning, A. R., Meyer, W. K., Chikina, M., Clark, N. L. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.06.730612</dc:identifier>
<dc:title><![CDATA[RERconverge Update: Runtime Reduction and Analysis Function Overhaul]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.05.730273v1?rss=1">
<title>
<![CDATA[
FIND: a software tool for identifying population-enriched pathogenic variants in gnomAD 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.05.730273v1?rss=1
</link>
<description><![CDATA[
Founder mutations are variants that arose in a single ancestor and became enriched in a descendant population through a bottleneck and endogamy. Identification of pathogenic founder mutations has facilitated efficient targeted screening. More broadly, even without confirmed founder status, identifying pathogenic variants that are enriched within specific populations reveals population-specific disease burden. However, many such variants remain hidden in plain sight within existing datasets. To address this gap, we developed FIND (Founder candidates hidden IN Data), a web tool that identifies pathogenic, likely pathogenic, and predicted loss-of-function variants in gnomAD with frequencies >0.00008 in one ancestry group and at least tenfold higher than in all others (after zeroing populations with four or fewer observed alleles). Testing FIND on the genes FLNC, TMEM127, MYH7, and BRCA2 confirmed its utility and functionality by identifying nine well-known founder mutations and seven candidate founders. Candidates enriched in African American and admixed American populations were validated with the All of Us database, highlighting the utility of this approach for populations historically underrepresented in genetic studies. Source code is freely available at https://github.com/aacoder105/FIND under an MIT license, with a web interface at https://ethnic-variant-mutation-finder.onrender.com/.
]]></description>
<dc:creator><![CDATA[ Horowitz, A. L., Liebman, A. Z., Liebman, S. W. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.05.730273</dc:identifier>
<dc:title><![CDATA[FIND: a software tool for identifying population-enriched pathogenic variants in gnomAD]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.06.730578v1?rss=1">
<title>
<![CDATA[
Long-read cross-platform validation reveals novel repeat features in myotonic dystrophy type 2 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.06.730578v1?rss=1
</link>
<description><![CDATA[
The broader application of long-read sequencing (LRS) for repeat expansion characterization in myotonic dystrophy type 2 (DM2) and other repeat expansion disorders (REDs) remains limited by the lack of systematic validation and benchmarking of sequencing results and bioinformatic workflows. Here, we performed an orthogonal cross-platform validation of previously generated Oxford Nanopore Technologies (ONT) data by sequencing the same DNA samples with Pacific Biosciences (PacBio) HiFi following amplification-free targeted enrichment in a cohort of 8 DM2 patients. Despite substantial differences in sequencing chemistry and coverage, the two platforms showed high concordance in repeat size estimation, somatic mosaicism, and repeat architecture. This validation confirmed the presence of the (TCTG)n motif and enabled the identification of a previously unreported (CCCG)n motif at the 3' end of expanded alleles, further highlighting the structural complexity of the CNBP expansion. Through this analysis, we also established a bioinformatic workflow that improved ONT-based repeat characterization, addressing limitations in motif resolution and enabling more accurate analysis of CNBP expansions. Overall, this study provides a validated framework for LRS-based CNBP repeat analysis, supporting the integration of these technologies into routine molecular investigation for DM2 and other REDs.
]]></description>
<dc:creator><![CDATA[ Carlomagno, M., Suarez Lopez, F. J., Maestri, S., Esposito, A., Obadovic, V., Visconti, V. V., Ciabini, D., Marcolungo, L., Rossi, N., Casagrande, M., Angheben, L., Spadoni, L., D Apice, M. R., Novelli, G., Delledonne, M., Botta, A., Rossato, M. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.06.730578</dc:identifier>
<dc:title><![CDATA[Long-read cross-platform validation reveals novel repeat features in myotonic dystrophy type 2]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.06.730607v1?rss=1">
<title>
<![CDATA[
Is level-1 blob reconstruction under the network multispecies coalescent easy? 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.06.730607v1?rss=1
</link>
<description><![CDATA[
Hybridization is an important evolutionary process, commonly modeled by the network multispecies coalescent. Reconstructing evolutionary histories under this model is notoriously costly, even for level-1 networks where hybridization events are isolated from each other. The widely used methods that combine speed with statistical guarantees rely on quartet concordance factors computed for all subsets of four species, resulting in an o(n^4k) bottleneck that severely limits scalability to large numbers of species (n) and genes (k). Among quartet-based methods, NANUQ+ is notable because it decomposes the problem into two steps: first reconstructing a tree of blobs, which compresses each non-treelike part of the network, called a blob, into a single vertex, and second reconstructing the internal structure of each level-1 blob, specifically its circular order and hybrid vertex. Here, we investigate whether level-1 blob reconstruction is difficult once the tree of blobs is known. We present a fast and statistically consistent algorithm, called NetCS, based on two simple primitives: majority voting and merge sort, circumventing the bottleneck of computing all quartet concordance factors. In simulations, NetCS achieved comparable accuracy to NANUQ+ and was dramatically faster, enabling analyses of 200 taxa and 1000 genes in only a few minutes. Both methods attained near-perfect accuracy when given the true tree of blobs; however, their performance degraded in end-to-end pipelines due to errors in tree of blobs reconstruction. Strikingly, even methods that reconstruct level-1 networks directly struggled to accurately predict hybrid ancestry. Our results suggest that reconstructing level-1 blobs is unexpectedly easy once the tree of blobs is known, and that a major challenge for phylogenetic network inference lies in accurate tree of blobs reconstruction.
]]></description>
<dc:creator><![CDATA[ Dai, J., Molloy, E. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.06.730607</dc:identifier>
<dc:title><![CDATA[Is level-1 blob reconstruction under the network multispecies coalescent easy?]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.06.730569v1?rss=1">
<title>
<![CDATA[
SPARQ-MI leverages end-to-end spatial single-cell analysis of the tumor microenvironment 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.06.730569v1?rss=1
</link>
<description><![CDATA[
Detailed spatial analysis of the tumor micro-environment (TME) through multiplexed fluorescence imaging requires quantitative image-processing and data-analysis methods. While data-preprocessing down to segmentation of individual cells is captured by available methods, statistical analysis of single-cell features is compromised by the uneven noise distribution especially in complex tissues such as the TME, as well as by labor-intensive manual cell-type annotation and region segmentation. Here, we present SPARQ-MI (Spatial Phenotyping, Architecture Reconstruction and Quantification from Multiplexed Imaging) for streamlined spatial single-cell analysis, along with a tissue microarray PhenoCycler data-set with 37 fluorescent channels from melanoma patients under immunotherapy. We demonstrate that SPARQ-MI enables robust reconstruction of the cellular and spatial composition in this and other tissue types. Our analysis reveals associations of the cell-state and spatial location of CD8 T cells with response to immunotherapy. Overall, SPARQ-MI allows for quantitative analysis of complex fluorescence histology samples under minimal user input, and accounting for spatially uneven coverage of antibody signals, setting the stage for quantitative analysis of clinical samples.
]]></description>
<dc:creator><![CDATA[ Kiwitz, L., Turiello, R., Effern, M., Toma, M., Landsberg, J., Hoelzel, M., Thurley, K. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.06.730569</dc:identifier>
<dc:title><![CDATA[SPARQ-MI leverages end-to-end spatial single-cell analysis of the tumor microenvironment]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.06.730616v1?rss=1">
<title>
<![CDATA[
Gene loss under constant cold reveals "natural knockout" loci in Antarctic notothenioid fishes 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.06.730616v1?rss=1
</link>
<description><![CDATA[
Abstract Background Gene loss is a major but often underappreciated mode of genome evolution. Genes loss can reflect relaxed selective constraint, neutral decay, compensation, or adaptive change; once fixed, however, they alter the inherited gene complement and can shape the evolutionary trajectories available to descendant lineages. Shared losses near the base of a radiation may therefore become part of the genomic background on which later specialization occurs. The Antarctic notothenioid (cryonotothenioid) radiation provides a powerful system for studying gene loss in an extreme polar environment: it diversified in the thermally stable Southern Ocean, and outgroups outside the clade allow ancestral gene presence to be inferred. Results Using whole-genome alignments and orthology-aware comparative analyses across 11 cryonotothenioids and four non-Antarctic notothenioid outgroups, we systematically identified gene-inactivating mutations and reconstructed their phylogenetic distribution. After applying stringent filters to reduce reference bias, mitigate assembly and annotation artifacts, and exclude loci with detectable transcriptomic support, we identified 30 high-confidence single-copy orthologs with loss of coding potential across sampled cryonotothenioids but intact coding sequences in outgroups. These clade-wide losses affect genes associated with lipid and amino acid metabolism, water transport, renal glucose reabsorption, skeletal mineralization, circadian regulation, and tRNA modification pathways. We also identified 12 additional single-copy orthologs lost across examined icefishes but retained in red-blooded notothenioids and outgroups, including genes associated with oxygen transport, iron handling, erythroid biology, and vesicular trafficking. Conclusions Our study provides a genome-wide catalog of coding-gene losses shared across sampled cryonotothenioids, together with additional lineage-specific losses in icefishes. Many of these losses occur in conserved genes associated with disease-relevant phenotypes in humans or model vertebrates, suggesting that Antarctic notothenioids offer a natural comparative system for studying how loss-of-function variants can persist in viable wild populations. These candidate natural knockouts provide testable hypotheses for how environmental buffering, paralog compensation, and pathway rewiring may shape vertebrate physiology under chronic cold.
]]></description>
<dc:creator><![CDATA[ Lamba, V., Alverson, A. J., Daane, J. M., Zhuang, X. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.06.730616</dc:identifier>
<dc:title><![CDATA[Gene loss under constant cold reveals "natural knockout" loci in Antarctic notothenioid fishes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.04.730260v1?rss=1">
<title>
<![CDATA[
Bias-mitigated microbiome inference refines coronary artery disease signature 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.04.730260v1?rss=1
</link>
<description><![CDATA[
Roughly half the cells in the human body are microbial, and changes in these communities are increasingly implicated in cardiovascular, metabolic, and oncological diseases. Yet identifying which taxa truly differ in abundance, differential abundance (DA), is distorted by four major sources of bias: loss of total microbial load, taxa measurement efficiencies, arbitrary pseudocounts required to handle pervasive zeros, and contamination which has recently driven retractions. No existing DA method accounts for all four. Here we introduce BootDA, a non-parametric bootstrap-based method that explicitly models each bias source without data transformations, pseudocounts, parametric assumptions, or assuming that most taxa are non-DA. In semi-parametric simulations preserving the sparsity (>70% zeros) and correlation structure of real 16S amplicon data, BootDA achieved the highest sensitivity among tested methods, including ANCOM-BC2, LinDA, MaAsLin 3, and Wilcoxon tests, while controlling the false discovery rate. Performance was retained in low biomass settings when contamination contributed ~50% of counts, and without negative controls, indicating de novo decontamination capability. Applied to a coronary artery disease cohort, BootDA refined the original signature to two co-enriched genera, Klebsiella and Gemmiger, and excluded likely contaminants. BootDA is available as an R package and could generalise to other sparse, high dimensional biological data.
]]></description>
<dc:creator><![CDATA[ Honeybrook, L. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.04.730260</dc:identifier>
<dc:title><![CDATA[Bias-mitigated microbiome inference refines coronary artery disease signature]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.08.730821v1?rss=1">
<title>
<![CDATA[
ECMME: an atlas of selection pressures on the mammalian extracellular matrix reveals contrasting evolutionary dynamics 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.08.730821v1?rss=1
</link>
<description><![CDATA[
The extracellular matrix (ECM) is a fundamental metazoan innovation that provides structural support and regulatory cues essential for multicellular life. While core matrisome components are subject to strong functional constraints, their evolutionary dynamics at the molecular level remain incompletely characterized. Here, we present a comprehensive per-residue analysis of selection pressures across 272 human core matrisome proteins using high-quality orthologous sequences from up to 228 placental mammal species. We developed an automated pipeline integrating ortholog identification, codon-aware alignments, and site-specific selection analyses with the MEME and FUBAR methods from the HyPhy suite. Results reveal pervasive strong purifying selection across the matrisome, consistent with its structural and functional indispensability. This is accompanied by episodic positive selection and rarer pervasive positive selection, with collagens exhibiting significantly elevated episodic positive selection compared to glycoproteins and proteoglycans. To facilitate community access, we developed ECMME (ECM Molecular Evolution) browser, an intuitive open-access web resource that visualizes selection metrics plotted directly onto protein topologies. ECMME allows researchers to seamlessly browse and investigate the data, providing a powerful framework for interpreting functional sites. It is available online and requires no local installation or set-up (https://izzilab-ecmme.share.connect.posit.cloud/).
]]></description>
<dc:creator><![CDATA[ Petrov, P. B., Oshinjo, A., Roning, J., Izzi, V. ]]></dc:creator>
<dc:date>2026-06-10</dc:date>
<dc:identifier>doi:10.64898/2026.06.08.730821</dc:identifier>
<dc:title><![CDATA[ECMME: an atlas of selection pressures on the mammalian extracellular matrix reveals contrasting evolutionary dynamics]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.05.730453v1?rss=1">
<title>
<![CDATA[
Whole genome sequencing and variant discovery in 344 global grasspea (Lathyrus sativus L.) lines 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.05.730453v1?rss=1
</link>
<description><![CDATA[
The rapid expansion of genomic data resources for major crops is opening new options for crop improvement, while resources for most underutilised crops lag behind, risking a widening gap in crop improvement. One of these underutilised crops is grasspea (Lathyrus sativus), an ancient crop with modern cultivation centred on South Asia and Ethiopia. We conducted whole genome shotgun sequencing on a global collection of 344 grasspea lines, producing over 152 billion reads. Following variant discovery and filtering we created a single nucleotide polymorphism (SNP) marker set of over 1.5 million SNPs. This is a resource of major significance for this crop which can help unlock its breeding potential through marker development and the identification of genes controlling agronomically important traits.
]]></description>
<dc:creator><![CDATA[ Schreiber, M., Staples, J., Emmrich, P. M. F., Edwards, A., Martin, C., Bayer, M., Raubach, S., Kilian, B., Shaw, P. D. ]]></dc:creator>
<dc:date>2026-06-09</dc:date>
<dc:identifier>doi:10.64898/2026.06.05.730453</dc:identifier>
<dc:title><![CDATA[Whole genome sequencing and variant discovery in 344 global grasspea (Lathyrus sativus L.) lines]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.05.730479v1?rss=1">
<title>
<![CDATA[
Deciphering the limitations of immortalized hepatocyte cell lines for the study of liver cis-regulatory elements 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.05.730479v1?rss=1
</link>
<description><![CDATA[
Immortalized cell lines are widely used in biological research despite their known differences from their tissues and cell types of origin. Such cell lines are especially popular for testing hypotheses regarding the activity of cis-regulatory elements (CREs) that regulate gene expression. Previous investigations of blood and skin cell lines revealed many differences between the transcriptional regulatory networks of the cell lines and the associated primary cells. Similar comparisons for other tissues have been limited. Here, we used ATAC-seq to profile CREs in four immortalized liver cell lines and found many CRE differences between each cell line and primary liver tissue, including differences in the transcription factors that are likely to bind the CREs and differences in the genes that they are likely to regulate. Modifying cell culture conditions based on recommendations in the literature did not improve the similarity with primary liver tissue. Our results suggest that differences between the transcriptional regulatory networks in cell lines and primary tissue should be considered when designing and interpreting cell line experiments.
]]></description>
<dc:creator><![CDATA[ Bellesis, A., Li, X., Moore-Frederick, D., Xu, D., Delbridge, K., Ma, J., Vaccaro, G., Edward, B. A. A., Kellogg, M., Creeger, Y., Okamoto, A. S., Kaplow, I. M. ]]></dc:creator>
<dc:date>2026-06-09</dc:date>
<dc:identifier>doi:10.64898/2026.06.05.730479</dc:identifier>
<dc:title><![CDATA[Deciphering the limitations of immortalized hepatocyte cell lines for the study of liver cis-regulatory elements]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.05.730086v1?rss=1">
<title>
<![CDATA[
Enhancer-gene regulatory interactions in colorectal cancer revealed through genome-wide CRISPRi perturbations 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.05.730086v1?rss=1
</link>
<description><![CDATA[
Background Establishing functional relationships between distal enhancers and their target genes is a central challenge in genome biology and cancer genetics. CRISPR/dCas9-mediated perturbations followed by single-cell RNA sequencing (Perturb-seq) has emerged as an efficient means of establishing enhancer-target gene relationships. Results We used genome-wide CRISPRi Perturb-seq with 35,139 guide RNAs targeting 12,117 enhancers to create a functional enhancer-gene map of colorectal cancer (CRC), identifying 238 significant regulatory associations (FDR < 0.1). Integration of chromatin accessibility (ATAC-seq), histone modification profiling (ChIP-seq), and ultra-high resolution chromatin conformation capture (Micro-C) data revealed that enhancer regulation in CRC is constrained by topological domains and often targets the nearest gene, findings corroborated by Activity-by-Contact modelling. Conclusions We provide a comprehensive functional enhancer-gene interaction map for CRC. This resource should provide a foundation for studying gene regulation in colorectal tumorigenesis and for prioritising candidate non-coding drivers in cancer sequencing studies.
]]></description>
<dc:creator><![CDATA[ Law, P. J., Vijayakrishnan, J., Smith, J., Barry, T., Mandelia, M., Katsevich, E. ]]></dc:creator>
<dc:date>2026-06-09</dc:date>
<dc:identifier>doi:10.64898/2026.06.05.730086</dc:identifier>
<dc:title><![CDATA[Enhancer-gene regulatory interactions in colorectal cancer revealed through genome-wide CRISPRi perturbations]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.04.730246v1?rss=1">
<title>
<![CDATA[
Quantifying annotation-stratified pleiotropy and co-polygenicity between complex traits 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.04.730246v1?rss=1
</link>
<description><![CDATA[
Understanding shared genetic architecture is essential to interpreting disease comorbidities and trait correlations. We introduce SBayesAPP, a Bayesian model that integrates GWAS summary statistics with functional annotations to jointly estimate annotation-stratified SNP effect-size correlation and pleiotropic variant proportion (co-polygenicity) between traits, dissecting genetic correlation and coheritability enrichment across annotations. Simulations and real data analyses show improved accuracy and interpretability over existing methods. In type 2 diabetes analyses with 15 traits, SBayesAPP reveals clear tissue- and cell-type-specific enrichment and distinguishes mechanisms driven by few large-effect variants versus many modest-effect variants. The analysis of smoking and lung cancer prioritizes lung and immune cells, and identifies cell-type-specific genetic correlations driven by either pleiotropic or lung-cancer-specific variants, consistent with a causal relationship model. For schizophrenia and educational attainment, despite near-zero genome-wide genetic correlation, cell-type-specific correlations range from -0.20 to 0.21, with strong (co)heritability enrichment and high co-polygenicity found in dopaminergic neurons and oligodendrocytes. These results highlight the ability of SBayesAPP to resolve annotation-specific genetic sharing and uncover biological mechanisms across complex traits.
]]></description>
<dc:creator><![CDATA[ Qu, J., Zhao, T., Lin, T., Li, A., Liu, S., Chauquet, S., Visscher, P. M., Wray, N. R., Yengo, L., Zeng, J., Cheng, H. ]]></dc:creator>
<dc:date>2026-06-09</dc:date>
<dc:identifier>doi:10.64898/2026.06.04.730246</dc:identifier>
<dc:title><![CDATA[Quantifying annotation-stratified pleiotropy and co-polygenicity between complex traits]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.03.729980v1?rss=1">
<title>
<![CDATA[
Imputed graph-genotyped structural variants identify regulatory haplotypes associated with gene expression in Atlantic salmon 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.03.729980v1?rss=1
</link>
<description><![CDATA[
Structural variants (SVs) can affect gene regulation, but they are difficult to include in expression genetic studies when large RNA-seq cohorts lack whole-genome sequencing. This is common in non-human and non-model systems, where whole-genome sequencing at population scale remains costly. As a result, expression quantitative trait locus (eQTL) studies often rely on single nucleotide polymorphism (SNP) markers. These analyses can identify expression-associated regions, but often provide limited biological interpretation of the underlying regulatory mechanisms. Here, we used Atlantic salmon as a study system to test whether graph-genotyped SVs can be imputed into a SNP-array-genotyped RNA-seq cohort and used to interpret regulatory haplotypes. SVs were discovered from two long-read-sequenced individuals, supplemented with short-read SV and SNP calls from a 112-individual whole-genome-sequenced reference panel, graph-genotyped, jointly phased with SNPs, and imputed into 906 offspring with gill RNA-seq and SNP-array genotypes. After size filtering, the imputed SV catalogue contained 100,269 variants and showed nonuniform genomic distributions associated with sex-specific recombination landscapes. Association testing identified 51 SV-eQTL candidates, including 35 cis and 16 trans associations. These candidates were enriched for short-read-derived variants, indicating that short-read supplementation can recover regulatory variants missed by small-scale long-read discovery. SV-eQTL candidates were more strongly tagged by nearby SNPs than non-associated variants generally, but individual SNP lead markers often failed to capture the same eQTL signals in conditional regression. Retained candidates after the conditional analysis included target-gene-overlapping deletions, nearby local variants without target-gene overlap, trans associations, and short insertions with opposite effects on gene expression. These results show that imputed graph-genotyped SVs can add biological interpretation to possible regulatory haplotypes.
]]></description>
<dc:creator><![CDATA[ Chapis, M., Manousi, D., Diblasi, C., Brekke, C., Kwak, J., Ponce De Leon, A. V., Arnyasi, M., Fenstad, R., Boison, S., Saitou, M. ]]></dc:creator>
<dc:date>2026-06-09</dc:date>
<dc:identifier>doi:10.64898/2026.06.03.729980</dc:identifier>
<dc:title><![CDATA[Imputed graph-genotyped structural variants identify regulatory haplotypes associated with gene expression in Atlantic salmon]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.06.02.729719v1?rss=1">
<title>
<![CDATA[
PanKbase Integrated Single-Cell Map: A Comprehensive Atlas of Human Pancreatic Islets 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.06.02.729719v1?rss=1
</link>
<description><![CDATA[
Abstract Aims/hypothesis Single-cell RNA sequencing (scRNA-seq) of pancreatic islet tissue is a powerful tool for investigating Type 1 Diabetes (T1D). However, individual datasets are limited in size and fragmented across donors, laboratories, and experimental conditions, highlighting the need for a unified single-cell atlas. This study aimed to construct a comprehensive, integrated scRNA-seq map of human isolated pancreatic islets by collating data from diverse sources. Methods Publicly available scRNA-seq datasets derived from isolated pancreatic islets, generated and/or provided by the Human Pancreas Analysis Program (HPAP), Prodo Labs, and the Integrated Islet Distribution Program (IIDP), were collected. Systematic quality controls were implemented to select high-quality samples, reads and cells. Data integration was conducted, accounting for important variables such as age, sex, body mass index (BMI), origin study, treatments, islet data/distribution resources, and sequencing chemistry. Results We generated a comprehensive single-cell atlas of human pancreatic islets comprising 191 high-quality assays from 140 donors (59 female, 81 male) across five phenotypic groups: controls without diabetes (69 donors), autoantibody-positive donors without diabetes (12), pre-diabetic donors (11), donors with type 1 diabetes (12), and donors with type 2 diabetes (36). The atlas also includes experimentally perturbed samples, including those exposed to SARS-CoV-2 infection and pro-inflammatory cytokines. In total, the atlas contains 448,935 cells, capturing major endocrine islet populations, such as alpha cells (43.3%) and beta cells (26.8%), as well as non-endocrine populations such as endothelial cells (0.75%) and immune cells (0.6%). Conclusions/interpretation By uniformly harmonizing and integrating data from multiple sources, we have developed a comprehensive single-cell atlas of isolated human pancreatic islets, which is publicly available at www.pankbase.org. The atlas provides a platform for hypothesis-driven investigation of diabetes pathophysiology and, given rigorous quality control at the read, barcode, and sample levels alongside careful metadata curation, is well suited for downstream machine-learning applications.
]]></description>
<dc:creator><![CDATA[ Vu, H. T. H., Sun, H., Kudtarkar, P., Sharp, S. A., Brusman, L., Wang, Y., Huang, Y., Mao, R., Feng, F., Corban, S., Huber, A. K., Shilin, A., Sun, Y., Narayanaswamy, S., Jang, D., Jurgens, J., Robertson, C. C., Shrestha, S., Bate, T., Nguyen, T., Smadbeck, P., Zhang, L., Brandes, M., The PanKbase Consortium,, Flannick, J., Burtt, N., Chen, S., Liu, J., Cartailler, J.-P., Voight, B. F., Stitzel, M. L., Brissova, M., Gloyn, A. L., Gaulton, K. J., Parker, S. C. J. ]]></dc:creator>
<dc:date>2026-06-09</dc:date>
<dc:identifier>doi:10.64898/2026.06.02.729719</dc:identifier>
<dc:title><![CDATA[PanKbase Integrated Single-Cell Map: A Comprehensive Atlas of Human Pancreatic Islets]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-06-09</prism:publicationDate>
<prism:section></prism:section>
</item>
</rdf:RDF>
