<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF xmlns:admin="http://webns.net/mvcb/" xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:prism="http://purl.org/rss/1.0/modules/prism/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
<channel rdf:about="https://biorxiv.org">
<admin:errorReportsTo rdf:resource="mailto:biorxiv@cshlpress.edu"/>
<title>bioRxiv Subject Collection: Genomics Bioinformatics</title>
<link>https://biorxiv.org</link>
<description>
This feed contains articles for bioRxiv Subject Collection "Genomics Bioinformatics"
</description>

<items>
<rdf:Seq>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.21.726863v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.20.725981v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.20.726535v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.20.726471v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.20.726443v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.20.726468v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.20.726414v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726405v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726197v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.725261v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726254v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726271v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726178v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726290v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726314v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726275v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726040v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.713680v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726322v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726264v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726393v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726397v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726105v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726067v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726151v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726162v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726174v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726217v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.19.726053v1?rss=1"/>
<rdf:li rdf:resource="https://www.biorxiv.org/content/10.64898/2026.05.18.725946v1?rss=1"/>
</rdf:Seq>
</items>
<prism:eIssn/>
<prism:publicationName>bioRxiv</prism:publicationName>
<prism:issn/>

<image rdf:resource=""/>
</channel>
<image rdf:about="">
<title>bioRxiv</title>
<url>https://www.biorxiv.org/sites/default/files/bioRxiv_article.jpg</url>
<link>https://www.biorxiv.org</link>
</image>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.21.726863v1?rss=1">
<title>
<![CDATA[
A community machine learning challenge to predict the effects of gene perturbations on T cell differentiation for cancer immunotherapy 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.21.726863v1?rss=1
</link>
<description><![CDATA[
Perturbations of genes with functional importance in T cells could be used to change the distribution of CD8 T cell states to enhance anti-tumor functions for cancer immunotherapies. We launched a world-wide computational challenge to predict the effects of gene perturbations and to devise objective functions for prioritizing gene perturbations that lead to desired T-cell state distributions. We supported the challenge by generating a single-cell Perturb-seq dataset profiling the effect of knocking out 73 individual expert-defined genes in T cells transferred into a mouse melanoma model. We compared the top algorithms developed by participants, and found that performance was primarily determined by the prior data used for gene feature representation, with perturbational data derived features, proving most effective. Experimental validation of the top 61 genes nominated by the algorithms revealed that perturbation of Ndufv2 and Dimt1 reached the defined objective and biased T cell differentiation toward desired states.
]]></description>
<dc:creator><![CDATA[ Zhang, J., Schwartz, M. A., Mutaher, M., Olajide, O., Pritykin, Y., Ashenberg, O., Hacohen, N., Uhler, C. ]]></dc:creator>
<dc:date>2026-05-22</dc:date>
<dc:identifier>doi:10.64898/2026.05.21.726863</dc:identifier>
<dc:title><![CDATA[A community machine learning challenge to predict the effects of gene perturbations on T cell differentiation for cancer immunotherapy]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-22</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.20.725981v1?rss=1">
<title>
<![CDATA[
BPabZIP, a new bZIP protein motif that promotes binding near, and displacement of, nucleosomes 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.20.725981v1?rss=1
</link>
<description><![CDATA[
Many transcription factors (TFs) bind only a subset of their canonical binding sites in mammalian cells. To identify differences between bound and unbound sites we examined Zta(N182S), a mutant of the Epstein Barr Virus (EBV)-encoded Zta bZIP protein that binds distinct DNA sequences that are not strongly bound by any known human or viral TF, reducing the effects of selective pressure on endogenous genomic binding sites. We stably expressed Zta(N182S) in human HEK293 cells and monitored protein binding (ChIP-seq) and effects on chromatin accessibility (ATAC-seq). Zta(N182S) binds ~10% of the 14,979 genomic occurrences of the canonical 9-mer ATCACTCAT, creating stronger overall ATAC-seq signal compared to control cells, suggesting nucleosome displacement. Nucleosome occupancy, either predicted or experimentally determined (MNase), indicates that canonical Zta and Zta(N182S) sites are more strongly bound when they are ~60bp from a positioned nucleosome dyad. These data suggest that Zta and Zta(N182S) binding results in nucleosome remodeling, consistent with pioneer-like activity. Examination of amino acids across Zta and human bZIPs identifies four conserved basic amino acids, a proline, and acidic amino acids immediately N-terminal of the basic amino acids of the bZIP domain (PARRTRKPQQPESLEECDSELEIKRYKN). We term this new protein motif "BPabZIP" (Basic-Proline-acidic bZIP). Molecular structure predictions for both Zta and human Fos/Jun reveal the basic amino acids interacting with the acidic patch on the nucleosome. The acidic amino acids act as an a-helical extension of the basic region that mimics DNA by interacting with histones H2A and H2B. Taken together, our analyses of this synthetic TF reveal a pioneer-like mechanism that is present in both human and viral bZIP proteins.
]]></description>
<dc:creator><![CDATA[ Tillo, D., Zhurkin, V. B., Porollo, A., Durell, S., Hesse, H. K., Hass, M., Dexheimer, P. J., Kottyan, L., Weirauch, M. T., Vinson, C. ]]></dc:creator>
<dc:date>2026-05-22</dc:date>
<dc:identifier>doi:10.64898/2026.05.20.725981</dc:identifier>
<dc:title><![CDATA[BPabZIP, a new bZIP protein motif that promotes binding near, and displacement of, nucleosomes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-22</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.20.726535v1?rss=1">
<title>
<![CDATA[
Min-frame transformation enables more sensitive viral genome alignment 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.20.726535v1?rss=1
</link>
<description><![CDATA[
Motivation: Maximal unique matches (MUMs) are a fundamental primitive in genome comparison, where they serve as high-confidence anchors for downstream multiple genome alignment. However, because MUMs rely on exact string matching, their effectiveness degrades with increased genome divergence and larger sets of genomes, inhibiting their ability to recover long homologous regions and reducing the number of base pairs covered by the multiple genome alignment. Additionally, existing approaches that improve robustness to mutation, such as spaced seeds or translated alignment methods, introduce trade-offs in specificity, scalability, or computational complexity. Methods: To address this gap, we introduce the Min-Frame Transformation (MFT), a deterministic encoding of nucleotide sequences to sequences over a transformed alphabet that preserves the coordinate structure of the original sequence. At each position, the MFT selects a kmer from a local window according to a fixed global ordering and assigns it a character in the transformed alphabet via a predefined mapping. This process captures local sequence context and can mask the impact of mutations, increasing the likelihood that homologous regions remain detectable as exact matches. The resulting transformed sequences can be indexed using standard string data structures, such as suffix arrays and suffix trees, enabling efficient extraction of MUMs without modifying existing algorithms. Impact: The MFT is a novel computational approach for improving the robustness of MUM-based seeding for genome alignment by producing longer and more contiguous matches that span a greater fraction of the genome, leading to improved alignment coverage and SNP recall. Altogether, these improvements have the potential to result in improvements for downstream viral genome analysis applications such as phylogenetic inference and transmission analysis.
]]></description>
<dc:creator><![CDATA[ Doughty, R. D., Banerjee, A., Kille, B., Warnow, T., Treangen, T. J. ]]></dc:creator>
<dc:date>2026-05-22</dc:date>
<dc:identifier>doi:10.64898/2026.05.20.726535</dc:identifier>
<dc:title><![CDATA[Min-frame transformation enables more sensitive viral genome alignment]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-22</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.20.726471v1?rss=1">
<title>
<![CDATA[
RANKOR: Direct Drug Prioritization from Bulk and Single-Cell Transcriptomic Signatures 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.20.726471v1?rss=1
</link>
<description><![CDATA[
Background Prioritizing therapeutics from transcriptomic data remains a key challenge in precision medicine. Signature reversal approaches, most commonly implemented through Gene Set Enrichment Analysis (GSEA), have been widely used to match disease signatures to candidate drugs. However, enrichment-based methods can be sensitive to noise and are restricted to previously profiled compounds Methods We developed RANKOR, a machine-learning framework designed to rank candidate drugs directly from transcriptomic signatures. Rather than predicting full expression profiles, RANKOR learns structured latent representations of transcriptional responses alongside chemical structure, enabling prioritization from standardized signatures derived from disease states or treatment perturbations. The framework is applicable to both bulk and single-cell transcriptomic data. Results Across large-scale perturbational datasets, RANKOR achieved consistently lower median ranks than similarity- and distance-based approaches, while showing performance comparable to, and in some settings improved over, GSEA. The model generalized across unseen cell types and retained performance in single-cell settings, where it provided more consistent prioritization than existing approaches, such as ASGARD. RANKOR further enabled prioritization of transcriptionally unseen compounds through chemical-space embedding and achieved substantially reduced computation times. Robustness analyses demonstrated stable performance under moderate noise and degradation under extreme perturbation or gene shuffling. Gene attribution analyses indicated that prioritization decisions are driven by coherent and mechanism-relevant transcriptional programs. Conclusions RANKOR provides a scalable framework for transcriptomics-guided drug prioritization that can complement and extend existing approaches, such as GSEA. It can also support therapeutic hypothesis generation from bulk and single-cell data while leveraging the generalisability and computational efficiency of machine learning models.
]]></description>
<dc:creator><![CDATA[ Katsaouni, N., Schulz, M. H. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.20.726471</dc:identifier>
<dc:title><![CDATA[RANKOR: Direct Drug Prioritization from Bulk and Single-Cell Transcriptomic Signatures]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.20.726443v1?rss=1">
<title>
<![CDATA[
Benchmarking full-length ITS metabarcoding across Illumina 2x500, PacBio, and Oxford Nanopore sequencing using mock and soil communities 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.20.726443v1?rss=1
</link>
<description><![CDATA[
Metabarcoding is a powerful tool for biodiversity comparisons, where standard-size DNA barcodes (>500 bases) offer better taxonomic resolution than shorter ones. Still, the choice of sequencing platforms and bioinformatics pipelines may strongly affect inferred diversity due to various technical biases. We assessed the relative performance of Illumina MiSeq i100 (2x500 paired-end), PacBio Revio and Oxford Nanopore MinION sequencing and bioinformatics pipelines, using full-length ITS amplicon sequencing datasets from a 103-species mock community and 45 composite soil samples. Despite numerous low-quality reads, PacBio yielded the lowest overall error rate and highest number of taxa. Illumina revealed the highest proportion of chimeric and index-switched reads, along with a strong bias towards shorter amplicons. MinION data analysed using PRONAME and Minovar - a bioinformatics pipeline presented here - had the largest proportion of low-quality data, and rare taxa were lost during data filtering and read polishing steps. Although Minovar enabled amplicon sequence variant (ASV) level precision for common taxa, we recommend clustering ASVs into OTUs. For PacBio, standard filtering approaches outperformed the ASV approach because they retained rare taxa. For Illumina, a stringent ASV approach or removal of rare OTUs would limit artefacts. Across all platforms, excess PCR cycles promoted chimeric and low-quality reads and lost quantitativity in biodiversity assessments. With moderate differences in effect sizes, all analytical approaches supported the conclusion that sampling design determines how we see soil biodiversity responses to land use. For biodiversity surveys based on the full-length ITS metabarcoding, we recommend using PacBio sequencing with standard, non-ASV pipelines.
]]></description>
<dc:creator><![CDATA[ Tedersoo, L., Prous, M., Chen, M., Anslan, S., Saar, I., Dubois, B., Mikryukov, V. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.20.726443</dc:identifier>
<dc:title><![CDATA[Benchmarking full-length ITS metabarcoding across Illumina 2x500, PacBio, and Oxford Nanopore sequencing using mock and soil communities]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.20.726468v1?rss=1">
<title>
<![CDATA[
MolCodon: A Codon-Based Molecular Language for InterpretableStructural Representation and Similarity Search 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.20.726468v1?rss=1
</link>
<description><![CDATA[
Molecular representation determines which aspects of chemical structure can be learned, com-pared, and interpreted in computational drug discovery. Existing encodings typically emphasize either compact string description, as in SMILES and SELFIES, or efficient similarity search, as in circular fingerprints, but they may not simultaneously provide deterministic sequence structure, graph-level interpretability, pharmacophore annotation, and high-fidelity molecular reconstruction. Here, we introduce MolCodon, a codon-based molecular language that represents small molecules as deterministic sequences of fixed-width three-character tokens over a five-symbol alphabet, C, N, O, S, and X. Inspired by the triplet organization of the genetic code, MolCodon assigns chemically defined codon families to atoms, bonds, ring and branch topology, fused-ring references, pharmacophore features, bond mobility, charge, and stereochemistry. A deterministic graph traversal with ring-contiguity preservation produces sequences in which chemically meaningful substructures remain locally organized and traceable to the underlying molecular graph. Across around 2,9 million molecules from six commercial screening libraries, MolCodon achieved 98.93% InChIKey-level round-trip fidelity, supporting its use as a high-fidelity sequence representation for drug-like chemistry. MolCodon-derived sparse sequence and trace features further outperformed SELFIES and Group SELFIES across ten QSAR tasks and exceeded classical fingerprint baselines in six out of ten tasks. As an application of the representation, MolCodon BLAST similarity engine decomposes molecular similarity into ring topology, branch context, attachment architecture, and pharmacophore correspondence, enabling interpretable scaffold-hopping searches. In a PARP1 virtual screening study, MolCodon retrieved scaffold-diverse candidates to a known PARP-1 inhibitor Olaparib. Together, these results establish MolCodon as a new molecular representation paradigm that transforms chemical graphs into high-fidelity, interpretable, and alignment-compatible codon sequences, opening a direct path for bioinformatics-inspired analysis of small-molecule chemical space. The MolCodon encoder, decoder, and BLAST similarity engine are freely available as open-source software at https://github.com/DurdagiLab/MolCodon
]]></description>
<dc:creator><![CDATA[ Sayyah, E., Kurul, E., Tunc, H., DURDAGI, S. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.20.726468</dc:identifier>
<dc:title><![CDATA[MolCodon: A Codon-Based Molecular Language for InterpretableStructural Representation and Similarity Search]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.20.726414v1?rss=1">
<title>
<![CDATA[
A cross-tissue POSTN+ fibroblast atlas links periodontal, tumor, and fibrotic stromal niches 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.20.726414v1?rss=1
</link>
<description><![CDATA[
Cross-tissue single-cell atlases have re-framed fibroblasts as a continuum of activation states, with universal Pi16+ progenitors giving rise to tissue-restricted activated populations[1] and shared pathological states recurring across inflammatory diseases[2]. Periostin (POSTN), a matricellular protein of injured, fibrotic, and tumor stroma, has been independently linked to activated fibroblasts in liver fibrosis[3], colorectal cancer[4], head-and-neck cancer[5], and dental contexts[6, 7], but cross-tissue conservation of a single POSTN+ program is untested. Here we built a Harmony-integrated atlas of 56,713 human and mouse fibroblasts from eight single-cell datasets spanning six organ contexts (periodontal ligament, periodontitis, oral squamous-cell carcinoma, colorectal cancer, temporomandibular-joint osteoarthritis, and bile-duct-ligation liver fibrosis). A conservative cluster-consensus definition (Wilcoxon padj < 0.05 and log2FC > 0.5 within an atlas-integrated leiden cluster, combined with per-cell POSTN > 0) identified 11,451 POSTN+ cells (20.2% of the atlas) recurring across all six contexts at frequencies from 6.2% (periodontal ligament) to 55.1% (liver fibrosis). Within-fibroblast differential expression yielded a 102-gene shared core program - collagen biosynthesis, ECM crosslinkers, and matricellular markers including POSTN, SPARC, BGN, FN1, MMP2, and CTHRC1 - interpreted as POSTN-specific transcriptional amplification of an activated ECM-remodelling module. KLF4, hypothesized a priori as a POSTN+ co-marker, was upregulated in only one of six contexts, consistent with its role as a quiescence brake released during activation[3]. Three pre-registered sensitivity analyses (Harmony parameter, three definitions, dataset exclusion) and an independent Puram-2017 OSCC cohort (1,422 fibroblasts; 101/102 core genes recovered; primary vs lymph-node-met Mann-Whitney p = 0.005) support robustness across integration parameters, definitions, dataset inclusion, and platform.
]]></description>
<dc:creator><![CDATA[ Wang, C., Yang, H., Lin, M., Wang, Y., Guoli, Y. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.20.726414</dc:identifier>
<dc:title><![CDATA[A cross-tissue POSTN+ fibroblast atlas links periodontal, tumor, and fibrotic stromal niches]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726405v1?rss=1">
<title>
<![CDATA[
Joint enzyme-reaction retrieval and catalytic optima prediction via multimodal fusion 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726405v1?rss=1
</link>
<description><![CDATA[
Motivation: Enzyme-reaction retrieval is increasingly used to prioritize candidate biocatalysts for experimental follow-up, where useful recommendations should indicate not only whether an enzyme can catalyze a target reaction but also under which pH and temperature conditions it should be tested. Existing retrieval models optimize catalytic matching scores, whereas catalytic optima predictors are typically developed as enzyme-level regressors because public pH and temperature annotations are sparse and often available only at the enzyme or EC-associated record level. This separation leaves a practical gap: high-ranking enzyme-reaction pairs are not evaluated for condition suitability, and enzyme-level optima predictions do not use the reaction context being retrieved. Results: We present GERO, a multimodal fusion framework that uses feature-gated cross-modal fusion to integrate global enzyme sequence semantics, sequence-derived pocket geometry, and molecular reaction representations for condition-aware enzyme-reaction retrieval and catalytic optima estimation with reaction context. To evaluate this setting, we define the tolerance-restricted hit rate (Hit@k-TR), which requires both top-k retrieval of the correct candidate and condition prediction within predefined tolerances. Across enzyme- and reaction-similarity splits, GERO improves Hit@k-TR over two-stage retrieval-then-prediction baselines. Representative benchmark examples and an iodinin biosynthesis case study further illustrate GERO's ability to provide candidate rankings together with plausible assay-condition estimates for downstream experimental prioritization.
]]></description>
<dc:creator><![CDATA[ Cai, Y., Yang, F., Liu, J. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726405</dc:identifier>
<dc:title><![CDATA[Joint enzyme-reaction retrieval and catalytic optima prediction via multimodal fusion]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726197v1?rss=1">
<title>
<![CDATA[
MirMachine 2: a scalable, evolutionarily informed pipeline for microRNA annotation and comparative genomics across thousands of animal genomes 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726197v1?rss=1
</link>
<description><![CDATA[
SUMMARY Genome sequencing is rapidly outpacing the annotation of conserved regulatory elements, limiting the evolutionary and comparative insights that can be extracted from expanding genome collections. MicroRNAs are among the most conserved and phylogenetically informative genes, yet automated annotation has remained difficult to scale while preserving evolutionary interpretability. Here we present MirMachine 2, an evolutionarily informed framework that combines curated reference models, lineage-aware scoring, and adaptive filtering to enable robust genome-wide microRNA annotation at scale. Applying this to thousands of animal genomes reveals that many apparent absences of conserved microRNAs reflect methodological bias rather than biological loss, particularly in underrepresented lineages. By enabling consistent and interpretable comparison of microRNA complements across large datasets, MirMachine 2 establishes scalable microRNA annotation as a practical foundation for genome-scale evolutionary and comparative genomics.
]]></description>
<dc:creator><![CDATA[ Paynter, V. M., Umu, S. U., Tierney, J. A. S., Tricomi, F. F., Haggerty, L., Fromm, B. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726197</dc:identifier>
<dc:title><![CDATA[MirMachine 2: a scalable, evolutionarily informed pipeline for microRNA annotation and comparative genomics across thousands of animal genomes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.725261v1?rss=1">
<title>
<![CDATA[
Transcriptomics of cold stress and recovery reveal strongly tissue-specific responses 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.725261v1?rss=1
</link>
<description><![CDATA[
Cellular stress responses are often characterized as conserved, cell-autonomous processes. However, it remains unclear whether stress responses are coordinated uniformly across tissues within complex organisms, particularly during ecologically relevant conditions. We investigated tissue- and stage-specific transcriptional responses to cold stress in Drosophila melanogaster. Adults and larvae were independently exposed to a gradual cooling and recovery time series, and three adult tissues (gut, ovary, brain) and one larval tissue (gut) were sampled at baseline, at two time points that spanned the critical thermal minimum (before and during chill coma), and after recovery to rearing temperature. Transcriptomic analyses revealed strongly tissue- and stage-specific responses to cold stress, with limited overlap in differentially expressed genes or functional enrichment across tissues. These results indicate that the organismal response to thermal stress at the transcriptional level is not coordinated by a unified transcriptional program, but rather by largely distinct, tissue-specific regulatory processes.
]]></description>
<dc:creator><![CDATA[ Heilig, M., Gadey, L., Tomkinson, J., deMayo, J. A., Ragland, G. J. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.725261</dc:identifier>
<dc:title><![CDATA[Transcriptomics of cold stress and recovery reveal strongly tissue-specific responses]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726254v1?rss=1">
<title>
<![CDATA[
NanoCortex: A Unified Agentic System for Nanopore Sequencing Analysis 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726254v1?rss=1
</link>
<description><![CDATA[
Nanopore sequencing has enabled various layers of information about DNA and RNA sequence isoforms and chemical modifications. Yet, the archipelago of disjoint nanopore analysis tools makes navigating among these a significant challenge for the nanopore user. We present NanoCortex, a unified autonomous agentic framework designed to bridge this shortcoming by providing end-to-end data processing which ranges from raw signal basecalling to biological interpretation. Built upon Gemini API services that incur usage-based API costs and orchestrated through the Gemini Agent Development Kit (ADK), the system utilizes a multi-agent architecture to autonomously perform task parsing, code generation, iterative code-level self-correction of code, and scientific interpretation. Following code generation, the code can be used offline. Benchmarking reveals that NanoCortex achieves significantly higher usability across complex analytical tasks compared to general-purpose large language models. The framework seamlessly integrates experimental data with meta-analysis of publicly available, biological databases to facilitate the extraction of biologically meaningful insights from sequencing data without cumbersome computational steps.
]]></description>
<dc:creator><![CDATA[ Xia, Q., Wang, Z., Shokoufandeh, M., Rouhanifard, S. H., Wanunu, M. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726254</dc:identifier>
<dc:title><![CDATA[NanoCortex: A Unified Agentic System for Nanopore Sequencing Analysis]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726271v1?rss=1">
<title>
<![CDATA[
A Plasmodium falciparum Pangenome Resource to Drive Structural Variant Discovery and to assist Malaria Control 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726271v1?rss=1
</link>
<description><![CDATA[
Plasmodium falciparum, the deadliest causative agent of malaria, harbours extensive structural variation that underlies key biological processes including drug resistance and diagnostic evasion. Here, we present PfPan, a P. falciparum pangenome constructed from 13 geographically diverse high-quality reference genomes using Minigraph-Cactus, adding 4.7 Mb of sequence beyond the 3D7 linear reference. We identify over 5,000 structural variants across the reference genomes and demonstrate improved genotyping using the vg toolkit compared to linear reference-based approaches, with comparable performance for small variant discovery. Benchmarking against assembly-derived truth sets confirms pangenome superiority for structural variant detection, particularly at complex and hypervariable loci. Applying PfPan to 878 globally sampled P. falciparum whole-genome sequences, we characterise the population-level frequency of clinically relevant structural variants, including a high-frequency 10.4 kb insertion at the drug resistance-linked gch1 locus, and explore deletions upstream of the gene encoding the diagnostic target HRP2. PfPan provides a foundational resource for reducing reference bias in P. falciparum genomic surveillance and offers a framework for improved detection of variants relevant to drug resistance and malaria control.
]]></description>
<dc:creator><![CDATA[ Billows, N., Thorpe, J., Campino, S., Clark, T. G. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726271</dc:identifier>
<dc:title><![CDATA[A Plasmodium falciparum Pangenome Resource to Drive Structural Variant Discovery and to assist Malaria Control]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726178v1?rss=1">
<title>
<![CDATA[
A unified framework for batch correction and missing data handling in large-scale and single-cell mass spectrometry proteomics 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726178v1?rss=1
</link>
<description><![CDATA[
Large-scale mass spectrometry (MS)-based proteomics, including single-cell proteomics, is routinely affected by technical variation arising from discrete batch effects, inter-laboratory differences and continuous signal drift during data acquisition. Current correction strategies typically address these sources of unwanted variation independently and often require either removal of proteins with missing values or imputation before correction, both of which may lead to information loss and potential amplification of technical bias. Here we present NMFBatch, a unified statistical framework that simultaneously models discrete and continuous unwanted variation in bulk and single-cell proteomics data. NMFBatch integrates non-negative matrix factorization with generalized additive modelling and directly accommodates missing values, thereby enabling both on-the-fly imputation during correction and optional post-correction imputation. Benchmarking against six batch-correction methods using multi-laboratory reference datasets and a large plasma proteomics cohort, shows that NMFBatch consistently reduces batch-associated variation while preserving biological structure under both balanced and confounded experimental designs. Application to single-cell proteomics data further showed effective reduction of TMT- and acquisition-associated variation while retaining biologically meaningful clustering. Together, these results establish NMFBatch as a flexible framework for modelling unwanted variation in proteomics experiments, with potential applications in cross-cohort harmonization and integrative proteomics analysis.
]]></description>
<dc:creator><![CDATA[ Anwar, A. M., Bayoumi, S., Lahti, L., Coffey, E. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726178</dc:identifier>
<dc:title><![CDATA[A unified framework for batch correction and missing data handling in large-scale and single-cell mass spectrometry proteomics]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726290v1?rss=1">
<title>
<![CDATA[
Antimicrobial peptide databases and prediction tools: Toward a standard evaluation framework 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726290v1?rss=1
</link>
<description><![CDATA[
Antimicrobial resistance (AMR) has a profound impact on animal and human health and is associated with substantial morbidity, mortality and public health costs. There is a clear need to develop novel, effective antibiotic agents, which can overcome the current AMR crisis. Antimicrobial peptides (AMPs) may offer such a solution and have attracted growing attention for their potential to combat AMR. In parallel, the growing availability of peptide sequences in public databases has stimulated the development of numerous machine learning and deep learning tools to predict antimicrobial activity computationally. However, it remains unclear how reliably these tools can be compared, as existing studies often rely on heterogeneous datasets and inconsistent evaluation protocols that may lead to data leakage and inflated performance estimates. This raises a central question: what evaluation criteria and benchmark resources are needed to enable fair, reproducible, and biologically meaningful assessment of AMP prediction tools? We address this question by focusing specifically on antibacterial peptides (ABPs). We first provide an overview of AMP databases relevant to antibacterial activity and compare their content, redundancy, and experimental metadata. We then critically assess existing computational tools for ABP prediction, highlighting key limitations related to dataset construction, affinity to certain sequences, data leakage, and inconsistent performance reporting. Based on these limitations, we propose a reference evaluation framework designed to improve comparability, reproducibility, and practical utility in ABP prediction. Finally, we provide targeted recommendations for AMP databases and future tool development to support more robust progress in the computational discovery of ABPs.
]]></description>
<dc:creator><![CDATA[ Cisterna Garcia, A., Gonzalez Lopez, A. M., Vozi, A., Esteban, M. A., Egli, A., Jutzeler, C., Palma, J., Sanchez-Ferrer, A., Botia, J. A. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726290</dc:identifier>
<dc:title><![CDATA[Antimicrobial peptide databases and prediction tools: Toward a standard evaluation framework]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726314v1?rss=1">
<title>
<![CDATA[
zFISHer: Automated 3D Registration, Detection, and Colocalization with Interactive Curation for Sequential Multiplexed FISH 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726314v1?rss=1
</link>
<description><![CDATA[
Sequential multiplexed fluorescence in situ hybridization (FISH) enables spatially resolved molecular profiling in cell monolayers, but analyzing puncta colocalization across three-dimensional (3D) datasets remains a labor-intensive bottleneck. zFISHer is an open-source application built on the napari viewer that provides complete automation of sequential FISH image processing in conjunction with interactive user-curation tools. zFISHer provides end-to-end analysis of paired FISH datasets, encompassing nuclear segmentation, automated puncta detection on unaligned z-stacks, multi-round image registration via translation-constrained RANSAC with optional B-spline deformable warping, precise transformation of puncta coordinates into aligned space, consensus nuclei generation, interactive editing with real-time collision detection, and pairwise and tri-channel colocalization analysis with statistics. This includes a Fishing Hook raycasting algorithm that enables users to locate puncta at their true 3D centroids by identifying intensity maxima along the camera ray, eliminating manual z-slice navigation, complemented by a sub-voxel volume optimization. The included batch processing mode enables high-throughput unattended analysis of multiple experimental datasets.
]]></description>
<dc:creator><![CDATA[ Staller, S. A., Valentine, V., Burden, S. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726314</dc:identifier>
<dc:title><![CDATA[zFISHer: Automated 3D Registration, Detection, and Colocalization with Interactive Curation for Sequential Multiplexed FISH]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726275v1?rss=1">
<title>
<![CDATA[
ParaDISM: Precise mapping of short reads to genes with highly homologous regions 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726275v1?rss=1
</link>
<description><![CDATA[
Background Genes with highly similar genomic copies (paralogs, tandem duplications and pseudogenes) pose a major challenge for Short-Read High Throughput Sequencing (srHTS). High sequence similarity makes it difficult to unambiguously identify the sequences of origin of short reads. This results in misalignment artifacts which can propagate through bioinformatic pipelines and increase error rates in variant calling. Results We present ParaDISM, a pipeline that refines standard alignments to improve read placement and reduce misalignment-driven false variant calls in highly homologous sequences. ParaDISM assigns a read/read pair to a sequence only when supported by unambiguous sequence-specific evidence by using a multiple sequence alignment of reference sequences to identify disambiguating positions. An optional iterative refinement procedure calls variants from confidently assigned reads, updates the reference sequences, and processes remaining non-assigned reads. We evaluated the performance of ParaDISM both in terms of read alignment and the resulting short variant calls using extensive computational simulation experiments and the Genome in a Bottle HG002 benchmark. We applied ParaDISM to reanalyze two case studies: five public tumour exomes at the GNAQ/GNAQP1 locus, and 18 short-read sequencing datasets of patients diagnosed with Autosomal Dominant Polycystic Kidney Disease (16 exomes and 2 panel sequencing datasets). Compared to the standard aligners (bowtie2, bwa-mem and minimap2), ParaDISM reduced the number of misalignment artifacts and false variant calls, resulting in an increased specificity and precision of the results. Conclusions ParaDISM improves the precision of read placement and single-nucleotide variant calling in highly homologous reference sequences. By reducing the number of false variant calls caused by misalignment artifacts, ParaDISM provides a stronger level of evidence for the called variants compared to currently available approaches. The pipeline is open source and available under the MIT license at github.com/BioGeMT/ParaDISM.
]]></description>
<dc:creator><![CDATA[ Tzimotoudis, D., Farrugia, R., Zammit, J., Masini, M. C., Balestrucci, A., Carbott, F. B., Wettinger, S. B., Alexiou, P., Ciach, M. A. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726275</dc:identifier>
<dc:title><![CDATA[ParaDISM: Precise mapping of short reads to genes with highly homologous regions]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726040v1?rss=1">
<title>
<![CDATA[
Comparative somatic genomics reveals divergent development of cell lineages across scleractinian corals 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726040v1?rss=1
</link>
<description><![CDATA[
Somatic mutations may drive adaptation and aging across diverse life forms, yet their role remains poorly understood in many early-branching animals. Here, we compare somatic mutation accumulation in the robust coral Orbicella faveolata with previous findings in the complex coral Acropora palmata. Whole-genome sequencing revealed high fixation of somatic genetic variants in O. faveolata, particularly in older, interior regions of colonies, contrasting with A. palmata. These patterns suggest distinct cell population dynamics between clades, indicating a segregated, mammal-like germline in O. faveolata, whereas such a germline remains undetected in A. palmata. This underscores the diversity of somatic evolutionary mechanisms across scleractinian corals.
]]></description>
<dc:creator><![CDATA[ Conn, T., Reusch, T. B. H., Werner, B., Baums, I. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726040</dc:identifier>
<dc:title><![CDATA[Comparative somatic genomics reveals divergent development of cell lineages across scleractinian corals]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.713680v1?rss=1">
<title>
<![CDATA[
Efficient and Robust Genomic DNA Isolation and Next-Generation Sequencing Library Preparation from Recalcitrant Wild Grape Species 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.713680v1?rss=1
</link>
<description><![CDATA[
This protocol details the extraction of high-molecular-weight genomic DNA from grapevine tissues (wild and cultivated Vitis spp., including pathogen-infected samples) and the subsequent preparation of Illumina(R) whole-genome sequencing libraries using bead-bound Tn5 transposase. It is designed to overcome challenges from polyphenolic compounds and secondary metabolites in wild plants, providing a cost-effective workflow for large-scale population genomics. It includes recipes for buffers, incubation times, critical notes, and troubleshooting tips to maximize yield and library quality. Although designed for the grapevine DNA, this protocol is potentially applicable to other similar wild plant species
]]></description>
<dc:creator><![CDATA[ Bhattarai, A., Smith, J., Abdelgaffar, H., Carpenter, R., Mishra, S., Fuentes, J. L. J., Shirsekar, G. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.713680</dc:identifier>
<dc:title><![CDATA[Efficient and Robust Genomic DNA Isolation and Next-Generation Sequencing Library Preparation from Recalcitrant Wild Grape Species]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726322v1?rss=1">
<title>
<![CDATA[
Combinatorial pioneer transcription factor binding reinforces bivalent epigenetic states to preserve lineage fidelity 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726322v1?rss=1
</link>
<description><![CDATA[
Combinatorial binding of transcription factors (TFs) is central to eukaryotic gene regulation, providing regulatory specificity and robustness to cell fate control. However, its impact on the epigenetic regulation remains poorly understood. Here, we show that the pioneer TFs GATA6, EOMES, and SOX17 cooperate with the zinc finger TF PRDM1 to recruit Polycomb Repressive Complexes (PRCs) and establish enhancers marked by H3K4me1 and PRC-associated histone modifications during endoderm development. Increasing the number and diversity of pioneer TFs bound at enhancers drives synergistic nucleosome remodeling and promotes the formation of hyper-bivalent enhancers that reinforce repression of alternative-lineage programs. Together, our findings demonstrate that combinatorial pioneer TF binding creates locally accessible regions that facilitate recruitment of not only active but also PRC-associated epigenetic regulators to preserve lineage fidelity during development.
]]></description>
<dc:creator><![CDATA[ Mirizio, G., Buckley, M., Ludwig, K., Matsui, S., Sampson, S., Lim, H.-W., Iwafuchi, M. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726322</dc:identifier>
<dc:title><![CDATA[Combinatorial pioneer transcription factor binding reinforces bivalent epigenetic states to preserve lineage fidelity]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726264v1?rss=1">
<title>
<![CDATA[
Differential Gene Expression in the Tropical House Cricket and Its Iridovirus in Healthy versus Diseased Specimens 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726264v1?rss=1
</link>
<description><![CDATA[
The tropical house cricket, Gryllodes sigillatus, is a mass-produced insect that is used as a protein source for pets and livestock. However, intensive mass-rearing conditions, coupled with high genetic relatedness, create an ideal environment for the spread of pathogenic microbes that severely impact production. Cricket iridovirus (CrIV) is a pathogen that impedes cricket growth and causes significant losses for cricket farmers. Interestingly, recent studies have shown that CrIV is often present asymptomatically, yet the molecular basis of the emergence of disease symptoms remains unknown. To address this, we sampled healthy and diseased crickets and examined differences in cricket and CrIV gene expression via RNAseq. Using differential gene expression analysis and functional enrichment analysis, we found significant differences in host and viral gene expression between healthy and diseased crickets, including genes involved in immunity. Interestingly, while we observed high CrIV gene expression across the entire CrIV genome in sick populations, healthy asymptomatic populations showed elevated expression at a single viral locus. Our results shed light not only on the cricket immune response to CrIV infection but also identify a viral gene that is highly expressed during covert infections, suggesting its potential role in suppressing the host's immune response. These findings enhance our understanding of how CrIV interacts with our cricket host, providing essential insights for developing targeted strategies to manage CrIV outbreaks in cricket mass-rearing facilities.
]]></description>
<dc:creator><![CDATA[ Hinton, J. A., Walt, H. K., Duffield, K. R., Ramirez, J. L., Meyer, F., Hoffmann, F. G. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726264</dc:identifier>
<dc:title><![CDATA[Differential Gene Expression in the Tropical House Cricket and Its Iridovirus in Healthy versus Diseased Specimens]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726393v1?rss=1">
<title>
<![CDATA[
Structural Pockets and Interacting RNA-Associated Ligands (SPIRAL): A DSSR-enabled Meta-Analysis of RNA-Small Molecule Recognition 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726393v1?rss=1
</link>
<description><![CDATA[
Small molecules that target structured RNA hold therapeutic promise across a wide range of diseases, yet the structural principles governing RNA-ligand recognition remain poorly defined. Here we present SPIRAL (Structural Pockets and Interacting RNA-Associated Ligands), a curated database of 1,098 RNA-small molecule structures from the Protein Data Bank covering 1,137 ligand-binding events across six functional RNA categories: riboswitches, ribozymes, synthetic aptamers, G-quadruplexes, ribosomal RNA, and regulatory RNA motifs. A customized pipeline built on DSSR (Dissecting the Spatial Structure of RNA) extracts structural interaction parameters from each complex, capturing stacking geometry, hydrogen-bond topology by RNA moiety, backbone contacts, groove engagement, and tertiary motif context. Unsupervised clustering of these fingerprints resolves six mechanistically distinct binding modes whose distribution is strongly governed by RNA functional class, demonstrating that different RNA categories engage small molecules through fundamentally different chemical strategies. To enable category-independent comparison of interaction quality across these mechanistically diverse modes, we introduce the Composite Binding Quality Score (CBQS), a seven-metric framework that ranks riboswitches highest and regulatory RNA motifs lowest among the six categories, while ribozymes, synthetic aptamers, and G-quadruplexes achieve statistically equivalent intermediate scores through three distinct recognition strategies. Analysis of 275 non-redundant affinity-characterized entries identifies C2'-endo sugar pucker count and total buried contact surface area as the dominant independent predictors of binding affinity. Both predictors are enriched at junction loops, pseudoknots, and base multiplet networks, the same tertiary structural sites most under engaged by current regulatory RNA motif binders, suggesting that ligands designed to contact these sites would improve both potency and selectivity simultaneously.
]]></description>
<dc:creator><![CDATA[ Lu, X.-J., Wang, Y. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726393</dc:identifier>
<dc:title><![CDATA[Structural Pockets and Interacting RNA-Associated Ligands (SPIRAL): A DSSR-enabled Meta-Analysis of RNA-Small Molecule Recognition]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726397v1?rss=1">
<title>
<![CDATA[
Multi-layer transcriptomic characterization of age-related immune dynamics 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726397v1?rss=1
</link>
<description><![CDATA[
Despite the pivotal role of mRNA isoform diversity in governing immune cell function, current investigations into peripheral immune aging predominantly focused on gene-level expression, obscuring deeper regulatory layers of transcriptome complexity. Here, we leveraged a 5' scRNA-seq atlas comprising approximately 2.5 million PBMCs from 378 healthy donors. We demonstrate that immune aging is characterized by profound, non-linear transcriptional reprogramming that extends beyond gene-level shifts to include fine-tuned regulation of alternative transcription initiation and splice site selection. By quantifying the transcriptional activity of cis-regulatory elements, we resolved their contributions to age-related expression dynamics. Notably, we identified a subset of endogenous retroviruses that are reactivated in older individuals, some of which served as alternative promoters driving the production of chimeric transcripts. Furthermore, our analysis revealed EDA as a top-ranked gene consistently upregulated with age across multiple independent cohorts. Increasing EDA expression in in vitro-stimulated naive CD4+ T cells from young individuals recapitulated aged phenotypes. This comprehensive resource elucidates the multi-layered transcriptomic landscape of the aging immune system and facilitates the identification of novel drivers of immune aging.
]]></description>
<dc:creator><![CDATA[ Zhao, Z., Zhao, S., Jin, J., Ni, T. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726397</dc:identifier>
<dc:title><![CDATA[Multi-layer transcriptomic characterization of age-related immune dynamics]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726105v1?rss=1">
<title>
<![CDATA[
S-IGTD: supervised tabular-to-image topology learning via between-group correlation for multiclass classification of biological data 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726105v1?rss=1
</link>
<description><![CDATA[
Motivation: Tabular-to-image methods allow convolutional neural network (CNN)-based classifiers to analyse high-dimensional biological tables by mapping features onto a two-dimensional grid. Existing layouts are usually driven by unsupervised global correlation, which can place class-discriminative features far apart when nuisance or housekeeping covariation dominates the total covariance structure. Results: We present the Supervised Image Generator for Tabular Data (S-IGTD), a supervised extension of IGTD that optimizes tabular-to-image topology by replacing total-correlation distance with one minus the absolute between-group correlation, computed from class-wise feature means, under the Within-And-Between-Analysis (WABA) decomposition. We prove entrywise consistency of the supervised distance matrix under standard moment conditions and identify balanced-class settings in which S-IGTD improves a Signal Dispersion Score (SDS)-related topology objective. In controlled simulations targeting between-group signal, S-IGTD outperformed Euclidean- and correlation-distance IGTD variants in SDS, accuracy and macro-F1 score. Across five biological benchmarks ranging from 4- to 91-class classification, S-IGTD produced compact class-supervised layouts, with 24/35 Holm-adjusted significant SDS wins against seven non-reference layout controls. As a secondary downstream diagnostic, a CNN with batch normalization showed higher mean accuracy than random layouts and correlation-distance IGTD on all real datasets, and higher mean accuracy than Euclidean-distance IGTD on four of five datasets, with the clearest gains on large multiclass cancer and methylation benchmarks. Availability and implementation: Source code, datasets, configuration files and reproducibility scripts are freely available at https://github.com/hanmingwu1103/S-IGTD.
]]></description>
<dc:creator><![CDATA[ WU, H.-M. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726105</dc:identifier>
<dc:title><![CDATA[S-IGTD: supervised tabular-to-image topology learning via between-group correlation for multiclass classification of biological data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726067v1?rss=1">
<title>
<![CDATA[
A framework for peptide identification on commercial nanopore sequencing platforms 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726067v1?rss=1
</link>
<description><![CDATA[
Direct single-molecule peptide analysis could in principle enable rapid and sensitive identification of pathogen-derived or disease-associated biomarkers without reliance on mass spectrometry. However, existing nanopore peptide sensing methods are typically constrained by limited throughput and lack of accessibility beyond specialized setups. Here, we present an integrated experimental-computational framework for DNA-linked peptide translocation on a commercially available, high-throughput nanopore sequencing platform, the MinION. Synthetic peptides were covalently bound to oligonucleotides at both termini. The resulting peptide-DNA constructs were then translocated through the CsgG-CsgF pores using a DNA motor protein. Current traces were segmented using the known DNA sequences to extract peptide-associated signal regions. From these segments, we extracted signal features and trained feature-based and deep-learning classifiers to distinguish peptides, balancing interpretability and classification performance. We establish a framework for peptide identification using standard nanopore sequencing hardware. Across a diverse panel of synthetic peptides, our approach resolves single-amino-acid substitutions, maintains performance across independent sequencing runs, and correctly identifies peptides in blind mixtures. Interpretable model analyses connect classifier decisions and common errors to specific signal motifs. By combining commercially available instrumentation with a reproducible experimental and computational workflow, this framework lowers the barrier to nanopore-based proteomics and enables broader adoption across laboratories. It provides a foundation for future developments in amino acid modification detection and sequence analysis.
]]></description>
<dc:creator><![CDATA[ Beslic, D., Kucklick, M., Graap, E., Sedaghatjoo, S., Renard, B. Y., Fuchs, S., Engelmann, S., Koerber, N. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726067</dc:identifier>
<dc:title><![CDATA[A framework for peptide identification on commercial nanopore sequencing platforms]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726151v1?rss=1">
<title>
<![CDATA[
Spectral Prompting: Unsupervised Recovery of Human Hair Follicle Cell-Type and Multiscale Systems Architecture from Bulk and Single-Cell RNA-Seq Datasets via Single-Gene Seeded Spectral Unfolding 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726151v1?rss=1
</link>
<description><![CDATA[
Bulk RNA sequencing datasets are assumed to carry minimal resolvable programmatic and cell type biological information; as such, in the absence of single-cell resolution, researchers prioritise data analysis approaches based on differential expression, or rely on deconvolution and co-expression methods that require external reference panels, large multi-sample cohorts, or prior single-cell data to resolve cell-type structure. Here I describe the recovery of specialised cell-type and systems gene expression architecture resolved from a static gene expression dataset of untreated cultured human hair follicles (pooled from N=12 patients) isolated from scalp skin. To achieve this, I used graph theoretic methods to mathematically transform gene expression data into a latent space of relational structure, which was spectrally organised into coarse- and fine-grained modes and partitioned using a purpose-built computational algorithm. This permitted the synthesis of a computational Spectral Prompting system, whereby a single gene can be seeded to unfold to reveal associated partners across manifold projections in gene expression space. Individual projections across the manifold can reveal rich individual gene expression programmes, which can then be aggregated to identify core-associated genes for a given spectral gene prompt, both within the manifold analysed and across >1 manifold constructions. With this, I recover hitherto unresolved gene expression programmes from bulk data, including, but not limited to, epithelial hair follicle stem cell (eHFSC), hair shaft, dermal papilla and endothelial gene expression signatures. Focusing on querying KRT15, a human anagen bulge eHFSC and progenitor marker, raw output from individual spectral prompts during testing recovered known eHFSC-associated genes including LGR5, LHX2 and CXCL14, and discovered new candidate human eHFSC and progenitor cell-associated markers, such as RGMA and MUCL1 which were validated in situ. Finally, I show a brief demonstration that the technique can be similarly applied to single-cell data (GSE129611), whereby a KRT15 gene prompt from a combined expression matrix was mapped to a KRT15+/CXCL14+/LHX2+/DIO2+/SFRP1+ cell population (31/6000 cells) independent of standard clustering tools. Moving forward, from this foundation, the method will be developed to study how latent gene expression space shifts following perturbation or pathology.
]]></description>
<dc:creator><![CDATA[ Purba, T. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726151</dc:identifier>
<dc:title><![CDATA[Spectral Prompting: Unsupervised Recovery of Human Hair Follicle Cell-Type and Multiscale Systems Architecture from Bulk and Single-Cell RNA-Seq Datasets via Single-Gene Seeded Spectral Unfolding]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726162v1?rss=1">
<title>
<![CDATA[
Heterogeneity-driven adaptive scale graph learning for subcellular spatial transcriptomics 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726162v1?rss=1
</link>
<description><![CDATA[
Spatial transcriptomics enables gene expression profiling within intact tissue sections, providing an important basis for analyzing tissue organization, cellular heterogeneity, and microenvironmental interactions. However, existing spatial structure identification methods often integrate spatial information using fixed neighborhoods or predefined smoothing scales, which limits their ability to adapt to region-specific structural heterogeneity. In homogeneous regions, broader spatial smoothing can help preserve continuous tissue structures, whereas in regions with complex boundaries or mixed cell populations, excessive smoothing may obscure local expression differences and fine-scale structural changes. Therefore, it is necessary to develop an adaptive graph learning framework that can adjust the range of spatial information integration according to tissue structural heterogeneity. In this study, we propose HAST, a heterogeneity-driven adaptive-scale graph learning framework for spatial transcriptomics. HAST adaptively determines graph filtering scales according to spatial structural heterogeneity, enabling flexible information aggregation across different tissue regions. It further decomposes gene expression signals into low-frequency structural components and high-frequency residual components, thereby jointly modeling global spatial continuity and local expression variations. Experiments on high-resolution spatial transcriptomics datasets show that HAST improves spatial structure identification and cross-section generalization. Tumor-enriched cluster identification and neighborhood enrichment analysis further demonstrate its ability to characterize tumor-associated spatial regions and microenvironmental organization.
]]></description>
<dc:creator><![CDATA[ Shi, W., Shen, C., Liu, Y., Xiao, Q., Luo, J. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726162</dc:identifier>
<dc:title><![CDATA[Heterogeneity-driven adaptive scale graph learning for subcellular spatial transcriptomics]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726174v1?rss=1">
<title>
<![CDATA[
BioRAG-DRAG: A Multimodal Biological Retrieval Layer for Local-First Biomedical Agents 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726174v1?rss=1
</link>
<description><![CDATA[
Biomedical agents need reliable access to heterogeneous evidence: literature text, gene and pathway records, protein sequences, DNA/cDNA sequences, and structured biological relations. Classical sequence tools such as BLAST remain the right choice for alignment-grounded verification, but they are not a unified context interface for large language model agents. We present BioRAG-DRAG, a local-first multimodal retrieval layer that combines pluggable neural sequence-text retrieval, BLAST verification, and graph-based evidence packaging. Specialized encoders such as ESM-2 can serve protein partitions, while OmniGene CPT provides a unified biological-language backbone for mixed sequence/text and agent-facing use; BLAST reranks or verifies sequence candidates; and DRAG graphs expose typed, traceable paths for downstream agents. We introduce BioRAG-Standard v0, a partitioned corpus/library with 257,886 retrievable records and an initial annotation layer for engineering evaluation built from Open-Rosalind Standard biomedical records and sequence-window extensions. On an in-index sequence-window stress test, BLAST nearly saturates biological matching, while vector retrieval recovers substantial but lower biological match rates. On held-out parent-fragment controls, public protein encoders outperform the current OmniGene protein-window embedding, while DNA/cDNA dense retrieval remains weak even with off-the-shelf Nucleotide Transformer pooling; this supports a model-agnostic BioRAG design rather than a claim that one unified generator backbone is the best sequence-search encoder. Indexed Chroma lookup over Standard text and 100k sequence-window collections adds only small lookup overhead after query embedding; this does not measure end-to-end instant latency. Finally, exploratory sequence DRAG traces show inspectable biological neighborhoods, including immunoglobulin-family and gene-symbol modules, with initial graph controls indicating non-random but partly sequence-similarity-driven structure. These results support a bounded architecture: vector retrieval supplies unified candidate context, while BLAST and DRAG provide biological verification and evidence attribution.
]]></description>
<dc:creator><![CDATA[ Wang, L. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726174</dc:identifier>
<dc:title><![CDATA[BioRAG-DRAG: A Multimodal Biological Retrieval Layer for Local-First Biomedical Agents]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726217v1?rss=1">
<title>
<![CDATA[
A phylogeny-guided framework for decoding mechanisms of human endogenous retrovirus regulation in health and disease 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726217v1?rss=1
</link>
<description><![CDATA[
Human endogenous retroviruses (HERVs) are remmants of ancient infections which make up to ~8% of the human genome. Their activity influences development, immunity, and cancer, but studying them has been limited by a key technical challenge: short-read sequencing cannot uniquely assign reads to these highly repetitive elements. Here, we present ERVmancer, a phylogeny-informed method that resolves the read-mapping ambiguity and quantifies HERV expression across scales, from individual loci to entire retroviral clades, depending on mapping confidence. Benchmarking with sample-matched long- and short-read data generated in this study demonsrates that ERVmancer outperforms existing approaches in both sensitivity and specificity. Application of ERVmancer recapitulates known HERV expression patterns in multiple sclerosis and uncovers new biology in breast cancer, including suppression of HERVH-LTR7 by p53. By enabling accurate and scalable quantification of integrated retroviral elements, ERVmancer provides a broadly applicable resource for investigating retroviral mechanisms in health and disease.
]]></description>
<dc:creator><![CDATA[ Patterson, A., Duong, B., Yoon, L., Foster, M., MacMullen, L., Wickramasinghe, J., Lucas, A., Srivastava, A., Jacobson, S., Murphy, M. E., Soldan, S., Lieberman, P. M., Auslander, N. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726217</dc:identifier>
<dc:title><![CDATA[A phylogeny-guided framework for decoding mechanisms of human endogenous retrovirus regulation in health and disease]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.19.726053v1?rss=1">
<title>
<![CDATA[
Lifestyles of Gypsy-family transposons shape their regulatory mechanisms 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.19.726053v1?rss=1
</link>
<description><![CDATA[
Transposable elements are a highly diverse group of selfish genomic elements, prevalent across the tree of life, whose uncontrolled propagation poses a threat to genome stability. Recent studies have explored the evolution of Drosophila melanogaster transposable elements, their co-evolution with the host genome, and mechanisms that regulate their activity. However, little is known about their cross-species evolutionary patterns. Long terminal repeat (LTR) retrotransposons are the most active group of transposable elements in Drosophila. They are broadly separated into retroelements, which are active in the germline, and insect endogenous retroviruses that are active in the soma. Somatic elements are hypothesised to infect the germline through their acquisition of virus-derived proteins such as Envelope and sORF2, thus multiplying through successive generations. In this study, we curated the sequences of LTR retrotransposons in 249 drosophilid genomes, allowing us to study their evolution across these species and highlight their varying degrees of conservation. Furthermore, we reveal multiple instances of Envelope protein loss or inactivation that suggest shifts in the expression pattern of these transposons, likely accompanied by adopting different transcriptional control mechanisms. We contrast this with the evolutionary history of sORF2, which we found to be much more stable. Lastly, we examined variations in transposon LTR regions responsible for transcriptional regulation and use predictive modelling to suggest six transcription factors likely involved in their tissue-specific expression. Altogether, we reveal complex, interspecies evolutionary patterns of Gypsy-family LTR retrotransposons and highlight examples of their co-evolution with their host genome.
]]></description>
<dc:creator><![CDATA[ Papameletiou, A.-M., Czech Nicholson, B., Bornelöv, S., Hannon, G. J. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.19.726053</dc:identifier>
<dc:title><![CDATA[Lifestyles of Gypsy-family transposons shape their regulatory mechanisms]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://www.biorxiv.org/content/10.64898/2026.05.18.725946v1?rss=1">
<title>
<![CDATA[
geneML: Gene annotation across diverse fungal species using deep learning 
]]>
</title>
<link>
https://www.biorxiv.org/content/10.64898/2026.05.18.725946v1?rss=1
</link>
<description><![CDATA[
Accurate gene prediction remains a major bottleneck in fungal genomics, where lineage diversity and alternative splicing challenge existing ab initio methods. Here, we present geneML, a deep learning-based gene prediction tool tailored to fungal genomes. Across nine reference genomes spanning diverse fungal taxa, geneML improved gene-level F1 score from 64.9 to 67.1 compared to BRAKER3 with protein-based hints, driven by substantially higher recall (69.0 vs. 64.1) at equivalent precision. geneML also remains fast, averaging around 6 minutes per genome on a standard 8-core CPU. A key feature of geneML is its ability to predict alternative transcripts. Compared to Fusarium graminearum Iso-Seq control data, it achieves 41.1% transcript recall and 71.1% precision, outperforming AUGUSTUS (33.8% recall, 48.9% precision), one of the few tools that support isoform prediction. The predicted transcript diversity is consistent with experimentally observed fungal alternative splicing patterns. Reannotation of the curated training dataset further suggests improved biological completeness, with geneML recovering 15.3% more genes containing complete PFAM domains than the reference annotation. These results demonstrate that geneML enables faster, more sensitive, and more biologically informative fungal genome annotation. geneML is available as an open-source command-line tool at https://github.com/hexagonbio/geneML.
]]></description>
<dc:creator><![CDATA[ Vader, L., Harvey, C. J., Weber, T., Hon, L. S. ]]></dc:creator>
<dc:date>2026-05-21</dc:date>
<dc:identifier>doi:10.64898/2026.05.18.725946</dc:identifier>
<dc:title><![CDATA[geneML: Gene annotation across diverse fungal species using deep learning]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory</dc:publisher>
<prism:publicationDate>2026-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
</rdf:RDF>
