	<rdf:RDF xmlns:admin="http://webns.net/mvcb/" xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:prism="http://purl.org/rss/1.0/modules/prism/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
	<channel rdf:about="https://biorxiv.org">
	<admin:errorReportsTo rdf:resource="mailto:biorxiv@cshlpress.edu"/>
	<title>bioRxiv Channel: NHGRI MorPhiC</title>
	<link>https://biorxiv.org</link>
	<description>
	This feed contains articles for bioRxiv Channel "NHGRI MorPhiC"
	</description>

		<items>
	<rdf:Seq>
		</rdf:Seq>
	</items>
	<prism:eIssn/>
	<prism:publicationName>bioRxiv</prism:publicationName>
	<prism:issn/>

	<image rdf:resource=""/>
	</channel>
	<image rdf:about="">
	<title>bioRxiv</title>
	<url/>
	<link>https://biorxiv.org</link>
	</image>
	<item rdf:about="https://biorxiv.org/cgi/content/short/2024.09.29.615716v1?rss=1">
<title>
<![CDATA[
Leveraging CRISPR activation for rapid assessment of gene editing products in human pluripotent stem cells 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.09.29.615716v1?rss=1"
</link>
<description><![CDATA[
Verification of genome editing in human pluripotent stem cells (hPSCs), particularly in silent locus is desirable but challenging because it often requires complex and time-intensive lineage-specific or tissue-specific differentiation to induce their expression. Here, we establish a rapid and effective workflow for the verification of hPSC lines with genome editing in unexpressed genes using CRISPR-mediated transcriptional activation (CRISPRa). We systematically compared the efficiency of various CRISPRa systems in hPSCs, identifying the SAM system as the most potent for activating silent genes in hPSCs. Furthermore, we demonstrated enhanced gene activation by combining the SAM system with TET1, a demethylation module. By inducing targeted gene activation in undifferentiated hPSCs using CRISPRa, we successfully verified single and dual reporter hPSC lines and conducted functional tests of dTAG knock-ins and silent gene knockouts within 48 hours. This approach eliminates the need for cell differentiation to access genes only expressed by differentiated cells, offering a handy assay for verifying gene editing in hPSCs.
]]></description>
<dc:creator>Wu, Y.</dc:creator>
<dc:creator>Zhong, A.</dc:creator>
<dc:creator>Evangelisti, A.</dc:creator>
<dc:creator>Sidharta, M.</dc:creator>
<dc:creator>Studer, L.</dc:creator>
<dc:creator>Zhou, T.</dc:creator>
<dc:date>2024-09-30</dc:date>
<dc:identifier>doi:10.1101/2024.09.29.615716</dc:identifier>
<dc:title><![CDATA[Leveraging CRISPR activation for rapid assessment of gene editing products in human pluripotent stem cells]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-09-30</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.05.03.539283v1?rss=1">
<title>
<![CDATA[
Parallel genome-scale CRISPR screens distinguish pluripotency and self-renewal 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.05.03.539283v1?rss=1"
</link>
<description><![CDATA[
Pluripotent stem cells are defined by their self-renewal capacity, which is the ability of the stem cells to proliferate indefinitely while maintaining the pluripotent identity essential for their ability to differentiate into any somatic cell lineage. However, understanding the mechanisms that control stem cell fitness versus the pluripotent cell identity is challenging. To investigate the interplay between these two aspects of pluripotency, we performed four parallel genome-scale CRISPR-Cas9 loss-of-function screens interrogating stem cell fitness in hPSC self-renewal conditions, and the dissolution of the primed pluripotency identity during early differentiation. Comparative analyses led to the discovery of genes with distinct roles in pluripotency regulation, including mitochondrial and metabolism regulators crucial for stem cell fitness, and chromatin regulators that control pluripotent identity during early differentiation. We further discovered a core set of factors that control both stem cell fitness and pluripotent identity, including a network of chromatin factors that safeguard pluripotency. Our unbiased and systematic screening and comparative analyses disentangle two interconnected aspects of pluripotency, provide rich datasets for exploring pluripotent cell identity versus cell fitness, and offer a valuable model for categorizing gene function in broad biological contexts.
]]></description>
<dc:creator>Rosen, B. P.</dc:creator>
<dc:creator>Li, Q. V.</dc:creator>
<dc:creator>Cho, H.</dc:creator>
<dc:creator>Liu, D.</dc:creator>
<dc:creator>Yang, D.</dc:creator>
<dc:creator>Graff, S.</dc:creator>
<dc:creator>Yan, J.</dc:creator>
<dc:creator>Luo, R.</dc:creator>
<dc:creator>Verma, N.</dc:creator>
<dc:creator>Damodaran, J. R.</dc:creator>
<dc:creator>Beer, M. A.</dc:creator>
<dc:creator>Sidoli, S.</dc:creator>
<dc:creator>Huangfu, D.</dc:creator>
<dc:date>2023-05-03</dc:date>
<dc:identifier>doi:10.1101/2023.05.03.539283</dc:identifier>
<dc:title><![CDATA[Parallel genome-scale CRISPR screens distinguish pluripotency and self-renewal]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-05-03</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.10.30.564796v1?rss=1">
<title>
<![CDATA[
Decoding Heterogenous Single-cell Perturbation Responses 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.10.30.564796v1?rss=1"
</link>
<description><![CDATA[
Understanding diverse responses of individual cells to the same perturbation is central to many biological and biomedical problems. Current methods, however, do not precisely quantify the strength of perturbation responses and, more importantly, reveal new biological insights from heterogeneity in responses. Here we introduce the perturbation-response score (PS), based on constrained quadratic optimization, to quantify diverse perturbation responses at a single-cell level. Applied to single-cell transcriptomes of large-scale genetic perturbation datasets (e.g., Perturb-seq), PS outperforms existing methods for quantifying partial gene perturbation responses. In addition, PS presents two major advances. First, PS enables large-scale, single-cell-resolution dosage analysis of perturbation, without the need to titrate perturbation strength. By analyzing the dose-response patterns of over 2,000 essential genes in Perturb-seq, we identify two distinct patterns, depending on whether a moderate reduction in their expression induces strong downstream expression alterations. Second, PS identifies intrinsic and extrinsic biological determinants of perturbation responses. We demonstrate the application of PS in contexts such as T cell stimulation, latent HIV-1 expression, and pancreatic cell differentiation. Notably, PS unveiled a previously unrecognized, cell-type-specific role of coiled-coil domain containing 6 (CCDC6) in guiding liver and pancreatic lineage decisions, where CCDC6 knockouts drive the endoderm cell differentiation towards liver lineage, rather than pancreatic lineage. The PS approach provides an innovative method for dose-to-function analysis and will enable new biological discoveries from single-cell perturbation datasets.

One sentence summaryWe present a method to quantify diverse perturbation responses and discover novel biological insights in single-cell perturbation datasets.
]]></description>
<dc:creator>Song, B.</dc:creator>
<dc:creator>Liu, D.</dc:creator>
<dc:creator>Dai, W.</dc:creator>
<dc:creator>McMyn, N.</dc:creator>
<dc:creator>Wang, Q.</dc:creator>
<dc:creator>Yang, D.</dc:creator>
<dc:creator>Krejci, A.</dc:creator>
<dc:creator>Vasilyev, A.</dc:creator>
<dc:creator>Untermoser, N.</dc:creator>
<dc:creator>Loregger, A.</dc:creator>
<dc:creator>Song, D.</dc:creator>
<dc:creator>Williams, B.</dc:creator>
<dc:creator>Rosen, B.</dc:creator>
<dc:creator>Cheng, X.</dc:creator>
<dc:creator>Chao, L.</dc:creator>
<dc:creator>Kale, H.</dc:creator>
<dc:creator>Zhang, H.</dc:creator>
<dc:creator>Diao, Y.</dc:creator>
<dc:creator>Bürckstümmer, T.</dc:creator>
<dc:creator>Siliciano, J. M.</dc:creator>
<dc:creator>Li, J. J.</dc:creator>
<dc:creator>Siliciano, R.</dc:creator>
<dc:creator>Huangfu, D.</dc:creator>
<dc:creator>Li, W.</dc:creator>
<dc:date>2023-11-02</dc:date>
<dc:identifier>doi:10.1101/2023.10.30.564796</dc:identifier>
<dc:title><![CDATA[Decoding Heterogenous Single-cell Perturbation Responses]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-11-02</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.07.15.603527v1?rss=1">
<title>
<![CDATA[
Supervised Deep Learning with Gene Annotation for Cell Classification 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.07.15.603527v1?rss=1"
</link>
<description><![CDATA[
Gene-by-gene differential expression analysis is a widely used supervised approach for interpreting single-cell RNA-sequencing (scRNA-seq) data. However, modern scRNA-seq datasets often contain large numbers of cells, which can produce numerous differentially expressed genes with exceedingly small p-values but minimal effect sizes, and thus making biological interpretation difficult. To overcome this challenge, we developed Supervised Deep learning with gene ANnotation (SDAN), a method that integrates gene-annotation information with gene-expression profiles using a graph neural network. SDAN identifies functionally coherent gene sets that best classify cells, and the resulting cell-level classification scores can be aggregated to make individual-level predictions. We evaluated SDAN and two representative existing methods in three real-data applications to identify gene sets associated with severe COVID-19, dementia, and immunotherapy response in cancer. SDAN consistently outperformed alternative approaches by achieving two key objectives simultaneously: accurate classification of outcomes and unambiguous assignment of genes to gene sets of functionally related genes.
]]></description>
<dc:creator>Lin, Z.</dc:creator>
<dc:creator>Sun, W.</dc:creator>
<dc:date>2024-07-15</dc:date>
<dc:identifier>doi:10.1101/2024.07.15.603527</dc:identifier>
<dc:title><![CDATA[Supervised Deep Learning with Gene Annotation for Cell Classification]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-07-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.04.26.650787v1?rss=1">
<title>
<![CDATA[
Singe cell RNA sequencing data processing using cloud-based serverless computing 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.04.26.650787v1?rss=1"
</link>
<description><![CDATA[
Singe cell RNA sequencing (scRNA-seq) has become a routine method for measuring cell activities. Processing large scRNA-seq datasets requires high-performance computing resources. The emergence of cloud computing allows us to leverage its on-demand capabilities without major investment in infrastructure. Serverless computing provides cost efficiency by allowing users to pay only for actual resource usage, eliminating the necessity for pre-allocated server capacities. Additionally, there is no requirement to set up servers in advance. We present a novel and generalizable methodology using serverless cloud computing to accelerate computationally intensive workflows. We create an on-demand "supercomputer" using rapidly deployable cloud serverless functions as automatically provisioned computation units. We tested our methodology of optimizing a scRNA-seq workflow by leveraging serverless functions on the cloud using two publicly available peripheral blood mononuclear cell (PBMC) datasets. In addition, we demonstrate our approach using data generated by the NIH MorPhiC program, where we process a 450 GB human scRNA-seq dataset across 86 cell lines designed to study the temporal impact of perturbations on pancreatic differentiation. We compared the total execution time of the scRNA-seq serverless workflow with the traditional workflow without using serverless functions, and demonstrate major speedup for large scRNA-seq datasets.
]]></description>
<dc:creator>Hung, L.-H.</dc:creator>
<dc:creator>Nasam, N.</dc:creator>
<dc:creator>Lloyd, W.</dc:creator>
<dc:creator>Yeung, K. Y.</dc:creator>
<dc:date>2025-04-30</dc:date>
<dc:identifier>doi:10.1101/2025.04.26.650787</dc:identifier>
<dc:title><![CDATA[Singe cell RNA sequencing data processing using cloud-based serverless computing]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-04-30</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.05.23.655879v1?rss=1">
<title>
<![CDATA[
Graphical and Interactive Spatial Proteomics Image Analysis Workflow 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.05.23.655879v1?rss=1"
</link>
<description><![CDATA[
Spatial proteomics provides a spatially resolved view of protein expression and localization within cells and tissues by mapping the location and abundance of proteins. There is a need for fully-integrated end-to-end imaging workflows for spatial proteomic analysis that are flexible, high-throughput, and support graphical and interactive visualizations. We present a modular and interactive spatial proteomic image analysis workflow with individual steps containerized that empowers biomedical researchers to reproducibly execute and customize complex analyses.

Our workflow consists of cell segmentation, unsupervised clustering, validation of clusters on the image, and cell type clustering results visualization. Users can utilize a form-based graphical interface to execute and customize multi-step workflows with a single click or interactively adjust image processing steps within the workflow, apply workflows to various datasets, and modify input parameters as needed. We illustrated the functionality of our workflow using a cancer imaging dataset consisting of a tissue microarray (TMA) stained by high-plex immunohistochemistry. This TMA contained a variety of cancer and tissue cell types to assess the broad applicability of this workflow to different biopsy and tissue types.
]]></description>
<dc:creator>Singh, P.</dc:creator>
<dc:creator>Wright, J. H.</dc:creator>
<dc:creator>Smythe, K. S.</dc:creator>
<dc:creator>Fukuda, B. N.</dc:creator>
<dc:creator>Hung, L.-H.</dc:creator>
<dc:creator>Yeung, C. C.</dc:creator>
<dc:creator>Yeung, K. Y.</dc:creator>
<dc:date>2025-05-27</dc:date>
<dc:identifier>doi:10.1101/2025.05.23.655879</dc:identifier>
<dc:title><![CDATA[Graphical and Interactive Spatial Proteomics Image Analysis Workflow]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-05-27</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.12.23.696311v1?rss=1">
<title>
<![CDATA[
A stem cell knockout village reveals lineage rewiring and a non-canonical islet cell fate in monogenic diabetes 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.12.23.696311v1?rss=1"
</link>
<description><![CDATA[
Genetics studies have identified a core set of regulators essential for pancreatic {beta} cell development, many of which are mutated in monogenic diabetes. However, how these mutations alter developmental trajectories to produce pathological cell states remains elusive. Here we introduce a knockout village framework that enables longitudinal scRNA-seq profiling of 79 human pluripotent stem cell mutant lines targeting 30 developmental regulators, including 15 diabetes genes, across five islet differentiation stages. We show that loss of lineage regulators impairs {beta} cell formation in a stage-specific manner and rewires developmental trajectories towards competing lineages. Notably, several monogenic diabetes gene mutations drive a shift from {beta} cells to enterochromaffin (EC)-like cells, a recently recognized non-canonical islet cell fate. These EC-like cells exhibit incomplete activation of hormone regulation programs, along with elevated neuron signatures. Leveraging the diversity of cell fate outcomes across mutants, we predicted and experimentally validated ISL1 as a key downstream effector of PDX1 and PAX6 that safeguards {beta} cell identity against an EC-like fate. Together, our findings reveal cell fate rewiring as a widespread, previously underappreciated pathological mechanism in monogenic diabetes and establish a scalable platform for uncovering developmental vulnerabilities in human genetic disorders.
]]></description>
<dc:creator>Liu, D.</dc:creator>
<dc:creator>Song, B.</dc:creator>
<dc:creator>Li, Z.</dc:creator>
<dc:creator>Zhang, S.</dc:creator>
<dc:creator>Fabiha, T.</dc:creator>
<dc:creator>Zhao, J.</dc:creator>
<dc:creator>Inoki, A.</dc:creator>
<dc:creator>Piccand, J.</dc:creator>
<dc:creator>Soh, C.-L.</dc:creator>
<dc:creator>Dixon, G.</dc:creator>
<dc:creator>Zhong, A.</dc:creator>
<dc:creator>Hu, N.</dc:creator>
<dc:creator>Luo, R.</dc:creator>
<dc:creator>Ozlusen, B.</dc:creator>
<dc:creator>Menon, V.</dc:creator>
<dc:creator>Zhou, T.</dc:creator>
<dc:creator>Qiu, X.</dc:creator>
<dc:creator>Gradwohl, G.</dc:creator>
<dc:creator>Yang, D.</dc:creator>
<dc:creator>Dey, K.</dc:creator>
<dc:creator>Sun, W.</dc:creator>
<dc:creator>Li, W.</dc:creator>
<dc:creator>Huangfu, D.</dc:creator>
<dc:date>2025-12-25</dc:date>
<dc:identifier>doi:10.64898/2025.12.23.696311</dc:identifier>
<dc:title><![CDATA[A stem cell knockout village reveals lineage rewiring and a non-canonical islet cell fate in monogenic diabetes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-12-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.09.01.673517v1?rss=1">
<title>
<![CDATA[
User-friendly scheduler Using a hybrid architecture and supercomputing for big data processing 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.09.01.673517v1?rss=1"
</link>
<description><![CDATA[
The exponential growth of omics data requires novel strategies for storage, transfer, and processing of said data. We present a scheduler based on the Temporal.io workflow framework which enables two key optimizations of bioinformatics workflows. Firstly, we enable users to transparently map workflow steps to diverse execution environments, including high-performance computing (HPC) resources managed by the SLURM resource manager through an easy-to-use graphical user interface. Secondly, we enable asynchronous execution of workflows, a feature which guarantees that workflows will achieve reasonable resource utilization even when the scheduler cannot make use of a systems full RAM and CPU resources. Thirdly, we propose a universal, platform agnostic JSON representation of workflows that allows platform-specific execution details to be abstracted away from the core scientific logic. Our work includes a custom executor plugin that supports translation of workflows from an external language, such as Nextflow, to our universal JSON format. Finally, we develop a graphical user interface to make our scheduler easy-to-use for non-technical users. When benchmarked on a bulk RNA sequencing workflow, these features reduced the cost and time requirements. We illustrated the merits of our cross-platform method using credit allocations from federally funded supercomputers.
]]></description>
<dc:creator>McKeever, P.</dc:creator>
<dc:creator>Mittal, V.</dc:creator>
<dc:creator>Fukuda, B.</dc:creator>
<dc:creator>Yeung, K. Y.</dc:creator>
<dc:creator>Hung, L.-H.</dc:creator>
<dc:date>2025-09-03</dc:date>
<dc:identifier>doi:10.1101/2025.09.01.673517</dc:identifier>
<dc:title><![CDATA[User-friendly scheduler Using a hybrid architecture and supercomputing for big data processing]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-09-03</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2026.03.09.710580v1?rss=1">
<title>
<![CDATA[
STAR Suite: Integrating transcriptomics through AI software engineering in the NIH MorPhiC consortium 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2026.03.09.710580v1?rss=1"
</link>
<description><![CDATA[
To accommodate rapid methodological turnover, bioinformatics pipelines typically consist of discrete binaries linked via scripts. While flexible, this architecture relies on intermediate files, sacrificing performance, and treating complex codebases as static silos. For example, the STAR aligner [1]--the standard engine for transcriptomics--uses an external script for adapter trimming, necessitating the decompression and re-compression of large files. These limitations presented scalability problems for uniform processing of data in the NIH MorPhiC consortium. We present our solution, STAR Suite, a human-engineered and AI-implemented modernization that integrates functionality directly into the C++ source. In just four months, a single developer added over 92,000 lines to the original 28,000-line codebase to produce four unified modules: STAR-core, STAR-Flex, STAR-Perturb, and STAR-SLAM that can be installed as a pre-compiled binary without introducing any new dependencies. This work demonstrates a new paradigm for the rapid evolution of high-performance bioinformatics software.
]]></description>
<dc:creator>Hung, L.-H.</dc:creator>
<dc:creator>Yeung, K. Y.</dc:creator>
<dc:date>2026-03-10</dc:date>
<dc:identifier>doi:10.64898/2026.03.09.710580</dc:identifier>
<dc:title><![CDATA[STAR Suite: Integrating transcriptomics through AI software engineering in the NIH MorPhiC consortium]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2026-03-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.10.15.682667v1?rss=1">
<title>
<![CDATA[
Human Dicer1 hotspot mutation induces both loss and gain of miRNA function 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.10.15.682667v1?rss=1"
</link>
<description><![CDATA[
The core miRNA biogenesis enzyme Dicer1 sustains recurrent mutations in cancer that compromise its RNase IIIb domain, which cleaves 5p arms of pre-miRNA hairpins. However, the lack of knockin models has limited fuller understanding. Here, we generated Dicer1-KO and Dicer1-S1344L (homozygous and hemizygous) human ESCs; the latter is a non-catalytic mutation in RNase IIIa that impairs RNase IIIb activity. Dicer1 knockouts lack canonical miRNAs, while S1344L induces two trends: ablation of miRNA-5p strands, and selective changes in miRNA-3p strands. Curiously, we recognized directional upregulation of miRNA-3p passenger strands, indicating a broad strand switch. We used multiple in vitro assays to show 3p arm-nicked pre-miRNAs preferentially load miRNA-3p species into Argonaute, compared to corresponding duplexes. Moreover, activity assays, RNA-seq data, and Argonaute-mRNA profiling, confirm that these confer increased repression capacity. These data expand the molecular consequences of Dicer1 hotspot mutations in cancer.
]]></description>
<dc:creator>Jee, D.</dc:creator>
<dc:creator>Lee, S.</dc:creator>
<dc:creator>Yang, D.</dc:creator>
<dc:creator>Rickert, R.</dc:creator>
<dc:creator>Shang, R.</dc:creator>
<dc:creator>Huangfu, D.</dc:creator>
<dc:creator>Lai, E. C.</dc:creator>
<dc:date>2025-10-15</dc:date>
<dc:identifier>doi:10.1101/2025.10.15.682667</dc:identifier>
<dc:title><![CDATA[Human Dicer1 hotspot mutation induces both loss and gain of miRNA function]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-10-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.02.13.580048v1?rss=1">
<title>
<![CDATA[
Common data models to streamline metabolomics processing and annotation, and implementation in a Python pipeline 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.02.13.580048v1?rss=1"
</link>
<description><![CDATA[
To standardize metabolomics data analysis and facilitate future computational developments, it is essential is have a set of well-defined templates for common data structures. Here we describe a collection of data structures involved in metabolomics data processing and illustrate how they are utilized in a full-featured Python-centric pipeline. We demonstrate the performance of the pipeline, and the details in annotation and quality control using large-scale LC-MS metabolomics and lipidomics data and LC-MS/MS data. Multiple previously published datasets are also reanalyzed to showcase its utility in biological data analysis. This pipeline allows users to streamline data processing, quality control, annotation, and standardization in an efficient and transparent manner. This work fills a major gap in the Python ecosystem for computational metabolomics.

Author SummaryAll life processes involve the consumption, creation, and interconversion of metabolites. Metabolomics is the comprehensive study of these small molecules, often using mass spectrometry, to provide critical information of health and disease. Automated processing of such metabolomics data is desired, especially for the bioinformatics community with familiar tools and infrastructures. Despite of Pythons popularity in bioinformatics and machine learning, the Python ecosystem in computational metabolomics still misses a complete data pipeline. We have developed an end-to-end computational metabolomics data processing pipeline, based on the raw data preprocessor Asari [1]. Our pipeline takes experimental data in .mzML or .raw format and outputs annotated feature tables for subsequent biological interpretation. We demonstrate the application of this pipeline to multiple metabolomics and lipidomics datasets. Accompanying the pipeline, we have designed a set of reusable data structures, released as the MetDataModel package, which shall promote more consistent terminology and software interoperability in this area.
]]></description>
<dc:creator>Mitchell, J.</dc:creator>
<dc:creator>Chi, Y.</dc:creator>
<dc:creator>Thapa, M.</dc:creator>
<dc:creator>Pang, Z.</dc:creator>
<dc:creator>Xia, J.</dc:creator>
<dc:creator>Li, S.</dc:creator>
<dc:date>2024-02-14</dc:date>
<dc:identifier>doi:10.1101/2024.02.13.580048</dc:identifier>
<dc:title><![CDATA[Common data models to streamline metabolomics processing and annotation, and implementation in a Python pipeline]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-02-14</prism:publicationDate>
<prism:section></prism:section>
</item>
</rdf:RDF>
