	<rdf:RDF xmlns:admin="http://webns.net/mvcb/" xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:prism="http://purl.org/rss/1.0/modules/prism/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
	<channel rdf:about="https://biorxiv.org">
	<admin:errorReportsTo rdf:resource="mailto:biorxiv@cshlpress.edu"/>
	<title>bioRxiv Channel: Human Pangenome Reference Consortium (HPRC)</title>
	<link>https://biorxiv.org</link>
	<description>
	This feed contains articles for bioRxiv Channel "Human Pangenome Reference Consortium (HPRC)"
	</description>

		<items>
	<rdf:Seq>
		</rdf:Seq>
	</items>
	<prism:eIssn/>
	<prism:publicationName>bioRxiv</prism:publicationName>
	<prism:issn/>

	<image rdf:resource=""/>
	</channel>
	<image rdf:about="">
	<title>bioRxiv</title>
	<url/>
	<link>https://biorxiv.org</link>
	</image>
	<item rdf:about="https://biorxiv.org/cgi/content/short/2022.07.09.499321v1?rss=1">
<title>
<![CDATA[
A Draft Human Pangenome Reference 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.07.09.499321v1?rss=1"
</link>
<description><![CDATA[
The Human Pangenome Reference Consortium (HPRC) presents a first draft human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals. These assemblies cover more than 99% of the expected sequence and are more than 99% accurate at the structural and base-pair levels. Based on alignments of the assemblies, we generated a draft pangenome that captures known variants and haplotypes, reveals novel alleles at structurally complex loci, and adds 119 million base pairs of euchromatic polymorphic sequence and 1,529 gene duplications relative to the existing reference, GRCh38. Roughly 90 million of the additional base pairs derive from structural variation. Using our draft pangenome to analyze short-read data reduces errors when discovering small variants by 34% and boosts the detected structural variants per haplotype by 104% compared to GRCh38-based workflows, and by 34% compared to using previous diversity sets of genome assemblies.
]]></description>
<dc:creator>Liao, W.-W.</dc:creator>
<dc:creator>Asri, M.</dc:creator>
<dc:creator>Ebler, J.</dc:creator>
<dc:creator>Doerr, D.</dc:creator>
<dc:creator>Haukness, M.</dc:creator>
<dc:creator>Hickey, G.</dc:creator>
<dc:creator>Lu, S.</dc:creator>
<dc:creator>Lucas, J. K.</dc:creator>
<dc:creator>Monlong, J.</dc:creator>
<dc:creator>Abel, H. J.</dc:creator>
<dc:creator>Buonaiuto, S.</dc:creator>
<dc:creator>Chang, X. H.</dc:creator>
<dc:creator>Cheng, H.</dc:creator>
<dc:creator>Chu, J.</dc:creator>
<dc:creator>Colonna, V.</dc:creator>
<dc:creator>Eizenga, J. M.</dc:creator>
<dc:creator>Feng, X.</dc:creator>
<dc:creator>Fischer, C.</dc:creator>
<dc:creator>Fulton, R. S.</dc:creator>
<dc:creator>Garg, S.</dc:creator>
<dc:creator>Groza, C.</dc:creator>
<dc:creator>Guarracino, A.</dc:creator>
<dc:creator>Harvey, W. T.</dc:creator>
<dc:creator>Heumos, S.</dc:creator>
<dc:creator>Howe, K.</dc:creator>
<dc:creator>Jain, M.</dc:creator>
<dc:creator>Lu, T.-Y.</dc:creator>
<dc:creator>Markello, C.</dc:creator>
<dc:creator>Martin, F. J.</dc:creator>
<dc:creator>Mitchell, M. W.</dc:creator>
<dc:creator>Munson, K. M.</dc:creator>
<dc:creator>Mwaniki, M. N.</dc:creator>
<dc:creator>Novak, A. M.</dc:creator>
<dc:creator>Olsen, H. E.</dc:creator>
<dc:creator>Pesout, T.</dc:creator>
<dc:creator>Porubsky, D.</dc:creator>
<dc:creator>Prins, P.</dc:creator>
<dc:creator>Sibbesen, J. A.</dc:creator>
<dc:creator>Tomlinson, C.</dc:creator>
<dc:creator>Villani, F.</dc:creator>
<dc:creator>Vollger, M. R.</dc:creator>
<dc:creator>Human Pangenome Reference Consortium,</dc:creator>
<dc:creator>Bourque, G.</dc:creator>
<dc:creator>Chaisson, M.</dc:creator>
<dc:date>2022-07-09</dc:date>
<dc:identifier>doi:10.1101/2022.07.09.499321</dc:identifier>
<dc:title><![CDATA[A Draft Human Pangenome Reference]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-07-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.08.19.457003v1?rss=1">
<title>
<![CDATA[
StainedGlass: Interactive visualization of massive tandem repeat structures with identity heatmaps 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.08.19.457003v1?rss=1"
</link>
<description><![CDATA[
SummaryVisualization and analysis of genomic repeats is typically accomplished through the use of dot plots; however, the emergence of telomere-to-telomere assemblies with multi-megabase repeats requires new visualization strategies. Here, we introduce StainedGlass which can generate publication quality figures and interactive visualizations that depict the identity and orientation of multi-megabase repeat structures at a genome-wide scale. The tool can rapidly reveal higher-order structures and improve the inference of evolutionary history for some of the most complex regions of genomes.

Availability and implementationStainedGlass is implemented using Snakemake and is available open source under the MIT license at https://mrvollger.github.io/StainedGlass/.

Contactmvollger@uw.edu
]]></description>
<dc:creator>Vollger, M. R.</dc:creator>
<dc:creator>Kerpedjiev, P.</dc:creator>
<dc:creator>Phillippy, A. M.</dc:creator>
<dc:creator>Eichler, E. E.</dc:creator>
<dc:date>2021-08-21</dc:date>
<dc:identifier>doi:10.1101/2021.08.19.457003</dc:identifier>
<dc:title><![CDATA[StainedGlass: Interactive visualization of massive tandem repeat structures with identity heatmaps]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-08-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.07.12.452052v1?rss=1">
<title>
<![CDATA[
Complete genomic and epigenetic maps of human centromeres 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.07.12.452052v1?rss=1"
</link>
<description><![CDATA[
Existing human genome assemblies have almost entirely excluded highly repetitive sequences within and near centromeres, limiting our understanding of their sequence, evolution, and essential role in chromosome segregation. Here, we present an extensive study of newly assembled peri/centromeric sequences representing 6.2% (189.9 Mb) of the first complete, telomere-to-telomere human genome assembly (T2T-CHM13). We discovered novel patterns of peri/centromeric repeat organization, variation, and evolution at both large and small length scales. We also found that inner kinetochore proteins tend to overlap the most recently duplicated subregions within centromeres. Finally, we compared chromosome X centromeres across a diverse panel of individuals and uncovered structural, epigenetic, and sequence variation at single-base resolution across these regions. In total, this work provides an unprecedented atlas of human centromeres to guide future studies of their complex and critical functions as well as their unique evolutionary dynamics.

One-sentence summaryDeep characterization of fully assembled human centromeres reveals their architecture and fine-scale organization, variation, and evolution.
]]></description>
<dc:creator>Altemose, N.</dc:creator>
<dc:creator>Logsdon, G.</dc:creator>
<dc:creator>Bzikadze, A. V.</dc:creator>
<dc:creator>Sidhwani, P.</dc:creator>
<dc:creator>Langley, S. A.</dc:creator>
<dc:creator>Caldas, G. V.</dc:creator>
<dc:creator>Hoyt, S. J.</dc:creator>
<dc:creator>Uralsky, L.</dc:creator>
<dc:creator>Ryabov, F. D.</dc:creator>
<dc:creator>Shew, C.</dc:creator>
<dc:creator>Sauria, M. E. G.</dc:creator>
<dc:creator>Borchers, M.</dc:creator>
<dc:creator>Gershman, A.</dc:creator>
<dc:creator>Mikheenko, A.</dc:creator>
<dc:creator>Shepelev, V. A.</dc:creator>
<dc:creator>Dvorkina, T.</dc:creator>
<dc:creator>Kunyavskaya, O.</dc:creator>
<dc:creator>Vollger, M. R.</dc:creator>
<dc:creator>Rhie, A.</dc:creator>
<dc:creator>McCartney, A. M.</dc:creator>
<dc:creator>Asri, M.</dc:creator>
<dc:creator>Lorig-Roach, R.</dc:creator>
<dc:creator>Shafin, K.</dc:creator>
<dc:creator>Aganezov, S.</dc:creator>
<dc:creator>Olson, D.</dc:creator>
<dc:creator>Gomes de Lima, L.</dc:creator>
<dc:creator>Potapova, T.</dc:creator>
<dc:creator>Hartley, G. A.</dc:creator>
<dc:creator>Haukness, M.</dc:creator>
<dc:creator>Kerpedjiev, P.</dc:creator>
<dc:creator>Gusev, F.</dc:creator>
<dc:creator>Tigyi, K.</dc:creator>
<dc:creator>Brooks, S. Y.</dc:creator>
<dc:creator>Young, A.</dc:creator>
<dc:creator>Nurk, S.</dc:creator>
<dc:creator>Koren, S.</dc:creator>
<dc:creator>Salama, S.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Rogaev, E. I.</dc:creator>
<dc:creator>Streets, A. M.</dc:creator>
<dc:creator>Karpen, G. H.</dc:creator>
<dc:creator>Dernburg, A.</dc:creator>
<dc:creator>Sullivan, B.</dc:creator>
<dc:date>2021-07-13</dc:date>
<dc:identifier>doi:10.1101/2021.07.12.452052</dc:identifier>
<dc:title><![CDATA[Complete genomic and epigenetic maps of human centromeres]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-07-13</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.07.02.450803v1?rss=1">
<title>
<![CDATA[
Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.07.02.450803v1?rss=1"
</link>
<description><![CDATA[
Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first Telomere-to-Telomere (T2T) human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Though derived from highly accurate sequencing, evaluation revealed that the initial T2T draft assembly had evidence of small errors and structural misassemblies. To correct these errors, we designed a novel repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly QV to 73.9. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both PacBio HiFi and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies
]]></description>
<dc:creator>Mc Cartney, A. M.</dc:creator>
<dc:creator>Shafin, K.</dc:creator>
<dc:creator>Alonge, M.</dc:creator>
<dc:creator>Bzikadze, A. V.</dc:creator>
<dc:creator>Formenti, G.</dc:creator>
<dc:creator>Fungtammasan, A.</dc:creator>
<dc:creator>Howe, K.</dc:creator>
<dc:creator>Jain, C.</dc:creator>
<dc:creator>Koren, S.</dc:creator>
<dc:creator>Logsdon, G. A.</dc:creator>
<dc:creator>Miga, K. H.</dc:creator>
<dc:creator>Mikheenko, A.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Shumate, A.</dc:creator>
<dc:creator>Soto, D. C.</dc:creator>
<dc:creator>Sovic, I.</dc:creator>
<dc:creator>Wood, J. M.</dc:creator>
<dc:creator>Zook, J. M.</dc:creator>
<dc:creator>Phillippy, A. M.</dc:creator>
<dc:creator>Rhie, A.</dc:creator>
<dc:date>2021-07-02</dc:date>
<dc:identifier>doi:10.1101/2021.07.02.450803</dc:identifier>
<dc:title><![CDATA[Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-07-02</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.07.12.499787v1?rss=1">
<title>
<![CDATA[
GBZ File Format for Pangenome Graphs 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.07.12.499787v1?rss=1"
</link>
<description><![CDATA[
MotivationPangenome graphs representing aligned genome assemblies are being shared in the text-based Graphical Fragment Assembly format. As the number of assemblies grows, there is a need for a file format that can store the highly repetitive data space-efficiently.

ResultsWe propose the GBZ file format based on data structures used in the Giraffe short read aligner. The format provides good compression, and the files can be efficiently loaded into in-memory data structures. We provide compression and decompression tools and libraries for using GBZ graphs, and we show that they can be efficiently used on a variety of systems.

AvailabilityC++ and Rust implementations are available at https://github.com/jltsiren/gbwtgraph and https://github.com/jltsiren/gbwt-rs, respectively.

Contactjouni.siren@iki.fi

Supplementary informationSupplementary data are available online.
]]></description>
<dc:creator>Siren, J.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:date>2022-07-14</dc:date>
<dc:identifier>doi:10.1101/2022.07.12.499787</dc:identifier>
<dc:title><![CDATA[GBZ File Format for Pangenome Graphs]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-07-14</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.07.12.452063v1?rss=1">
<title>
<![CDATA[
A complete reference genome improves analysis of human genetic variation 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.07.12.452063v1?rss=1"
</link>
<description><![CDATA[
Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 Mbp of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome to clinical and functional study. Here we demonstrate how the new reference universally improves read mapping and variant calling for 3,202 and 17 globally diverse samples sequenced with short and long reads, respectively. We identify hundreds of thousands of novel variants per sample--a new frontier for evolutionary and biomedical discovery. Simultaneously, the new reference eliminates tens of thousands of spurious variants per sample, including up to 12-fold reduction of false positives in 269 medically relevant genes. The vast improvement in variant discovery coupled with population and functional genomic resources position T2T-CHM13 to replace GRCh38 as the prevailing reference for human genetics.

One Sentence SummaryThe T2T-CHM13 reference genome universally improves the analysis of human genetic variation.
]]></description>
<dc:creator>Aganezov, S.</dc:creator>
<dc:creator>Yan, S. M.</dc:creator>
<dc:creator>Soto, D. C.</dc:creator>
<dc:creator>Kirsche, M.</dc:creator>
<dc:creator>Zarate, S.</dc:creator>
<dc:creator>Avdeyev, P.</dc:creator>
<dc:creator>Taylor, D. J.</dc:creator>
<dc:creator>Shafin, K.</dc:creator>
<dc:creator>Shumate, A.</dc:creator>
<dc:creator>Xiao, C.</dc:creator>
<dc:creator>Wagner, J.</dc:creator>
<dc:creator>McDaniel, J.</dc:creator>
<dc:creator>Olson, N. D.</dc:creator>
<dc:creator>Sauria, M. E. G.</dc:creator>
<dc:creator>Vollger, M. R.</dc:creator>
<dc:creator>Rhie, A.</dc:creator>
<dc:creator>Meredith, M.</dc:creator>
<dc:creator>Martin, S.</dc:creator>
<dc:creator>Lee, J.</dc:creator>
<dc:creator>Koren, S.</dc:creator>
<dc:creator>Rosenfeld, J.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Layer, R.</dc:creator>
<dc:creator>Chin, C.-S.</dc:creator>
<dc:creator>Sedlazeck, F. J.</dc:creator>
<dc:creator>Hansen, N. F.</dc:creator>
<dc:creator>Miller, D. E.</dc:creator>
<dc:creator>Phillippy, A. M.</dc:creator>
<dc:creator>Miga, K. H.</dc:creator>
<dc:creator>McCoy, R. C.</dc:creator>
<dc:creator>Dennis, M. Y.</dc:creator>
<dc:creator>Zook, J. M.</dc:creator>
<dc:creator>Schatz, M. C.</dc:creator>
<dc:date>2021-07-13</dc:date>
<dc:identifier>doi:10.1101/2021.07.12.452063</dc:identifier>
<dc:title><![CDATA[A complete reference genome improves analysis of human genetic variation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-07-13</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.07.06.498021v1?rss=1">
<title>
<![CDATA[
Increased mutation rate and interlocus gene conversion within human segmental duplications. 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.07.06.498021v1?rss=1"
</link>
<description><![CDATA[
Single-nucleotide variants (SNVs) within segmental duplications (SDs) have not been systematically assessed because of the difficulty in mapping short-read sequence data to virtually identical repetitive sequences. Using 102 phased human haplotypes, we constructed 1:1 unambiguous alignments spanning high-identity SDs and compared the pattern of SNVs between unique and SD regions. We find that human SNVs are elevated 60% in SDs compared to unique regions. We estimate that at least 23% of this increase is due to interlocus gene conversion (IGC) with >7 Mbp of SD sequence converted on average per human haplotype. We develop a genome-wide map of IGC donors and acceptors, including 498 acceptor and 454 donor hotspots affecting the exons of ~800 protein-coding genes. The latter includes 171 genes that have "relocated" on average 1.61 Mbp in a subset of human haplotypes. Using a coalescent framework, we show that SD regions are evolutionarily older when compared to unique sequences with most of this signal originating from putative IGC loci. SNVs within SDs, however, also exhibit a distinct mutational spectrum where there is a 27.1% increase in transversions that convert cytosine to guanine or the reverse across all triplet contexts. In addition, we observe a 7.6% reduction in the frequency of CpG associated mutations when compared to unique DNA. We hypothesize that these distinct mutational properties help to maintain an overall higher GC content of SD DNA when compared to unique DNA, and we show that these GC-favoring mutational events are likely driven by GC-biased conversion between paralogous sequences.
]]></description>
<dc:creator>Vollger, M. R.</dc:creator>
<dc:creator>DeWitt, W. S.</dc:creator>
<dc:creator>Dishuck, P. C.</dc:creator>
<dc:creator>Harvey, W. T.</dc:creator>
<dc:creator>Guitart, X.</dc:creator>
<dc:creator>Goldberg, M. E.</dc:creator>
<dc:creator>Rozanski, A.</dc:creator>
<dc:creator>Lucas, J.</dc:creator>
<dc:creator>Asri, M.</dc:creator>
<dc:creator>The Human Pangenome Reference Consortium,</dc:creator>
<dc:creator>Munson, K. M.</dc:creator>
<dc:creator>Lewis, A. P.</dc:creator>
<dc:creator>Hoekzema, K.</dc:creator>
<dc:creator>Logsdon, G. A.</dc:creator>
<dc:creator>Porubsky, D.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Harris, K.</dc:creator>
<dc:creator>Hsieh, P.</dc:creator>
<dc:creator>Eichler, E. E.</dc:creator>
<dc:date>2022-07-07</dc:date>
<dc:identifier>doi:10.1101/2022.07.06.498021</dc:identifier>
<dc:title><![CDATA[Increased mutation rate and interlocus gene conversion within human segmental duplications.]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-07-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.05.26.445798v1?rss=1">
<title>
<![CDATA[
The complete sequence of a human genome 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.05.26.445798v1?rss=1"
</link>
<description><![CDATA[
In 2001, Celera Genomics and the International Human Genome Sequencing Consortium published their initial drafts of the human genome, which revolutionized the field of genomics. While these drafts and the updates that followed effectively covered the euchromatic fraction of the genome, the heterochromatin and many other complex regions were left unfinished or erroneous. Addressing this remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium has finished the first truly complete 3.055 billion base pair (bp) sequence of a human genome, representing the largest improvement to the human reference genome since its initial release. The new T2T-CHM13 reference includes gapless assemblies for all 22 autosomes plus Chromosome X, corrects numerous errors, and introduces nearly 200 million bp of novel sequence containing 2,226 paralogous gene copies, 115 of which are predicted to be protein coding. The newly completed regions include all centromeric satellite arrays and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies for the first time.
]]></description>
<dc:creator>Nurk, S.</dc:creator>
<dc:creator>Koren, S.</dc:creator>
<dc:creator>Rhie, A.</dc:creator>
<dc:creator>Rautiainen, M.</dc:creator>
<dc:creator>Bzikadze, A. V.</dc:creator>
<dc:creator>Mikheenko, A.</dc:creator>
<dc:creator>Vollger, M. R.</dc:creator>
<dc:creator>Altemose, N.</dc:creator>
<dc:creator>Uralsky, L.</dc:creator>
<dc:creator>Gershman, A.</dc:creator>
<dc:creator>Aganezov, S.</dc:creator>
<dc:creator>Hoyt, S. J.</dc:creator>
<dc:creator>Diekhans, M.</dc:creator>
<dc:creator>Logsdon, G. A.</dc:creator>
<dc:creator>Alonge, M.</dc:creator>
<dc:creator>Antonarakis, S. E.</dc:creator>
<dc:creator>Borchers, M.</dc:creator>
<dc:creator>Bouffard, G. G.</dc:creator>
<dc:creator>Brooks, S. Y.</dc:creator>
<dc:creator>Caldas, G. V.</dc:creator>
<dc:creator>Cheng, H.</dc:creator>
<dc:creator>Chin, C.-S.</dc:creator>
<dc:creator>Chow, W.</dc:creator>
<dc:creator>de Lima, L. G.</dc:creator>
<dc:creator>Dishuck, P. C.</dc:creator>
<dc:creator>Durbin, R.</dc:creator>
<dc:creator>Dvorkina, T.</dc:creator>
<dc:creator>Fiddes, I. T.</dc:creator>
<dc:creator>Formenti, G.</dc:creator>
<dc:creator>Fulton, R. S.</dc:creator>
<dc:creator>Fungtammasan, A.</dc:creator>
<dc:creator>Garrison, E.</dc:creator>
<dc:creator>Grady, P. G. S.</dc:creator>
<dc:creator>Graves-Lindsay, T. A.</dc:creator>
<dc:creator>Hall, I. M.</dc:creator>
<dc:creator>Hansen, N. F.</dc:creator>
<dc:creator>Hartley, G. A.</dc:creator>
<dc:creator>Haukness, M.</dc:creator>
<dc:creator>Howe, K.</dc:creator>
<dc:creator>Hunkapiller, M. W.</dc:creator>
<dc:creator>Jain, C.</dc:creator>
<dc:creator>Jain, M.</dc:creator>
<dc:date>2021-05-27</dc:date>
<dc:identifier>doi:10.1101/2021.05.26.445798</dc:identifier>
<dc:title><![CDATA[The complete sequence of a human genome]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-05-27</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.01.11.475254v1?rss=1">
<title>
<![CDATA[
Identifying and correcting repeat-calling errors in nanopore sequencing of telomeres 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.01.11.475254v1?rss=1"
</link>
<description><![CDATA[
Nanopore long-read genome sequencing is emerging as a potential approach for the study of genomes including long repetitive elements like telomeres. Here, we report extensive basecalling induced errors at telomere repeats across nanopore datasets, sequencing platforms, basecallers, and basecalling models. We found that telomeres which are represented by (TTAGGG)n and (CCCTAA)n repeats in many organisms were frequently miscalled (~40-50% of reads) as (TTAAAA)n, or as (CTTCTT)n and (CCCTGG)n repeats respectively in a strand-specific manner during nanopore sequencing. We showed that this miscalling is likely caused by the high similarity of current profiles between telomeric repeats and these repeat artefacts, leading to mis-assignment of electrical current profiles during basecalling. We further demonstrated that tuning of nanopore basecalling models, and selective application of the tuned models to telomeric reads led to improved recovery and analysis of telomeric regions, with little detected negative impact on basecalling of other genomic regions. Our study thus highlights the importance of verifying nanopore basecalls in long, repetitive, and poorly defined regions of the genome, and showcases how such artefacts in regions like telomeres can potentially be resolved by improvements in nanopore basecalling models.
]]></description>
<dc:creator>Tan, K.-T.</dc:creator>
<dc:creator>Slevin, M.</dc:creator>
<dc:creator>Meyerson, M.</dc:creator>
<dc:creator>Li, H.</dc:creator>
<dc:date>2022-01-12</dc:date>
<dc:identifier>doi:10.1101/2022.01.11.475254</dc:identifier>
<dc:title><![CDATA[Identifying and correcting repeat-calling errors in nanopore sequencing of telomeres]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-01-12</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.07.06.498874v1?rss=1">
<title>
<![CDATA[
Gaps and complex structurally variant loci in phased genome assemblies 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.07.06.498874v1?rss=1"
</link>
<description><![CDATA[
There has been tremendous progress in the production of phased genome assemblies by combining long-read data with parental information or linking read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than ~140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 77 phased and assembled human genomes (154 unique haplotypes). We find that trio-based approaches using HiFi are the current gold standard although chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. We find two-thirds of defined contig ends cluster near the largest and most identical repeats [including segmental duplications (35.4%) or satellite DNA (22.3%) or to regions enriched in GA/AT rich DNA (27.4%)]. As a result, 1513 protein-coding genes overlap assembly gaps in at least one haplotype and 231 are recurrently disrupted or missing from five or more haplotypes. In addition, we estimate that 6-7 Mbp of DNA are incorrectly orientated per haplotype irrespective of whether trio-free or trio-based approaches are employed. 81% of such misorientations correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large identical segmental duplications. In addition, we also identify large-scale alignment discontinuities consistent with an 11.9 Mbp deletion and 161.4 Mbp of insertion per human haploid genome. While 99% of this variation corresponds to satellite DNA, we identify 230 regions of the euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Although not completely resolved, these regions include copy number polymorphic and biomedically relevant genic regions where complete resolution and a pangenome representation will be most useful, yet most challenging, to realize.
]]></description>
<dc:creator>Porubsky, D.</dc:creator>
<dc:creator>Vollger, M. R.</dc:creator>
<dc:creator>Harvey, W. T.</dc:creator>
<dc:creator>Rozanski, A. N.</dc:creator>
<dc:creator>Ebert, P.</dc:creator>
<dc:creator>Hickey, G.</dc:creator>
<dc:creator>Hasenfeld, P.</dc:creator>
<dc:creator>Sanders, A. D.</dc:creator>
<dc:creator>Stober, C.</dc:creator>
<dc:creator>Human Pangenome Reference Consortium (HPRC),</dc:creator>
<dc:creator>Korbel, J. O.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Marschall, T.</dc:creator>
<dc:creator>Eichler, E. E.</dc:creator>
<dc:date>2022-07-06</dc:date>
<dc:identifier>doi:10.1101/2022.07.06.498874</dc:identifier>
<dc:title><![CDATA[Gaps and complex structurally variant loci in phased genome assemblies]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-07-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.07.17.452767v1?rss=1">
<title>
<![CDATA[
CoLoRd: Compressing long reads 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.07.17.452767v1?rss=1"
</link>
<description><![CDATA[
The costs of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in todays genomics. In spite of the increasing popularity of the third generation sequencing, the existing algorithms for compressing long reads exhibit minor advantage over general purpose gzip. We present CoLoRd, an algorithm able to reduce 3rd generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyzes.
]]></description>
<dc:creator>Kokot, M.</dc:creator>
<dc:creator>Gudys, A.</dc:creator>
<dc:creator>Li, H.</dc:creator>
<dc:creator>Deorowicz, S.</dc:creator>
<dc:date>2021-07-19</dc:date>
<dc:identifier>doi:10.1101/2021.07.17.452767</dc:identifier>
<dc:title><![CDATA[CoLoRd: Compressing long reads]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-07-19</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.11.24.469912v1?rss=1">
<title>
<![CDATA[
A Complete Pedigree-Based Graph Workflow for Rare Candidate Variant Analysis 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.11.24.469912v1?rss=1"
</link>
<description><![CDATA[
Methods that use a linear genome reference for genome sequencing data analysis are reference biased. In the field of clinical genetics for rare diseases, a resulting reduction in genotyping accuracy in some regions has likely prevented the resolution of some cases. Pangenome graphs embed population variation into a reference structure. While pangenome graphs have helped to reduce reference mapping bias, further performance improvements are possible. We introduce VG-Pedigree, a pedigree-aware workflow based on the pangenome-mapping tool of Giraffe (Siren et al. 2021) and the variant-calling tool DeepTrio (Kolesnikov et al. 2021) using a specially-trained model for Giraffe-based alignments. We demonstrate mapping and variant calling improvements in both single-nucleotide variants (SNVs) and insertion and deletion (INDEL) variants over those produced by alignments created using BWA-MEM to a linear-reference and Giraffe mapping to a pangenome graph containing data from the 1000 Genomes Project. We have also adapted and upgraded the deleterious-variant (DV) detecting methods and programs of Gu et al. into a streamlined workflow (Gu et al. 2019). We used these workflows in combination to detect small lists of candidate DVs among 15 family quartets and quintets of the Undiagnosed Diseases Program (UDP). All candidate DVs that were previously diagnosed using the mendelian models covered by the previously published Gu et al. methods were recapitulated by these workflows. The results of these experiments indicate a slightly greater absolute count of DVs are detected in the proband population than in their matched unaffected siblings.
]]></description>
<dc:creator>Markello, C.</dc:creator>
<dc:creator>Huang, C.</dc:creator>
<dc:creator>Rodriguez, A.</dc:creator>
<dc:creator>Carroll, A.</dc:creator>
<dc:creator>Chang, P.-C.</dc:creator>
<dc:creator>Eizenga, J.</dc:creator>
<dc:creator>Markello, T.</dc:creator>
<dc:creator>Haussler, D.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:date>2021-11-25</dc:date>
<dc:identifier>doi:10.1101/2021.11.24.469912</dc:identifier>
<dc:title><![CDATA[A Complete Pedigree-Based Graph Workflow for Rare Candidate Variant Analysis]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-11-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2020.04.23.056317v1?rss=1">
<title>
<![CDATA[
Succinct dynamic variation graphs 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2020.04.23.056317v1?rss=1"
</link>
<description><![CDATA[
MotivationPangenomics is a growing field within computational genomics. Many pangenomic analyses use bidirected sequence graphs as their core data model. However, implementing and correctly using this data model can be difficult, and the scale of pangenomic data sets can be challenging to work at. These challenges have impeded progress in this field.

ResultsHere we present a stack of two C++ libraries, libbdsg and libhandlegraph, which use a simple, field-proven interface, designed to expose elementary features of these graphs while preventing common graph manipulation mistakes. The libraries also provide a Python binding. Using a diverse collection of pangenome graphs, we demonstrate that these tools allow for efficient construction and manipulation of large genome graphs with dense variation. For instance, the speed and memory usage is up to an order of magnitude better than the prior graph implementation in the vg toolkit, which has now transitioned to using libbdsgs implementations.

Availabilitylibhandlegraph and libbdsg are available under an MIT License from https://github.com/vgteam/libhandlegraph and https://github.com/vgteam/libbdsg.

Contacterik.garrison@ucsc.edu
]]></description>
<dc:creator>Eizenga, J. M.</dc:creator>
<dc:creator>Novak, A. M.</dc:creator>
<dc:creator>Kobayashi, E.</dc:creator>
<dc:creator>Villani, F.</dc:creator>
<dc:creator>Cisar, C.</dc:creator>
<dc:creator>Heumos, S.</dc:creator>
<dc:creator>Hickey, G.</dc:creator>
<dc:creator>Colonna, V.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Garrison, E.</dc:creator>
<dc:date>2020-04-25</dc:date>
<dc:identifier>doi:10.1101/2020.04.23.056317</dc:identifier>
<dc:title><![CDATA[Succinct dynamic variation graphs]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2020-04-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.12.16.472988v1?rss=1">
<title>
<![CDATA[
Concerted modification of nucleotides at functional centers of the ribosome revealed by single-molecule RNA modification profiling 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.12.16.472988v1?rss=1"
</link>
<description><![CDATA[
Nucleotides in RNA and DNA are chemically modified by numerous enzymes that alter their function. Eukaryotic ribosomal RNA (rRNA) is modified at more than 100 locations, particularly at highly conserved and functionally important nucleotides. During ribosome biogenesis, modifications are added at various stages of assembly. The existence of differently modified classes of ribosomes in normal cells is unknown because no method exists to simultaneously evaluate the modification status at all sites within a single rRNA molecule. Using a combination of yeast genetics and nanopore direct RNA sequencing, we developed a reliable method to track the modification status of single rRNA molecules at 37 sites in 18S rRNA and 73 sites in 25S rRNA. We use our method to characterize patterns of modification heterogeneity and identify concerted modification of nucleotides found near functional centers of the ribosome. Distinct, undermodified subpopulations of rRNAs accumulate upon loss of Dbp3 or Prp43 RNA helicases, suggesting overlapping roles in ribosome biogenesis. Modification profiles are surprisingly resistant to change in response to many genetic and acute environmental conditions that affect translation, ribosome biogenesis, and pre-mRNA splicing. The ability to capture single molecule RNA modification profiles provides new insights into the roles of nucleotide modifications in RNA function.

HighlightsO_LImethod enabling single-molecule profiling of RNA modifications is developed and reveals heterogeneous classes of modified ribosomes.
C_LIO_LIrRNA 2O methylation and pseudouridylation modifications are independent of each other.
C_LIO_LIin functional centers of the ribosome are modified in a concerted fashion.
C_LIO_LIof function for RNA helicases Dbp3 and Prp43 produce discrete overlapping subpopulations of incompletely modified ribosomes.
C_LIO_LImodification profiles are resilient to rapidly changing nutrient conditions and perturbation of translation
C_LI
]]></description>
<dc:creator>Bailey, A. D.</dc:creator>
<dc:creator>Talkish, J.</dc:creator>
<dc:creator>Ding, H.</dc:creator>
<dc:creator>Igel, H.</dc:creator>
<dc:creator>Duran, A.</dc:creator>
<dc:creator>Mantripragada, S.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Ares, M.</dc:creator>
<dc:date>2021-12-17</dc:date>
<dc:identifier>doi:10.1101/2021.12.16.472988</dc:identifier>
<dc:title><![CDATA[Concerted modification of nucleotides at functional centers of the ribosome revealed by single-molecule RNA modification profiling]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-12-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.07.16.452324v1?rss=1">
<title>
<![CDATA[
Merfin: improved variant filtering and polishing via k-mer validation 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.07.16.452324v1?rss=1"
</link>
<description><![CDATA[
Read mapping and variant calling approaches have been widely used for accurate genotyping and improving consensus quality assembled from noisy long reads. Variant calling accuracy relies heavily on the read quality, the precision of the read mapping algorithm and variant caller, and the criteria adopted to filter the calls. However, it is impossible to define a single set of optimal parameters, as they vary depending on the quality of the read set, the variant caller of choice, and the quality of the unpolished assembly. To overcome this issue, we have devised a new tool called Merfin (k-mer based finishing tool), a k-mer based variant filtering algorithm for improved genotyping and polishing. Merfin evaluates the accuracy of a call based on expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant callers internal score. Moreover, we introduce novel assembly quality and completeness metrics that account for the expected genomic copy numbers. Merfin significantly increased the precision of a variant call and reduced frameshift errors when applied to PacBio HiFi, PacBio CLR, or Nanopore long read based assemblies. We demonstrate the utility while polishing the first complete human genome, a fully phased human genome, and non-human high-quality genomes.
]]></description>
<dc:creator>Formenti, G.</dc:creator>
<dc:creator>Rhie, A.</dc:creator>
<dc:creator>Walenz, B. P.</dc:creator>
<dc:creator>Thibaud-Nissen, F.</dc:creator>
<dc:creator>Shafin, K.</dc:creator>
<dc:creator>Koren, S.</dc:creator>
<dc:creator>Myers, E. W.</dc:creator>
<dc:creator>Jarvis, E. D.</dc:creator>
<dc:creator>Phillippy, A. M.</dc:creator>
<dc:date>2021-07-18</dc:date>
<dc:identifier>doi:10.1101/2021.07.16.452324</dc:identifier>
<dc:title><![CDATA[Merfin: improved variant filtering and polishing via k-mer validation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-07-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2020.06.07.139212v1?rss=1">
<title>
<![CDATA[
Higher rates of processed pseudogene acquisition in humans and three great apes revealed by long read assemblies 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2020.06.07.139212v1?rss=1"
</link>
<description><![CDATA[
LINE-1 mediated retrotransposition of protein-coding mRNAs is an active process in modern humans for both germline and somatic genomes. Prior works that surveyed human data or human cohorts mostly relied on detecting discordant mappings of paired-end short reads, or assumed L1 hallmarks such as polyA tails and target site duplications. Moreover, there has been few genome-wide comparison between gene retrocopies in great apes and humans. In this study, we introduced a more sensitive and accurate approach to the discovery of processed pseudogene. Our method utilizes long read assemblies, and more importantly, is able to provide full retrocopy sequences as well as the neighboring sequences which are missed by short-read based methods reads. We provided an overview of novel gene retrocopies of 40 events (38 parent genes) in 20 human assemblies, a significantly higher discovery rate than previous reports (39 events of 36 parent genes out of 939 individuals). We also performed comprehensive analysis of lineage specific retrocopies in chimpanzee, gorilla and orangutan genomes.
]]></description>
<dc:creator>Feng, X.</dc:creator>
<dc:creator>Li, H.</dc:creator>
<dc:date>2020-06-08</dc:date>
<dc:identifier>doi:10.1101/2020.06.07.139212</dc:identifier>
<dc:title><![CDATA[Higher rates of processed pseudogene acquisition in humans and three great apes revealed by long read assemblies]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2020-06-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2020.11.30.404947v1?rss=1">
<title>
<![CDATA[
Towards Inferring Nanopore Sequencing Ionic Currents from Nucleotide Chemical Structures 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2020.11.30.404947v1?rss=1"
</link>
<description><![CDATA[
The characteristic ionic currents of nucleotide kmers are commonly used in analyzing nanopore sequencing readouts. We present a graph convolutional network-based deep learning framework for predicting kmer characteristic ionic currents from corresponding chemical structures. We show such a framework can generalize the chemical information of the 5-methyl group from thymine to cytosine by correctly predicting 5-methylcytosine-containing DNA 6mers, thus shedding light on the de novo detection of nucleotide modifications.
]]></description>
<dc:creator>DING, H.</dc:creator>
<dc:creator>Anastopoulos, I.</dc:creator>
<dc:creator>Bailey, A. D.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Stuart, J.</dc:creator>
<dc:date>2020-12-02</dc:date>
<dc:identifier>doi:10.1101/2020.11.30.404947</dc:identifier>
<dc:title><![CDATA[Towards Inferring Nanopore Sequencing Ionic Currents from Nucleotide Chemical Structures]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2020-12-02</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.06.07.444885v1?rss=1">
<title>
<![CDATA[
Towards a Comprehensive Variation Benchmark for Challenging Medically-Relevant Autosomal Genes 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.06.07.444885v1?rss=1"
</link>
<description><![CDATA[
The repetitive nature and complexity of multiple medically important genes make them intractable to accurate analysis, despite the maturity of short-read sequencing, resulting in a gap in clinical applications of genome sequencing. The Genome in a Bottle Consortium has provided benchmark variant sets, but these excluded some medically relevant genes due to their repetitiveness or polymorphic complexity. In this study, we characterize 273 of these 395 challenging autosomal genes that have multiple implications for medical sequencing. This extended, curated benchmark reports over 17,000 SNVs, 3,600 INDELs, and 200 SVs each for GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically important genes including CBS, CRYAA, and KCNE1. Our proposed solution improves variant recall in these genes from 8% to 100%. This benchmark will significantly improve the comprehensive characterization of these medically relevant genes and guide new method development.
]]></description>
<dc:creator>Wagner, J.</dc:creator>
<dc:creator>Olson, N. D.</dc:creator>
<dc:creator>Harris, L.</dc:creator>
<dc:creator>McDaniel, J.</dc:creator>
<dc:creator>Cheng, H.</dc:creator>
<dc:creator>Fungtammasan, A.</dc:creator>
<dc:creator>Hwang, Y.-C.</dc:creator>
<dc:creator>Gupta, R.</dc:creator>
<dc:creator>Wenger, A. M.</dc:creator>
<dc:creator>Rowell, W. J.</dc:creator>
<dc:creator>Khan, Z. M.</dc:creator>
<dc:creator>Farek, J.</dc:creator>
<dc:creator>Zhu, Y.</dc:creator>
<dc:creator>Pisupati, A.</dc:creator>
<dc:creator>Mahmoud, M.</dc:creator>
<dc:creator>Xiao, C.</dc:creator>
<dc:creator>Yoo, B.</dc:creator>
<dc:creator>Sahraeian, S. M. E.</dc:creator>
<dc:creator>Miller, D. E.</dc:creator>
<dc:creator>Jaspez, D.</dc:creator>
<dc:creator>Lorenzo-Salazar, J. M.</dc:creator>
<dc:creator>Munoz-Barrera, A.</dc:creator>
<dc:creator>Rubio-Rodriguez, L. A.</dc:creator>
<dc:creator>Flores, C.</dc:creator>
<dc:creator>Narzisi, G.</dc:creator>
<dc:creator>Evani, U. S.</dc:creator>
<dc:creator>Clarke, W. E.</dc:creator>
<dc:creator>Lee, J.</dc:creator>
<dc:creator>Mason, C. E.</dc:creator>
<dc:creator>Lincoln, S. E.</dc:creator>
<dc:creator>Miga, K. H.</dc:creator>
<dc:creator>Ebbert, M. T.</dc:creator>
<dc:creator>Shumate, A.</dc:creator>
<dc:creator>Li, H.</dc:creator>
<dc:creator>Chin, C.-S.</dc:creator>
<dc:creator>Zook, J. M.</dc:creator>
<dc:creator>Sedlazeck, F. J.</dc:creator>
<dc:date>2021-06-07</dc:date>
<dc:identifier>doi:10.1101/2021.06.07.444885</dc:identifier>
<dc:title><![CDATA[Towards a Comprehensive Variation Benchmark for Challenging Medically-Relevant Autosomal Genes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-06-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.03.26.437240v1?rss=1">
<title>
<![CDATA[
Haplotype-aware pantranscriptome analyses using spliced pangenome graphs 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.03.26.437240v1?rss=1"
</link>
<description><![CDATA[
Pangenomics is emerging as a powerful computational paradigm in bioinformatics. This field uses population-level genome reference structures, typically consisting of a sequence graph, to mitigate reference bias and facilitate analyses that were challenging with previous reference-based methods. In this work, we extend these methods into transcriptomics to analyze sequencing data using the pantranscriptome: a population-level transcriptomic reference. Our novel toolchain can construct spliced pangenome graphs, map RNA-seq data to these graphs, and perform haplotype-aware expression quantification of transcripts in a pantranscriptome. This workflow improves accuracy over state-of-the-art RNA-seq mapping methods, and it can efficiently quantify haplotype-specific transcript expression without needing to characterize a samples haplotypes beforehand.
]]></description>
<dc:creator>Sibbesen, J. A.</dc:creator>
<dc:creator>Eizenga, J. M.</dc:creator>
<dc:creator>Novak, A. M.</dc:creator>
<dc:creator>Siren, J.</dc:creator>
<dc:creator>Chang, X.</dc:creator>
<dc:creator>Garrison, E.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:date>2021-03-28</dc:date>
<dc:identifier>doi:10.1101/2021.03.26.437240</dc:identifier>
<dc:title><![CDATA[Haplotype-aware pantranscriptome analyses using spliced pangenome graphs]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-03-28</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.05.26.445678v1?rss=1">
<title>
<![CDATA[
Segmental duplications and their variation in a complete human genome 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.05.26.445678v1?rss=1"
</link>
<description><![CDATA[
Despite their importance in disease and evolution, highly identical segmental duplications (SDs) have been among the last regions of the human reference genome (GRCh38) to be finished. Based on a complete telomere-to-telomere human genome (T2T-CHM13), we present the first comprehensive view of human SD organization. SDs account for nearly one-third of the additional sequence increasing the genome-wide estimate from 5.4% to 7.0% (218 Mbp). An analysis of 266 human genomes shows that 91% of the new T2T-CHM13 SD sequence (68.3 Mbp) better represents human copy number. We find that SDs show increased single-nucleotide variation diversity when compared to unique regions; we characterize methylation signatures that correlate with duplicate gene transcription and predict 182 novel protein-coding gene candidates. We find that 63% (35.11/55.7 Mbp) of acrocentric chromosomes consist of SDs distinct from rDNA and satellite sequences. Acrocentric SDs are 1.75-fold longer (p=0.00034) than other SDs, are frequently shared with autosomal pericentromeric regions, and are heteromorphic among human chromosomes. Comparing long-read assemblies from other human (n=12) and nonhuman primate (n=5) genomes, we use the T2T-CHM13 genome to systematically reconstruct the evolution and structural haplotype diversity of biomedically relevant (LPA, SMN) and duplicated genes (TBC1D3, SRGAP2C, ARHGAP11B) important in the expansion of the human frontal cortex. The analysis reveals unprecedented patterns of structural heterozygosity and massive evolutionary differences in SD organization between humans and their closest living relatives.
]]></description>
<dc:creator>Vollger, M. R.</dc:creator>
<dc:creator>Guitart, X.</dc:creator>
<dc:creator>Dishuck, P. C.</dc:creator>
<dc:creator>Mercuri, L.</dc:creator>
<dc:creator>Harvey, W. T.</dc:creator>
<dc:creator>Gershman, A.</dc:creator>
<dc:creator>Diekhans, M.</dc:creator>
<dc:creator>Sulovari, A.</dc:creator>
<dc:creator>Munson, K. M.</dc:creator>
<dc:creator>Lewis, A. M.</dc:creator>
<dc:creator>Hoekzema, K.</dc:creator>
<dc:creator>Porubsky, D.</dc:creator>
<dc:creator>Li, R.</dc:creator>
<dc:creator>Nurk, S.</dc:creator>
<dc:creator>Koren, S.</dc:creator>
<dc:creator>Miga, K. H.</dc:creator>
<dc:creator>Phillippy, A. M.</dc:creator>
<dc:creator>Timp, W.</dc:creator>
<dc:creator>Ventura, M.</dc:creator>
<dc:creator>Eichler, E. E.</dc:creator>
<dc:date>2021-05-26</dc:date>
<dc:identifier>doi:10.1101/2021.05.26.445678</dc:identifier>
<dc:title><![CDATA[Segmental duplications and their variation in a complete human genome]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-05-26</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.05.26.493621v1?rss=1">
<title>
<![CDATA[
The dynseq genome browser track enables visualization of context-specific, dynamic DNA sequence features at single nucleotide resolution 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.05.26.493621v1?rss=1"
</link>
<description><![CDATA[
We introduce the dynseq genome browser track, which displays DNA nucleotide characters scaled by user-specified, base-resolution scores provided in the BigWig file format. The dynseq track enables visualization of context-specific, informative genomic sequence features. We demonstrate its utility in three popular genome browsers for interpreting cis-regulatory sequence syntax and regulatory variant interpretation by visualizing nucleotide importance scores derived from machine learning models of regulatory DNA trained on protein-DNA binding and chromatin accessibility experiments.
]]></description>
<dc:creator>Nair, S.</dc:creator>
<dc:creator>Barrett, A.</dc:creator>
<dc:creator>Li, D.</dc:creator>
<dc:creator>Raney, B. J.</dc:creator>
<dc:creator>Lee, B. T.</dc:creator>
<dc:creator>Kerpedjiev, P.</dc:creator>
<dc:creator>Ramalingam, V.</dc:creator>
<dc:creator>Pampari, A.</dc:creator>
<dc:creator>Lekschas, F.</dc:creator>
<dc:creator>Wang, T.</dc:creator>
<dc:creator>Haeussler, M.</dc:creator>
<dc:creator>Kundaje, A.</dc:creator>
<dc:date>2022-05-28</dc:date>
<dc:identifier>doi:10.1101/2022.05.26.493621</dc:identifier>
<dc:title><![CDATA[The dynseq genome browser track enables visualization of context-specific, dynamic DNA sequence features at single nucleotide resolution]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-05-28</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.03.04.433952v1?rss=1">
<title>
<![CDATA[
Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.03.04.433952v1?rss=1"
</link>
<description><![CDATA[
Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read based phasing. Third-generation nanopore sequence data has demonstrated a long read length, but current interpretation methods for its novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline PEPPER-Margin-DeepVariant that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single nucleotide variant identification method at the whole genome-scale and produces high-quality single nucleotide variants in segmental duplications and low-mappability regions where short-read based genotyping fails. We show that our pipeline can provide highly-contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% to 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance than the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio-HiFi-polished).
]]></description>
<dc:creator>Shafin, K.</dc:creator>
<dc:creator>Pesout, T.</dc:creator>
<dc:creator>Chang, P.-C.</dc:creator>
<dc:creator>Nattestad, M.</dc:creator>
<dc:creator>Kolesnikov, A.</dc:creator>
<dc:creator>Goel, S.</dc:creator>
<dc:creator>Baid, G.</dc:creator>
<dc:creator>Eizenga, J. M.</dc:creator>
<dc:creator>Miga, K. H.</dc:creator>
<dc:creator>Carnevali, P.</dc:creator>
<dc:creator>Jain, M.</dc:creator>
<dc:creator>Carroll, A.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:date>2021-03-05</dc:date>
<dc:identifier>doi:10.1101/2021.03.04.433952</dc:identifier>
<dc:title><![CDATA[Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-03-05</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.01.09.523347v1?rss=1">
<title>
<![CDATA[
Spatial transcriptomics reveals a conserved segment polarity program that governs muscle patterning in Nematostella vectensis 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.01.09.523347v1?rss=1"
</link>
<description><![CDATA[
During early animal evolution, the emergence of axially-polarized segments was central to the diversification of complex bilaterian body plans. Nevertheless, precisely how and when segment polarity pathways arose remains obscure. Here we demonstrate the molecular basis for segment polarization in developing larvae of the pre-bilaterian sea anemone Nematostella vectensis. Utilizing spatial transcriptomics, we first constructed a 3-D gene expression atlas of developing larval segments. Capitalizing on accurate in silico predictions, we identified Lbx and Uncx, conserved homeodomain-containing genes that occupy opposing subsegmental domains under the control of both BMP signaling and the Hox-Gbx cascade. Functionally, Lbx mutagenesis eliminated all molecular evidence of segment polarization at larval stage and caused an aberrant mirror-symmetric pattern of retractor muscles in primary polyps. These results demonstrate the molecular basis for segment polarity in a pre-bilaterian animal, suggesting that polarized metameric structures were present in the Cnidaria-Bilateria common ancestor over 600 million years ago.

HighlightsO_LINematostella endomesodermal tissue forms metameric segments and displays a transcriptomic profile similar to that observed in bilaterian mesoderm
C_LIO_LIConstruction of a comprehensive 3-D gene expression atlas enables systematic dissection of segmental identity in endomesoderm
C_LIO_LILbx and Uncx, two conserved homeobox-containing genes, establish segment polarity in Nematostella
C_LIO_LIThe Cnidarian-Bilaterian common ancestor likely possessed the genetic toolkit to generate polarized metameric structures
C_LI
]]></description>
<dc:creator>He, S.</dc:creator>
<dc:creator>Shao, W.</dc:creator>
<dc:creator>Chen, S.</dc:creator>
<dc:creator>Wang, T.</dc:creator>
<dc:creator>Gibson, M.</dc:creator>
<dc:date>2023-01-10</dc:date>
<dc:identifier>doi:10.1101/2023.01.09.523347</dc:identifier>
<dc:title><![CDATA[Spatial transcriptomics reveals a conserved segment polarity program that governs muscle patterning in Nematostella vectensis]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-01-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.10.26.513955v1?rss=1">
<title>
<![CDATA[
T1K: efficient and accurate KIR and HLA genotyping with next-generation sequencing data 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.10.26.513955v1?rss=1"
</link>
<description><![CDATA[
Killer immunoglobulin-like receptor (KIR) genes and human leukocyte antigen (HLA) genes are highly polymorphic in a population and play important roles in innate and adaptive immunity. We have developed a novel computational method T1K that can efficiently and accurately infer the KIR or HLA alleles from next-generation sequencing data. T1K is flexible and is compatible with various sequencing platforms including RNA-seq and genomic sequencing data. We applied T1K on CD8+ T cell single-cell RNA-seq data, and identified that KIR2DL4 allele expression levels were enriched in tumor-specific CD8+ T cells.
]]></description>
<dc:creator>Song, L.</dc:creator>
<dc:creator>Bai, G.</dc:creator>
<dc:creator>Liu, X. S.</dc:creator>
<dc:creator>Li, B.</dc:creator>
<dc:creator>Li, H.</dc:creator>
<dc:date>2022-10-27</dc:date>
<dc:identifier>doi:10.1101/2022.10.26.513955</dc:identifier>
<dc:title><![CDATA[T1K: efficient and accurate KIR and HLA genotyping with next-generation sequencing data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-10-27</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.04.07.487441v1?rss=1">
<title>
<![CDATA[
AGC: Compact representation of assembled genomes 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.04.07.487441v1?rss=1"
</link>
<description><![CDATA[
High-quality sequence assembly is the ultimate representation of complete genetic information of an individual. Several ongoing pangenome projects are producing collections of high-quality assemblies of various species. Here, we show how to represent the sequenced genomes in 2-3 orders of magnitude smaller space, allowing easy and fast extraction of any contig or its part.
]]></description>
<dc:creator>Deorowicz, S.</dc:creator>
<dc:creator>Danek, A.</dc:creator>
<dc:creator>Li, H.</dc:creator>
<dc:date>2022-04-07</dc:date>
<dc:identifier>doi:10.1101/2022.04.07.487441</dc:identifier>
<dc:title><![CDATA[AGC: Compact representation of assembled genomes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-04-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2020.12.04.412486v1?rss=1">
<title>
<![CDATA[
Genotyping common, large structural variations in 5,202 genomes using pangenomes, the Giraffe mapper, and the vg toolkit 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2020.12.04.412486v1?rss=1"
</link>
<description><![CDATA[
We introduce Giraffe, a pangenome short read mapper that can efficiently map to a collection of haplotypes threaded through a sequence graph. Giraffe, part of the variation graph toolkit (vg)1, maps reads to thousands of human genomes at around the same speed BWA-MEM2 maps reads to a single reference genome, while maintaining comparable accuracy to VG-MAP, vgs original mapper. We have developed efficient genotyping pipelines using Giraffe. We demonstrate improvements in genotyping for single-nucleotide variants (SNVs), small insertions and deletions (indels) and structural variations (SVs) genome-wide. We use Giraffe to genotype about 167 thousand structural variants ascertained from long read studies in 5,202 human genomes sequenced with short reads, including the complete 1000 Genomes Project dataset, at an average cost of $1.50 per sample. We determine the frequency of these variations in diverse human populations, characterize their complex allelic variations and identify thousands of expression quantitative trait loci (eQTLs) driven by these variations.
]]></description>
<dc:creator>Siren, J.</dc:creator>
<dc:creator>Monlong, J.</dc:creator>
<dc:creator>Chang, X.</dc:creator>
<dc:creator>Novak, A. M.</dc:creator>
<dc:creator>Eizenga, J. M.</dc:creator>
<dc:creator>Markello, C.</dc:creator>
<dc:creator>Sibbesen, J. A.</dc:creator>
<dc:creator>Hickey, G.</dc:creator>
<dc:creator>Chang, P.-C.</dc:creator>
<dc:creator>Carroll, A.</dc:creator>
<dc:creator>Haussler, D.</dc:creator>
<dc:creator>Garrison, E.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:date>2020-12-06</dc:date>
<dc:identifier>doi:10.1101/2020.12.04.412486</dc:identifier>
<dc:title><![CDATA[Genotyping common, large structural variations in 5,202 genomes using pangenomes, the Giraffe mapper, and the vg toolkit]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2020-12-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.05.27.493721v1?rss=1">
<title>
<![CDATA[
Constructing founder sets under allelic and non-allelic homologous recombination 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.05.27.493721v1?rss=1"
</link>
<description><![CDATA[
Homologous recombination between the maternal and paternal copies of a chromosome is a key mechanism for human inheritance and shapes population genetic properties of our species. However, a similar mechanism can also act between different copies of the same sequence, then called non-allelic homologous recombination (NAHR). This process can result in genomic rearrangements--including deletion, duplication, and inversion--and is underlying many genomic disorders. Despite its importance for genome evolution and disease, there is a lack of computational models to study genomic loci prone to NAHR.

In this work, we propose such a computational model, providing a unified framework for both (allelic) homologous recombination and NAHR. Our model represents a set of genomes as a graph, where human haplotypes correspond to walks through this graph. We formulate two founder set problems under our recombination model, provide flow-based algorithms for their solution, and demonstrate scalability to problem instances arising in practice.
]]></description>
<dc:creator>Bonnet, K.</dc:creator>
<dc:creator>Marschall, T.</dc:creator>
<dc:creator>Doerr, D.</dc:creator>
<dc:date>2022-05-29</dc:date>
<dc:identifier>doi:10.1101/2022.05.27.493721</dc:identifier>
<dc:title><![CDATA[Constructing founder sets under allelic and non-allelic homologous recombination]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-05-29</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.10.06.511148v1?rss=1">
<title>
<![CDATA[
Inversion polymorphism in a complete human genome assembly 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.10.06.511148v1?rss=1"
</link>
<description><![CDATA[
The completion of the human genome significantly improved our ability to discover and interpret genome copy number variation. In order to understand its impact on the characterization of inversion polymorphisms, we remapped data from 41 human genomes and 10 new samples against the telomere-to-telomere (T2T) reference genome as compared to the standard GRCh38 reference. Our analysis shows a ~21% increase in sensitivity identifying and improving mapping of 63 inversions. We further identify 26 misorientations within GRCh38, and show that the T2T reference is three times more likely to represent the correct orientation of the major human allele. As a result, we report a significant bias for inversions accumulating within the pericentromeric regions of specific chromosomes and show that functional annotations around inverted regions, such as topological-associated domains, can be better interpreted.
]]></description>
<dc:creator>Porubsky, D.</dc:creator>
<dc:creator>Harvey, W. T.</dc:creator>
<dc:creator>Rozanski, A. N.</dc:creator>
<dc:creator>Ebler, J.</dc:creator>
<dc:creator>Hoeps, W.</dc:creator>
<dc:creator>Ashraf, H.</dc:creator>
<dc:creator>Hasenfeld, P.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Sanders, A. D.</dc:creator>
<dc:creator>Marschall, T.</dc:creator>
<dc:creator>Korbel, J. O.</dc:creator>
<dc:creator>Eichler, E. E.</dc:creator>
<dc:date>2022-10-06</dc:date>
<dc:identifier>doi:10.1101/2022.10.06.511148</dc:identifier>
<dc:title><![CDATA[Inversion polymorphism in a complete human genome assembly]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-10-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.01.12.523790v1?rss=1">
<title>
<![CDATA[
Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.01.12.523790v1?rss=1"
</link>
<description><![CDATA[
Long-read sequencing technologies substantially overcome the limitations of short-reads but to date have not been considered as feasible replacement at scale due to a combination of being too expensive, not scalable enough, or too error-prone. Here, we develop an efficient and scalable wet lab and computational protocol for Oxford Nanopore Technologies (ONT) long-read sequencing that seeks to provide a genuine alternative to short-reads for large-scale genomics projects. We applied our protocol to cell lines and brain tissue samples as part of a pilot project for the NIH Center for Alzheimers and Related Dementias (CARD). Using a single PromethION flow cell, we can detect SNPs with F1-score better than Illumina short-read sequencing. Small indel calling remains difficult within homopolymers and tandem repeats, but is comparable to Illumina calls elsewhere. Further, we can discover structural variants with F1-score comparable to state-of-the-art methods involving Pacific Biosciences HiFi sequencing and trio information (but at a lower cost and greater throughput). Using ONT-based phasing, we can then combine and phase small and structural variants at megabase scales. Our protocol also produces highly accurate, haplotype-specific methylation calls. Overall, this makes large-scale long-read sequencing projects feasible; the protocol is currently being used to sequence thousands of brain-based genomes as a part of the NIH CARD initiative. We provide the protocol and software as open-source integrated pipelines for generating phased variant calls and assemblies.
]]></description>
<dc:creator>Kolmogorov, M.</dc:creator>
<dc:creator>Billingsley, K. J.</dc:creator>
<dc:creator>Mastoras, M.</dc:creator>
<dc:creator>Meredith, M.</dc:creator>
<dc:creator>Monlong, J.</dc:creator>
<dc:creator>Lorig-Roach, R.</dc:creator>
<dc:creator>Asri, M.</dc:creator>
<dc:creator>Alvarez Jerez, P.</dc:creator>
<dc:creator>Malik, L.</dc:creator>
<dc:creator>Dewan, R.</dc:creator>
<dc:creator>Reed, X.</dc:creator>
<dc:creator>Genner, R. M.</dc:creator>
<dc:creator>Daida, K.</dc:creator>
<dc:creator>Behera, S.</dc:creator>
<dc:creator>Shafin, K.</dc:creator>
<dc:creator>Pesout, T.</dc:creator>
<dc:creator>Prabakaran, J.</dc:creator>
<dc:creator>Carnevali, P.</dc:creator>
<dc:creator>North American Brain Expression Consortium (NABEC),</dc:creator>
<dc:creator>Yang, J.</dc:creator>
<dc:creator>Rhie, A.</dc:creator>
<dc:creator>Scholz, S. W.</dc:creator>
<dc:creator>Traynor, B. J.</dc:creator>
<dc:creator>Miga, K. H.</dc:creator>
<dc:creator>Jain, M.</dc:creator>
<dc:creator>Timp, W.</dc:creator>
<dc:creator>Phillippy, A. M.</dc:creator>
<dc:creator>Chaisson, M.</dc:creator>
<dc:creator>Sedlazeck, F. J.</dc:creator>
<dc:creator>Blauwendraat, C.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:date>2023-01-15</dc:date>
<dc:identifier>doi:10.1101/2023.01.12.523790</dc:identifier>
<dc:title><![CDATA[Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-01-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.04.14.488380v1?rss=1">
<title>
<![CDATA[
Optimal gap-affine alignment in O(s) space 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.04.14.488380v1?rss=1"
</link>
<description><![CDATA[
MotivationPairwise sequence alignment remains a fundamental problem in computational biology and bioinformatics. Recent advances in genomics and sequencing technologies demand faster and scalable algorithms that can cope with the ever-increasing sequence lengths. Classical pairwise alignment algorithms based on dynamic programming are strongly limited by quadratic requirements in time and memory. The recently proposed wavefront alignment algorithm (WFA) introduced an efficient algorithm to perform exact gap-affine alignment in O(ns) time, where s is the optimal score and n is the sequence length. Notwithstanding these bounds, WFAs O(s2) memory requirements become computationally impractical for genome-scale alignments, leading to a need for further improvement.

ResultsIn this paper, we present the bidirectional WFA algorithm (BiWFA), the first gap-affine algorithm capable of computing optimal alignments in O(s) memory while retaining WFAs time complexity of O(ns). As a result, this work improves the lowest known memory bound O(n) to compute gap-affine alignments. In practice, our implementation never requires more than a few hundred MBs aligning noisy Oxford Nanopore Technologies reads up to 1 Mbp long while maintaining competitive execution times.

AvailabilityAll code is publicly available at https://github.com/smarco/BiWFA-paper

Contactsantiagomsola@gmail.com
]]></description>
<dc:creator>Marco-Sola, S.</dc:creator>
<dc:creator>Eizenga, J. M.</dc:creator>
<dc:creator>Guarracino, A.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Garrison, E.</dc:creator>
<dc:creator>Moreto, M.</dc:creator>
<dc:date>2022-04-15</dc:date>
<dc:identifier>doi:10.1101/2022.04.14.488380</dc:identifier>
<dc:title><![CDATA[Optimal gap-affine alignment in O(s) space]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-04-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2020.10.30.361162v1?rss=1">
<title>
<![CDATA[
UCSC Cell Browser: Visualize Your Single-Cell Data 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2020.10.30.361162v1?rss=1"
</link>
<description><![CDATA[
SummaryAs the use of single-cell technologies has grown, so has the need for tools to explore these large, complicated datasets. The UCSC Cell Browser is a tool that allows scientists to visualize gene expression and metadata annotation distribution throughout a single-cell dataset or multiple datasets.

Availability and implementationWe provide the UCSC Cell Browser as a free website where users can explore a growing collection of single-cell datasets and a freely available python package for scientists to create stable, self-contained visualizations for their own single-cell datasets. Learn more at https://cells.ucsc.edu.

Contactcells@ucsc.edu
]]></description>
<dc:creator>Speir, M. L.</dc:creator>
<dc:creator>Bhaduri, A.</dc:creator>
<dc:creator>Markov, N. S.</dc:creator>
<dc:creator>Moreno, P.</dc:creator>
<dc:creator>Nowakowski, T. J.</dc:creator>
<dc:creator>Papatheodorou, I.</dc:creator>
<dc:creator>Pollen, A. A.</dc:creator>
<dc:creator>Seninge, L.</dc:creator>
<dc:creator>Kent, W. J.</dc:creator>
<dc:creator>Haeussler, M.</dc:creator>
<dc:date>2020-10-31</dc:date>
<dc:identifier>doi:10.1101/2020.10.30.361162</dc:identifier>
<dc:title><![CDATA[UCSC Cell Browser: Visualize Your Single-Cell Data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2020-10-31</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.01.12.476087v1?rss=1">
<title>
<![CDATA[
Improving the time and space complexity of the WFA algorithm and generalizing its scoring 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.01.12.476087v1?rss=1"
</link>
<description><![CDATA[
MotivationModern genomic sequencing data is trending toward longer sequences with higher accuracy. Many analyses using these data will center on alignments, but classical exact alignment algorithms are infeasible for long sequences. The recently proposed WFA algorithm demonstrated how to perform exact alignment for long, similar sequences in O(sN) time and O(s2) memory, where s is a score that is low for similar sequences (Marco-Sola et al., 2021). However, this algorithm still has infeasible memory requirements for longer sequences. Also, it uses an alternate scoring system that is unfamiliar to many bioinformaticians.

ResultsWe describe variants of WFA that improve its asymptotic memory use from O(s2) to O(s3/2) and its asymptotic run time from O(sN) to O(s2 + N). We expect the reduction in memory use to be particularly impactful, as it makes it practical to perform highly multithreaded megabase-scale exact alignments in common compute environments. In addition, we show how to fold WFAs alternate scoring into the broader literature on alignment scores.

AvailabilityAll code is publicly available for use and modification at https://github.com/jeizenga/wfalm.

Contactjeizenga@ucsc.edu

Supplementary informationSupplementary data are available online.
]]></description>
<dc:creator>Eizenga, J. M.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:date>2022-01-15</dc:date>
<dc:identifier>doi:10.1101/2022.01.12.476087</dc:identifier>
<dc:title><![CDATA[Improving the time and space complexity of the WFA algorithm and generalizing its scoring]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-01-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.01.18.476849v1?rss=1">
<title>
<![CDATA[
Exploring genomic data coupled with 3D chromatin structures using the WashU Epigenome Browser 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.01.18.476849v1?rss=1"
</link>
<description><![CDATA[
Biological functions are not only encoded by the genomes sequence but also regulated by its three-dimensional (3D) structure. More and more studies have revealed the importance of 3D chromatin structures in development and diseases; therefore, visualizing the connections between genome sequence, epigenomic dynamics (1D) and the 3D genome becomes a pressing need. The WashU Epigenome Browser introduces a new 3D visualization module to integrate visualization of 1D (such as sequence features, epigenomic data) and 2D data (such as chromosome conformation capture data) with 3D genome structure. Genomic coordinates are encoded in 3D models of the chromosomes; thus, all genomic information displayed on a 1D genome browser can be visualized on a 3D model, supported by genome browser utilities and facilitating interpretation of genomic data. Biological information that is difficult to illustrate in 1D becomes more intuitive when displayed in 3D, providing novel and powerful tools for investigators to hypothesize and understand the connections between biological functions and 3D genome structures.
]]></description>
<dc:creator>Li, D.</dc:creator>
<dc:creator>Purushotham, D.</dc:creator>
<dc:creator>Harrison, J. K.</dc:creator>
<dc:creator>Wang, T.</dc:creator>
<dc:date>2022-01-21</dc:date>
<dc:identifier>doi:10.1101/2022.01.18.476849</dc:identifier>
<dc:title><![CDATA[Exploring genomic data coupled with 3D chromatin structures using the WashU Epigenome Browser]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-01-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.12.01.518658v1?rss=1">
<title>
<![CDATA[
Assembly of 43 diverse human Y chromosomes reveals extensive complexity and variation 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.12.01.518658v1?rss=1"
</link>
<description><![CDATA[
The prevalence of highly repetitive sequences within the human Y chromosome has led to its incomplete assembly and systematic omission from genomic analyses. Here, we present long-read de novo assemblies of 43 diverse Y chromosomes spanning 180,000 years of human evolution, including two from deep-rooted African Y lineages, and report remarkable complexity and diversity in chromosome size and structure, in contrast with its low level of base substitution variation. The size of the Y chromosome assemblies varies extensively from 45.2 to 84.9 Mbp and include, on average, 81 kbp of novel sequence per Y chromosome. Half of the male-specific euchromatic region is subject to large inversions with a >2-fold higher recurrence rate compared to inversions in the rest of the human genome. Ampliconic sequences associated with these inversions further show differing mutation rates that are sequence context-dependent and some ampliconic genes show evidence for concerted evolution with the acquisition and purging of lineage-specific pseudogenes. The largest heterochromatic region in the human genome, the Yq12, is composed of alternating arrays of DYZ1 and DYZ2 repeat units that show extensive variation in the number, size and distribution of these arrays, but retain a 1:1 copy number ratio of the monomer repeats, consistent with the notion that functional or evolutionary forces are acting on this chromosomal region. Finally, our data suggests that the boundary between the recombining pseudoautosomal region 1 and the non-recombining portions of the X and Y chromosomes lies 500 kbp distal to the currently established boundary. The availability of sequence-resolved Y chromosomes from multiple individuals provides a unique opportunity for identifying new associations of specific traits with Y-chromosomal variants and garnering novel insights into the evolution and function of complex regions of the human genome.
]]></description>
<dc:creator>Hallast, P.</dc:creator>
<dc:creator>Ebert, P.</dc:creator>
<dc:creator>Loftus, M.</dc:creator>
<dc:creator>Yilmaz, F.</dc:creator>
<dc:creator>Audano, P. A.</dc:creator>
<dc:creator>Logsdon, G. A.</dc:creator>
<dc:creator>Bonder, M. J.</dc:creator>
<dc:creator>Zhou, W.</dc:creator>
<dc:creator>Hoeps, W.</dc:creator>
<dc:creator>Kim, K.</dc:creator>
<dc:creator>Li, C.</dc:creator>
<dc:creator>Dishuck, P. C.</dc:creator>
<dc:creator>Porubsky, D.</dc:creator>
<dc:creator>Tsetsos, F.</dc:creator>
<dc:creator>Kwon, J. Y.</dc:creator>
<dc:creator>Zhu, Q.</dc:creator>
<dc:creator>Munson, K. M.</dc:creator>
<dc:creator>Hasenfeld, P.</dc:creator>
<dc:creator>Harvey, W. T.</dc:creator>
<dc:creator>Lewis, A. P.</dc:creator>
<dc:creator>Kordosky, J.</dc:creator>
<dc:creator>Hoekzema, K.</dc:creator>
<dc:creator>(HGSVC), T. H. G. S. V. C.</dc:creator>
<dc:creator>Korbel, J. O.</dc:creator>
<dc:creator>Tyler-Smith, C.</dc:creator>
<dc:creator>Eichler, E. E.</dc:creator>
<dc:creator>Shi, X.</dc:creator>
<dc:creator>Beck, C. R.</dc:creator>
<dc:creator>Marschall, T.</dc:creator>
<dc:creator>Konkel, M. K.</dc:creator>
<dc:creator>Lee, C.</dc:creator>
<dc:date>2022-12-01</dc:date>
<dc:identifier>doi:10.1101/2022.12.01.518658</dc:identifier>
<dc:title><![CDATA[Assembly of 43 diverse human Y chromosomes reveals extensive complexity and variation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-12-01</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.03.16.483999v1?rss=1">
<title>
<![CDATA[
Evolution of transposable element-derived enhancer activity 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.03.16.483999v1?rss=1"
</link>
<description><![CDATA[
Many transposable elements (TEs) contain transcription factor binding sites and are implicated as potential regulatory elements. However, TEs are rarely functionally tested for regulatory activity, which in turn limits our understanding of how TE regulatory activity has evolved. We systematically tested the human LTR18A subfamily for regulatory activity using massively parallel reporter assay (MPRA) and found AP-1 and C/EBP-related binding motifs as drivers of enhancer activity. Functional analysis of evolutionarily reconstructed ancestral sequences revealed that LTR18A elements have generally lost regulatory activity over time through sequence changes, with the largest effects occurring due to mutations in the AP-1 and C/EBP motifs. We observed that the two motifs are conserved at higher rates than expected based on neutral evolution. Finally, we identified LTR18A elements as potential enhancers in the human genome, primarily in epithelial cells. Together, our results provide a model for the origin, evolution, and co-option of TE-derived regulatory elements.
]]></description>
<dc:creator>Du, A. Y.</dc:creator>
<dc:creator>Zhuo, X.</dc:creator>
<dc:creator>Sundaram, V.</dc:creator>
<dc:creator>Jensen, N. O.</dc:creator>
<dc:creator>Chaudhari, H. G.</dc:creator>
<dc:creator>Saccone, N. L.</dc:creator>
<dc:creator>Cohen, B. A.</dc:creator>
<dc:creator>Wang, T.</dc:creator>
<dc:date>2022-03-17</dc:date>
<dc:identifier>doi:10.1101/2022.03.16.483999</dc:identifier>
<dc:title><![CDATA[Evolution of transposable element-derived enhancer activity]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-03-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/810341v1?rss=1">
<title>
<![CDATA[
Efficient chromosome-scale haplotype-resolved assembly of human genomes 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/810341v1?rss=1"
</link>
<description><![CDATA[
Haplotype-resolved or phased sequence assembly provides a complete picture of genomes and complex genetic variations. However, current phased assembly algorithms either fail to generate chromosome-scale phasing or require pedigree information, which limits their application. We present a method that leverages long accurate reads and long-range conformation data for single individuals to generate chromosome-scale phased assembly within a day. Applied to three public human genomes, PGP1, HG002 and NA12878, our method produced haplotype-resolved assemblies with contig NG50 up to 25 Mb and phased [~]99.5% of heterozygous sites to 98-99% accuracy, outperforming other approaches in terms of both contiguity and phasing completeness. We demonstrate the importance of chromosome-scale phased assemblies to discover structural variants (SVs), including thousands of new transposon insertions, and of highly polymorphic and medically important regions such as HLA and KIR. Our improved method will enable high-quality precision medicine and facilitate new studies of individual haplotype variation and population diversity.
]]></description>
<dc:creator>Garg, S.</dc:creator>
<dc:creator>Arkarachai Fungtammasan, A.</dc:creator>
<dc:creator>Carroll, A.</dc:creator>
<dc:creator>Chou, M.</dc:creator>
<dc:creator>Schmitt, A.</dc:creator>
<dc:creator>Zhou, X.</dc:creator>
<dc:creator>Mac, S.</dc:creator>
<dc:creator>Peluso, P.</dc:creator>
<dc:creator>Hatas, E.</dc:creator>
<dc:creator>Ghurye, J.</dc:creator>
<dc:creator>Maguire, J.</dc:creator>
<dc:creator>Mahmoud, M.</dc:creator>
<dc:creator>Cheng, H.</dc:creator>
<dc:creator>Heller, D.</dc:creator>
<dc:creator>Zook, J. M.</dc:creator>
<dc:creator>Moemke, T.</dc:creator>
<dc:creator>Marschall, T.</dc:creator>
<dc:creator>Sedlazeck, F. J.</dc:creator>
<dc:creator>Aach, J.</dc:creator>
<dc:creator>Chin, C.-S.</dc:creator>
<dc:creator>Church, G. M.</dc:creator>
<dc:creator>Li, H. M.</dc:creator>
<dc:date>2019-10-18</dc:date>
<dc:identifier>doi:10.1101/810341</dc:identifier>
<dc:title><![CDATA[Efficient chromosome-scale haplotype-resolved assembly of human genomes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2019-10-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.07.12.451456v1?rss=1">
<title>
<![CDATA[
From telomere to telomere: the transcriptional and epigenetic state of human repeat elements 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.07.12.451456v1?rss=1"
</link>
<description><![CDATA[
Mobile elements and highly repetitive genomic regions are potent sources of lineage-specific genomic innovation and fingerprint individual genomes. Comprehensive analyses of large, composite or arrayed repeat elements and those found in more complex regions of the genome require a complete, linear genome assembly. Here we present the first de novo repeat discovery and annotation of a complete human reference genome, T2T-CHM13v1.0. We identified novel satellite arrays, expanded the catalog of variants and families for known repeats and mobile elements, characterized new classes of complex, composite repeats, and provided comprehensive annotations of retroelement transduction events. Utilizing PRO-seq to detect nascent transcription and nanopore sequencing to delineate CpG methylation profiles, we defined the structure of transcriptionally active retroelements in humans, including for the first time those found in centromeres. Together, these data provide expanded insight into the diversity, distribution and evolution of repetitive regions that have shaped the human genome.
]]></description>
<dc:creator>Hoyt, S. J.</dc:creator>
<dc:creator>Storer, J. M.</dc:creator>
<dc:creator>Hartley, G. A.</dc:creator>
<dc:creator>Grady, P. G. S.</dc:creator>
<dc:creator>Gershman, A.</dc:creator>
<dc:creator>de Lima, L. G.</dc:creator>
<dc:creator>Limouse, C.</dc:creator>
<dc:creator>Halabian, R.</dc:creator>
<dc:creator>Wojenski, L.</dc:creator>
<dc:creator>Rodriguez, M.</dc:creator>
<dc:creator>Altemose, N.</dc:creator>
<dc:creator>Core, L.</dc:creator>
<dc:creator>Gerton, J. L.</dc:creator>
<dc:creator>Makalowski, W.</dc:creator>
<dc:creator>Olson, D.</dc:creator>
<dc:creator>Rosen, J.</dc:creator>
<dc:creator>Smit, A. F. A.</dc:creator>
<dc:creator>Straight, A. F.</dc:creator>
<dc:creator>Vollger, M. R.</dc:creator>
<dc:creator>Wheeler, T.</dc:creator>
<dc:creator>Schatz, M.</dc:creator>
<dc:creator>Eichler, E.</dc:creator>
<dc:creator>Phillippy, A. M.</dc:creator>
<dc:creator>Timp, W.</dc:creator>
<dc:creator>Miga, K. H.</dc:creator>
<dc:creator>O'Neill, R. J.</dc:creator>
<dc:date>2021-07-12</dc:date>
<dc:identifier>doi:10.1101/2021.07.12.451456</dc:identifier>
<dc:title><![CDATA[From telomere to telomere: the transcriptional and epigenetic state of human repeat elements]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-07-12</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.03.06.483034v1?rss=1">
<title>
<![CDATA[
Automated assembly of high-quality diploid human reference genomes 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.03.06.483034v1?rss=1"
</link>
<description><![CDATA[
The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has greatly benefited society1, 2. However, it still has many gaps and errors, and does not represent a biological human genome since it is a blend of multiple individuals3, 4. Recently, a high-quality telomere-to-telomere reference genome, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a duplicate genome, and is thus nearly homozygous5. To address these limitations, the Human Pangenome Reference Consortium (HPRC) recently formed with the goal of creating a collection of high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and automated assembly approaches yields the most complete, accurate, and cost-effective diploid genome assemblies with minimal manual curation. Approaches that used highly accurate long reads and parent-child data to sort haplotypes during assembly outperformed those that did not. Developing a combination of all the top performing methods, we generated our first high- quality diploid reference assembly, containing only [~]4 gaps (range 0-12) per chromosome, most within + 1% of CHM13s length. Nearly 1/4th of protein coding genes have synonymous amino acid changes between haplotypes, and centromeric regions showed the highest density of variation. Our findings serve as a foundation for assembling near-complete diploid human genomes at the scale required for constructing a human pangenome reference that captures all genetic variation from single nucleotides to large structural rearrangements.
]]></description>
<dc:creator>Jarvis, E. D.</dc:creator>
<dc:creator>Formenti, G.</dc:creator>
<dc:creator>Rhie, A.</dc:creator>
<dc:creator>Guarracino, A.</dc:creator>
<dc:creator>Yang, C.</dc:creator>
<dc:creator>Wood, J.</dc:creator>
<dc:creator>Tracey, A.</dc:creator>
<dc:creator>Thibaud-Nissen, F.</dc:creator>
<dc:creator>Vollger, M. R.</dc:creator>
<dc:creator>Porubsky, D.</dc:creator>
<dc:creator>Cheng, H.</dc:creator>
<dc:creator>Asri, M.</dc:creator>
<dc:creator>Logsdon, G. A.</dc:creator>
<dc:creator>Carnevali, P.</dc:creator>
<dc:creator>Chaisson, M.</dc:creator>
<dc:creator>Chin, C.-S.</dc:creator>
<dc:creator>Cody, S.</dc:creator>
<dc:creator>Collins, J.</dc:creator>
<dc:creator>Ebert, P.</dc:creator>
<dc:creator>Escalona, M.</dc:creator>
<dc:creator>Fedrigo, O.</dc:creator>
<dc:creator>Fulton, R. S.</dc:creator>
<dc:creator>Fulton, L. L.</dc:creator>
<dc:creator>Garg, S.</dc:creator>
<dc:creator>Ghurye, J.</dc:creator>
<dc:creator>Green, E.</dc:creator>
<dc:creator>Hall, I. M.</dc:creator>
<dc:creator>Harvey, W. H.</dc:creator>
<dc:creator>Hasenfeld, P.</dc:creator>
<dc:creator>Hastie, A.</dc:creator>
<dc:creator>Haukness, M.</dc:creator>
<dc:creator>Jain, M.</dc:creator>
<dc:creator>Kirsche, M.</dc:creator>
<dc:creator>Kolmogorov, M.</dc:creator>
<dc:creator>Korbel, J. O.</dc:creator>
<dc:creator>Koren, S.</dc:creator>
<dc:creator>Korlach, J.</dc:creator>
<dc:creator>Lee, J.</dc:creator>
<dc:creator>Li, D.</dc:creator>
<dc:creator>Lindsay, T.</dc:creator>
<dc:creator>Lucas, J.</dc:creator>
<dc:creator>Luo, F.</dc:creator>
<dc:creator>Marschall, T.</dc:creator>
<dc:creator>McDaniel, J.</dc:creator>
<dc:creator>Nie, F.</dc:creator>
<dc:creator>Olsen, H. E.</dc:creator>
<dc:creator>Olson, N.</dc:creator>
<dc:creator></dc:creator>
<dc:date>2022-03-06</dc:date>
<dc:identifier>doi:10.1101/2022.03.06.483034</dc:identifier>
<dc:title><![CDATA[Automated assembly of high-quality diploid human reference genomes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-03-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2020.02.25.964445v1?rss=1">
<title>
<![CDATA[
SDip: A novel graph-based approach to haplotype-aware assembly based structural variant calling in targeted segmental duplications sequencing 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2020.02.25.964445v1?rss=1"
</link>
<description><![CDATA[
Segmental duplications are important for understanding human diseases and evolution. The challenge to distinguish allelic and duplication sequences has hindered their phased assembly as well as characterization of structural variant calls. Here we have developed a novel graph-based approach that leverages single nucleotide differences in overlapping reads to distinguish allelic and duplication sequences information from long read accurate PacBio HiFi sequencing. These differences enable to generate allelic and duplication-specific overlaps in the graph to spell out phased assembly used for structural variant calling. We have applied our method to three public genomes: CHM13, NA12878 and HG002. Our method resolved 86% of duplicated regions fully with contig N50 up to 79 kb and produced <800 structural variant phased calls, outperforming state-of-the-part SDA method in terms of all metrics. Furthermore, we demonstrate the importance of phased assemblies and variant calls to the biologically-relevant duplicated genes such as SMN1, SRGAP2C, NPY4R and FAM72A. Our phased assemblies and accurate variant calling specifically in duplicated regions will enable the study of the evolution and adaptation of various species.
]]></description>
<dc:creator>Heller, D.</dc:creator>
<dc:creator>Vingron, M.</dc:creator>
<dc:creator>Church, G.</dc:creator>
<dc:creator>Li, H.</dc:creator>
<dc:creator>Garg, S.</dc:creator>
<dc:date>2020-02-26</dc:date>
<dc:identifier>doi:10.1101/2020.02.25.964445</dc:identifier>
<dc:title><![CDATA[SDip: A novel graph-based approach to haplotype-aware assembly based structural variant calling in targeted segmental duplications sequencing]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2020-02-26</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2020.02.07.939124v1?rss=1">
<title>
<![CDATA[
Exploring the coronavirus epidemic using the new WashU Virus Genome Browser 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2020.02.07.939124v1?rss=1"
</link>
<description><![CDATA[
Since its debut in mid-December, 2019, the novel coronavirus (2019-nCoV) has rapidly spread from its origin in Wuhan, China, to several countries across the globe, leading to a global health crisis. As of February 7, 2020, 44 strains of the virus have been sequenced and uploaded to NCBIs GenBank [1], providing insight into the viruss evolutionary history and pathogenesis. Here, we present the WashU Virus Genome Browser, a web-based portal for viewing virus genomic data. The browser is home to 16 complete 2019-nCoV genome sequences, together with hundreds of related viral sequences including severe acute respiratory syndrome coronavirus (SARS-CoV), Middle East respiratory syndrome coronavirus (MERS-CoV), and Ebola virus. In addition, the browser features unique customizability, supporting user-provided upload of novel viral sequences in various formats. Sequences can be viewed in both a track-based representation as well as a phylogenetic tree-based view, allowing the user to easily compare sequence features across multiple strains. The WashU Virus Genome Browser inherited many features and track types from the WashU Epigenome Browser, and additionally incorporated a new type of SNV track to address the specific needs of viral research. Our Virus Browser portal can be accessed at https://virusgateway.wustl.edu, and documentation is available at https://virusgateway.readthedocs.io/.
]]></description>
<dc:creator>Flynn, J.</dc:creator>
<dc:creator>Purushotham, D.</dc:creator>
<dc:creator>Choudhary, M. N.</dc:creator>
<dc:creator>Zhuo, X.</dc:creator>
<dc:creator>Fan, C.</dc:creator>
<dc:creator>Matt, G.</dc:creator>
<dc:creator>Li, D.</dc:creator>
<dc:creator>Wang, T.</dc:creator>
<dc:date>2020-02-11</dc:date>
<dc:identifier>doi:10.1101/2020.02.07.939124</dc:identifier>
<dc:title><![CDATA[Exploring the coronavirus epidemic using the new WashU Virus Genome Browser]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2020-02-11</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.05.26.443420v1?rss=1">
<title>
<![CDATA[
Epigenetic Patterns in a Complete Human Genome 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.05.26.443420v1?rss=1"
</link>
<description><![CDATA[
The completion of the first telomere-to-telomere human genome, T2T-CHM13, enables exploration of the full epigenome, removing limitations previously imposed by the missing reference sequence. Existing epigenetic studies omit unassembled and unmappable genomic regions (e.g. centromeres, pericentromeres, acrocentric chromosome arms, subtelomeres, segmental duplications, tandem repeats). Leveraging the new assembly, we were able to measure enrichment of epigenetic marks with short reads using k-mer assisted mapping methods. This granted array-level enrichment information to characterize the epigenetic regulation of these satellite repeats. Using nanopore sequencing data, we generated base level maps of the most complete human methylome ever produced. We examined methylation patterns in satellite DNA and revealed organized patterns of methylation along individual molecules. When exploring the centromeric epigenome, we discovered a distinctive dip in centromere methylation consistent with active sites of kinetochore assembly. Through long-read chromatin accessibility measurements (nanoNOMe) paired to CUT&RUN data, we found the hypomethylated region was extremely inaccessible and paired to CENP-A/B binding. With long-reads we interrogated allele-specific, longrange epigenetic patterns in complex macro-satellite arrays such as those involved in X chromosome inactivation. Using the single molecule measurements we can clustered reads based on methylation status alone distinguishing epigenetically heterogeneous and homogeneous areas. The analysis provides a framework to investigate the most elusive regions of the human genome, applying both long and short-read technology to grant new insights into epigenetic regulation.
]]></description>
<dc:creator>Gershman, A.</dc:creator>
<dc:creator>Sauria, M. E. G.</dc:creator>
<dc:creator>Hook, P. W.</dc:creator>
<dc:creator>Hoyt, S.</dc:creator>
<dc:creator>Razaghi, R.</dc:creator>
<dc:creator>Koren, S.</dc:creator>
<dc:creator>Altemose, N.</dc:creator>
<dc:creator>Caldas, G. V.</dc:creator>
<dc:creator>Vollger, M. R.</dc:creator>
<dc:creator>Logsdon, G.</dc:creator>
<dc:creator>Rhie, A.</dc:creator>
<dc:creator>Eichler, E.</dc:creator>
<dc:creator>Schatz, M.</dc:creator>
<dc:creator>O'Neill, R. J.</dc:creator>
<dc:creator>Phillippy, A. M.</dc:creator>
<dc:creator>Miga, K. H.</dc:creator>
<dc:creator>Timp, W.</dc:creator>
<dc:date>2021-05-27</dc:date>
<dc:identifier>doi:10.1101/2021.05.26.443420</dc:identifier>
<dc:title><![CDATA[Epigenetic Patterns in a Complete Human Genome]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-05-27</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.12.20.472354v1?rss=1">
<title>
<![CDATA[
Haplotype-resolved inversion landscape reveals hotspots of mutational recurrence associated with genomic disorders 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.12.20.472354v1?rss=1"
</link>
<description><![CDATA[
Unlike copy number variants (CNVs), inversions remain an underexplored genetic variation class. By integrating multiple genomic technologies, we discover 729 inversions in 41 human genomes. Approximately 85% of inversions <2 kbp form by twin-priming during L1-retrotransposition; 80% of the larger inversions are balanced and affect twice as many base pairs as CNVs. Balanced inversions show an excess of common variants, and 72% are flanked by segmental duplications (SDs) or mobile elements. Since this suggests recurrence due to non-allelic homologous recombination, we developed complementary approaches to identify recurrent inversion formation. We describe 40 recurrent inversions encompassing 0.6% of the genome, showing inversion rates up to 2.7x10-4 per locus and generation. Recurrent inversions exhibit a sex- chromosomal bias, and significantly co-localize to the critical regions of genomic disorders. We propose that inversion recurrence results in an elevated number of heterozygous carriers and structural SD diversity, which increases mutability in the population and predisposes to disease- causing CNVs.
]]></description>
<dc:creator>Porubsky, D.</dc:creator>
<dc:creator>Höps, W.</dc:creator>
<dc:creator>Ashraf, H.</dc:creator>
<dc:creator>Hsieh, P.</dc:creator>
<dc:creator>Rodriguez-Martin, B.</dc:creator>
<dc:creator>Yilmaz, F.</dc:creator>
<dc:creator>Ebler, J.</dc:creator>
<dc:creator>Hallast, P.</dc:creator>
<dc:creator>Maggiolini, F. A. M.</dc:creator>
<dc:creator>Harvey, W. T.</dc:creator>
<dc:creator>Henning, B.</dc:creator>
<dc:creator>Audano, P. A.</dc:creator>
<dc:creator>Gordon, D. S.</dc:creator>
<dc:creator>Ebert, P.</dc:creator>
<dc:creator>Hasenfeld, P.</dc:creator>
<dc:creator>Benito, E.</dc:creator>
<dc:creator>Zhu, Q.</dc:creator>
<dc:creator>Human Genome Structural Variation Consortium,</dc:creator>
<dc:creator>Lee, C.</dc:creator>
<dc:creator>Antonacci, F.</dc:creator>
<dc:creator>Steinrücken, M.</dc:creator>
<dc:creator>Beck, C. R.</dc:creator>
<dc:creator>Sanders, A. D.</dc:creator>
<dc:creator>Marschall, T.</dc:creator>
<dc:creator>Eichler, E. E.</dc:creator>
<dc:creator>Korbel, J. O.</dc:creator>
<dc:date>2021-12-20</dc:date>
<dc:identifier>doi:10.1101/2021.12.20.472354</dc:identifier>
<dc:title><![CDATA[Haplotype-resolved inversion landscape reveals hotspots of mutational recurrence associated with genomic disorders]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-12-20</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2020.07.31.231027v1?rss=1">
<title>
<![CDATA[
Transcript assembly improves expression quantification of transposable elements in single cell RNA-seq data 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2020.07.31.231027v1?rss=1"
</link>
<description><![CDATA[
Transposable elements (TEs) are an integral part of the host transcriptome. TE-containing noncoding RNAs (ncRNAs) exhibit considerable tissue specificity and play crucial roles during development, including stem cell maintenance and cell differentiation. Recent advances in single cell RNA-seq (scRNA-seq) revolutionized cell-type specific gene expression analysis. However, scRNA-seq quantification tools tailored for TEs are lacking, limiting our ability to dissect TE expression dynamics at single cell resolution. To address this issue, we established a TE expression quantification pipeline that is compatible with scRNA-seq data generated across multiple technology platforms. We constructed TE containing ncRNA references using bulk RNA-seq data and demonstrated that quantifying TE expression at the transcript level effectively reduces noise. As proof of principle, we applied this strategy to mouse embryonic stem cells and successfully captured the expression profile of endogenous retroviruses in single cells. We further expanded our analysis to scRNA-seq data from early stages of mouse embryogenesis. Our results illustrated the dynamic TE expression at pre-implantation stages and revealed 137 TE-containing ncRNA transcripts with substantial tissue specificity during gastrulation and early organogenesis.
]]></description>
<dc:creator>Shao, W.</dc:creator>
<dc:creator>Wang, T.</dc:creator>
<dc:date>2020-07-31</dc:date>
<dc:identifier>doi:10.1101/2020.07.31.231027</dc:identifier>
<dc:title><![CDATA[Transcript assembly improves expression quantification of transposable elements in single cell RNA-seq data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2020-07-31</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.03.24.436683v1?rss=1">
<title>
<![CDATA[
A species-specific retrotransposon drives a conserved Cdk2ap1 isoform essential for preimplantation development 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.03.24.436683v1?rss=1"
</link>
<description><![CDATA[
Retrotransposons mediate gene regulation in multiple developmental and pathological processes. Here, we characterized the transient retrotransposon induction in preimplantation development of eight mammalian species. While species-specific in sequences, induced retrotransposons exhibit a similar preimplantation profile, conferring gene regulatory activities particularly through LTR retrotransposon promoters. We investigated a mouse-specific MT2B2 retrotransposon promoter, which generates an N-terminally truncated, preimplantation-specific Cdk2ap1{Delta}N isoform to promote cell proliferation. Cdk2ap1{Delta}N functionally contrasts to the canonical Cdk2ap1, which represses cell proliferation and peaks in mid-gestation stage. The mouse-specific MT2B2 element is developmentally essential, as its deletion abolishes Cdk2ap1{Delta}N, reduces cell proliferation and impairs embryo implantation. Intriguingly, Cdk2ap1{Delta}N is evolutionarily conserved across mammals, driven by species-specific promoters. The distinct preimplantation Cdk2ap1{Delta}N expression across different mammalian species correlates with their different duration in preimplantation development. Hence, species-specific transposon promoters can yield evolutionarily conserved, alternative protein isoforms, bestowing them with new functions and species-specific expression to govern essential biological divergence.

One Sentence SummaryIn mammalian preimplantation embryos, retrotransposon promoters generate conserved gene isoforms, confer species-specific expression, and perform essential developmental functions.
]]></description>
<dc:creator>Modzelewski, A. J.</dc:creator>
<dc:creator>Shao, W.</dc:creator>
<dc:creator>Chen, J.</dc:creator>
<dc:creator>Lee, A.</dc:creator>
<dc:creator>Qi, X.</dc:creator>
<dc:creator>Noon, M.</dc:creator>
<dc:creator>Tjokro, K.</dc:creator>
<dc:creator>Sales, G.</dc:creator>
<dc:creator>Biton, A.</dc:creator>
<dc:creator>Speed, T.</dc:creator>
<dc:creator>Xuan, Z.</dc:creator>
<dc:creator>Wang, T.</dc:creator>
<dc:creator>Risso, D.</dc:creator>
<dc:creator>He, L.</dc:creator>
<dc:date>2021-03-25</dc:date>
<dc:identifier>doi:10.1101/2021.03.24.436683</dc:identifier>
<dc:title><![CDATA[A species-specific retrotransposon drives a conserved Cdk2ap1 isoform essential for preimplantation development]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-03-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2020.12.16.423102v1?rss=1">
<title>
<![CDATA[
De novo assembly of 64 haplotype-resolved human genomes of diverse ancestry and integrated analysis of structural variation 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2020.12.16.423102v1?rss=1"
</link>
<description><![CDATA[
Long-read and strand-specific sequencing technologies together facilitate the de novo assembly of high-quality haplotype-resolved human genomes without parent-child trio data. We present 64 assembled haplotypes from 32 diverse human genomes. These highly contiguous haplotype assemblies (average contig N50: 26 Mbp) integrate all forms of genetic variation across even complex loci such as the major histocompatibility complex. We focus on 107,590 structural variants (SVs), of which 68% are inaccessible by short-read sequencing. We identify new SV hotspots (spanning megabases of gene-rich sequence), characterize 130 of the most active mobile element source elements, and find that 63% of all SVs arise by homology-mediated mechanisms--a twofold increase from previous studies. Our resource now enables reliable graph-based genotyping from short reads of up to 50,340 SVs, resulting in the identification of 1,525 expression quantitative trait loci (SV-eQTLs) as well as SV candidates for adaptive selection within the human population.
]]></description>
<dc:creator>Ebert, P.</dc:creator>
<dc:creator>Audano, P. A.</dc:creator>
<dc:creator>Zhu, Q.</dc:creator>
<dc:creator>Rodriguez-Martin, B.</dc:creator>
<dc:creator>Porubsky, D.</dc:creator>
<dc:creator>Bonder, M. J.</dc:creator>
<dc:creator>Sulovari, A.</dc:creator>
<dc:creator>Ebler, J.</dc:creator>
<dc:creator>Zhou, W.</dc:creator>
<dc:creator>Serra Mari, R.</dc:creator>
<dc:creator>Yilmaz, F.</dc:creator>
<dc:creator>Zhao, X.</dc:creator>
<dc:creator>Hsieh, P.</dc:creator>
<dc:creator>Lee, J.</dc:creator>
<dc:creator>Kumar, S.</dc:creator>
<dc:creator>Lin, J.</dc:creator>
<dc:creator>Rausch, T.</dc:creator>
<dc:creator>Chen, Y.</dc:creator>
<dc:creator>Ren, J.</dc:creator>
<dc:creator>Santamarina, M.</dc:creator>
<dc:creator>Hoeps, W.</dc:creator>
<dc:creator>Ashraf, H.</dc:creator>
<dc:creator>Chuang, N. T.</dc:creator>
<dc:creator>Yang, X.</dc:creator>
<dc:creator>Munson, K. M.</dc:creator>
<dc:creator>Lewis, A. P.</dc:creator>
<dc:creator>Fairley, S.</dc:creator>
<dc:creator>Tallon, L. J.</dc:creator>
<dc:creator>Clarke, W. E.</dc:creator>
<dc:creator>Basile, A. O.</dc:creator>
<dc:creator>Byrska-Bishop, M.</dc:creator>
<dc:creator>Corvelo, A.</dc:creator>
<dc:creator>Chaisson, M. J. P.</dc:creator>
<dc:creator>Chen, J.</dc:creator>
<dc:creator>Li, C.</dc:creator>
<dc:creator>Brand, H.</dc:creator>
<dc:creator>Wenger, A. M.</dc:creator>
<dc:creator>Ghareghani, M.</dc:creator>
<dc:creator>Harvey, W.</dc:creator>
<dc:creator>Raeder, B.</dc:creator>
<dc:creator>Hasenfeld, P.</dc:creator>
<dc:creator>Regier, A.</dc:creator>
<dc:creator>Abel, H.</dc:creator>
<dc:creator>Hall, I.</dc:creator>
<dc:creator>Flicek, P.</dc:creator>
<dc:creator>Stegle, O.</dc:creator>
<dc:creator>Gerstein, M</dc:creator>
<dc:date>2020-12-16</dc:date>
<dc:identifier>doi:10.1101/2020.12.16.423102</dc:identifier>
<dc:title><![CDATA[De novo assembly of 64 haplotype-resolved human genomes of diverse ancestry and integrated analysis of structural variation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2020-12-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2020.09.08.285395v1?rss=1">
<title>
<![CDATA[
The structure, function, and evolution of a complete human chromosome 8 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2020.09.08.285395v1?rss=1"
</link>
<description><![CDATA[
The complete assembly of each human chromosome is essential for understanding human biology and evolution. Using complementary long-read sequencing technologies, we complete the first linear assembly of a human autosome, chromosome 8. Our assembly resolves the sequence of five previously long-standing gaps, including a 2.08 Mbp centromeric -satellite array, a 644 kbp defensin copy number polymorphism important for disease risk, and an 863 kbp variable number tandem repeat at chromosome 8q21.2 that can function as a neocentromere. We show that the centromeric -satellite array is generally methylated except for a 73 kbp hypomethylated region of diverse higher-order -satellite enriched with CENP-A nucleosomes, consistent with the location of the kinetochore. Using a dual long-read sequencing approach, we complete the assembly of the orthologous chromosome 8 centromeric regions in chimpanzee, orangutan, and macaque for the first time to reconstruct its evolutionary history. Comparative and phylogenetic analyses show that the higher-order -satellite structure evolved specifically in the great ape ancestor, and the centromeric region evolved with a layered symmetry, with more ancient higher-order repeats located at the periphery adjacent to monomeric -satellites. We estimate that the mutation rate of centromeric satellite DNA is accelerated at least 2.2-fold, and this acceleration extends beyond the higher-order -satellite into the flanking sequence.
]]></description>
<dc:creator>Logsdon, G. A.</dc:creator>
<dc:creator>Vollger, M. R.</dc:creator>
<dc:creator>Hsieh, P.</dc:creator>
<dc:creator>Mao, Y.</dc:creator>
<dc:creator>Liskovykh, M. A.</dc:creator>
<dc:creator>Koren, S.</dc:creator>
<dc:creator>Nurk, S.</dc:creator>
<dc:creator>Mercuri, L.</dc:creator>
<dc:creator>Dishuck, P. C.</dc:creator>
<dc:creator>Rhie, A.</dc:creator>
<dc:creator>de Lima, L. G.</dc:creator>
<dc:creator>Porubsky, D.</dc:creator>
<dc:creator>Bzikadze, A. V.</dc:creator>
<dc:creator>Kremitzki, M.</dc:creator>
<dc:creator>Graves-Lindsay, T. A.</dc:creator>
<dc:creator>Jain, C.</dc:creator>
<dc:creator>Hoekzema, K.</dc:creator>
<dc:creator>Murali, S. C.</dc:creator>
<dc:creator>Munson, K. M.</dc:creator>
<dc:creator>Baker, C.</dc:creator>
<dc:creator>Sorenson, M.</dc:creator>
<dc:creator>Lewis, A. M.</dc:creator>
<dc:creator>Surti, U.</dc:creator>
<dc:creator>Gerton, J. L.</dc:creator>
<dc:creator>Larionov, V.</dc:creator>
<dc:creator>Ventura, M.</dc:creator>
<dc:creator>Miga, K. H.</dc:creator>
<dc:creator>Phillippy, A. M.</dc:creator>
<dc:creator>Eichler, E. E.</dc:creator>
<dc:date>2020-09-08</dc:date>
<dc:identifier>doi:10.1101/2020.09.08.285395</dc:identifier>
<dc:title><![CDATA[The structure, function, and evolution of a complete human chromosome 8]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2020-09-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/485342v1?rss=1">
<title>
<![CDATA[
Co-opted transposons help perpetuate conserved higher-order chromosomal structures 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/485342v1?rss=1"
</link>
<description><![CDATA[
Transposable elements (TEs) make up half of mammalian genomes and shape genome regulation by harboring binding sites for regulatory factors. These include architectural proteins--such as CTCF, RAD21 and SMC3--that are involved in tethering chromatin loops and marking domain boundaries. The 3D organization of the mammalian genome is intimately linked to its function and is remarkably conserved. However, the mechanisms by which these structural intricacies emerge and evolve have not been thoroughly probed. Here we show that TEs contribute extensively to both the formation of species-specific loops in humans and mice via deposition of novel anchoring motifs, as well as to the maintenance of conserved loops across both species via CTCF binding site turnover. The latter function demonstrates the ability of TEs to contribute to genome plasticity and reinforce conserved genome architecture as redundant loop anchors. Deleting such candidate TEs in human cells leads to a collapse of such conserved loop and domain structures. These TEs are also marked by reduced DNA methylation and bear mutational signatures of hypomethylation through evolutionary time. TEs have long been considered a source of genetic innovation; by examining their contribution to genome topology, we show that TEs can contribute to regulatory plasticity by inducing redundancy and potentiating genetic drift locally while conserving genome architecture globally, revealing a paradigm for defining regulatory conservation in the noncoding genome beyond classic sequence-level conservation.nnOne-sentence summaryCo-option of transposable elements maintains conserved 3D genome structures via CTCF binding site turnover in human and mouse.
]]></description>
<dc:creator>Choudhary, M. N.</dc:creator>
<dc:creator>Friedman, R. Z.</dc:creator>
<dc:creator>Wang, J. T.</dc:creator>
<dc:creator>Jang, H. S.</dc:creator>
<dc:creator>Zhuo, X.</dc:creator>
<dc:creator>Wang, T.</dc:creator>
<dc:date>2018-12-05</dc:date>
<dc:identifier>doi:10.1101/485342</dc:identifier>
<dc:title><![CDATA[Co-opted transposons help perpetuate conserved higher-order chromosomal structures]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2018-12-05</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.12.01.518724v1?rss=1">
<title>
<![CDATA[
The complete sequence of a human Y chromosome 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.12.01.518724v1?rss=1"
</link>
<description><![CDATA[
The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure including long palindromes, tandem repeats, and segmental duplications1-3. As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished4, 5. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029 base pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, revealing the complete ampliconic structures of TSPY, DAZ, and RBMY gene families; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region. We have combined T2T-Y with a prior assembly of the CHM13 genome4 and mapped available population variation, clinical variants, and functional genomics data to produce a complete and comprehensive reference sequence for all 24 human chromosomes.
]]></description>
<dc:creator>Rhie, A.</dc:creator>
<dc:creator>Nurk, S.</dc:creator>
<dc:creator>Cechova, M.</dc:creator>
<dc:creator>Hoyt, S. J.</dc:creator>
<dc:creator>Taylor, D. J.</dc:creator>
<dc:creator>Altemose, N.</dc:creator>
<dc:creator>Hook, P. W.</dc:creator>
<dc:creator>Koren, S.</dc:creator>
<dc:creator>Rautiainen, M.</dc:creator>
<dc:creator>Alexandrov, I. A.</dc:creator>
<dc:creator>Allen, J.</dc:creator>
<dc:creator>Asri, M.</dc:creator>
<dc:creator>Bzikadze, A. V.</dc:creator>
<dc:creator>Chen, N.-C.</dc:creator>
<dc:creator>Chin, C.-S.</dc:creator>
<dc:creator>Diekhans, M.</dc:creator>
<dc:creator>Flicek, P.</dc:creator>
<dc:creator>Formenti, G.</dc:creator>
<dc:creator>Fungtammasan, A.</dc:creator>
<dc:creator>Garcia Giron, C.</dc:creator>
<dc:creator>Garrison, E.</dc:creator>
<dc:creator>Gershman, A.</dc:creator>
<dc:creator>Gerton, J.</dc:creator>
<dc:creator>Grady, P. G.</dc:creator>
<dc:creator>Guarracino, A.</dc:creator>
<dc:creator>Haggerty, L.</dc:creator>
<dc:creator>Halabian, R.</dc:creator>
<dc:creator>Hansen, N. F.</dc:creator>
<dc:creator>Harris, R.</dc:creator>
<dc:creator>Hartley, G. A.</dc:creator>
<dc:creator>Harvey, W. T.</dc:creator>
<dc:creator>Haukness, M.</dc:creator>
<dc:creator>Heinz, J.</dc:creator>
<dc:creator>Hourlier, T.</dc:creator>
<dc:creator>Hubley, R. M.</dc:creator>
<dc:creator>Hunt, S. E.</dc:creator>
<dc:creator>Hwang, S.</dc:creator>
<dc:creator>Jain, M.</dc:creator>
<dc:creator>Kesharwani, R. K.</dc:creator>
<dc:creator>Lewis, A. P.</dc:creator>
<dc:creator>Li, H.</dc:creator>
<dc:creator>Logsdon, G. A.</dc:creator>
<dc:creator>Lucas, J. K.</dc:creator>
<dc:creator>Makalowski,</dc:creator>
<dc:date>2022-12-01</dc:date>
<dc:identifier>doi:10.1101/2022.12.01.518724</dc:identifier>
<dc:title><![CDATA[The complete sequence of a human Y chromosome]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-12-01</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.11.29.518374v1?rss=1">
<title>
<![CDATA[
Comparing Genomic and Epigenomic Features across Species Using the WashU Comparative Epigenome Browser 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.11.29.518374v1?rss=1"
</link>
<description><![CDATA[
Genome browsers have become an intuitive and critical tool to visualize and analyze genomic features and data. Conventional genome browsers display data/annotations on a single reference genome/assembly; there are also genomic alignment viewer/browsers that help users visualize alignment, mismatch, and rearrangement between syntenic regions. However, there is a growing need for a comparative epigenome browser that can display genomic and epigenomic datasets across different species and enable users to compare them between syntenic regions. Here, we present the WashU Comparative Epigenome Browser (http://comparativegateway.wustl.edu). It allows users to load functional genomic datasets/annotations mapped to different genomes and display them over syntenic regions simultaneously. The browser also displays genetic differences between the genomes from single nucleotide variants (SNVs) to structural variants (SVs) to visualize the association between epigenomic differences and genetic differences. Instead of anchoring all datasets to the reference genome coordinates, it creates independent coordinates of different genome assemblies to faithfully present features and data mapped to different genomes. It uses a simple, intuitive genome-align track to illustrate the syntenic relationship between different species. It extends the widely used WashU Epigenome Browser infrastructure and can be expanded to support multiple species. This new browser function will greatly facilitate comparative genomic/epigenomic research, as well as support the recent growing needs to directly compare and benchmark the T2T CHM13 assembly and other human genome assemblies.
]]></description>
<dc:creator>Zhuo, X.</dc:creator>
<dc:creator>Hsu, S.</dc:creator>
<dc:creator>Purushotham, D.</dc:creator>
<dc:creator>Chen, S.</dc:creator>
<dc:creator>Li, D.</dc:creator>
<dc:creator>Wang, T.</dc:creator>
<dc:date>2022-12-02</dc:date>
<dc:identifier>doi:10.1101/2022.11.29.518374</dc:identifier>
<dc:title><![CDATA[Comparing Genomic and Epigenomic Features across Species Using the WashU Comparative Epigenome Browser]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-12-02</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.09.17.613505v1?rss=1">
<title>
<![CDATA[
Highly accurate assembly polishing with DeepPolisher 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.09.17.613505v1?rss=1"
</link>
<description><![CDATA[
Accurate genome assemblies are essential for biological research, but even the highest quality assemblies retain errors caused by the technologies used to construct them. Base-level errors are typically fixed with an additional polishing step that uses reads aligned to the draft assembly to identify necessary edits. However, current methods struggle to find a balance between over- and under-polishing. Here, we present an encoder-only transformer model for assembly polishing called DeepPolisher, which predicts corrections to the underlying sequence using Pacbio HiFi read alignments to a diploid assembly. Our pipeline introduces a method, PHARAOH (Phasing Reads in Areas Of Homozygosity), which uses ultra-long ONT data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions. We demonstrate that the DeepPolisher pipeline can reduce assembly errors by half, with a greater than 70% reduction in indel errors. We have applied our DeepPolisher-based pipeline to 180 assemblies from the next Human Pangenome Reference Consortium (HPRC) data release, producing an average predicted Quality Value (QV) improvement of 3.4 (54% error reduction) for the majority of the genome.
]]></description>
<dc:creator>Mastoras, M.</dc:creator>
<dc:creator>Asri, M.</dc:creator>
<dc:creator>Brambrink, L.</dc:creator>
<dc:creator>Hebbar, P.</dc:creator>
<dc:creator>Kolesnikov, A.</dc:creator>
<dc:creator>Cook, D. E.</dc:creator>
<dc:creator>Nattestad, M.</dc:creator>
<dc:creator>Lucas, J.</dc:creator>
<dc:creator>Won, T. S.</dc:creator>
<dc:creator>Chang, P.-C.</dc:creator>
<dc:creator>Carroll, A.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Shafin, K.</dc:creator>
<dc:date>2024-09-19</dc:date>
<dc:identifier>doi:10.1101/2024.09.17.613505</dc:identifier>
<dc:title><![CDATA[Highly accurate assembly polishing with DeepPolisher]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-09-19</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.08.12.607489v1?rss=1">
<title>
<![CDATA[
SAFARI: Pangenome Alignment of Ancient DNA Using Purine/Pyrimidine Encodings 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.08.12.607489v1?rss=1"
</link>
<description><![CDATA[
Aligning DNA sequences retrieved from fossils or other paleontological artifacts, referred to as ancient DNA, is particularly challenging due to the short sequence length and chemical damage which creates a specific pattern of substitution (C[-&gt;]T and G[-&gt;]A) in addition to the heightened divergence between the sample and the reference genome thus exacerbating reference bias. This bias can be mitigated by aligning to pangenome graphs to incorporate documented organismic variation, but this approach still suffers from substitution patterns due to chemical damage. We introduce a novel methodology introducing the RYmer index, a variant of the commonly-used minimizer index which represents purines (A,G) and pyrimidines (C,T) as R and Y respectively. This creates an indexing scheme robust to the aforementioned chemical damage. We implemented SAFARI, an ancient DNA damage-aware version of the pangenome aligner vg giraffe which uses RYmers to rescue alignments containing deaminated seeds. We show that our approach produces more correct alignments from ancient DNA sequences than current approaches while maintaining a tolerable rate of spurious alignments. In addition, we demonstrate that our algorithm improves the estimate of the rate of ancient DNA damage, especially for highly damaged samples. Crucially, we show that this improved alignment can directly translate into better insights gained from the data by showcasing its integration with a number of extant pangenome tools.
]]></description>
<dc:creator>Rubin, J. D.</dc:creator>
<dc:creator>van Waaij, J.</dc:creator>
<dc:creator>Kraft, L. M.</dc:creator>
<dc:creator>Siren, J.</dc:creator>
<dc:creator>Sackett, P. W.</dc:creator>
<dc:creator>Renaud, G.</dc:creator>
<dc:date>2024-08-12</dc:date>
<dc:identifier>doi:10.1101/2024.08.12.607489</dc:identifier>
<dc:title><![CDATA[SAFARI: Pangenome Alignment of Ancient DNA Using Purine/Pyrimidine Encodings]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-08-12</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.08.11.607269v1?rss=1">
<title>
<![CDATA[
High-resolution global diversity copy number variation maps and association with ctyper 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.08.11.607269v1?rss=1"
</link>
<description><![CDATA[
Copy number variant (CNV) genes are important in evolution and disease, yet sequence variation in CNV genes remains a blind spot in large-scale studies. We present ctyper, a method that leverages pangenomes to produce allele-specific copy numbers with locally phased variants from next-generation sequencing (NGS) reads. Benchmarking on 3,351 CNV genes, including HLA, SMN, and CYP2D6, and 212 challenging medically relevant (CMR) genes that are poorly mapped by NGS, ctyper captures 96.5% of phased variants with [&ge;]99.1% correctness of copy number on CNV genes and 94.8% of phased variants on CMR genes. Applying alignment-free algorithms, ctyper requires 1.5 hours per genome on a single CPU. The results largely improve predictions of gene expression compared to known expression quantitative trait loci (eQTL) variants. Allele-specific expression quantified divergent expression on 7.94% of paralogs and tissue-specific biases on 4.68% of paralogs. We found reduced expression of SMN2 due to SMN1 conversion, potentially affecting spinal muscular atrophy, and increased expression of translocated duplications of AMY2B. Overall, ctyper enables biobank-scale genotyping of CNV and CMR genes.
]]></description>
<dc:creator>Chaisson, M.</dc:creator>
<dc:creator>Ma, W.</dc:creator>
<dc:date>2024-08-11</dc:date>
<dc:identifier>doi:10.1101/2024.08.11.607269</dc:identifier>
<dc:title><![CDATA[High-resolution global diversity copy number variation maps and association with ctyper]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-08-11</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.08.16.608331v1?rss=1">
<title>
<![CDATA[
DeepSomatic: Accurate somatic small variant discovery for multiple sequencing technologies 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.08.16.608331v1?rss=1"
</link>
<description><![CDATA[
Somatic variant detection is an integral part of cancer genomics analysis. While most methods have focused on short-read sequencing, long-read technologies now offer potential advantages in terms of repeat mapping and variant phasing. We present DeepSomatic, a deep learning method for detecting somatic SNVs and insertions and deletions (indels) from both short-read and long-read data, with modes for whole-genome and exome sequencing, and able to run on tumor-normal, tumor-only, and with FFPE-prepared samples. To help address the dearth of publicly available training and benchmarking data for somatic variant detection, we generated and make openly available a dataset of five matched tumor-normal cell line pairs sequenced with Illumina, PacBio HiFi, and Oxford Nanopore Technologies, along with benchmark variant sets. Across samples and technologies (short-read and long-read), DeepSomatic consistently outperforms existing callers, particularly for indels.
]]></description>
<dc:creator>Park, J.</dc:creator>
<dc:creator>Cook, D. E.</dc:creator>
<dc:creator>Chang, P.-C.</dc:creator>
<dc:creator>Kolesnikov, A.</dc:creator>
<dc:creator>Brambrink, L.</dc:creator>
<dc:creator>Mier, J. C.</dc:creator>
<dc:creator>Gardner, J.</dc:creator>
<dc:creator>McNulty, B.</dc:creator>
<dc:creator>Sacco, S.</dc:creator>
<dc:creator>Keskus, A.</dc:creator>
<dc:creator>Bryant, A.</dc:creator>
<dc:creator>Ahmad, T.</dc:creator>
<dc:creator>Shetty, J.</dc:creator>
<dc:creator>Zhao, Y.</dc:creator>
<dc:creator>Tran, B.</dc:creator>
<dc:creator>Narzisi, G.</dc:creator>
<dc:creator>Helland, A.</dc:creator>
<dc:creator>Yoo, B.</dc:creator>
<dc:creator>Pushel, I.</dc:creator>
<dc:creator>Lansdon, L. A.</dc:creator>
<dc:creator>Bi, C.</dc:creator>
<dc:creator>Walter, A.</dc:creator>
<dc:creator>Gibson, M.</dc:creator>
<dc:creator>Pastinen, T.</dc:creator>
<dc:creator>Farooqi, M. S.</dc:creator>
<dc:creator>Robine, N.</dc:creator>
<dc:creator>Miga, K. H.</dc:creator>
<dc:creator>Carroll, A.</dc:creator>
<dc:creator>Kolmogorov, M.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Shafin, K.</dc:creator>
<dc:date>2024-08-19</dc:date>
<dc:identifier>doi:10.1101/2024.08.16.608331</dc:identifier>
<dc:title><![CDATA[DeepSomatic: Accurate somatic small variant discovery for multiple sequencing technologies]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-08-19</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.06.11.598418v1?rss=1">
<title>
<![CDATA[
Panacus: fast and exact pangenome growth and core size estimation 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.06.11.598418v1?rss=1"
</link>
<description><![CDATA[
MotivationUsing a single linear reference genome poses a limitation to exploring the full genomic diversity of a species. The release of a draft human pangenome underscores the increasing relevance of pangenomics to overcome these limitations. Pangenomes are commonly represented as graphs, which can represent billions of base pairs of sequence. Presently, there is a lack of scalable software able to perform key tasks on pangenomes, such as quantifying universally shared sequence across genomes (the core genome) and measuring the extent of genomic variability as a function of sample size (pangenome growth).

ResultsWe introduce Panacus (pangenome-abacus), a tool designed to rapidly perform these tasks and visualize the results in interactive plots. Panacus can process GFA files, the accepted standard for pangenome graphs, and is able to analyze a human pangenome graph with 110 million nodes in less than one hour.

AvailabilityPanacus is implemented in Rust and is published as Open Source software under the MIT license. The source code and documentation are available at https://github.com/marschall-lab/panacus. Panacus can be installed via Bioconda at https://bioconda.github.io/recipes/panacus/README.html.

ContactLuca Parmigiani (luca.parmigiani@uni-bielefeld.de), Daniel Doerr (daniel.doerr@hhu.de).
]]></description>
<dc:creator>Parmigiani, L.</dc:creator>
<dc:creator>Garrison, E.</dc:creator>
<dc:creator>Stoye, J.</dc:creator>
<dc:creator>Marschall, T.</dc:creator>
<dc:creator>Doerr, D.</dc:creator>
<dc:date>2024-06-12</dc:date>
<dc:identifier>doi:10.1101/2024.06.11.598418</dc:identifier>
<dc:title><![CDATA[Panacus: fast and exact pangenome growth and core size estimation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-06-12</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.05.18.594796v1?rss=1">
<title>
<![CDATA[
Telomere-to-telomere phased genome assembly using error-corrected Simplex nanopore reads 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.05.18.594796v1?rss=1"
</link>
<description><![CDATA[
Telomere-to-telomere phased assemblies have become the norm in genomics. To achieve these for diploid and even polyploid genomes, the contemporary approach involves a combination of two long-read sequencing technologies: high-accuracy long reads, e.g. Pacific Biosciences (PacBio) HiFi or Oxford Nanopore (ONT)  Duplex reads, and ultra-long ONT  Simplex reads. Using two different technologies increases the cost and the required amount of genomic DNA. Here, we show that comparable results are possible using error correction of ultra-long ONT Simplex reads and then assembling them using state-of-the-art de novo assembly methods. To achieve this, we have developed the deep learning-based HERRO framework, which corrects ONT Simplex reads while carefully preserving differences in related genomic sequences. Taking into account informative positions that differentiate the haplotypes or genomic repeat copies, HERRO achieves an increase of read accuracy of up to 100-fold for diploid human genomes. By combining HERRO with Verkko assembler, we achieve high contiguity on several human genomes by reconstructing many chromosomes telomere-to-telomere, including chromosomes X and Y. HERRO supports both R9.4.1 and R10.4.1 ONT Simplex reads and generalizes well to other species. These results provide an opportunity to reduce the cost of genome sequencing and use corrected ONT reads to analyze more complex genomes with different levels of ploidy or even aneuploidy.
]]></description>
<dc:creator>Stanojevic, D.</dc:creator>
<dc:creator>Lin, D.</dc:creator>
<dc:creator>Florez De Sessions, P.</dc:creator>
<dc:creator>Sikic, M.</dc:creator>
<dc:date>2024-05-21</dc:date>
<dc:identifier>doi:10.1101/2024.05.18.594796</dc:identifier>
<dc:title><![CDATA[Telomere-to-telomere phased genome assembly using error-corrected Simplex nanopore reads]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-05-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.03.15.585294v1?rss=1">
<title>
<![CDATA[
Gapless assembly of complete human and plant chromosomes using only nanopore sequencing 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.03.15.585294v1?rss=1"
</link>
<description><![CDATA[
The combination of ultra-long Oxford Nanopore (ONT) sequencing reads with long, accurate PacBio HiFi reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, "telomere-to-telomere" genome assembly relies on multiple sequencing platforms, limiting its accessibility.

ONT "Duplex" sequencing reads, where both strands of the DNA are read to improve quality, promise high per-base accuracy. To evaluate this new data type, we generated ONT Duplex data for three widely-studied genomes: human HG002, Solanum lycopersicum Heinz 1706 (tomato), and Zea mays B73 (maize). For the diploid, heterozygous HG002 genome, we also used "Pore-C chromatin contact mapping to completely phase the haplotypes.

We found the accuracy of Duplex data to be similar to HiFi sequencing, but with read lengths tens of kilobases longer, and the Pore-C data to be compatible with existing diploid assembly algorithms. This combination of read length and accuracy enables the construction of a high-quality initial assembly, which can then be further resolved using the ultra-long reads, and finally phased into chromosome-scale haplotypes with Pore-C. The resulting assemblies have a base accuracy exceeding 99.999% (Q50) and near-perfect continuity, with most chromosomes assembled as single contigs. We conclude that ONT sequencing is a viable alternative to HiFi sequencing for de novo genome assembly, and has the potential to provide a single-instrument solution for the reconstruction of complete genomes.
]]></description>
<dc:creator>Koren, S.</dc:creator>
<dc:creator>Bao, Z.</dc:creator>
<dc:creator>Guarracino, A.</dc:creator>
<dc:creator>Ou, S.</dc:creator>
<dc:creator>Goodwin, S.</dc:creator>
<dc:creator>Jenike, K. M.</dc:creator>
<dc:creator>Lucas, J.</dc:creator>
<dc:creator>McNulty, B.</dc:creator>
<dc:creator>Park, J.</dc:creator>
<dc:creator>Rautianinen, M.</dc:creator>
<dc:creator>Rhie, A.</dc:creator>
<dc:creator>Roelofs, D.</dc:creator>
<dc:creator>Schneiders, H.</dc:creator>
<dc:creator>Vrijenhoek, I.</dc:creator>
<dc:creator>Nijbroek, K.</dc:creator>
<dc:creator>Ware, D.</dc:creator>
<dc:creator>Schatz, M. C.</dc:creator>
<dc:creator>Garrison, E.</dc:creator>
<dc:creator>Huang, S.</dc:creator>
<dc:creator>McCombie, W. R.</dc:creator>
<dc:creator>Miga, K. H.</dc:creator>
<dc:creator>Wittenberg, A. H. J.</dc:creator>
<dc:creator>Phillippy, A. M.</dc:creator>
<dc:date>2024-03-17</dc:date>
<dc:identifier>doi:10.1101/2024.03.15.585294</dc:identifier>
<dc:title><![CDATA[Gapless assembly of complete human and plant chromosomes using only nanopore sequencing]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-03-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.01.20.576452v1?rss=1">
<title>
<![CDATA[
Full resolution HLA and KIR genes annotation for human genome assemblies 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.01.20.576452v1?rss=1"
</link>
<description><![CDATA[
The HLA (Human Leukocyte Antigen) genes and the KIR (Killer cell Immunoglobulin-like Receptor) genes are critical to immune responses and are associated with many immune-related diseases. Located in highly polymorphic regions, they are hard to be studied with traditional short-read alignment-based methods. Although modern long-read assemblers can often assemble these genes, using existing tools to annotate HLA and KIR genes in these assemblies remains a non-trivial task. Here, we describe Immuannot, a new computation tool to annotate the gene structures of HLA and KIR genes and to type the allele of each gene. Applying Immuannot to 56 regional and 212 whole-genome assemblies from previous studies, we annotated 9,931 HLA and KIR genes and found that almost half of these genes, 4,068, had novel sequences compared to the current Immuno Polymorphism Database (IPD). These novel gene sequences were represented by 2,664 distinct alleles, some of which contained non-synonymous variations resulting in 92 novel protein sequences. We demonstrated the complex haplotype structures at the two loci and reported the linkage between HLA/KIR haplotypes and gene alleles. We anticipate that Immuannot will speed up the discovery of new HLA/KIR alleles and enable the association of HLA/KIR haplotype structures with clinical outcomes in the future.
]]></description>
<dc:creator>Zhou, Y.</dc:creator>
<dc:creator>Song, L.</dc:creator>
<dc:creator>Li, H.</dc:creator>
<dc:date>2024-01-23</dc:date>
<dc:identifier>doi:10.1101/2024.01.20.576452</dc:identifier>
<dc:title><![CDATA[Full resolution HLA and KIR genes annotation for human genome assemblies]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-01-23</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.11.01.565049v1?rss=1">
<title>
<![CDATA[
The complete human diploid reference genome of RPE-1 identifies the phased epigenetic landscapes from multi-omics data 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.11.01.565049v1?rss=1"
</link>
<description><![CDATA[
Comparative analysis of recent human genome assemblies highlights profound sequence divergence that peaks within polymorphic loci such as centromeres. This raises the question about the adequacy of relying on human reference genomes to accurately analyze sequencing data derived from experimental cell lines. Here, we generated the complete diploid genome assembly for the human retinal epithelial cells (RPE-1), a widely used non-cancer laboratory cell line with a stable karyotype, to use as matched reference for multi-omics sequencing data analysis. Our RPE1v1.0 assembly presents completely phased haplotypes and chromosome-level scaffolds that span centromeres with ultra-high base accuracy (>QV60). We mapped the haplotype-specific genomic variation specific to this cell line including t(Xq;10q), a stable 73.18 Mb duplication of chromosome 10 translocated onto the microdeleted chromosome X telomere t(Xq;10q). Polymorphisms between haplotypes of the same genome reveals genetic and epigenetic variation for all chromosomes, especially at centromeres. The RPE-1 assembly as matched reference genome improves mapping quality of multi-omics reads originating from RPE-1 cells with drastic reduction in alignments mismatches compared to using the most complete human reference to date (CHM13). Leveraging the accuracy achieved using a matched reference, we were able to identify the kinetochore sites at base pair resolution and show unprecedented variation between haplotypes. This work showcases the use of matched reference genomes for multi-omics analyses and serves as the foundation for a call to comprehensively assemble experimentally relevant cell lines for widespread application.

HighlightsO_LIWe generated the complete phased genome assembly of one of the most widely used non-cancer cell lines (RPE-1) with a stable diploid karyotype
C_LIO_LIWe used this genome as a matched reference to analyze sequencing data from RPE-1
C_LIO_LIMapping to the RPE1v1.0 genome improves alignment quality, faithful assignment of reads to each haplotype, and epigenome peak calling accuracy uncovering inter-haplotype variation
C_LIO_LIUse of the matched reference genome enables epigenetic precision in identifying for the first time the kinetochore site at base pair resolution for each haplotype
C_LIO_LIThe RPE-1 genome represents a new telomere-to-telomere (T2T) human diploid reference for the scientific community that will advance genetic and epigenetic research across fields using this cell line
C_LI
]]></description>
<dc:creator>Volpe, E.</dc:creator>
<dc:creator>Corda, L.</dc:creator>
<dc:creator>Di Tommaso, E.</dc:creator>
<dc:creator>Pelliccia, F.</dc:creator>
<dc:creator>Ottalevi, R.</dc:creator>
<dc:creator>Licastro, D.</dc:creator>
<dc:creator>Formenti, G.</dc:creator>
<dc:creator>Capulli, M.</dc:creator>
<dc:creator>Guarracino, A.</dc:creator>
<dc:creator>Tassone, E.</dc:creator>
<dc:creator>Giunta, S.</dc:creator>
<dc:date>2023-11-03</dc:date>
<dc:identifier>doi:10.1101/2023.11.01.565049</dc:identifier>
<dc:title><![CDATA[The complete human diploid reference genome of RPE-1 identifies the phased epigenetic landscapes from multi-omics data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-11-03</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.12.13.571553v1?rss=1">
<title>
<![CDATA[
Personalized Pangenome References 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.12.13.571553v1?rss=1"
</link>
<description><![CDATA[
Pangenomes, by including genetic diversity, should reduce reference bias by better representing new samples compared to them. Yet when comparing a new sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with using allele frequency filters. However, this is a blunt heuristic that both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach, inspired by local ancestry inference methods, that imputes a personalized pangenome subgraph based on sampling local haplotypes according to k-mer counts in the reads. Our approach is tailored for the Giraffe short read aligner, as the indexes it needs for read mapping can be built quickly. We compare the accuracy of our approach to state-of-the-art methods using graphs from the Human Pangenome Reference Consortium. The resulting personalized pangenome pipelines provide faster pangenome read mapping than comparable pipelines that use a linear reference, reduce small variant genotyping errors by 4x relative to the Genome Analysis Toolkit (GATK) best-practice pipeline, and for the first time make short-read structural variant genotyping competitive with long-read discovery methods.
]]></description>
<dc:creator>Siren, J.</dc:creator>
<dc:creator>Eskandar, P.</dc:creator>
<dc:creator>Ungaro, M. T.</dc:creator>
<dc:creator>Hickey, G.</dc:creator>
<dc:creator>Eizenga, J. M.</dc:creator>
<dc:creator>Novak, A. M.</dc:creator>
<dc:creator>Chang, X.</dc:creator>
<dc:creator>Chang, P.-C.</dc:creator>
<dc:creator>Kolmogorov, M.</dc:creator>
<dc:creator>Carroll, A.</dc:creator>
<dc:creator>Monlong, J.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:date>2023-12-14</dc:date>
<dc:identifier>doi:10.1101/2023.12.13.571553</dc:identifier>
<dc:title><![CDATA[Personalized Pangenome References]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-12-14</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.11.30.569101v1?rss=1">
<title>
<![CDATA[
Neotelomeres and Telomere-Spanning Chromosomal Arm Fusions in Cancer Genomes Revealed by Long-Read Sequencing 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.11.30.569101v1?rss=1"
</link>
<description><![CDATA[
Alterations in the structure and location of telomeres are key events in cancer genome evolution. However, previous genomic approaches, unable to span long telomeric repeat arrays, could not characterize the nature of these alterations. Here, we applied both long-read and short-read genome sequencing to assess telomere repeat-containing structures in cancers and cancer cell lines. Using long-read genome sequences that span telomeric repeat arrays, we defined four types of telomere repeat variations in cancer cells: neotelomeres where telomere addition heals chromosome breaks, chromosomal arm fusions spanning telomere repeats, fusions of neotelomeres, and peri-centromeric fusions with adjoined telomere and centromere repeats. Analysis of lung adenocarcinoma genome sequences identified somatic neotelomere and telomere-spanning fusion alterations. These results provide a framework for systematic study of telomeric repeat arrays in cancer genomes, that could serve as a model for understanding the somatic evolution of other repetitive genomic elements.
]]></description>
<dc:creator>Tan, K.-T.</dc:creator>
<dc:creator>Slevin, M. K.</dc:creator>
<dc:creator>Leibowitz, M. L.</dc:creator>
<dc:creator>Garrity-Janger, M.</dc:creator>
<dc:creator>Li, H.</dc:creator>
<dc:creator>Meyerson, M.</dc:creator>
<dc:date>2023-12-01</dc:date>
<dc:identifier>doi:10.1101/2023.11.30.569101</dc:identifier>
<dc:title><![CDATA[Neotelomeres and Telomere-Spanning Chromosomal Arm Fusions in Cancer Genomes Revealed by Long-Read Sequencing]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-12-01</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.09.07.556731v1?rss=1">
<title>
<![CDATA[
Local read haplotagging enables accurate long-read small variant calling 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.09.07.556731v1?rss=1"
</link>
<description><![CDATA[
Long-read sequencing technology has enabled variant detection in difficult-to-map regions of the genome and enabled rapid genetic diagnosis in clinical settings. Rapidly evolving third-generation sequencing platforms like Pacific Biosciences (PacBio) and Oxford nanopore technologies (ONT) are introducing newer platforms and data types. It has been demonstrated that variant calling methods based on deep neural networks can use local haplotyping information with long-reads to improve the genotyping accuracy. However, using local haplotype information creates an overhead as variant calling needs to be performed multiple times which ultimately makes it difficult to extend to new data types and platforms as they get introduced. In this work, we have developed a local haplotype approximate method that enables state-of-the-art variant calling performance with multiple sequencing platforms including PacBio Revio system, ONT R10.4 simplex and duplex data. This addition of local haplotype approximation makes DeepVariant a universal variant calling solution for long-read sequencing platforms.
]]></description>
<dc:creator>Kolesnikov, A.</dc:creator>
<dc:creator>Cook, D. E.</dc:creator>
<dc:creator>Nattestad, M.</dc:creator>
<dc:creator>Ashley, E. A.</dc:creator>
<dc:creator>Gorzynski, J.</dc:creator>
<dc:creator>Goenka, S. D.</dc:creator>
<dc:creator>Jain, M.</dc:creator>
<dc:creator>McNulty, B.</dc:creator>
<dc:creator>Miga, K. H.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Chang, P.-C.</dc:creator>
<dc:creator>Carroll, A.</dc:creator>
<dc:creator>Shafin, K.</dc:creator>
<dc:date>2023-09-12</dc:date>
<dc:identifier>doi:10.1101/2023.09.07.556731</dc:identifier>
<dc:title><![CDATA[Local read haplotagging enables accurate long-read small variant calling]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-09-12</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.06.05.543788v1?rss=1">
<title>
<![CDATA[
Evaluation of haplotype-aware long-read error correction with hifieval 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.06.05.543788v1?rss=1"
</link>
<description><![CDATA[
SummaryThe PacBio High-Fidelity (HiFi) sequencing technology produces long reads of >99% in accuracy. It has enabled the development of a new generation of de novo sequence assemblers, which all have sequencing error correction as the first step. As HiFi is a new data type, this critical step has not been evaluated before. Here, we introduced hifieval, a new command-line tool for measuring over- and under-corrections produced by error correction algorithms. We assessed the accuracy of the error correction components of existing HiFi assemblers on the CHM13 and the HG002 datasets and further investigated the performance of error correction methods in challenging regions such as homopolymer regions, centromeric regions, and segmental duplications. Hifieval will help HiFi assemblers to improve error correction and assembly quality in the long run.

Availability and implementationThe source code is available at https://github.com/magspho/hifieval

Contacthli@ds.dfci.harvard.edu

Supplementary informationSupplementary data are available at Bioinformatics online.
]]></description>
<dc:creator>Guo, Y.</dc:creator>
<dc:creator>Feng, X.</dc:creator>
<dc:creator>Li, H.</dc:creator>
<dc:date>2023-06-07</dc:date>
<dc:identifier>doi:10.1101/2023.06.05.543788</dc:identifier>
<dc:title><![CDATA[Evaluation of haplotype-aware long-read error correction with hifieval]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-06-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.05.15.540856v1?rss=1">
<title>
<![CDATA[
A comprehensive catalog of 3D genome organization in diverse human genomes facilitates understanding of the impact of structural variation on chromatin structure 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.05.15.540856v1?rss=1"
</link>
<description><![CDATA[
The human genome is packaged into the three-dimensional (3D) nucleus and organized into functional units known as topologically associating domains (TADs) and chromatin loops. Recent studies show that the 3D genome can be modified by genome structural variants (SVs) through disrupting higher-order chromatin organizations such as TADs, which play an essential role in insulating genes from aberrant regulation by regulatory elements outside TADs. Here, we have developed an integrative Hi-C analysis pipeline to generate a comprehensive catalog of TADs, TAD boundaries, and loops in human genomes to fill the gap of limited resources. We identified 2,293 TADs and 6,810 sub-TADs missing in the previously released TADs of GM12878. We then quantified the impact of SVs overlapping with TAD boundaries and observed that two SVs could significantly alter chromatin architecture leading to abnormal expression and splicing of genes associated with human diseases.
]]></description>
<dc:creator>Li, C.</dc:creator>
<dc:creator>Bonder, M. J.</dc:creator>
<dc:creator>Syed, S.</dc:creator>
<dc:creator>Human Genome Structural Variation Consortium (HGSVC),</dc:creator>
<dc:creator>HGSVC Functional Analysis Working Group,</dc:creator>
<dc:creator>Zody, M. C.</dc:creator>
<dc:creator>Chaisson, M. J. P.</dc:creator>
<dc:creator>Talkowski, M. E.</dc:creator>
<dc:creator>Marschall, T.</dc:creator>
<dc:creator>Korbel, J. O.</dc:creator>
<dc:creator>Eichler, E. E.</dc:creator>
<dc:creator>Lee, C.</dc:creator>
<dc:creator>Shi, X.</dc:creator>
<dc:date>2023-05-15</dc:date>
<dc:identifier>doi:10.1101/2023.05.15.540856</dc:identifier>
<dc:title><![CDATA[A comprehensive catalog of 3D genome organization in diverse human genomes facilitates understanding of the impact of structural variation on chromatin structure]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-05-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.05.04.539448v1?rss=1">
<title>
<![CDATA[
Whole-genome long-read sequencing downsampling and its effect on variant calling precision and recall 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.05.04.539448v1?rss=1"
</link>
<description><![CDATA[
Advances in long-read sequencing (LRS) technology continue to make whole-genome sequencing more complete, affordable, and accurate. LRS provides significant advantages over short-read sequencing approaches, including phased de novo genome assembly, access to previously excluded genomic regions, and discovery of more complex structural variants (SVs) associated with disease. Limitations remain with respect to cost, scalability, and platform-dependent read accuracy and the tradeoffs between sequence coverage and sensitivity of variant discovery are important experimental considerations for the application of LRS. We compare the genetic variant calling precision and recall of Oxford Nanopore Technologies (ONT) and PacBio HiFi platforms over a range of sequence coverages. For read-based applications, LRS sensitivity begins to plateau around 12-fold coverage with a majority of variants called with reasonable accuracy (F1 score above 0.5), and both platforms perform well for SV detection. Genome assembly increases variant calling precision and recall of SVs and indels in HiFi datasets with HiFi outperforming ONT in quality as measured by the F1 score of assembly-based variant callsets. While both technologies continue to evolve, our work offers guidance to design cost-effective experimental strategies that do not compromise on discovering novel biology.
]]></description>
<dc:creator>Harvey, W. T.</dc:creator>
<dc:creator>Ebert, P.</dc:creator>
<dc:creator>Ebler, J.</dc:creator>
<dc:creator>Audano, P. A.</dc:creator>
<dc:creator>Munson, K. M.</dc:creator>
<dc:creator>Hoekzema, K.</dc:creator>
<dc:creator>Porubsky, D. E.</dc:creator>
<dc:creator>Beck, C. R.</dc:creator>
<dc:creator>Marschall, T. R.</dc:creator>
<dc:creator>Garimella, K. V.</dc:creator>
<dc:creator>Eichler, E. E.</dc:creator>
<dc:date>2023-05-04</dc:date>
<dc:identifier>doi:10.1101/2023.05.04.539448</dc:identifier>
<dc:title><![CDATA[Whole-genome long-read sequencing downsampling and its effect on variant calling precision and recall]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-05-04</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/580159v1?rss=1">
<title>
<![CDATA[
A haplotype-aware de novo assembly of related individuals using pedigree graph 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/580159v1?rss=1"
</link>
<description><![CDATA[
MotivationReconstructing high-quality haplotype-resolved assemblies for related individuals of various species has important applications in understanding Mendelian diseases along with evolutionary and comparative genomics. Through major genomics sequencing efforts such as the Personal Genome Project, the Vertebrate Genome Project (VGP), the Earth Biogenome Project (EBP) and the Genome in a Bottle project (GIAB), a variety of sequencing datasets from mother-father-child trios of various diploid species are becoming available.nnCurrent trio assembly approaches are not designed to incorporate long-read sequencing data from parents in a trio, and therefore require relatively high coverages of costly long-read data to produce high-quality assemblies. Thus, building a trio-aware assembler capable of producing accurate and chromosomal-scale diploid genomes in a pedigree, while being cost-effective in terms of sequencing costs, is a pressing need of the genomics community.nnResultsWe present a novel pedigree-graph-based approach to diploid assembly using accurate Illumina data and long-read Pacific Biosciences (PacBio) data from all related individuals, thereby generalizing our previous work on single individuals. We demonstrate the effectiveness of our pedigree approach on a simulated trio of pseudo-diploid yeast genomes with different heterozygosity rates, and real data from Arabidopsis Thaliana. We show that we require as little as 30x coverage Illumina data and 15x PacBio data from each individual in a trio to generate chromosomal-scale phased assemblies. Additionally, we show that we can detect and phase variants from generated phased assemblies.nnAvailabilityhttps://github.com/shilpagarg/WHdenovonnContactshilpa_garg@hms.harvard.edu, gchurch@genetics.med.harvard.edu
]]></description>
<dc:creator>Garg, S.</dc:creator>
<dc:creator>Aach, J.</dc:creator>
<dc:creator>Li, H.</dc:creator>
<dc:creator>Durbin, R.</dc:creator>
<dc:creator>Church, G.</dc:creator>
<dc:date>2019-03-17</dc:date>
<dc:identifier>doi:10.1101/580159</dc:identifier>
<dc:title><![CDATA[A haplotype-aware de novo assembly of related individuals using pedigree graph]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2019-03-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.12.17.520860v1?rss=1">
<title>
<![CDATA[
A refined characterization of large-scale genomic differences in the first complete human genome 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.12.17.520860v1?rss=1"
</link>
<description><![CDATA[
The first telomere-to-telomere (T2T) human genome assembly (T2T-CHM13) release was a milestone in human genomics. The T2T-CHM13 genome assembly extends our understanding of telomeres, centromeres, segmental duplication, and other complex regions. The current human genome reference (GRCh38) has been widely used in various human genomic studies. However, the large-scale genomic differences between these two important genome assemblies are not characterized in detail yet. Here, we identify 590 discrepant regions ([~]226 Mbp) in total. In addition to the previously reported  non-syntenic regions, we identify 67 additional large-scale discrepant regions and precisely categorize them into four structural types with a newly developed website tool (SynPlotter). The discrepant regions ([~]20.4 Mbp) excluding telomeric and centromeric regions are highly structurally polymorphic in humans, where copy number variation are likely associated with various human disease and disease susceptibility, such as immune and neurodevelopmental disorders. The analyses of a newly identified discrepant region--the KLRC gene cluster--shows that the depletion of KLRC2 by a single deletion event is associated with natural killer cell differentiation in [~]20% of humans. Meanwhile, the rapid amino acid replacements within KLRC3 is consistent with the action of natural selection during primate evolution. Our study furthers our understanding of the large-scale structural variation differences between these two crucial human reference genomes and future interpretation of studies of human genetic variation.
]]></description>
<dc:creator>Yang, X.</dc:creator>
<dc:creator>Wang, X.</dc:creator>
<dc:creator>Zou, Y.</dc:creator>
<dc:creator>Zhang, S.</dc:creator>
<dc:creator>Xia, M.</dc:creator>
<dc:creator>Vollger, M. R.</dc:creator>
<dc:creator>Chen, N.-C.</dc:creator>
<dc:creator>Taylor, D. J.</dc:creator>
<dc:creator>Harvey, W. T.</dc:creator>
<dc:creator>Logsdon, G. A.</dc:creator>
<dc:creator>Meng, D.</dc:creator>
<dc:creator>Shi, J.</dc:creator>
<dc:creator>McCoy, R. C.</dc:creator>
<dc:creator>Schatz, M. C.</dc:creator>
<dc:creator>Li, W.</dc:creator>
<dc:creator>Eichler, E. E.</dc:creator>
<dc:creator>Lu, Q.</dc:creator>
<dc:creator>Mao, Y.</dc:creator>
<dc:date>2022-12-19</dc:date>
<dc:identifier>doi:10.1101/2022.12.17.520860</dc:identifier>
<dc:title><![CDATA[A refined characterization of large-scale genomic differences in the first complete human genome]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-12-19</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.09.22.559047v1?rss=1">
<title>
<![CDATA[
BAllC and BAllCools: Efficient Formatting and Operating for Single-Cell DNA Methylation Data 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.09.22.559047v1?rss=1"
</link>
<description><![CDATA[
MotivationWith single-cell DNA methylation studies yielding vast datasets, existing data formats struggle with the unique challenges of storage and efficient operations, highlighting a need for improved solutions.

ResultsBAllC (Binary All Cytosines) emerges as a tailored binary format for methylation data, addressing these challenges. BAllCools, its complementary software toolkit, enhances parsing, indexing, and querying capabilities, promising superior operational speeds and reduced storage needs.

Availabilityhttps://github.com/jksr/ballcools

Contactecker@salk.edu

Supplementary informationSupplementary data are available at Bioinformatics online.
]]></description>
<dc:creator>Tian, W.</dc:creator>
<dc:creator>Ding, W.</dc:creator>
<dc:creator>Ecker, J. R.</dc:creator>
<dc:date>2023-09-25</dc:date>
<dc:identifier>doi:10.1101/2023.09.22.559047</dc:identifier>
<dc:title><![CDATA[BAllC and BAllCools: Efficient Formatting and Operating for Single-Cell DNA Methylation Data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-09-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.04.05.535718v1?rss=1">
<title>
<![CDATA[
Building pangenome graphs 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.04.05.535718v1?rss=1"
</link>
<description><![CDATA[
Pangenome graphs can represent all variation between multiple reference genomes, but current approaches to build them exclude complex sequences or are based upon a single reference. In response, we developed the PanGenome Graph Builder (PGGB), a pipeline for constructing pangenome graphs without bias or exclusion. PGGB uses all-to-all alignments to build a variation graph in which we can identify variation, measure conservation, detect recombination events, and infer phylogenetic relationships.
]]></description>
<dc:creator>Garrison, E.</dc:creator>
<dc:creator>Guarracino, A.</dc:creator>
<dc:creator>Heumos, S.</dc:creator>
<dc:creator>Villani, F.</dc:creator>
<dc:creator>Bao, Z.</dc:creator>
<dc:creator>Tattini, L.</dc:creator>
<dc:creator>Hagmann, J.</dc:creator>
<dc:creator>Vorbrugg, S.</dc:creator>
<dc:creator>Marco-Sola, S.</dc:creator>
<dc:creator>Kubica, C.</dc:creator>
<dc:creator>Ashbrook, D. G.</dc:creator>
<dc:creator>Thorell, K.</dc:creator>
<dc:creator>Rusholme-Pilcher, R. L.</dc:creator>
<dc:creator>Liti, G.</dc:creator>
<dc:creator>Rudbeck, E.</dc:creator>
<dc:creator>Nahnsen, S.</dc:creator>
<dc:creator>Yang, Z.</dc:creator>
<dc:creator>Moses, M. N.</dc:creator>
<dc:creator>Nobrega, F. L.</dc:creator>
<dc:creator>Wu, Y.</dc:creator>
<dc:creator>Chen, H.</dc:creator>
<dc:creator>de Ligt, J.</dc:creator>
<dc:creator>Sudmant, P. H.</dc:creator>
<dc:creator>Soranzo, N.</dc:creator>
<dc:creator>Colonna, V.</dc:creator>
<dc:creator>Williams, R. W.</dc:creator>
<dc:creator>Prins, P.</dc:creator>
<dc:date>2023-04-06</dc:date>
<dc:identifier>doi:10.1101/2023.04.05.535718</dc:identifier>
<dc:title><![CDATA[Building pangenome graphs]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-04-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.11.22.623804v1?rss=1">
<title>
<![CDATA[
Characterizing cytosine methylation of polymorphic human transposable element insertions using human pangenome resources 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.11.22.623804v1?rss=1"
</link>
<description><![CDATA[
Cytosine methylation, a crucial epigenetic modification, plays a vital role in genomic regulation. Leveraging the advancements in third-generation sequencing, we investigated the methylation patterns of non-reference insertions of human lymphoblastoid cell lines (LCLs), particularly polymorphic transposable elements (TEs). We validated the high concordance between long-read methylation calls and conventional whole genome bisulfite sequencing (WGBS) method. By characterizing thousands of polymorphic TE insertions genome-wide using long reads from the draft Human Pangenome Reference, we aimed to establish general rules of TE methylation by addressing three key questions: 1) what is the methylation profile of each insertion? 2) do newly inserted TEs adopt the methylation pattern of their genomic context? and 3) do new TE insertions affect the methylation of their flanking regions? While most non-TE insertions exhibit DNA methylation patterns consistent with their genomic context, TE insertions are generally highly methylated, exhibiting distinct, class-specific patterns, and with profound variation within TE bodies. A small percentage of Alu insertions are hypomethylated, particularly those inserted within hypomethylated CpG islands. By comparing DNA methylation of flanking regions of TE insertions between individuals with and without the TE insertions, we revealed that majority of TEs exhibited minimal impact on nearby regions, although numerous exceptions exist where the methylation status of both L1 and Alu insertions "leak" into nearby regions, leading to either methylation spreading or hypomethylation sloping shores. In conclusion, we demonstrated the methylation calling capability of third-generation sequencing and its unique advantage in characterizing epigenomic features within non-reference positions. While TE insertions primarily exhibit methylation patterns restricted within their boundaries, some TEs are able to engage in context-dependent complex interactions with genomic neighborhood.
]]></description>
<dc:creator>Zhuo, X.</dc:creator>
<dc:creator>Tomlinson, C.</dc:creator>
<dc:creator>Belter, E. A.</dc:creator>
<dc:creator>Kuntala, P. K.</dc:creator>
<dc:creator>Saintilnord, W. N.</dc:creator>
<dc:creator>Lindsay, T.</dc:creator>
<dc:creator>Macias, J.</dc:creator>
<dc:creator>Fulton, R. S.</dc:creator>
<dc:creator>Wang, T.</dc:creator>
<dc:date>2024-11-23</dc:date>
<dc:identifier>doi:10.1101/2024.11.22.623804</dc:identifier>
<dc:title><![CDATA[Characterizing cytosine methylation of polymorphic human transposable element insertions using human pangenome resources]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-11-23</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2020.10.23.349621v1?rss=1">
<title>
<![CDATA[
Complete and haplotype-specific sequence assembly of segmental duplication-mediated genome rearrangements using targeted CRISPR-targeted ultra-long read sequencing (CTLR-Seq) 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2020.10.23.349621v1?rss=1"
</link>
<description><![CDATA[
We have developed a generally applicable method based on CRISPR/Cas9-targeted ultra-long read sequencing (CTLR-Seq) to completely and haplotype-specifically resolve, at base-pair resolution, large, complex, and highly repetitive genomic regions that had been previously impenetrable to next-generation sequencing analysis such as large segmental duplication (SegDup) regions and their associated genome rearrangements that stretch hundreds of kilobases. Our method combines in vitro Cas9-mediated cutting of the genome and pulse-field gel electrophoresis to haplotype-specifically isolate intact large (200-550 kb) target regions that encompass previously unresolvable genomic sequences. These target fragments are then sequenced (amplification-free) to produce ultra-long reads at up to 40x on-target coverage using Oxford nanopore technology, allowing for the complete assembly of the complex genomic regions of interest at single base-pair resolution. We applied CTLR-Seq to resolve the exact sequence of SegDup rearrangements that constitute the boundary regions of the 22q11.2 deletion CNV and of the 16p11.2 deletion and duplication CNVs. These CNVs are among the strongest known risk factors for schizophrenia and autism. We then perform de novo assembly to resolve, for the first time, at single base-pair resolution, the sequence rearrangements of the 22q11.2 and 16p11.2 CNVs, mapping out exactly the genes and non-coding regions that are affected by the CNV for different carriers.
]]></description>
<dc:creator>Zhou, B.</dc:creator>
<dc:creator>Shin, G.</dc:creator>
<dc:creator>Greer, S. U.</dc:creator>
<dc:creator>Vervoort, L.</dc:creator>
<dc:creator>Huang, Y.</dc:creator>
<dc:creator>Pattni, R.</dc:creator>
<dc:creator>Ho, M.</dc:creator>
<dc:creator>Wong, W. H.</dc:creator>
<dc:creator>Vermeesch, J. R.</dc:creator>
<dc:creator>Ji, H.</dc:creator>
<dc:creator>Urban, A. E.</dc:creator>
<dc:date>2020-10-23</dc:date>
<dc:identifier>doi:10.1101/2020.10.23.349621</dc:identifier>
<dc:title><![CDATA[Complete and haplotype-specific sequence assembly of segmental duplication-mediated genome rearrangements using targeted CRISPR-targeted ultra-long read sequencing (CTLR-Seq)]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2020-10-23</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.03.09.531574v1?rss=1">
<title>
<![CDATA[
Evolutionary constraint and innovation across hundreds of placental mammals 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.03.09.531574v1?rss=1"
</link>
<description><![CDATA[
Evolutionary constraint and acceleration are powerful, cell-type agnostic measures of functional importance. Previous studies in mammals were limited by species number and reliance on human-referenced alignments. We explore the evolution of placental mammals, including humans, through reference-free whole-genome alignment of 240 species and protein-coding alignments for 428 species. We estimate 10.7% of the human genome is evolutionarily constrained. We resolve constraint to single nucleotides, pinpointing functional positions, and refine and expand by over seven-fold the catalog of ultraconserved elements. Overall, 48.5% of constrained bases are as yet unannotated, suggesting yet-to-be-discovered functional importance. Using species-level phenotypes and an updated phylogeny, we associate coding and regulatory variation with olfaction and hibernation. Focusing on biodiversity conservation, we identify genomic metrics that predict species at risk of extinction.
]]></description>
<dc:creator>Christmas, M. J.</dc:creator>
<dc:creator>Kaplow, I. M.</dc:creator>
<dc:creator>Genereux, D. P.</dc:creator>
<dc:creator>Dong, M. X.</dc:creator>
<dc:creator>Hughes, G. M.</dc:creator>
<dc:creator>Li, X.</dc:creator>
<dc:creator>Sullivan, P. F.</dc:creator>
<dc:creator>Hindle, A. G.</dc:creator>
<dc:creator>Andrews, G.</dc:creator>
<dc:creator>Armstrong, J. C.</dc:creator>
<dc:creator>Bianchi, M.</dc:creator>
<dc:creator>Breit, A. M.</dc:creator>
<dc:creator>Diekhans, M.</dc:creator>
<dc:creator>Fanter, C.</dc:creator>
<dc:creator>Foley, N. M.</dc:creator>
<dc:creator>Goodman, D. B.</dc:creator>
<dc:creator>Goodman, L.</dc:creator>
<dc:creator>Keough, K. C.</dc:creator>
<dc:creator>Kirilenko, B.</dc:creator>
<dc:creator>Kowalczyk, A.</dc:creator>
<dc:creator>Lawless, C.</dc:creator>
<dc:creator>Lind, A. L.</dc:creator>
<dc:creator>Meadows, J. R. S.</dc:creator>
<dc:creator>Moreira, L. R.</dc:creator>
<dc:creator>Redlich, R. W.</dc:creator>
<dc:creator>Ryan, L.</dc:creator>
<dc:creator>Swofford, R.</dc:creator>
<dc:creator>Valenzuela, A.</dc:creator>
<dc:creator>Wagner, F.</dc:creator>
<dc:creator>Wallerman, O.</dc:creator>
<dc:creator>Brown, A. R.</dc:creator>
<dc:creator>Damas, J.</dc:creator>
<dc:creator>Fan, K.</dc:creator>
<dc:creator>Gatesy, J.</dc:creator>
<dc:creator>Grimshaw, J.</dc:creator>
<dc:creator>Johnson, J.</dc:creator>
<dc:creator>Kozyrev, S. V.</dc:creator>
<dc:creator>Lawler, A. J.</dc:creator>
<dc:creator>Marinescu, V. D.</dc:creator>
<dc:creator>Morrill, K. M.</dc:creator>
<dc:creator>Osmanski, A.</dc:creator>
<dc:creator>Paulat, N. S.</dc:creator>
<dc:creator></dc:creator>
<dc:date>2023-03-09</dc:date>
<dc:identifier>doi:10.1101/2023.03.09.531574</dc:identifier>
<dc:title><![CDATA[Evolutionary constraint and innovation across hundreds of placental mammals]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-03-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.09.26.615256v1?rss=1">
<title>
<![CDATA[
Gene expansions contributing to human brain evolution 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.09.26.615256v1?rss=1"
</link>
<description><![CDATA[
Duplicated genes expanded in the human lineage likely contributed to brain evolution, yet challenges exist in their discovery due to sequence-assembly errors. We used a complete telomere-to-telomere genome sequence to identify 213 human-specific gene families. From these, 362 paralogs were found in all modern human genomes tested and brain transcriptomes, making them top candidates contributing to human-universal brain features. Choosing a subset of paralogs, long-read DNA sequencing of hundreds of modern humans revealed previously hidden signatures of selection, including for T-cell marker CD8B. To understand roles in brain development, we generated zebrafish CRISPR "knockout" models of nine orthologs and introduced mRNA-encoding paralogs, effectively "humanizing" larvae. Our findings implicate two genes in possibly contributing to hallmark features of the human brain: GPR89B in dosage-mediated brain expansion and FRMPD2B in altered synapse signaling. Our holistic approach provides insights and a comprehensive resource for studying gene expansion drivers of human brain evolution.
]]></description>
<dc:creator>Soto, D. C.</dc:creator>
<dc:creator>Uribe-Salazar, J. M.</dc:creator>
<dc:creator>Kaya, G.</dc:creator>
<dc:creator>Valdarrago, R.</dc:creator>
<dc:creator>Sekar, A.</dc:creator>
<dc:creator>Haghani, N. K.</dc:creator>
<dc:creator>Hino, K.</dc:creator>
<dc:creator>La, G. N.</dc:creator>
<dc:creator>Mariano, N. A. F.</dc:creator>
<dc:creator>Ingamells, C.</dc:creator>
<dc:creator>Baraban, A. E.</dc:creator>
<dc:creator>Turner, T. N.</dc:creator>
<dc:creator>Green, E. D.</dc:creator>
<dc:creator>Simo, S.</dc:creator>
<dc:creator>Quon, G.</dc:creator>
<dc:creator>Andres, A.</dc:creator>
<dc:creator>Dennis, M. Y.</dc:creator>
<dc:date>2024-09-26</dc:date>
<dc:identifier>doi:10.1101/2024.09.26.615256</dc:identifier>
<dc:title><![CDATA[Gene expansions contributing to human brain evolution]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-09-26</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2020.03.14.992248v1?rss=1">
<title>
<![CDATA[
HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2020.03.14.992248v1?rss=1"
</link>
<description><![CDATA[
Complete and accurate genome assemblies form the basis of most downstream genomic analyses and are of critical importance. Recent genome assembly projects have relied on a combination of noisy long-read sequencing and accurate short-read sequencing, with the former offering greater assembly continuity and the latter providing higher consensus accuracy. The recently introduced PacBio HiFi sequencing technology bridges this divide by delivering long reads (>10 kbp) with high per-base accuracy (>99.9%). Here we present HiCanu, a significant modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering. We benchmark HiCanu with a focus on the recovery of haplotype diversity, major histocompatibility complex (MHC) variants, satellite DNAs, and segmental duplications. For diploid human genomes sequenced to 30x HiFi coverage, HiCanu achieved superior accuracy and allele recovery compared to the current state of the art. On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultra-long Oxford Nanopore reads in terms of both accuracy and continuity. This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of 9 complete human centromeric regions. Although gaps and errors still remain within the most challenging regions of the genome, these results represent a significant advance towards the complete assembly of human genomes.

AvailabilityHiCanu is implemented within the Canu assembly framework and is available from https://github.com/marbl/canu.
]]></description>
<dc:creator>Nurk, S.</dc:creator>
<dc:creator>Walenz, B. P.</dc:creator>
<dc:creator>Rhie, A.</dc:creator>
<dc:creator>Vollger, M. R.</dc:creator>
<dc:creator>Logsdon, G. A.</dc:creator>
<dc:creator>Grothe, R.</dc:creator>
<dc:creator>Miga, K. H.</dc:creator>
<dc:creator>Eichler, E. E.</dc:creator>
<dc:creator>Phillippy, A. M.</dc:creator>
<dc:creator>Koren, S.</dc:creator>
<dc:date>2020-03-17</dc:date>
<dc:identifier>doi:10.1101/2020.03.14.992248</dc:identifier>
<dc:title><![CDATA[HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2020-03-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.05.17.444256v1?rss=1">
<title>
<![CDATA[
KmerKeys: a web resource for searching indexed genome assemblies and variants 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.05.17.444256v1?rss=1"
</link>
<description><![CDATA[
K-mers are short DNA sequences that are used for genome sequence analysis. Applications that use k-mers include genome assembly and alignment. Despite these current applications, the wider bioinformatic use of k-mers in has challenges related to the massive scale of genomic sequence data. A single human genome assembly has billions of these short sequences. The sheer amount of computation for effective use of k-mer information is enormous, particularly when involving multiple genome assemblies. To address these issues, we developed a new k-mer indexing data structure based on a hash table tuned for the lookup of k-mer keys. This web application, referred to as KmerKeys (https://kmerkeys.dgi-stanford.org/), provides performant, rapid query speeds for cloud computation on genome assemblies. We enable fuzzy as well as exact k-mer-based searches of assemblies. To enable robust and speedy performance, the website implements cache-friendly hash tables, memory mapping and massive parallel processing. Our method employs a scalable and efficient data structure that can be used to jointly index and search a large collection of human genome assembly information. One can include variant databases and their associated metadata such as the gnomAD population variant catalog. This feature enables the incorporation of future genomic information into sequencing analysis.
]]></description>
<dc:creator>Pavlichin, D. S.</dc:creator>
<dc:creator>Lee, H.</dc:creator>
<dc:creator>Greer, S. U.</dc:creator>
<dc:creator>Grimes, S. M.</dc:creator>
<dc:creator>Weissman, T.</dc:creator>
<dc:creator>Ji, H. P.</dc:creator>
<dc:date>2021-05-18</dc:date>
<dc:identifier>doi:10.1101/2021.05.17.444256</dc:identifier>
<dc:title><![CDATA[KmerKeys: a web resource for searching indexed genome assemblies and variants]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-05-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.10.11.561941v1?rss=1">
<title>
<![CDATA[
KmerSV: a visualization and annotation tool for structural variants using Human Pangenome derived k-mers 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.10.11.561941v1?rss=1"
</link>
<description><![CDATA[
SummaryKmerSV is a visualization and annotation tool for structural variants (SVs). It can be applied to assembly contigs or long-read sequences. Using k-mers it rapidly generates images and provides genome features of SVs. As an important feature, it utilizes the new Human Pangenome reference which provide haploid specific assemblies, addresses limitations in prior references and improves the discovery of SVs.

Availability and implementationKmerSV is implemented in Python and available at github.com/sgtc-stanford/kmerSV
]]></description>
<dc:creator>Meng, Q.</dc:creator>
<dc:creator>Ji, H. P.</dc:creator>
<dc:creator>Lee, H.</dc:creator>
<dc:date>2023-10-15</dc:date>
<dc:identifier>doi:10.1101/2023.10.11.561941</dc:identifier>
<dc:title><![CDATA[KmerSV: a visualization and annotation tool for structural variants using Human Pangenome derived k-mers]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-10-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2020.11.15.383273v1?rss=1">
<title>
<![CDATA[
lra: the Long Read Aligner for Sequences and Contigs 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2020.11.15.383273v1?rss=1"
</link>
<description><![CDATA[
MotivationIt is computationally challenging to detect variation by aligning long reads from single-molecule sequencing (SMS) instruments, or megabase-scale contigs from SMS assemblies. One approach to efficiently align long sequences is sparse dynamic programming (SDP), where exact matches are found between the sequence and the genome, and optimal chains of matches are found representing a rough alignment. Sequence variation is more accurately modeled when alignments are scored with a gap penalty that is a convex function of the gap length. Because previous implementations of SDP used a linear-cost gap function that does not accurately model variation, and implementations of alignment that have a convex gap penalty are either inefficient or use heuristics, we developed a method, lra, that uses SDP with a convex-cost gap penalty. We use lra to align long-read sequences from PacBio and Oxford Nanopore (ONT) instruments as well as de novo assembly contigs.

ResultsAcross all data types, the runtime of lra is between 52-168% of the state of the art aligner minimap2 when generating SAM alignment, and 9-15% of an alternative method, ngmlr. This alignment approach may be used to provide additional evidence of SV calls in PacBio datasets, and an increase in sensitivity and specificity on ONT data with current SV detection algorithms. The number of calls discovered using pbsv with lra alignments are within 98.3-98.6% of calls made from minimap2 alignments on the same data, and give a nominal 0.2-0.4% increase in F1 score by Truvari analysis. On ONT data with SV called using Sniffles, the number of calls made from lra alignments is 3% greater than minimap2-based calls, and 30% greater than ngmlr based calls, with a 4.6-5.5% increase in Truvari F1 score. When applied to calling variation from de novo assembly contigs, there is a 5.8% increase in SV calls compared to minimap2+paftools, with a 4.3% increase in Truvari F1 score.

Availability and implementationAvailable in bioconda: https://anaconda.org/bioconda/lra and github: https://github.com/ChaissonLab/LRA

Contactmchaisso@usc.edu, jingwenr@usc.edu
]]></description>
<dc:creator>Ren, J.</dc:creator>
<dc:creator>Chaisson, M.</dc:creator>
<dc:date>2020-11-17</dc:date>
<dc:identifier>doi:10.1101/2020.11.15.383273</dc:identifier>
<dc:title><![CDATA[lra: the Long Read Aligner for Sequences and Contigs]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2020-11-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.12.05.471312v1?rss=1">
<title>
<![CDATA[
LungMAP Portal Ecosystem: Systems-Level Exploration of the Lung 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.12.05.471312v1?rss=1"
</link>
<description><![CDATA[
An improved understanding of the human lung necessitates advanced systems models informed by an ever-increasing repertoire of molecular omics, cellular, imaging and pathological datasets. To centralize and standardize information across broad lung research efforts we expanded the LungMAP.net website into a gateway portal. This portal connects a broad-spectrum of research networks, bulk and single-cell multi-omics data and a diverse collection of image data that span mammalian lung development and disease. The data are standardized across species and technologies using harmonized data and metadata models that leverage recent advances including those from the Human Cell Atlas, diverse ontologies, and the LungMAP CellCards initiative. To cultivate future discoveries, we have aggregated a diverse collection of single-cell atlases for multiple species (human, rhesus, mouse), to enable consistent queries across technologies, cohorts, age, disease and drug treatment. These atlases are provided as independent and integrated queriable datasets, with an emphasis on dynamic visualization, figure generation and reference-based classification of user-provided datasets (Azimuth). As this resource grows, we intend to increase the breadth of available interactive interfaces, data portals and datasets from LungMAP and external research efforts.
]]></description>
<dc:creator>Gaddis, N.</dc:creator>
<dc:creator>Fortriede, J.</dc:creator>
<dc:creator>Guo, M.</dc:creator>
<dc:creator>Bardes, E. E.</dc:creator>
<dc:creator>Kouril, M.</dc:creator>
<dc:creator>Tabar, S.</dc:creator>
<dc:creator>Burns, K.</dc:creator>
<dc:creator>Ardini-Poleske, M. E.</dc:creator>
<dc:creator>Loos, S.</dc:creator>
<dc:creator>Schnell, D.</dc:creator>
<dc:creator>Jin, K.</dc:creator>
<dc:creator>Iyer, B.</dc:creator>
<dc:creator>Du, Y.</dc:creator>
<dc:creator>Korte, J.</dc:creator>
<dc:creator>Munshi, R.</dc:creator>
<dc:creator>Smith, V.</dc:creator>
<dc:creator>Herbst, A.</dc:creator>
<dc:creator>Kitzmiller, J. A.</dc:creator>
<dc:creator>Clair, G. C.</dc:creator>
<dc:creator>Carson, J.</dc:creator>
<dc:creator>Adkins, J.</dc:creator>
<dc:creator>Morrisey, E. E.</dc:creator>
<dc:creator>Pryhuber, G. S.</dc:creator>
<dc:creator>Misra, R.</dc:creator>
<dc:creator>Whitsett, J. A.</dc:creator>
<dc:creator>Sun, X.</dc:creator>
<dc:creator>Heathorn, T.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Prasath, V. B. S.</dc:creator>
<dc:creator>Xu, Y.</dc:creator>
<dc:creator>Tickle, T.</dc:creator>
<dc:creator>Aronow, B. J.</dc:creator>
<dc:creator>Salomonis, N.</dc:creator>
<dc:date>2021-12-06</dc:date>
<dc:identifier>doi:10.1101/2021.12.05.471312</dc:identifier>
<dc:title><![CDATA[LungMAP Portal Ecosystem: Systems-Level Exploration of the Lung]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-12-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.11.01.565041v1?rss=1">
<title>
<![CDATA[
ntsm: an alignment-free, ultra low coverage, sequencing technology agnostic, intraspecies sample comparison tool for sample swap detection 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.11.01.565041v1?rss=1"
</link>
<description><![CDATA[
BackgroundDue to human error, sample swapping in large cohort studies with heterogeneous data types (e.g. mix of Oxford Nanopore, Pacific Bioscience, Illumina data, etc.) remains a common issue plaguing large-scale studies. At present, all sample swapping detection methods require costly and unnecessary (e.g. if data is only used for genome assembly) alignment, positional sorting, and indexing of the data in order to compare similarly. As studies include more samples and new sequencing data types, robust quality control tools will become increasingly important.

FindingsThe similarity between samples can be determined using indexed k-mer sequence variants. To increase statistical power, we use coverage information on variant sites, calculating similarity using a likelihood ratio-based test. Per sample error rate, and coverage bias (i.e. missing sites) can also be estimated with this information, which can be used to determine if a spatially indexed PCA-based pre-screening method can be used, which can greatly speed up analysis by preventing exhaustive all-to-all comparisons.

ConclusionsBecause this tool processes raw data, is faster than alignment, and can be used on very low coverage data, it can save an immense degree of computational resources in standard QC pipelines. It is robust enough to be used on different sequencing data types, important in studies that leverage the strengths of different sequencing technologies. In addition to its primary use case of sample-swap detection, this method provides other useful information useful in QC, such as error rate and coverage bias, as well as population-level PCA ancestry analysis visualization.
]]></description>
<dc:creator>Chu, J.</dc:creator>
<dc:creator>Rong, J.</dc:creator>
<dc:creator>Feng, X.</dc:creator>
<dc:creator>Li, H.</dc:creator>
<dc:date>2023-11-03</dc:date>
<dc:identifier>doi:10.1101/2023.11.01.565041</dc:identifier>
<dc:title><![CDATA[ntsm: an alignment-free, ultra low coverage, sequencing technology agnostic, intraspecies sample comparison tool for sample swap detection]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-11-03</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.10.06.511217v1?rss=1">
<title>
<![CDATA[
Pangenome Graph Construction from Genome Alignment with Minigraph-Cactus 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.10.06.511217v1?rss=1"
</link>
<description><![CDATA[
Reference genomes provide mapping targets and coordinate systems but introduce biases when samples under study diverge sufficiently from them. Pangenome references seek to address this by storing a representative set of diverse haplotypes and their alignment, usually as a graph. Alternate alleles determined by variant callers can be used to construct pangenome graphs, but thanks to advances in long-read sequencing, high-quality phased assemblies are becoming widely available. Constructing a pangenome graph directly from assemblies, as opposed to variant calls, leverages the graphs ability to consistently represent variation at different scales and reduces biases introduced by reference-based variant calls. Pangenome construction in this way is equivalent to multiple genome alignment. Here we present the Minigraph-Cactus pangenome pipeline, a method to create pangenomes directly from whole-genome alignments, and demonstrate its ability to scale to 90 human haplotypes from the Human Pangenome Reference Consortium (HPRC). This tool was designed to build graphs containing all forms of genetic variation while still being practical for use with current mapping and genotyping tools. We show that this graph is useful both for studying variation within the input haplotypes, but also as a basis for achieving state of the art performance in short and long read mapping, small variant calling and structural variant genotyping. We further measure the effect of the quality and completeness of reference genomes used for analysis within the pangenomes, and show that using the CHM13 reference from the Telomere-to-Telomere Consortium improves the accuracy of our methods, even after projecting back to GRCh38. We also demonstrate that our method can apply to nonhuman data by showing improved mapping and variant detection sensitivity with a Drosophila melanogaster pangenome.
]]></description>
<dc:creator>Hickey, G.</dc:creator>
<dc:creator>Monlong, J.</dc:creator>
<dc:creator>Novak, A.</dc:creator>
<dc:creator>Eizenga, J. M.</dc:creator>
<dc:creator>Human Pangenome Reference Consortium,</dc:creator>
<dc:creator>Li, H.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:date>2022-10-07</dc:date>
<dc:identifier>doi:10.1101/2022.10.06.511217</dc:identifier>
<dc:title><![CDATA[Pangenome Graph Construction from Genome Alignment with Minigraph-Cactus]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-10-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.02.21.529152v1?rss=1">
<title>
<![CDATA[
Phased nanopore assembly with Shasta and modular graph phasing with GFAse 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.02.21.529152v1?rss=1"
</link>
<description><![CDATA[
As a step towards simplifying and reducing the cost of haplotype resolved de novo assembly, we describe new methods for accurately phasing nanopore data with the Shasta genome assembler and a modular tool for extending phasing to the chromosome scale called GFAse. We test using new variants of Oxford Nanopore Technologies (ONT) PromethION sequencing, including those using proximity ligation and show that newer, higher accuracy ONT reads substantially improve assembly quality.
]]></description>
<dc:creator>Lorig-Roach, R.</dc:creator>
<dc:creator>Meredith, M.</dc:creator>
<dc:creator>Monlong, J.</dc:creator>
<dc:creator>Jain, M.</dc:creator>
<dc:creator>Olsen, H.</dc:creator>
<dc:creator>McNulty, B.</dc:creator>
<dc:creator>Porubsky, D.</dc:creator>
<dc:creator>Montague, T. G.</dc:creator>
<dc:creator>Lucas, J.</dc:creator>
<dc:creator>Condon, C.</dc:creator>
<dc:creator>Eizenga, J.</dc:creator>
<dc:creator>Juul, S.</dc:creator>
<dc:creator>McKenzie, S.</dc:creator>
<dc:creator>Simmonds, S.</dc:creator>
<dc:creator>Park, J.</dc:creator>
<dc:creator>Asri, M.</dc:creator>
<dc:creator>Koren, S.</dc:creator>
<dc:creator>Eichler, E.</dc:creator>
<dc:creator>Axel, R.</dc:creator>
<dc:creator>Martin, B.</dc:creator>
<dc:creator>Carnevali, P.</dc:creator>
<dc:creator>Miga, K.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:date>2023-02-22</dc:date>
<dc:identifier>doi:10.1101/2023.02.21.529152</dc:identifier>
<dc:title><![CDATA[Phased nanopore assembly with Shasta and modular graph phasing with GFAse]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-02-22</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2020.08.13.249839v1?rss=1">
<title>
<![CDATA[
Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2020.08.13.249839v1?rss=1"
</link>
<description><![CDATA[
Variable number tandem repeat sequences (VNTR) are composed of consecutive repeats of short segments of DNA with hypervariable repeat count and composition. They include protein coding sequences and associations with clinical disorders. It has been difficult to incorporate VNTR analysis in disease studies that use short-read sequencing because the traditional approach of mapping to the human reference is less effective for repetitive and divergent sequences. We solve VNTR mapping for short reads with a repeat-pangenome graph (RPGG), a data structure that encodes both the population diversity and repeat structure of VNTR loci from multiple haplotype-resolved assemblies. We developed software to build a RPGG, and use the RPGG to estimate VNTR composition with short reads. We used this to discover VNTRs with length stratified by continental population, and novel expression quantitative trait loci, indicating that RPGG analysis of VNTRs will be critical for future studies of diversity and disease.
]]></description>
<dc:creator>Lu, T.-Y. T.</dc:creator>
<dc:creator>The Human Genome Structural Variation Consortium,</dc:creator>
<dc:creator>Chaisson, M. J.</dc:creator>
<dc:date>2020-08-14</dc:date>
<dc:identifier>doi:10.1101/2020.08.13.249839</dc:identifier>
<dc:title><![CDATA[Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2020-08-14</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.08.15.504037v1?rss=1">
<title>
<![CDATA[
Recombination between heterologous human acrocentric chromosomes 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.08.15.504037v1?rss=1"
</link>
<description><![CDATA[
The short arms of the human acrocentric chromosomes 13, 14, 15, 21, and 22 share large homologous regions, including the ribosomal DNA repeats and extended segmental duplications (Floutsakou et al. 2013; van Sluis et al. 2019). While the complete assembly of these regions in the Telomere-to-Telomere consortiums CHM13 provided a model of their homology (Nurk et al. 2022), it remained unclear if these patterns were ancestral or maintained by ongoing recombination exchange. Here, we show that acrocentric chromosomes contain pseudo-homologous regions (PHRs) indicative of recombination between non-homologs. Considering an all-to-all comparison of the high-quality human pangenome from the Human Pangenome Reference Consortium (HPRC) (Liao et al. 2022), we find that contigs from all of the acrocentric short arms form a community similar to those formed by single chromosomes or the sex chromosome pair. A variation graph (Garrison et al. 2018) constructed from centromere-spanning acrocentric contigs indicates the presence of regions where most contigs appear nearly identical between heterologous CHM13 acrocentrics. Except on chromosome 15, we observe faster decay of linkage disequilibrium in the PHRs than in the corresponding short and long arms, indicating higher rates of recombination (N. Li and Stephens 2003; Huttley et al. 1999). The PHRs include sequences previously shown to lie at the breakpoint of Robertsonian translocations (Jarmuz-Szymczak et al. 2014), and we show that their arrangement is compatible with crossover in inverted duplications on chromosomes 13, 14, and 21. The ubiquity of signals of recombination between heterologous chromosomes seen in the HPRC draft pangenomes acrocentric assemblies suggests that these shared sequences form the basis for recurrent Robertsonian translocations, providing sequence and population-based confirmation of hypotheses first developed cytogenetically fifty years ago (Hamerton et al. 1975).
]]></description>
<dc:creator>Guarracino, A.</dc:creator>
<dc:creator>Buonaiuto, S.</dc:creator>
<dc:creator>Potapova, T.</dc:creator>
<dc:creator>Rhie, A.</dc:creator>
<dc:creator>Koren, S.</dc:creator>
<dc:creator>Rubinstein, B.</dc:creator>
<dc:creator>Fischer, C.</dc:creator>
<dc:creator>Human Pangenome Reference Consortium,</dc:creator>
<dc:creator>Gerton, J. L.</dc:creator>
<dc:creator>Phillippy, A. M.</dc:creator>
<dc:creator>Colonna, V.</dc:creator>
<dc:creator>Garrison, E.</dc:creator>
<dc:date>2022-08-15</dc:date>
<dc:identifier>doi:10.1101/2022.08.15.504037</dc:identifier>
<dc:title><![CDATA[Recombination between heterologous human acrocentric chromosomes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-08-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.09.05.556380v1?rss=1">
<title>
<![CDATA[
Regulatory Transposable Elements in the Encyclopedia of DNA Elements 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.09.05.556380v1?rss=1"
</link>
<description><![CDATA[
Transposable elements (TEs) make up about half of the human genome and many have the biochemical hallmarks of tissue- or cell type-specific cis-regulatory elements. While some TEs have been rigorously documented to contribute directly to host gene regulation, we still have a very partial view of their regulatory landscape. Leveraging Phase 4 ENCODE data, we carried out the most comprehensive study to date of TE contributions to the regulatory genome. Here we investigated the sequence origins of candidate cis-regulatory elements (cCREs), showing that [~]25% of human cCREs comprising 236,181 elements are derived from TEs. Human-mouse comparisons indicate that over 90% of TE-derived cCREs are lineage-specific, accounting for 8-36% of lineage-specific cCREs across cCRE types. Next, we found that cCRE-associated transcription factor (TF) binding motifs in TEs originated from TE ancestral sequences significantly more than expected in all TE classes except for SINEs. Using both cCRE and TF binding data, we discovered that TEs providing cCREs and TF binding sites are closer in genomic distance to non-TE sites compared to other TEs, suggesting that TE integration site influences their later co-option as regulatory elements. We show that TEs have promoted TF binding site turnover events since human-mouse divergence, accounting for 3-56% of turnover events across 30 TFs examined. Finally, we demonstrate that TE-derived cCREs share similar features with non-TE cCREs, including massively parallel reporter assay activity and GWAS variant enrichment. Overall, our results substantiate the notion that TEs have played an important role in shaping the human regulatory genome.
]]></description>
<dc:creator>Du, A. Y.</dc:creator>
<dc:creator>Chobirko, J. D.</dc:creator>
<dc:creator>Zhuo, X.</dc:creator>
<dc:creator>Feschotte, C.</dc:creator>
<dc:creator>Wang, T.</dc:creator>
<dc:date>2023-09-06</dc:date>
<dc:identifier>doi:10.1101/2023.09.05.556380</dc:identifier>
<dc:title><![CDATA[Regulatory Transposable Elements in the Encyclopedia of DNA Elements]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-09-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.07.01.450805v1?rss=1">
<title>
<![CDATA[
Single cell characterization of CRISPR-modified transcript isoforms with nanopore sequencing 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.07.01.450805v1?rss=1"
</link>
<description><![CDATA[
Transcript isoforms are mRNAs that arise from alternative splicing events. During RNA processing, different combinations of a gene's exons lead to a diverse set of isoforms. Polymorphisms or mutations at splice junctions can generate alternative splicing events. Various splicing factors also impact the representation of a gene's transcript isoforms. To assess how these two features contribute to alternative splicing, we developed a single cell approach to introduce CRISPR edits that modify mRNA transcript structure. Our method combines (1) long-read sequencing to characterize the expressed transcripts and identify the edit at single cell resolution; (2) short-read sequencing to match the single cell gene expression profiles of the cells with the altered isoform. First, we modify target exon-intron segments with CRISPR-Cas9. Second, using cDNAs with cell barcodes, we use long read sequencing to directly identify the changes in transcript isoforms from the targeted CRISPR edits. As a variation on this approach, we also determined how modifying specific splicing factors influence isoform expression and structure. Overall, we demonstrate how the integration of single cell long read analysis and CRISPR engineering can be used to directly confirm transcript isoform and target genomic edits at single cell resolution. This approach will improve our understanding of the role of alternative splicing in transcriptional regulation.
]]></description>
<dc:creator>Kim, H. S.</dc:creator>
<dc:creator>Grimes, S. M.</dc:creator>
<dc:creator>Hooker, A. C.</dc:creator>
<dc:creator>Lau, B. T.</dc:creator>
<dc:creator>Ji, H. P.</dc:creator>
<dc:date>2021-07-02</dc:date>
<dc:identifier>doi:10.1101/2021.07.01.450805</dc:identifier>
<dc:title><![CDATA[Single cell characterization of CRISPR-modified transcript isoforms with nanopore sequencing]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-07-02</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.06.04.597452v1?rss=1">
<title>
<![CDATA[
Structural polymorphism and diversity of human segmental duplications 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.06.04.597452v1?rss=1"
</link>
<description><![CDATA[
Segmental duplications (SDs) contribute significantly to human disease, evolution, and diversity yet have been difficult to resolve at the sequence level. We present a population genetics survey of SDs by analyzing 170 human genome assemblies where the majority of SDs are fully resolved using long-read sequence assembly. Excluding the acrocentric short arms, we identify 173.2 Mbp of duplicated sequence (47.4 Mbp not present in the telomere-to-telomere reference) distinguishing fixed from structurally polymorphic events. We find that intrachromosomal SDs are among the most variable with rare events mapping near their progenitor sequences. African genomes harbor significantly more intrachromosomal SDs and are more likely to have recently duplicated gene families with higher copy number when compared to non-African samples. A comparison to a resource of 563 million full-length Iso-Seq reads identifies 201 novel, potentially protein-coding genes corresponding to these copy number polymorphic SDs.
]]></description>
<dc:creator>Jeong, H.</dc:creator>
<dc:creator>Dishuck, P. C.</dc:creator>
<dc:creator>Yoo, D.</dc:creator>
<dc:creator>Harvey, W. T.</dc:creator>
<dc:creator>Munson, K. M.</dc:creator>
<dc:creator>Lewis, A. P.</dc:creator>
<dc:creator>Kordosky, J.</dc:creator>
<dc:creator>Garcia, G. H.</dc:creator>
<dc:creator>Human Genome Structural Variation Consortium (HGSVC),</dc:creator>
<dc:creator>Yilmaz, F.</dc:creator>
<dc:creator>Hallast, P.</dc:creator>
<dc:creator>Lee, C.</dc:creator>
<dc:creator>Pastinen, T.</dc:creator>
<dc:creator>Eichler, E. E.</dc:creator>
<dc:date>2024-06-06</dc:date>
<dc:identifier>doi:10.1101/2024.06.04.597452</dc:identifier>
<dc:title><![CDATA[Structural polymorphism and diversity of human segmental duplications]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-06-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.03.07.531415v1?rss=1">
<title>
<![CDATA[
Structurally divergent and recurrently mutated regions of primate genomes 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.03.07.531415v1?rss=1"
</link>
<description><![CDATA[
To better understand the pattern of primate genome structural variation, we sequenced and assembled using multiple long-read sequencing technologies the genomes of eight nonhuman primate species, including New World monkeys (owl monkey and marmoset), Old World monkey (macaque), Asian apes (orangutan and gibbon), and African ape lineages (gorilla, bonobo, and chimpanzee). Compared to the human genome, we identified 1,338,997 lineage-specific fixed structural variants (SVs) disrupting 1,561 protein-coding genes and 136,932 regulatory elements, including the most complete set of human-specific fixed differences. Across 50 million years of primate evolution, we estimate that 819.47 Mbp or ~27% of the genome has been affected by SVs based on analysis of these primate lineages. We identify 1,607 structurally divergent regions (SDRs) wherein recurrent structural variation contributes to creating SV hotspots where genes are recurrently lost (CARDs, ABCD7, OLAH) and new lineage-specific genes are generated (e.g., CKAP2, NEK5) and have become targets of rapid chromosomal diversification and positive selection (e.g., RGPDs). High-fidelity long-read sequencing has made these dynamic regions of the genome accessible for sequence-level analyses within and between primate species for the first time.
]]></description>
<dc:creator>Mao, Y.</dc:creator>
<dc:creator>Harvey, W. T.</dc:creator>
<dc:creator>Porubsky, D.</dc:creator>
<dc:creator>Munson, K. M.</dc:creator>
<dc:creator>Hoekzema, K.</dc:creator>
<dc:creator>Lewis, A. P.</dc:creator>
<dc:creator>Audano, P. A.</dc:creator>
<dc:creator>Rozanski, A.</dc:creator>
<dc:creator>Yang, X.</dc:creator>
<dc:creator>Zhang, S.</dc:creator>
<dc:creator>Gordon, D. S.</dc:creator>
<dc:creator>Wei, X.</dc:creator>
<dc:creator>Logsdon, G. A.</dc:creator>
<dc:creator>Haukness, M.</dc:creator>
<dc:creator>Dishuck, P. C.</dc:creator>
<dc:creator>Jeong, H.</dc:creator>
<dc:creator>del Rosario, R.</dc:creator>
<dc:creator>Bauer, V. L.</dc:creator>
<dc:creator>Fattor, W. T.</dc:creator>
<dc:creator>Wilkerson, G. K.</dc:creator>
<dc:creator>Lu, Q.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Feng, G.</dc:creator>
<dc:creator>Sawyer, S. L.</dc:creator>
<dc:creator>Warren, W. C.</dc:creator>
<dc:creator>Carbone, L.</dc:creator>
<dc:creator>Eichler, E. E.</dc:creator>
<dc:date>2023-03-07</dc:date>
<dc:identifier>doi:10.1101/2023.03.07.531415</dc:identifier>
<dc:title><![CDATA[Structurally divergent and recurrently mutated regions of primate genomes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-03-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/735928v1?rss=1">
<title>
<![CDATA[
Telomere-to-telomere assembly of a complete human X chromosome 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/735928v1?rss=1"
</link>
<description><![CDATA[
After nearly two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no one chromosome has been finished end to end, and hundreds of unresolved gaps persist 1,2. The remaining gaps include ribosomal rDNA arrays, large near-identical segmental duplications, and satellite DNA arrays. These regions harbor largely unexplored variation of unknown consequence, and their absence from the current reference genome can lead to experimental artifacts and hide true variants when re-sequencing additional human genomes. Here we present a de novo human genome assembly that surpasses the continuity of GRCh38 2, along with the first gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome 3, we reconstructed the [~]2.8 megabase centromeric satellite DNA array and closed all 29 remaining gaps in the current reference, including new sequence from the human pseudoautosomal regions and cancer-testis ampliconic gene families (CT-X and GAGE). This complete chromosome X, combined with the ultra-long nanopore data, also allowed us to map methylation patterns across complex tandem repeats and satellite arrays for the first time. These results demonstrate that finishing the human genome is now within reach and will enable ongoing efforts to complete the remaining human chromosomes.
]]></description>
<dc:creator>Miga, K. H.</dc:creator>
<dc:creator>Koren, S.</dc:creator>
<dc:creator>Rhie, A.</dc:creator>
<dc:creator>Vollger, M. R.</dc:creator>
<dc:creator>Gershman, A.</dc:creator>
<dc:creator>Bzikadze, A.</dc:creator>
<dc:creator>Brooks, S.</dc:creator>
<dc:creator>Howe, E.</dc:creator>
<dc:creator>Porubsky, D.</dc:creator>
<dc:creator>Logsdon, G. A.</dc:creator>
<dc:creator>Schneider, V. A.</dc:creator>
<dc:creator>Potapova, T.</dc:creator>
<dc:creator>Wood, J.</dc:creator>
<dc:creator>Chow, W.</dc:creator>
<dc:creator>Armstrong, J.</dc:creator>
<dc:creator>Fredrickson, J.</dc:creator>
<dc:creator>Pak, E.</dc:creator>
<dc:creator>Tigyi, K.</dc:creator>
<dc:creator>Kremitzki, M.</dc:creator>
<dc:creator>Markovic, C.</dc:creator>
<dc:creator>Maduro, V.</dc:creator>
<dc:creator>Dutra, A.</dc:creator>
<dc:creator>Bouffard, G. G.</dc:creator>
<dc:creator>Chang, A. M.</dc:creator>
<dc:creator>Hansen, N. F.</dc:creator>
<dc:creator>Thibaud-Nissen, F.</dc:creator>
<dc:creator>Schmitt, A. D.</dc:creator>
<dc:creator>Belton, J.-M.</dc:creator>
<dc:creator>Selvaraj, S.</dc:creator>
<dc:creator>Dennis, M. Y.</dc:creator>
<dc:creator>Soto, D. C.</dc:creator>
<dc:creator>Sahasrabudhe, R.</dc:creator>
<dc:creator>Kaya, G.</dc:creator>
<dc:creator>Quick, J.</dc:creator>
<dc:creator>Loman, N. J.</dc:creator>
<dc:creator>Holmes, N.</dc:creator>
<dc:creator>Loose, M.</dc:creator>
<dc:creator>Surti, U.</dc:creator>
<dc:creator>Risques, R. a.</dc:creator>
<dc:creator>Lindsay, T. A. G.</dc:creator>
<dc:creator>Fulton, R.</dc:creator>
<dc:creator>Hall, I.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Howe, K.</dc:creator>
<dc:creator>Timp, W.</dc:creator>
<dc:creator></dc:creator>
<dc:date>2019-08-16</dc:date>
<dc:identifier>doi:10.1101/735928</dc:identifier>
<dc:title><![CDATA[Telomere-to-telomere assembly of a complete human X chromosome]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2019-08-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.03.17.484784v1?rss=1">
<title>
<![CDATA[
The motif composition of variable-number tandem repeats impacts gene expression 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.03.17.484784v1?rss=1"
</link>
<description><![CDATA[
Understanding the impact of DNA variation on human traits is a fundamental question in human genetics. Variable number tandem repeats (VNTRs) make up roughly 3% of the human genome but are often excluded from association analysis due to poor read mappability or divergent repeat content. While methods exist to estimate VNTR length from short-read data, it is known that VNTRs vary in both length and repeat (motif) composition. Here, we use a repeat-pangenome graph (RPGG) constructed on 35 haplotype-resolved assemblies to detect variation in both VNTR length and repeat composition. We align population scale data from the Genotype-Tissue Expression (GTEx) Consortium to examine how variations in sequence composition may be linked to expression, including cases independent of overall VNTR length. We find that 9,422 out of 39,125 VNTRs are associated with nearby gene expression through motif variations, of which only 23.4% associations are accessible from length. Fine-mapping identifies 174 genes to be likely driven by variation in certain VNTR motifs and not overall length. We highlight two genes, CACNA1C and RNF213 that have expression associated with motif variation, demonstrating the utility of RPGG analysis as a new approach for trait association in multiallelic and highly variable loci.
]]></description>
<dc:creator>Lu, T.-Y. T.</dc:creator>
<dc:creator>Chaisson, M.</dc:creator>
<dc:date>2022-03-19</dc:date>
<dc:identifier>doi:10.1101/2022.03.17.484784</dc:identifier>
<dc:title><![CDATA[The motif composition of variable-number tandem repeats impacts gene expression]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-03-19</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2020.04.27.060061v1?rss=1">
<title>
<![CDATA[
The qBED track: a novel genome browser visualization for point processes 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2020.04.27.060061v1?rss=1"
</link>
<description><![CDATA[
SummaryTransposon calling cards is a genomic assay for identifying transcription factor binding sites in both bulk and single cell experiments. Here we describe the qBED format, an open, text-based standard for encoding and analyzing calling card data. In parallel, we introduce the qBED track on the WashU Epigenome Browser, a novel visualization that enables researchers to inspect calling card data in their genomic context. Finally, through examples, we demonstrate that qBED files can be used to visualize non-calling card datasets, such as CADD scores and GWAS/eQTL hits, and may have broad utility to the genomics community.

Availability and ImplementationThe qBED track is available on the WashU Epigenome Browser (http://epigenomegateway.wustl.edu/browser), beginning with version 46. Source code for the WashU Epigenome Browser with qBED support is available on GitHub (http://github.com/arnavm/eg-react and http://github.com/lidaof/eg-react). We have also released a tutorial on how to upload qBED data to the browser (dx.doi.org/10.17504/protocols.io.bca8ishw).
]]></description>
<dc:creator>Moudgil, A.</dc:creator>
<dc:creator>Li, D.</dc:creator>
<dc:creator>Hsu, S.</dc:creator>
<dc:creator>Purushotham, D.</dc:creator>
<dc:creator>Wang, T.</dc:creator>
<dc:creator>Mitra, R. D.</dc:creator>
<dc:date>2020-04-29</dc:date>
<dc:identifier>doi:10.1101/2020.04.27.060061</dc:identifier>
<dc:title><![CDATA[The qBED track: a novel genome browser visualization for point processes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2020-04-29</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.02.14.480364v1?rss=1">
<title>
<![CDATA[
The Transcription Factor Bach2 Negatively Regulates Natural Killer Cell Maturation and Function 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.02.14.480364v1?rss=1"
</link>
<description><![CDATA[
BTB domain And CNC Homolog 2 (Bach2) is a transcription repressor that actively participates in T and B lymphocyte development, but it is unknown if Bach2 is also involved in the development of innate immune cells, such as natural killer (NK) cells. Here, we followed the expression of Bach2 during NK cell development, finding that it peaked in CD27+CD11b+ cells and decreased upon further maturation. Bach2 expression positively correlated with that of the transcription factor TCF1 and negatively correlated with genes encoding NK effector molecules as well as genes involved in the cell cycle. Bach2-deficient mice showed increased numbers of terminally differentiated NK cells with increased production of granzymes and cytokines. NK cell-mediated control of tumor metastasis was also augmented in the absence of Bach2. Therefore, Bach2 is a key checkpoint protein regulating NK terminal maturation.
]]></description>
<dc:creator>Li, S.</dc:creator>
<dc:creator>Bern, M.</dc:creator>
<dc:creator>Miao, B.</dc:creator>
<dc:creator>Inoue, T.</dc:creator>
<dc:creator>Piersma, S. J.</dc:creator>
<dc:creator>Colonna, M.</dc:creator>
<dc:creator>Kurosaki, T.</dc:creator>
<dc:creator>Yokoyama, W. M.</dc:creator>
<dc:date>2022-02-14</dc:date>
<dc:identifier>doi:10.1101/2022.02.14.480364</dc:identifier>
<dc:title><![CDATA[The Transcription Factor Bach2 Negatively Regulates Natural Killer Cell Maturation and Function]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-02-14</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.02.14.480413v1?rss=1">
<title>
<![CDATA[
Unbiased pangenome graphs 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.02.14.480413v1?rss=1"
</link>
<description><![CDATA[
MotivationPangenome variation graphs model the mutual alignment of collections of DNA sequences. A set of pairwise alignments implies a variation graph, but there are no scalable methods to generate such a graph from these alignments. Existing related approaches depend on a single reference, a specific ordering of genomes, or a de Bruijn model based on a fixed k-mer length. A scalable, self-contained method to build pangenome graphs without such limitations would be a key step in pangenome construction and manipulation pipelines.

ResultsWe design the seqwish algorithm, which builds a variation graph from a set of sequences and alignments between them. We first transform the alignment set into an implicit interval tree. To build up the variation graph, we query this tree-based representation of the alignments to reduce transitive matches into single DNA segments in a sequence graph. By recording the mapping from input sequence to output graph, we can trace the original paths through this graph, yielding a pangenome variation graph. We present an implementation that operates in external memory, using disk-backed data structures and lock-free parallel methods to drive the core graph induction step. We demonstrate that our method scales to very large graph induction problems by applying it to build pangenome graphs for several species.

Availabilityseqwish is published as free software under the MIT open source license. Source code and documentation are available at https://github.com/ekg/seqwish. seqwish can be installed via Bioconda https://bioconda.github.io/recipes/seqwish/README.html or GNU Guix https://github.com/ekg/guix-genomics/blob/master/seqwish.scm.

Contactegarris5@uthsc.edu
]]></description>
<dc:creator>Garrison, E.</dc:creator>
<dc:creator>Guarracino, A.</dc:creator>
<dc:date>2022-02-16</dc:date>
<dc:identifier>doi:10.1101/2022.02.14.480413</dc:identifier>
<dc:title><![CDATA[Unbiased pangenome graphs]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-02-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2020.06.20.163113v1?rss=1">
<title>
<![CDATA[
Unique K-mer sequences for validating cancer-related substitution, insertion and deletion mutations 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2020.06.20.163113v1?rss=1"
</link>
<description><![CDATA[
The cancer genome sequencing has led to important discoveries such as identifying cancer gene. However, challenges remain in the analysis of cancer genome sequencing. One significant issue is that mutations identified by multiple variant callers are frequently discordant even when using the same genome sequencing data. For insertion and deletion mutations, oftentimes there is no agreement among different callers. Identifying somatic mutations involves read mapping and variant calling, a complicated process that uses many parameters and model tuning. To validate the identification of true mutations, we developed a method using k-mer sequences. First, we characterized the landscape of unique versus non-unique k-mers in the human genome. Second, we developed a software package, KmerVC, to validate the given somatic mutations from sequencing data. Our program validates the occurrence of a mutation based on statistically significant difference in frequency of k-mers with and without a mutation from matched normal and tumor sequences. Third, we tested our method on both simulated and cancer genome sequencing data. Counting k-mer involving mutations effectively validated true positive mutations including insertions and deletions across different individual samples in a reproducible manner. Thus, we demonstrated a straightforward approach for rapidly validating mutations from cancer genome sequencing data.
]]></description>
<dc:creator>Lee, H.</dc:creator>
<dc:creator>Shuaibi, A. A.</dc:creator>
<dc:creator>Bell, J.</dc:creator>
<dc:creator>Pavlichin, D. S.</dc:creator>
<dc:creator>Ji, H.</dc:creator>
<dc:date>2020-06-20</dc:date>
<dc:identifier>doi:10.1101/2020.06.20.163113</dc:identifier>
<dc:title><![CDATA[Unique K-mer sequences for validating cancer-related substitution, insertion and deletion mutations]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2020-06-20</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.12.16.628723v1?rss=1">
<title>
<![CDATA[
Long-read sequencing of hundreds of diverse brains provides insight into the impact of structural variation on gene expression and DNA methylation 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.12.16.628723v1?rss=1"
</link>
<description><![CDATA[
Structural variants (SVs) drive gene expression in the human brain and are causative of many neurological conditions. However, most existing genetic studies have been based on short-read sequencing methods, which capture fewer than half of the SVs present in any one individual. Long-read sequencing (LRS) enhances our ability to detect disease-associated and functionally relevant structural variants (SVs); however, its application in large-scale genomic studies has been limited by challenges in sample preparation and high costs. Here, we leverage a new scalable wet-lab protocol and computational pipeline for whole-genome Oxford Nanopore Technologies sequencing and apply it to neurologically normal control samples from the North American Brain Expression Consortium (NABEC) (European ancestry) and Human Brain Collection Core (HBCC) (African or African admixed ancestry) cohorts. Through this work, we present a publicly available long-read resource from 351 human brain samples (median N50: 27 Kbp and at an average depth of [~]40x genome coverage). We discover approximately 234,905 SVs and produce locally phased assemblies that cover 95% of all protein-coding genes in GRCh38. Utilizing matched expression datasets for these samples, we apply quantitative trait locus (QTL) analyses and identify SVs that impact gene expression in post-mortem frontal cortex brain tissue. Further, we determine haplotype- specific methylation signatures at millions of CpGs and, with this data, identify cis-acting SVs. In summary, these results highlight that large-scale LRS can identify complex regulatory mechanisms in the brain that were inaccessible using previous approaches. We believe this new resource provides a critical step toward understanding the biological effects of genetic variation in the human brain.
]]></description>
<dc:creator>Billingsley, K. J.</dc:creator>
<dc:creator>Meredith, M.</dc:creator>
<dc:creator>Daida, K.</dc:creator>
<dc:creator>Alvarez Jerez, P.</dc:creator>
<dc:creator>Negi, S.</dc:creator>
<dc:creator>Malik, L.</dc:creator>
<dc:creator>Genner, R. M.</dc:creator>
<dc:creator>Moller, A.</dc:creator>
<dc:creator>Zheng, X.</dc:creator>
<dc:creator>Gibson, S. B.</dc:creator>
<dc:creator>Mastoras, M.</dc:creator>
<dc:creator>Baker, B.</dc:creator>
<dc:creator>Kouam, C.</dc:creator>
<dc:creator>Paquette, K.</dc:creator>
<dc:creator>Jarreau, P.</dc:creator>
<dc:creator>Makarious, M. B.</dc:creator>
<dc:creator>Moore, A.</dc:creator>
<dc:creator>Hong, S.</dc:creator>
<dc:creator>Vitale, D.</dc:creator>
<dc:creator>Shah, S.</dc:creator>
<dc:creator>Monlong, J.</dc:creator>
<dc:creator>Pantazis, C. B.</dc:creator>
<dc:creator>Asri, M.</dc:creator>
<dc:creator>Shafin, K.</dc:creator>
<dc:creator>Carnevali, P.</dc:creator>
<dc:creator>Marenco, S.</dc:creator>
<dc:creator>Auluck, P.</dc:creator>
<dc:creator>Mandal, A.</dc:creator>
<dc:creator>Miga, K. H.</dc:creator>
<dc:creator>Rhie, A.</dc:creator>
<dc:creator>Reed, X.</dc:creator>
<dc:creator>Ding, J.</dc:creator>
<dc:creator>Cookson, M. R.</dc:creator>
<dc:creator>Nalls, M.</dc:creator>
<dc:creator>Singleton, A.</dc:creator>
<dc:creator>Miller, D. E.</dc:creator>
<dc:creator>Chaisson, M.</dc:creator>
<dc:creator>Timp, W.</dc:creator>
<dc:creator>Gibbs, J. R.</dc:creator>
<dc:creator>Phillippy, A. M.</dc:creator>
<dc:creator>Kolmogorov, M.</dc:creator>
<dc:creator>Jain, M.</dc:creator>
<dc:creator>Sedlazeck, F. J.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Blauwendraat, C.</dc:creator>
<dc:date>2024-12-18</dc:date>
<dc:identifier>doi:10.1101/2024.12.16.628723</dc:identifier>
<dc:title><![CDATA[Long-read sequencing of hundreds of diverse brains provides insight into the impact of structural variation on gene expression and DNA methylation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-12-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.07.04.662981v1?rss=1">
<title>
<![CDATA[
Population differences of chromosome 22q11.2 duplication structure predisposes differentially to microdeletion and inversion. 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.07.04.662981v1?rss=1"
</link>
<description><![CDATA[
The most common genomic disorder, chromosome 22q11.2 microdeletion syndrome (22q11.2DS), is mediated by highly identical and polymorphic segmental duplications (SDs) known as low copy repeats (LCRs; regions A-D) that have been challenging to sequence and characterize. Here, we report the sequence-resolved genomic architecture of 135 chromosome 22q11.2 haplotypes from diverse 1000 Genomes Project samples. We find that more than 90% of the copy number variation is polarized to the most proximal LCR region A (LCRA) where 50 distinct structural configurations are observed ([~]189 kbp to [~]2.15 Mbp or 11-fold length variation). A higher-order SD cassette structure of 105 kbp in length, flanked by 25 kbp long inverted repeats, drives this variation and emerged in the human-chimpanzee ancestral lineage later expanding in humans [~]1.0 [0.8-1.2] million years ago. African LCRA haplotypes are significantly longer (p=0.0047) when compared to non-Africans yet are predicted to be more protected against recurrent microdeletions (p=0.00053) due to a preponderance of flanking SDs in an inverted orientation. Conversely, we identified nine distinct inversion polymorphisms, including five recurrent [~]2.28 Mbp inversions extending across the critical region (LCRA-D) and four smaller inversions (two LCRA-B, one LCRC-D, and one LCRB-D); 7/9 of these events were identified in haplotypes of African and admixed American ancestry. Finally, we sequence and assemble four families and show that LCRA-D deletion breakpoints map to the 105 kbp repeat unit while inversion breakpoints associate with the 25 kbp repeats adjacent to palindromic AT-rich regions. In one family, we observe evidence of more complex unequal crossover events associated with gene conversion and multiple breakpoints. Our findings suggest that specific haplotype configurations are protective and susceptible to chromosome 22q11.2DS while recurrent large-scale inversions help to explain why this syndrome is less prevalent among individuals of African descent.
]]></description>
<dc:creator>Porubsky, D.</dc:creator>
<dc:creator>Yoo, D.</dc:creator>
<dc:creator>Dishuck, P. C.</dc:creator>
<dc:creator>Koundinya, N.</dc:creator>
<dc:creator>Souche, E.</dc:creator>
<dc:creator>Harvey, W. T.</dc:creator>
<dc:creator>Munson, K. M.</dc:creator>
<dc:creator>Hoekzema, K.</dc:creator>
<dc:creator>Chan, D. D.</dc:creator>
<dc:creator>Leung, T. Y.</dc:creator>
<dc:creator>Santos, M. S.</dc:creator>
<dc:creator>Meynants, S.</dc:creator>
<dc:creator>Swillen, A.</dc:creator>
<dc:creator>Breckpot, J.</dc:creator>
<dc:creator>Tsapalou, V.</dc:creator>
<dc:creator>Hasenfeld, P.</dc:creator>
<dc:creator>Korbel, J. O.</dc:creator>
<dc:creator>Lansdorp, P. M.</dc:creator>
<dc:creator>Vermeesch, J. R.</dc:creator>
<dc:creator>Eichler, E. E.</dc:creator>
<dc:date>2025-07-05</dc:date>
<dc:identifier>doi:10.1101/2025.07.04.662981</dc:identifier>
<dc:title><![CDATA[Population differences of chromosome 22q11.2 duplication structure predisposes differentially to microdeletion and inversion.]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-07-05</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.06.05.657102v1?rss=1">
<title>
<![CDATA[
Pangenome-aware DeepVariant 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.06.05.657102v1?rss=1"
</link>
<description><![CDATA[
Population-scale genomics information provides valuable prior knowledge for various genomic analyses, especially variant calling. A notable example of such application is the human pangenome reference released by the Human Pangenome Reference Consortium, which has been shown to improve read mapping and structural variant genotyping. In this work, we introduce pangenome-aware DeepVariant, a variant caller that uses a pangenome reference alongside sample-specific read alignments. It generates pileup images of both reads and pangenome haplotypes near potential variants and uses a Convolutional Neural Network to infer genotypes. This approach allows directly using a pangenome for distinguishing true variant signals from sequencing or alignment noise. We assessed its performance on various short-read sequencing platforms and read mappers. Across all settings, pangenome-aware DeepVariant outperformed the linear-reference-based DeepVariant, reducing errors by up to 25.5%. We also show that Element reads with pangenome-aware DeepVariant can achieve 23.6% more accurate variant calling performance compared to existing methods.
]]></description>
<dc:creator>Asri, M.</dc:creator>
<dc:creator>Chang, P.-C.</dc:creator>
<dc:creator>Mier, J. C.</dc:creator>
<dc:creator>Siren, J.</dc:creator>
<dc:creator>Eskandar, P.</dc:creator>
<dc:creator>Kolesnikov, A.</dc:creator>
<dc:creator>Cook, D. E.</dc:creator>
<dc:creator>Brambrink, L.</dc:creator>
<dc:creator>Hickey, G.</dc:creator>
<dc:creator>Novak, A. M.</dc:creator>
<dc:creator>Dorfman, L.</dc:creator>
<dc:creator>Webster, D. R.</dc:creator>
<dc:creator>Carroll, A.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Shafin, K.</dc:creator>
<dc:date>2025-06-06</dc:date>
<dc:identifier>doi:10.1101/2025.06.05.657102</dc:identifier>
<dc:title><![CDATA[Pangenome-aware DeepVariant]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-06-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.05.12.653561v1?rss=1">
<title>
<![CDATA[
Lossless Pangenome Indexing Using Tag Arrays 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.05.12.653561v1?rss=1"
</link>
<description><![CDATA[
Pangenome graphs represent the genomic variation by encoding multiple haplotypes within a unified graph structure. However, efficient and lossless indexing of such structures remains challenging due to the scale and complexity of pangenomic data. We present a practical and scalable indexing framework based on tag arrays, which annotate positions in the Burrows-Wheeler transform (BWT) with graph coordinates. Our method extends the FM-index with a run-length compressed tag structure that enables efficient retrieval of all unique graph locations where a query pattern appears. We introduce a novel construction algorithm that combines unique k-mers, graph-based extensions, and haplotype traversal to compute the tag array in a memory-efficient manner. To support large genomes, we process each chromosome independently and then merge the results into a unified index using properties of the multi-string BWT and r-index. Our evaluation on the HPRC graphs demonstrates that the tag array structure compresses effectively, scales well with added haplotypes, and preserves accurate mapping information across diverse regions of the genome. This indexing method enables lossless and haplotype-aware querying in complex pangenomes and offers a practical indexing layer to develop scalable aligners and downstream graph-based analysis tools. The index additionally supports efficient one-to-all coordinate translation, enabling any interval on a haplotype to be mapped to its corresponding intervals across all other haplotypes in the graph.
]]></description>
<dc:creator>Eskandar, P.</dc:creator>
<dc:creator>Paten, B.</dc:creator>
<dc:creator>Siren, J.</dc:creator>
<dc:date>2025-05-15</dc:date>
<dc:identifier>doi:10.1101/2025.05.12.653561</dc:identifier>
<dc:title><![CDATA[Lossless Pangenome Indexing Using Tag Arrays]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-05-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.06.14.599122v1?rss=1">
<title>
<![CDATA[
A haplotype-resolved view of human gene regulation 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.06.14.599122v1?rss=1"
</link>
<description><![CDATA[
Diploid human cells contain two non-identical genomes, and differences in their regulation underlie human development and disease. We present Fiber-seq Inferred Regulatory Elements (FIRE) and show that FIRE provides a more comprehensive and quantitative snapshot of the accessible chromatin landscape across the 6 Gbp diploid human genome, overcoming previously known and unknown biases afflicting our existing regulatory element catalog. FIRE provides a comprehensive genome-wide map of haplotype-selective chromatin accessibility (HSCA), exposing novel imprinted elements that lack underlying parent-of-origin CpG methylation differences, common and rare genetic variants that disrupt gene regulatory patterns, gene regulatory modules that enable genes to escape X chromosome inactivation, and autosomal mitotically stable somatic epimutations. We find that the human leukocyte antigen (HLA) locus harbors the most HSCA in immune cells, and we resolve the specific transcription factor (TF) binding events disrupted by disease-associated variants within the HLA locus. Finally, we demonstrate that the regulatory landscape of a cell is littered with autosomal somatic epimutations that are propagated by clonal expansions to create mitotically stable and non-genetically deterministic chromatin alterations.
]]></description>
<dc:creator>Vollger, M. R.</dc:creator>
<dc:creator>Swanson, E. G.</dc:creator>
<dc:creator>Neph, S. J.</dc:creator>
<dc:creator>Ranchalis, J.</dc:creator>
<dc:creator>Munson, K. M.</dc:creator>
<dc:creator>Ho, C.-H.</dc:creator>
<dc:creator>Sedeno-Cortes, A. E.</dc:creator>
<dc:creator>Fondrie, W. E.</dc:creator>
<dc:creator>Bohaczuk, S. C.</dc:creator>
<dc:creator>Mao, Y.</dc:creator>
<dc:creator>Parmalee, N. L.</dc:creator>
<dc:creator>Mallory, B. J.</dc:creator>
<dc:creator>Harvey, W. T.</dc:creator>
<dc:creator>Kwon, Y.</dc:creator>
<dc:creator>Garcia, G. H.</dc:creator>
<dc:creator>Hoekzema, K.</dc:creator>
<dc:creator>Meyer, J. G.</dc:creator>
<dc:creator>Cicek, M.</dc:creator>
<dc:creator>Eichler, E. E.</dc:creator>
<dc:creator>Noble, W. S.</dc:creator>
<dc:creator>Witten, D. M.</dc:creator>
<dc:creator>Bennett, J. T.</dc:creator>
<dc:creator>Ray, J. P.</dc:creator>
<dc:creator>Stergachis, A. B.</dc:creator>
<dc:date>2024-06-16</dc:date>
<dc:identifier>doi:10.1101/2024.06.14.599122</dc:identifier>
<dc:title><![CDATA[A haplotype-resolved view of human gene regulation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-06-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.05.14.653340v1?rss=1">
<title>
<![CDATA[
Assembling unmapped reads reveals hidden variation in South Asian genomes 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.05.14.653340v1?rss=1"
</link>
<description><![CDATA[
Conventional genome mapping-based approaches systematically overlook genetic variation, particularly in regions that substantially differ from the reference. To explore this hidden variation, we examined unmapped and poorly mapped reads from the genomes of 640 human individuals from South Asian (SAS) populations in the 1000 Genomes Project and the Simons Genome Diversity Project. We assembled tens of megabases of non-redundant sequence in tens of thousands of large contigs, a significant portion of which is present in both SAS and non-SAS populations. We demonstrated that much of this sequence is not discovered by traditional variant discovery approaches even when using complete genomes and pangenomes. Across 20,000 placed contigs, we found 8,215 intersections with 106 protein coding genes and >15,000 placements within 1 kbp of a known GWAS hit. We used long read data from a subset of samples to validate the majority of their assembled sequences, aligned RNA-seq data to identify hundreds of unplaced contigs with transcriptional potential, and queried existing nucleotide databases to infer the origins of the remaining unplaced sequences. Our results highlight the limitations of even the most complete reference genomes and provide a model for understanding the distribution of hidden variation in any human population.
]]></description>
<dc:creator>Das, A.</dc:creator>
<dc:creator>Biddanda, A.</dc:creator>
<dc:creator>McCoy, R. C.</dc:creator>
<dc:creator>Schatz, M.</dc:creator>
<dc:date>2025-05-14</dc:date>
<dc:identifier>doi:10.1101/2025.05.14.653340</dc:identifier>
<dc:title><![CDATA[Assembling unmapped reads reveals hidden variation in South Asian genomes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-05-14</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.04.14.648685v1?rss=1">
<title>
<![CDATA[
Efficient near telomere-to-telomere assembly of Nanopore Simplex reads 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.04.14.648685v1?rss=1"
</link>
<description><![CDATA[
Telomere-to-telomere (T2T) assembly is the ultimate goal for de novo genome assembly. Existing algorithms capable of near T2T assembly all require Oxford Nanopore Technologies (ONT) ultra-long reads which are costly and experimentally challenging to obtain and are thus often unavailable for samples without established cell lines. Here, we introduce hifiasm (ONT), the first algorithm that can produce near T2T assemblies from standard ONT Simplex reads, eliminating the need for ultra-long sequencing. Compared to existing methods, hifiasm (ONT) reduces the computational demands by an order of magnitude and reconstructs more chromosomes from telomere to telomere on the same datasets. This advancement substantially broadens the feasibility of T2T assembly for applications previously limited by the high cost and experimental requirement of ultra-long reads.
]]></description>
<dc:creator>Cheng, H.</dc:creator>
<dc:creator>Qu, H.</dc:creator>
<dc:creator>McKenzie, S.</dc:creator>
<dc:creator>Lawrence, K. R.</dc:creator>
<dc:creator>Windsor, R.</dc:creator>
<dc:creator>Vella, M.</dc:creator>
<dc:creator>Park, P. J.</dc:creator>
<dc:creator>Li, H.</dc:creator>
<dc:date>2025-04-17</dc:date>
<dc:identifier>doi:10.1101/2025.04.14.648685</dc:identifier>
<dc:title><![CDATA[Efficient near telomere-to-telomere assembly of Nanopore Simplex reads]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.05.26.656191v1?rss=1">
<title>
<![CDATA[
SNP calling, haplotype phasing and allele-specific analysis with long RNA-seq reads 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.05.26.656191v1?rss=1"
</link>
<description><![CDATA[
Long-read RNA sequencing (lrRNA-seq) is a powerful technology to link transcript structures to genetic variants but such analysis is not often performed due to the lack of end-user tools. Here, we introduce longcallR for joint SNP calling, haplotype phasing, and allele-specific analysis, which achieves high accuracy on benchmark datasets. Applied to 202 human samples, longcallR identified 88 significant allele-specific splicing events per sample on average. 46% of them involved unannotated junctions.
]]></description>
<dc:creator>Huang, N.</dc:creator>
<dc:creator>Human Pangenome Reference Consortium,</dc:creator>
<dc:creator>Li, H.</dc:creator>
<dc:date>2025-05-29</dc:date>
<dc:identifier>doi:10.1101/2025.05.26.656191</dc:identifier>
<dc:title><![CDATA[SNP calling, haplotype phasing and allele-specific analysis with long RNA-seq reads]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-05-29</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2026.01.29.702431v1?rss=1">
<title>
<![CDATA[
Spatial mapping of RNA turnover kinetics and regulatory landscapes of mRNA stability in the mammalian brain 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2026.01.29.702431v1?rss=1"
</link>
<description><![CDATA[
Spatial and activity-dependent gene regulation in the mammalian brain requires coordinated control of RNA synthesis and degradation 1,2, yet spatially resolved measurement of RNA turnover kinetics in complex tissues remains technically challenging 3. Here, we present spatial NT-seq, an approach that integrates transgenesis-free metabolic RNA labeling with in situ chemical recoding to spatially co-map RNA abundance and turnover kinetics in the mouse brain. By distinguishing newly synthesized from pre-existing RNAs, this method reveals spatially resolved transcriptional and post-transcriptional responses to electroconvulsive stimulation (ECS), a treatment for refractory depression. We uncover pronounced spatial heterogeneity in RNA turnover, with the dentate gyrus (DG) exhibiting elevated basal RNA turnover and robust ECS-induced responses. These findings reveal a "kinetics scaling" mechanism of coordinated regulation of RNA synthesis and decay, by which DG cells can rapidly remodel their transcript pools in responses to external stimuli or differentiation signals 4,5. Machine learning applied to in vivo RNA kinetics landscapes further identifies sequence features and post-transcriptional regulators underlying region-and cell-type-specific control of mRNA stability. Together, this integrated experimental and computational framework, in vivo Timescope, enables transcriptome-wide mapping of RNA turnover kinetics and the regulatory architecture of RNA stability across spatial and cellular contexts, providing new insights into the spatiotemporal regulation of RNA dynamics in brain function and disease.
]]></description>
<dc:creator>Qiu, Q.</dc:creator>
<dc:creator>Zhang, H.</dc:creator>
<dc:creator>Xia, Z.</dc:creator>
<dc:creator>Gao, W.</dc:creator>
<dc:creator>Leu, J.</dc:creator>
<dc:creator>Liang, D.</dc:creator>
<dc:creator>Li, Y.</dc:creator>
<dc:creator>Su, Y.</dc:creator>
<dc:creator>Feierman, E.</dc:creator>
<dc:creator>Horn, E. V.</dc:creator>
<dc:creator>Ming, G.-l.</dc:creator>
<dc:creator>Korb, E.</dc:creator>
<dc:creator>Song, H.</dc:creator>
<dc:creator>Zhou, Z.</dc:creator>
<dc:creator>Wu, H.</dc:creator>
<dc:date>2026-01-29</dc:date>
<dc:identifier>doi:10.64898/2026.01.29.702431</dc:identifier>
<dc:title><![CDATA[Spatial mapping of RNA turnover kinetics and regulatory landscapes of mRNA stability in the mammalian brain]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2026-01-29</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.06.03.657745v1?rss=1">
<title>
<![CDATA[
Highly sensitive and scalable time-resolved RNA sequencing in single cells with scNT-seq2 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.06.03.657745v1?rss=1"
</link>
<description><![CDATA[
Understanding gene expression dynamics requires resolving newly synthesized RNAs from pre-existing pools at single-cell resolution. Here, we present scNT-seq2, a highly sensitive and scalable method for time-resolved single-cell RNA sequencing. By systematically optimizing the second-strand cDNA synthesis (2nd SS) step, we substantially improved read alignment rates, reduced background mutations, and enhanced library complexity compared to the original scNT-seq 1. Benchmarking in 4sU-labeled K562 cells demonstrated that scNT-seq2 accurately quantifies newly synthesized transcripts and preserves the gene level RNA turnover. The enhanced sensitivity enables robust detection of dynamic, cell-cycle state specific genes, such as S-phase regulators. Together, scNT-seq2 provides an efficient and versatile tool for dissecting transcriptional dynamics across diverse biological systems at single-cell resolution.
]]></description>
<dc:creator>Qiu, Q.</dc:creator>
<dc:creator>Zhang, H.</dc:creator>
<dc:creator>Gao, W.</dc:creator>
<dc:creator>Li, F.</dc:creator>
<dc:creator>Liang, D.</dc:creator>
<dc:creator>Wu, H.</dc:creator>
<dc:date>2025-06-03</dc:date>
<dc:identifier>doi:10.1101/2025.06.03.657745</dc:identifier>
<dc:title><![CDATA[Highly sensitive and scalable time-resolved RNA sequencing in single cells with scNT-seq2]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-06-03</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.04.01.646460v1?rss=1">
<title>
<![CDATA[
Integrated single-cell and spatial multiomic analysis reveals widespread reactivation of developmental programs in diseased human hearts 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.04.01.646460v1?rss=1"
</link>
<description><![CDATA[
The first organ to develop in utero, the human heart undergoes significant changes during development and must sustain its function over a lifetime. To better characterize molecular changes in human cardiac cell-types across sex, aging, developmental and disease, we analyzed single nucleus RNA sequencing (snRNA-seq) datasets from 299 donors, identifying many more differentially expressed genes (DEGs) across developmental and disease states than by sex and age. In cardiomyocytes and most non-cardiomyocyte cell types, developmental and disease DEGs showed significant overlap. Cardiac development and disease were associated with convergent changes in non-cardiomyocyte intercellular communication, including TGF{beta} signaling, but differences in cell-type proportions. By integrating snRNA-seq with 106 snATAC-seq datasets, we reveal potential transcriptional factors driving fetal reactivation in disease. Finally, using spatial transcriptomics data, we identify that fetal reactivation is highly localized in niches. This work offers the largest multimodal, cell-type resolved interrogation of the human heart, providing insights into convergence in development and disease.
]]></description>
<dc:creator>Gao, W.</dc:creator>
<dc:creator>Hu, P.</dc:creator>
<dc:creator>Qiu, Q.</dc:creator>
<dc:creator>Kang, X.</dc:creator>
<dc:creator>Bedi, K.</dc:creator>
<dc:creator>Sasaki, K.</dc:creator>
<dc:creator>Margulies, K.</dc:creator>
<dc:creator>Wu, H.</dc:creator>
<dc:date>2025-04-07</dc:date>
<dc:identifier>doi:10.1101/2025.04.01.646460</dc:identifier>
<dc:title><![CDATA[Integrated single-cell and spatial multiomic analysis reveals widespread reactivation of developmental programs in diseased human hearts]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-04-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.09.26.559662v1?rss=1">
<title>
<![CDATA[
5-hydroxymethylcytosines regulate gene expression as a passive DNA demethylation resisting epigenetic mark in proliferative somatic cells 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.09.26.559662v1?rss=1"
</link>
<description><![CDATA[
Enzymatic erasure of DNA methylation in mammals involves iterative 5-methylcytosine (5mC) oxidation by the ten-eleven translocation (TET) family of DNA dioxygenase proteins. As the most abundant form of oxidized 5mC, the prevailing model considers 5-hydroxymethylcytosine (5hmC) as a key nexus in active DNA demethylation that can either indirectly facilitate replication-dependent depletion of 5mC by inhibiting maintenance DNA methylation machinery (UHRF1/DNMT1), or directly be iteratively oxidized to 5-formylcytosine (5fC) and 5-carboxycytosine (5caC) and restored to cytosine (C) through thymine DNA glycosylase (TDG)-mediated 5fC/5caC excision repair. In proliferative somatic cells, to what extent TET-dependent removal of 5mC entails indirect DNA demethylation via 5hmC-induced replication-dependent dilution or direct iterative conversion of 5hmC to 5fC/5caC is unclear. Here we leverage a catalytic processivity stalling variant of human TET1 (TET1.var: T1662E) to decouple the stepwise generation of 5hmC from subsequent 5fC/5caC generation, excision and repair. By using a CRISPR/dCas9-based epigenome-editing platform, we demonstrate that 5fC/5caC excision repair (by wild-type TET1, TET1.wt), but not 5hmC generation alone (by TET1.var), is requisite for robust restoration of unmodified cytosines and reversal of somatic silencing of the methylation-sensitive, germline-specific RHOXF2B gene promoter. Furthermore, integrated whole-genome multi-modal epigenetic sequencing reveals that hemi-hydroxymethylated CpG dyads predominantly resist replication-dependent depletion of 5mC on the opposing strand in TET1.var-expressing cells. Notably, TET1.var-mediated 5hmC generation is sufficient to induce similar levels of differential gene expression (compared to TET1.wt) without inducing major changes in unmodified cytosine profiles across the genome. Our study suggests 5hmC alone plays a limited role in driving replication-dependent DNA demethylation in the presence of functional DNMT1/UHRF1 mechanisms, but can regulate gene expression as a bona fide epigenetic mark in proliferative somatic cells.
]]></description>
<dc:creator>Wei, A.</dc:creator>
<dc:creator>Zhang, H.</dc:creator>
<dc:creator>Qiu, Q.</dc:creator>
<dc:creator>Fabyanic, E. B.</dc:creator>
<dc:creator>Hu, P.</dc:creator>
<dc:creator>Wu, H.</dc:creator>
<dc:date>2023-09-27</dc:date>
<dc:identifier>doi:10.1101/2023.09.26.559662</dc:identifier>
<dc:title><![CDATA[5-hydroxymethylcytosines regulate gene expression as a passive DNA demethylation resisting epigenetic mark in proliferative somatic cells]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-09-27</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.04.15.537037v1?rss=1">
<title>
<![CDATA[
A transient dermal niche and dual epidermal programs underlie sweat gland development 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.04.15.537037v1?rss=1"
</link>
<description><![CDATA[
Eccrine glands are mammalian skin appendages indispensable for human thermoregulation. Like all skin-derived appendages, eccrine glands form from multipotent progenitors in the basal skin epidermis. It remains unclear how epidermal progenitors progressively specialize to specifically form eccrine glands, precluding efforts to regenerate these vital organs. Herein, we applied single nucleus transcriptomics to compare the expression content of wildtype, eccrine-forming mouse skin to that of mice harboring a skin-specific disruption of Engrailed 1 (En1), a transcription factor that promotes the formation of eccrine glands in both humans and mice. We identify two concurrent epidermal transcriptomes in the earliest eccrine anlagen: a predominant transcriptome that is shared with hair follicles, and a vastly underrepresented transcriptome that is En1-dependent and eccrine-specific. We demonstrate that differentiation of the eccrine anlage requires the induction of a transient and transcriptionally unique dermal niche that forms around each developing gland in humans and mice. Our study defines the transcriptional determinants underlying eccrine identity in the epidermis and uncovers the dermal niche required for eccrine developmental progression. By identifying these defining components of the eccrine developmental program, our findings set the stage for directed efforts to regenerate eccrine glands for comprehensive skin repair.
]]></description>
<dc:creator>Dingwall, H. L.</dc:creator>
<dc:creator>Tomizawa, R. R.</dc:creator>
<dc:creator>Aharoni, A.</dc:creator>
<dc:creator>Hu, P.</dc:creator>
<dc:creator>Qiu, Q.</dc:creator>
<dc:creator>Kokalari, B.</dc:creator>
<dc:creator>Martinez, S. M.</dc:creator>
<dc:creator>Donahue, J. C.</dc:creator>
<dc:creator>Aldea, D.</dc:creator>
<dc:creator>Mendoza, M.</dc:creator>
<dc:creator>Glass, I. A.</dc:creator>
<dc:creator>Birth Defects Research Laboratory,</dc:creator>
<dc:creator>Wu, H.</dc:creator>
<dc:creator>Kamberov, Y. G.</dc:creator>
<dc:date>2023-04-17</dc:date>
<dc:identifier>doi:10.1101/2023.04.15.537037</dc:identifier>
<dc:title><![CDATA[A transient dermal niche and dual epidermal programs underlie sweat gland development]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-04-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2026.03.12.711379v1?rss=1">
<title>
<![CDATA[
pertTF: context-aware AI modeling for genome-scale and cross-system perturbation prediction 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2026.03.12.711379v1?rss=1"
</link>
<description><![CDATA[
Predicting genetic perturbation responses at a single-cell level is central to building models for cell state and disease. However, existing approaches are limited on predicting phenotypic outcomes beyond expression changes and generalizing predictions across genome-scale perturbations in biologically relevant contexts. Here we introduce pertTF, a transformer-based single-cell genetic perturbation model. pertTF was trained from a unique dataset capturing single cell expressions profiles of 30 full gene knockouts across 14 relevant cell types during human pancreatic development and beta-cell differentiation. pertTF outperforms current methods in predicting expression changes of perturbing unseen genes in unseen cellular contexts. In addition, pertTF infers perturbation-induced shifts in cell identity and population composition, an important phenotypic outcome of perturbation in many physiology and disease settings. Through transfer learning, pertTF operates in physiologically relevant systems, including primary human islets, where large-scale perturbation experiments are challenging. The generalizability of pertTF is further demonstrated by in silico pooled and single-cell CRISPR screens, capturing critical regulators of stem cells and early pancreatic cell development. These results establish pertTF as a framework for integrating large-scale single-cell perturbation data with AI models to predict genetic perturbation effects across cellular systems and disease contexts.
]]></description>
<dc:creator>Su, Y.</dc:creator>
<dc:creator>Liu, D.</dc:creator>
<dc:creator>Menon, V.</dc:creator>
<dc:creator>Song, B.</dc:creator>
<dc:creator>Boccara, S.</dc:creator>
<dc:creator>Zhang, N.</dc:creator>
<dc:creator>Zhao, H.</dc:creator>
<dc:creator>Zhao, J. H.</dc:creator>
<dc:creator>Wang, L.</dc:creator>
<dc:creator>Hu, N.</dc:creator>
<dc:creator>Nzima, M.</dc:creator>
<dc:creator>Katz, A.</dc:creator>
<dc:creator>Swargam, B. K.</dc:creator>
<dc:creator>Ament, S. A.</dc:creator>
<dc:creator>Diao, Y.</dc:creator>
<dc:creator>Zhang, H.</dc:creator>
<dc:creator>Chao, L.</dc:creator>
<dc:creator>Hon, G.</dc:creator>
<dc:creator>Huangfu, D.</dc:creator>
<dc:creator>Li, W.</dc:creator>
<dc:date>2026-03-16</dc:date>
<dc:identifier>doi:10.64898/2026.03.12.711379</dc:identifier>
<dc:title><![CDATA[pertTF: context-aware AI modeling for genome-scale and cross-system perturbation prediction]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2026-03-16</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2026.03.05.709811v1?rss=1">
<title>
<![CDATA[
FourC: identifying significant and differential contacts in 1D chromatin conformation data 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2026.03.05.709811v1?rss=1"
</link>
<description><![CDATA[
4C-seq is a cost-effective 3C-based assay that measures the interactions between a single genomic element and all other genomic elements. However, 4C-seq data remains semi-quantitative because it cannot be deduplicated without UMIs. To address this, we developed an open source method, FourC, based on a Bayesian Bernoulli regression model, that overcomes the duplication problem and models spatial patterns with Gaussian processes to identify significantly enriched and differential contacts. We demonstrate the utility of FourC on 4C-seq data that profiles the local chromatin structure at key genes necessary for pancreatic differentiation and under CRISPR perturbation of enhancers.
]]></description>
<dc:creator>Wong, W.</dc:creator>
<dc:creator>Kaplan, S. J.</dc:creator>
<dc:creator>Luo, R.</dc:creator>
<dc:creator>Pulecio Rojas, J. A.</dc:creator>
<dc:creator>Yan, J.</dc:creator>
<dc:creator>Huangfu, D.</dc:creator>
<dc:creator>Leslie, C. S.</dc:creator>
<dc:date>2026-03-07</dc:date>
<dc:identifier>doi:10.64898/2026.03.05.709811</dc:identifier>
<dc:title><![CDATA[FourC: identifying significant and differential contacts in 1D chromatin conformation data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2026-03-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2026.03.07.710282v1?rss=1">
<title>
<![CDATA[
Reprogramming of neuronal genome function and phenotype by astrocytes 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2026.03.07.710282v1?rss=1"
</link>
<description><![CDATA[
Heterotypic cell-cell interactions are critical to governing cellular physiology, disease progression, and responses to the environment and pharmacologic interventions. For example, neurons and astrocytes engage in intricate interactions that are essential for brain development and function1-3. However, the transformation of these extracellular signals into epigenomic regulation that governs cell function is poorly understood. Here, we report that weeks of co-culture between human induced pluripotent stem cell (hiPSC)-derived neurons and mouse cortical astrocytes extensively reprograms gene expression and the chromatin accessibility landscape in neurons, affecting thousands of genes and putative gene regulatory elements (REs), including many transcription factors (TFs). These genes are enriched for functions implicated in neuronal differentiation and maturation, and tend to be impacted in schizophrenia, and autosomal dominant Alzheimers disease. Through complementary CRISPR interference and activation screens, we recapitulated hundreds of astrocyte-induced transcriptional and chromatin remodeling events in mono-cultured neurons at both promoters and distal regulatory elements (REs) of TF genes. We discovered functional REs for [~]50 astrocyte-responsive TF genes, providing a map of gene regulatory network control. Astrocyte-responsive TF genes fall into groups that exert independent or counter-balancing transcriptional effects, highlighting the complex coordination of the neuronal response to astrocytes. Functional effects of specific TFs, including POU3F2 and TFAP2E, on neurite morphology and neuronal electrophysiology are consistent with transcriptional effects, demonstrating the capacity of direct epigenetic control to mimic heterotypic cellular signals. This work illuminates the regulation of neurodevelopment-and disease-relevant gene modules by neuron-astrocyte interactions, and provides a blueprint for applying modern functional genomics to uncover the links between cell microenvironment and epigenomic programming.

HighlightsO_LINeuronal gene expression and chromatin accessibility landscape are profoundly remodeled by astrocytes over weeks of co-culture
C_LIO_LIAstrocyte-responsive neuronal gene modules and neuron-responsive astrocytic gene modules are enriched for genes associated with schizophrenia and familial Alzheimers Disease
C_LIO_LISingle-cell CRISPR interference and activation screens of astrocyte-responsive gene regulatory elements identified dozens of functional regulatory elements of TF genes in neurons
C_LIO_LISingle-cell CRISPR interference and activation screens of >200 astrocyte-responsive TF genes uncovered discrete functional clusters that promote neuronal maturity or stemness
C_LIO_LIAstrocyte-responsive TF genes reprogram neuronal electrophysiology and neurite morphology
C_LI
]]></description>
<dc:creator>Li, B.</dc:creator>
<dc:creator>Hagy, K.</dc:creator>
<dc:creator>Safi, A.</dc:creator>
<dc:creator>Beer, M. A.</dc:creator>
<dc:creator>Barrera, A.</dc:creator>
<dc:creator>Geraghty, S.</dc:creator>
<dc:creator>Rai, R.</dc:creator>
<dc:creator>Pederson, A. N.</dc:creator>
<dc:creator>Reisman, S. J.</dc:creator>
<dc:creator>Love, M. I.</dc:creator>
<dc:creator>Sullivan, P. F.</dc:creator>
<dc:creator>Eroglu, C.</dc:creator>
<dc:creator>Crawford, G. E.</dc:creator>
<dc:creator>Gersbach, C. A.</dc:creator>
<dc:date>2026-03-07</dc:date>
<dc:identifier>doi:10.64898/2026.03.07.710282</dc:identifier>
<dc:title><![CDATA[Reprogramming of neuronal genome function and phenotype by astrocytes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2026-03-07</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.08.07.669196v1?rss=1">
<title>
<![CDATA[
Perturb-seq reveals distinct responses to pluripotency regulator dosages underlying the control of self-renewal and differentiation 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.08.07.669196v1?rss=1"
</link>
<description><![CDATA[
Precise regulation of transcription factor (TF) expression is critical for maintaining cell identity, but studies on how graded expression levels affect cellular phenotypes are limited. To address this gap, we employed human embryonic stem cells (hESCs) as a dynamic model to study gene dosage effects and systematically titrated key TFs NANOG and OCT4 expression using CRISPR interference (CRISPRi). We then profiled transcriptomic changes in hESCs under self-renewal and differentiation conditions using single-cell RNA-seq (scRNA-seq). Quantitative modeling of these Perturb-seq datasets uncovers distinct response patterns for different types of genes, including a striking non-monotonic response of lineage-specific genes during differentiation, indicating that mild perturbations of hESC TFs promote differentiation while strong perturbations compromise it. These discoveries suggest that fine-tuning the dosage of stem cell TFs can enhance differentiation efficiency and underscore the importance of characterizing TF function across a gradient of expression levels.
]]></description>
<dc:creator>Yan, J.</dc:creator>
<dc:creator>Cho, H. S.</dc:creator>
<dc:creator>Luo, R.</dc:creator>
<dc:creator>Beer, M. A.</dc:creator>
<dc:creator>Li, W.</dc:creator>
<dc:creator>Huangfu, D.</dc:creator>
<dc:date>2025-08-08</dc:date>
<dc:identifier>doi:10.1101/2025.08.07.669196</dc:identifier>
<dc:title><![CDATA[Perturb-seq reveals distinct responses to pluripotency regulator dosages underlying the control of self-renewal and differentiation]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-08-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.12.21.628413v1?rss=1">
<title>
<![CDATA[
Discovery of NANOG enhancers and their essential roles in self-renewal and differentiation in human embryonic stem cells 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.12.21.628413v1?rss=1"
</link>
<description><![CDATA[
Human embryonic stem cells (hESCs) are notable for their ability to self-renew and to differentiate into all tissue types in the body. NANOG is a core regulator of hESC identity, and dynamic control of its expression is crucial to maintain the balance between self-renewal and differentiation. Transcriptional regulation depends on enhancers, but NANOG enhancers in hESCs are not well characterized. Here we report two NANOG enhancers discovered from a CRISPR interference screen in hESCs. Deletion of a single copy of either enhancer significantly reduced NANOG expression, compromising self-renewal and increasing differentiation propensity. Interestingly, these two NANOG enhancers are involved in a tandem duplication event found in certain primates including humans but not in mice. However, the duplicated counterparts do not regulate NANOG expression. This work expands our knowledge of functional enhancers in hESCs, and highlights the sensitivity of the hESC state to the dosage of core regulators and their enhancers.
]]></description>
<dc:creator>Yan, J.</dc:creator>
<dc:creator>Luo, R.</dc:creator>
<dc:creator>Rosen, B. P.</dc:creator>
<dc:creator>Liu, D.</dc:creator>
<dc:creator>Wong, W.</dc:creator>
<dc:creator>Leslie, C. S.</dc:creator>
<dc:creator>Huangfu, D.</dc:creator>
<dc:date>2024-12-21</dc:date>
<dc:identifier>doi:10.1101/2024.12.21.628413</dc:identifier>
<dc:title><![CDATA[Discovery of NANOG enhancers and their essential roles in self-renewal and differentiation in human embryonic stem cells]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-12-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.05.03.539283v1?rss=1">
<title>
<![CDATA[
Parallel genome-scale CRISPR screens distinguish pluripotency and self-renewal 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.05.03.539283v1?rss=1"
</link>
<description><![CDATA[
Pluripotent stem cells are defined by their self-renewal capacity, which is the ability of the stem cells to proliferate indefinitely while maintaining the pluripotent identity essential for their ability to differentiate into any somatic cell lineage. However, understanding the mechanisms that control stem cell fitness versus the pluripotent cell identity is challenging. To investigate the interplay between these two aspects of pluripotency, we performed four parallel genome-scale CRISPR-Cas9 loss-of-function screens interrogating stem cell fitness in hPSC self-renewal conditions, and the dissolution of the primed pluripotency identity during early differentiation. Comparative analyses led to the discovery of genes with distinct roles in pluripotency regulation, including mitochondrial and metabolism regulators crucial for stem cell fitness, and chromatin regulators that control pluripotent identity during early differentiation. We further discovered a core set of factors that control both stem cell fitness and pluripotent identity, including a network of chromatin factors that safeguard pluripotency. Our unbiased and systematic screening and comparative analyses disentangle two interconnected aspects of pluripotency, provide rich datasets for exploring pluripotent cell identity versus cell fitness, and offer a valuable model for categorizing gene function in broad biological contexts.
]]></description>
<dc:creator>Rosen, B. P.</dc:creator>
<dc:creator>Li, Q. V.</dc:creator>
<dc:creator>Cho, H.</dc:creator>
<dc:creator>Liu, D.</dc:creator>
<dc:creator>Yang, D.</dc:creator>
<dc:creator>Graff, S.</dc:creator>
<dc:creator>Yan, J.</dc:creator>
<dc:creator>Luo, R.</dc:creator>
<dc:creator>Verma, N.</dc:creator>
<dc:creator>Damodaran, J. R.</dc:creator>
<dc:creator>Beer, M. A.</dc:creator>
<dc:creator>Sidoli, S.</dc:creator>
<dc:creator>Huangfu, D.</dc:creator>
<dc:date>2023-05-03</dc:date>
<dc:identifier>doi:10.1101/2023.05.03.539283</dc:identifier>
<dc:title><![CDATA[Parallel genome-scale CRISPR screens distinguish pluripotency and self-renewal]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-05-03</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.04.26.591412v1?rss=1">
<title>
<![CDATA[
CRISPR Screening Uncovers a Long-Range Enhancer for ONECUT1 in Pancreatic Differentiation and Links a Diabetes Risk Variant 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.04.26.591412v1?rss=1"
</link>
<description><![CDATA[
Functional enhancer annotation is a valuable first step for understanding tissue-specific transcriptional regulation and prioritizing disease-associated non-coding variants for investigation. However, unbiased enhancer discovery in physiologically relevant contexts remains a major challenge. To discover regulatory elements pertinent to diabetes, we conducted a CRISPR interference screen in the human pluripotent stem cell (hPSC) pancreatic differentiation system. Among the enhancers uncovered, we focused on a long-range enhancer [~]664 kb from the ONECUT1 promoter, since coding mutations in ONECUT1 cause pancreatic hypoplasia and neonatal diabetes. Homozygous enhancer deletion in hPSCs was associated with a near-complete loss of ONECUT1 gene expression and compromised pancreatic differentiation. This enhancer contains a confidently fine-mapped type 2 diabetes associated variant (rs528350911) which disrupts a GATA motif. Introduction of the risk variant into hPSCs revealed substantially reduced binding of key pancreatic transcription factors (GATA4, GATA6 and FOXA2) on the edited allele, accompanied by a slight reduction of ONECUT1 transcription, supporting a causal role for this risk variant in metabolic disease. This work expands our knowledge about transcriptional regulation in pancreatic development through the characterization of a long-range enhancer and highlights the utility of enhancer discovery in disease-relevant settings for understanding monogenic and complex disease.
]]></description>
<dc:creator>Kaplan, S. J.</dc:creator>
<dc:creator>Wong, W.</dc:creator>
<dc:creator>Yan, J.</dc:creator>
<dc:creator>Pulecio, J.</dc:creator>
<dc:creator>Cho, H.</dc:creator>
<dc:creator>Leslie-Iyer, J.</dc:creator>
<dc:creator>Kazakov, J.</dc:creator>
<dc:creator>Zhao, J.</dc:creator>
<dc:creator>Li, Q.</dc:creator>
<dc:creator>Murphy, D.</dc:creator>
<dc:creator>Luo, R.</dc:creator>
<dc:creator>Dey, K. K.</dc:creator>
<dc:creator>Apostolou, E.</dc:creator>
<dc:creator>Lesie, C. S.</dc:creator>
<dc:creator>Huangfu, D.</dc:creator>
<dc:date>2024-04-29</dc:date>
<dc:identifier>doi:10.1101/2024.04.26.591412</dc:identifier>
<dc:title><![CDATA[CRISPR Screening Uncovers a Long-Range Enhancer for ONECUT1 in Pancreatic Differentiation and Links a Diabetes Risk Variant]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-04-29</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.10.30.564796v1?rss=1">
<title>
<![CDATA[
Decoding Heterogenous Single-cell Perturbation Responses 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.10.30.564796v1?rss=1"
</link>
<description><![CDATA[
Understanding diverse responses of individual cells to the same perturbation is central to many biological and biomedical problems. Current methods, however, do not precisely quantify the strength of perturbation responses and, more importantly, reveal new biological insights from heterogeneity in responses. Here we introduce the perturbation-response score (PS), based on constrained quadratic optimization, to quantify diverse perturbation responses at a single-cell level. Applied to single-cell transcriptomes of large-scale genetic perturbation datasets (e.g., Perturb-seq), PS outperforms existing methods for quantifying partial gene perturbation responses. In addition, PS presents two major advances. First, PS enables large-scale, single-cell-resolution dosage analysis of perturbation, without the need to titrate perturbation strength. By analyzing the dose-response patterns of over 2,000 essential genes in Perturb-seq, we identify two distinct patterns, depending on whether a moderate reduction in their expression induces strong downstream expression alterations. Second, PS identifies intrinsic and extrinsic biological determinants of perturbation responses. We demonstrate the application of PS in contexts such as T cell stimulation, latent HIV-1 expression, and pancreatic cell differentiation. Notably, PS unveiled a previously unrecognized, cell-type-specific role of coiled-coil domain containing 6 (CCDC6) in guiding liver and pancreatic lineage decisions, where CCDC6 knockouts drive the endoderm cell differentiation towards liver lineage, rather than pancreatic lineage. The PS approach provides an innovative method for dose-to-function analysis and will enable new biological discoveries from single-cell perturbation datasets.

One sentence summaryWe present a method to quantify diverse perturbation responses and discover novel biological insights in single-cell perturbation datasets.
]]></description>
<dc:creator>Song, B.</dc:creator>
<dc:creator>Liu, D.</dc:creator>
<dc:creator>Dai, W.</dc:creator>
<dc:creator>McMyn, N.</dc:creator>
<dc:creator>Wang, Q.</dc:creator>
<dc:creator>Yang, D.</dc:creator>
<dc:creator>Krejci, A.</dc:creator>
<dc:creator>Vasilyev, A.</dc:creator>
<dc:creator>Untermoser, N.</dc:creator>
<dc:creator>Loregger, A.</dc:creator>
<dc:creator>Song, D.</dc:creator>
<dc:creator>Williams, B.</dc:creator>
<dc:creator>Rosen, B.</dc:creator>
<dc:creator>Cheng, X.</dc:creator>
<dc:creator>Chao, L.</dc:creator>
<dc:creator>Kale, H.</dc:creator>
<dc:creator>Zhang, H.</dc:creator>
<dc:creator>Diao, Y.</dc:creator>
<dc:creator>Bürckstümmer, T.</dc:creator>
<dc:creator>Siliciano, J. M.</dc:creator>
<dc:creator>Li, J. J.</dc:creator>
<dc:creator>Siliciano, R.</dc:creator>
<dc:creator>Huangfu, D.</dc:creator>
<dc:creator>Li, W.</dc:creator>
<dc:date>2023-11-02</dc:date>
<dc:identifier>doi:10.1101/2023.10.30.564796</dc:identifier>
<dc:title><![CDATA[Decoding Heterogenous Single-cell Perturbation Responses]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-11-02</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.06.14.544990v1?rss=1">
<title>
<![CDATA[
Discovery of Competent Chromatin Regions in Human Embryonic Stem Cells 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.06.14.544990v1?rss=1"
</link>
<description><![CDATA[
The mechanisms underlying the ability of embryonic stem cells (ESCs) to rapidly activate lineage-specific genes during differentiation remain largely unknown. Through multiple CRISPR-activation screens, we discovered human ESCs have pre-established transcriptionally competent chromatin regions (CCRs) that support lineage-specific gene expression at levels comparable to differentiated cells. CCRs reside in the same topological domains as their target genes. They lack typical enhancer-associated histone modifications but show enriched occupancy of pluripotent transcription factors, DNA demethylation factors, and histone deacetylases. TET1 and QSER1 protect CCRs from excessive DNA methylation, while HDAC1 family members prevent premature activation. This "push and pull" feature resembles bivalent domains at developmental gene promoters but involves distinct molecular mechanisms. Our study provides new insights into pluripotency regulation and cellular plasticity in development and disease.

One sentence summaryWe report a class of distal regulatory regions distinct from enhancers that confer human embryonic stem cells with the competence to rapidly activate the expression of lineage-specific genes.
]]></description>
<dc:creator>Pulecio, J.</dc:creator>
<dc:creator>Tayyebi, Z.</dc:creator>
<dc:creator>Liu, D.</dc:creator>
<dc:creator>Wong, W.</dc:creator>
<dc:creator>Luo, R.</dc:creator>
<dc:creator>Damodaran, J. R.</dc:creator>
<dc:creator>Kaplan, S.</dc:creator>
<dc:creator>Cho, H.</dc:creator>
<dc:creator>Yan, J.</dc:creator>
<dc:creator>Murphy, D. J.</dc:creator>
<dc:creator>Rickert, R.</dc:creator>
<dc:creator>Shukla, A.</dc:creator>
<dc:creator>Zhong, A.</dc:creator>
<dc:creator>Gonzalez, F.</dc:creator>
<dc:creator>Yang, D.</dc:creator>
<dc:creator>Li, W.</dc:creator>
<dc:creator>Zhou, T.</dc:creator>
<dc:creator>Apostolou, E.</dc:creator>
<dc:creator>Leslie, C.</dc:creator>
<dc:creator>Huangfu, D.</dc:creator>
<dc:date>2023-06-14</dc:date>
<dc:identifier>doi:10.1101/2023.06.14.544990</dc:identifier>
<dc:title><![CDATA[Discovery of Competent Chromatin Regions in Human Embryonic Stem Cells]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-06-14</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.03.07.531569v1?rss=1">
<title>
<![CDATA[
Dynamic network-guided CRISPRi screen reveals CTCF loop-constrained nonlinear enhancer-gene regulatory activity in cell state transitions 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.03.07.531569v1?rss=1"
</link>
<description><![CDATA[
Comprehensive enhancer discovery is challenging because most enhancers, especially those affected in complex diseases, have weak effects on gene expression. Our network modeling revealed that nonlinear enhancer-gene regulation during cell state transitions can be leveraged to improve the sensitivity of enhancer discovery. Utilizing hESC definitive endoderm differentiation as a dynamic transition system, we conducted a mid-transition CRISPRi-based enhancer screen. The screen discovered a comprehensive set of enhancers (4 to 9 per locus) for each of the core endoderm lineage-specifying transcription factors, and many enhancers had strong effects mid-transition but weak effects post-transition. Through integrating enhancer activity measurements and three-dimensional enhancer-promoter interaction information, we were able to develop a CTCF loop-constrained Interaction Activity (CIA) model that can better predict functional enhancers compared to models that rely on Hi-C-based enhancer-promoter contact frequency. Our study provides generalizable strategies for sensitive and more comprehensive enhancer discovery in both normal and pathological cell state transitions.
]]></description>
<dc:creator>Luo, R.</dc:creator>
<dc:creator>Yan, J.</dc:creator>
<dc:creator>Oh, J. W.</dc:creator>
<dc:creator>Xi, W.</dc:creator>
<dc:creator>Shigaki, D.</dc:creator>
<dc:creator>Wong, W.</dc:creator>
<dc:creator>Cho, H.</dc:creator>
<dc:creator>Murphy, D.</dc:creator>
<dc:creator>Cutler, R.</dc:creator>
<dc:creator>Rosen, B. P.</dc:creator>
<dc:creator>Pulecio, J.</dc:creator>
<dc:creator>Yang, D.</dc:creator>
<dc:creator>Glenn, R.</dc:creator>
<dc:creator>Chen, T.</dc:creator>
<dc:creator>Li, Q. V.</dc:creator>
<dc:creator>Vierbuchen, T.</dc:creator>
<dc:creator>Sidoli, S.</dc:creator>
<dc:creator>Apostolou, E.</dc:creator>
<dc:creator>Huangfu, D.</dc:creator>
<dc:creator>Beer, M. A.</dc:creator>
<dc:date>2023-03-09</dc:date>
<dc:identifier>doi:10.1101/2023.03.07.531569</dc:identifier>
<dc:title><![CDATA[Dynamic network-guided CRISPRi screen reveals CTCF loop-constrained nonlinear enhancer-gene regulatory activity in cell state transitions]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-03-09</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2026.03.30.715154v1?rss=1">
<title>
<![CDATA[
Functional genomics reveals mediators of beta cell survival in ER stress and type 2 diabetes risk 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2026.03.30.715154v1?rss=1"
</link>
<description><![CDATA[
Endoplasmic reticulum (ER) stress in pancreatic beta cells contributes to impaired function and type 2 diabetes (T2D). In this study we performed genome-wide perturbation screens and genomic profiling in beta cells to identify novel mediators of ER stress responses and diabetes risk. We defined gene regulatory networks in beta cells and identified specific beta cell networks enriched for T2D risk variants with altered expression in ER stress. We performed a loss-of-function CRISPR screen for survival under ER stress in EndoC-{beta}H1 cells, which identified 167 pro-survival and 47 pro-death genes involved in processes related to insulin secretion, mitochondrial transport and protein ubiquitination. Beta cell survival genes collectively had limited genomic change in stress yet showed significant, independent enrichment for T2D risk variants, including novel T2D candidate gene DTNB which we validated protects against beta cell death during stress. Overall, our results revealed mediators of ER stress responses in beta cells and identified new therapeutic targets to preserve beta cells in diabetes pathogenesis.
]]></description>
<dc:creator>Okino, M.-L.</dc:creator>
<dc:creator>Zhu, H.</dc:creator>
<dc:creator>Corban, S.</dc:creator>
<dc:creator>Benaglio, P.</dc:creator>
<dc:creator>Djulamsah, J.</dc:creator>
<dc:creator>OMahony, B.</dc:creator>
<dc:creator>Vanderstel, K.</dc:creator>
<dc:creator>Elgamal, R.</dc:creator>
<dc:creator>Miller, M.</dc:creator>
<dc:creator>Wang, A.</dc:creator>
<dc:creator>Sander, M.</dc:creator>
<dc:creator>Gaulton, K. J.</dc:creator>
<dc:date>2026-04-02</dc:date>
<dc:identifier>doi:10.64898/2026.03.30.715154</dc:identifier>
<dc:title><![CDATA[Functional genomics reveals mediators of beta cell survival in ER stress and type 2 diabetes risk]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2026-04-02</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.06.05.658079v1?rss=1">
<title>
<![CDATA[
Comprehensive molecular impact mapping of common and rare variants at GWAS loci 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.06.05.658079v1?rss=1"
</link>
<description><![CDATA[
Deep learning sequence to function models can predict the molecular effects of genetic variants, but their predictions are limited to the cell types and assays they are trained on. Here we describe DNACipher, a deep learning model that predicts the effects of genetic variants across diverse biological contexts--including those not directly measured. DNACipher takes 196 kb of genome sequences as input and imputes variant effects across 38,582 cell type-assay combinations. DNACipher generates predictions for >7 times as many contexts as Enformer, which allows for better detection of variant effects at expression quantitative trait loci (eQTLs). We also introduce DNACipher Deep Variant Impact Mapping (DVIM), a method to identify variants with molecular effects at genome-wide association study (GWAS) loci. Application of DVIM to type 1 diabetes (T1D) reduced the mean fine-mapping credible set size from 24 to 1.4 variants per signal. DVIM variants had significantly higher fine-mapping posterior probabilities, and their predicted effects were supported by single-nucleus ATAC-seq and luciferase assays. DVIM also detected 6547 rare variants with molecular effects at 96% of GWAS T1D loci, and these were enriched for associations with immune traits. In summary, DNACipher DVIM prioritises common and rare variants at GWAS loci by predicting molecular effects across a broad range of contexts.
]]></description>
<dc:creator>Balderson, B.</dc:creator>
<dc:creator>Tule, S.</dc:creator>
<dc:creator>Okino, M.-L.</dc:creator>
<dc:creator>Rieger, W. J.</dc:creator>
<dc:creator>Corban, S.</dc:creator>
<dc:creator>Jaureguy, J.</dc:creator>
<dc:creator>Palpant, N.</dc:creator>
<dc:creator>Gaulton, K. J.</dc:creator>
<dc:creator>Boden, M.</dc:creator>
<dc:creator>McVicker, G.</dc:creator>
<dc:date>2025-06-06</dc:date>
<dc:identifier>doi:10.1101/2025.06.05.658079</dc:identifier>
<dc:title><![CDATA[Comprehensive molecular impact mapping of common and rare variants at GWAS loci]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-06-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.02.20.639325v1?rss=1">
<title>
<![CDATA[
Linking molecular pathways and islet cell dysfunction in human type 1 diabetes 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.02.20.639325v1?rss=1"
</link>
<description><![CDATA[
Type 1 diabetes (T1D) is characterized by the autoimmune destruction of most insulin-producing {beta}-cells, along with dysregulated glucagon secretion from pancreatic -cells. We conducted an integrated analysis that combines electrophysiological and transcriptomic profiling, along with machine learning, of islet cells from T1D donors to investigate the mechanisms underlying their dysfunction. Surviving {beta}-cells exhibit altered electrophysiological properties and transcriptomic signatures indicative of increased antigen presentation, metabolic reprogramming, and impaired protein translation. In -cells, we observed hyper-responsiveness and increased exocytosis, which are associated with upregulated immune signaling, disrupted transcription factor localization and lysosome homeostasis, as well as dysregulation of mTORC1 complex signaling. Notably, key genetic risk signals for T1D were enriched in transcripts related to -cell dysfunction, including MHC class I which were closely linked with -cell dysfunction. Our data provide novel insights into the molecular underpinnings of islet cell dysfunction in T1D, highlighting pathways that may be leveraged to preserve residual {beta}-cell function and modulate -cell activity. These findings underscore the complex interplay between immune signaling, metabolic stress, and cellular identity in shaping islet cell phenotypes in T1D.

HighlightsO_LISurviving {beta}-cells in T1D show disrupted electrical function linked to metabolic reprogramming and immune stress.
C_LIO_LITranscripts associated with -cell dysfunction are enriched in genetic risk alleles for T1D.
C_LIO_LIUpregulated MHC class I and impaired nuclear localization of key transcription factors associate with -cell dysfunction in T1D.
C_LIO_LIT1D -cells exhibit increased hyper-activity, lysosomal imbalance and impaired mTORC1 signaling, which promotes dysregulated glucagon secretion.
C_LI
]]></description>
<dc:creator>dos Santos, T.</dc:creator>
<dc:creator>Dai, X. Q.</dc:creator>
<dc:creator>Jones, R. C.</dc:creator>
<dc:creator>Spigelman, A. F.</dc:creator>
<dc:creator>Mummey, H. M.</dc:creator>
<dc:creator>Ewald, J. D.</dc:creator>
<dc:creator>Ellis, C. E.</dc:creator>
<dc:creator>Lyon, J. G.</dc:creator>
<dc:creator>Smith, N.</dc:creator>
<dc:creator>Bautista, A.</dc:creator>
<dc:creator>Manning Fox, J. E.</dc:creator>
<dc:creator>Neff, N. F.</dc:creator>
<dc:creator>Detweiler, A.</dc:creator>
<dc:creator>Tan, M.</dc:creator>
<dc:creator>Arrojo e Drigo, R.</dc:creator>
<dc:creator>Xia, J.</dc:creator>
<dc:creator>Camunas-Soler, J.</dc:creator>
<dc:creator>Gaulton, K. J.</dc:creator>
<dc:creator>Quake, S. R.</dc:creator>
<dc:creator>MacDonald, P. E.</dc:creator>
<dc:date>2025-02-21</dc:date>
<dc:identifier>doi:10.1101/2025.02.20.639325</dc:identifier>
<dc:title><![CDATA[Linking molecular pathways and islet cell dysfunction in human type 1 diabetes]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-02-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.01.15.633240v1?rss=1">
<title>
<![CDATA[
GenVarLoader: An accelerated dataloader for applying deep learning to personalized genomics 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.01.15.633240v1?rss=1"
</link>
<description><![CDATA[
Deep learning sequence models trained on personalized genomics can improve variant effect prediction, however, applications of these models are limited by computational requirements for storing and reading large datasets. We address this with GenVarLoader, which stores personalized genomic data in new memory-mapped formats with optimal data locality to achieve [~]1,000x faster throughput and [~]2,000x better compression compared to existing alternatives.
]]></description>
<dc:creator>Laub, D.</dc:creator>
<dc:creator>Ho, A.</dc:creator>
<dc:creator>Jaureguy, J.</dc:creator>
<dc:creator>Klie, A.</dc:creator>
<dc:creator>Salem, R. M.</dc:creator>
<dc:creator>McVicker, G.</dc:creator>
<dc:creator>Carter, H.</dc:creator>
<dc:date>2025-01-17</dc:date>
<dc:identifier>doi:10.1101/2025.01.15.633240</dc:identifier>
<dc:title><![CDATA[GenVarLoader: An accelerated dataloader for applying deep learning to personalized genomics]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-01-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.08.03.606460v1?rss=1">
<title>
<![CDATA[
Single cell multiome profiling of pancreatic islets reveals physiological changes in cell type-specific regulation associated with diabetes risk 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.08.03.606460v1?rss=1"
</link>
<description><![CDATA[
Physiological variability in pancreatic cell type gene regulation and the impact on diabetes risk is poorly understood. In this study we mapped gene regulation in pancreatic cell types using single cell multiomic (joint RNA-seq and ATAC-seq) profiling in 28 non-diabetic donors in combination with single cell data from 35 non-diabetic donors in the Human Pancreas Analysis Program. We identified widespread associations with age, sex, BMI, and HbA1c, where gene regulatory responses were highly cell type- and phenotype-specific. In beta cells, donor age associated with hypoxia, apoptosis, unfolded protein response, and external signal-dependent transcriptional regulators, while HbA1c associated with inflammatory responses and gender with chromatin organization. We identified 10.8K loci where genetic variants were QTLs for cis regulatory element (cRE) accessibility, including 20% with lineage- or cell type-specific effects which disrupted distinct transcription factor motifs. Type 2 diabetes and glycemic trait associated variants were enriched in both phenotype- and QTL-associated beta cell cREs, whereas type 1 diabetes showed limited enrichment. Variants at 226 diabetes and glycemic trait loci were QTLs in beta and other cell types, including 40 that were statistically colocalized, and annotating target genes of colocalized QTLs revealed genes with putatively novel roles in disease. Our findings reveal diverse responses of pancreatic cell types to phenotype and genotype in physiology, and identify pathways, networks, and genes through which physiology impacts diabetes risk.
]]></description>
<dc:creator>Mummey, H.</dc:creator>
<dc:creator>Elison, W.</dc:creator>
<dc:creator>Korgaonkar, K.</dc:creator>
<dc:creator>Elgamel, R.</dc:creator>
<dc:creator>Kudtarkar, P.</dc:creator>
<dc:creator>Griffin, E.</dc:creator>
<dc:creator>Benaglio, P.</dc:creator>
<dc:creator>Miller, M.</dc:creator>
<dc:creator>Jha, A.</dc:creator>
<dc:creator>Manning Fox, J. E.</dc:creator>
<dc:creator>Mccarthy, M.</dc:creator>
<dc:creator>Preissl, S.</dc:creator>
<dc:creator>Gloyn, A. L.</dc:creator>
<dc:creator>Macdonald, P. E.</dc:creator>
<dc:creator>Gaulton, K. J.</dc:creator>
<dc:date>2024-08-06</dc:date>
<dc:identifier>doi:10.1101/2024.08.03.606460</dc:identifier>
<dc:title><![CDATA[Single cell multiome profiling of pancreatic islets reveals physiological changes in cell type-specific regulation associated with diabetes risk]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-08-06</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.04.11.589096v1?rss=1">
<title>
<![CDATA[
Single cell regulatory architecture of human pancreatic islets suggests sex differences in β cell function and the pathogenesis of type 2 diabetes. 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.04.11.589096v1?rss=1"
</link>
<description><![CDATA[
Biological sex affects the pathogenesis of type 2 and type 1 diabetes (T2D, T1D) including the development of {beta} cell failure observed more often in males. The mechanisms that drive sex differences in {beta} cell failure is unknown. Studying sex differences in islet regulation and function represent a unique avenue to understand the sex-specific heterogeneity in {beta} cell failure in diabetes. Here, we examined sex and race differences in human pancreatic islets from up to 52 donors with and without T2D (including 37 donors from the Human Pancreas Analysis Program [HPAP] dataset) using an orthogonal series of experiments including single cell RNA-seq (scRNA-seq), single nucleus assay for transposase-accessible chromatin sequencing (snATAC-seq), dynamic hormone secretion, and bioenergetics. In cultured islets from nondiabetic (ND) donors, in the absence of the in vivo hormonal environment, sex differences in islet cell type gene accessibility and expression predominantly involved sex chromosomes. Of particular interest were sex differences in the X-linked KDM6A and Y-linked KDM5D chromatin remodelers in female and male islet cells respectively. Islets from T2D donors exhibited similar sex differences in differentially expressed genes (DEGs) from sex chromosomes. However, in contrast to islets from ND donors, islets from T2D donors exhibited major sex differences in DEGs from autosomes. Comparing {beta} cells from T2D and ND donors revealed that females had more DEGs from autosomes compared to male {beta} cells. Gene set enrichment analysis of female {beta} cell DEGs showed a suppression of oxidative phosphorylation and electron transport chain pathways, while male {beta} cell had suppressed insulin secretion pathways. Thus, although sex-specific differences in gene accessibility and expression of cultured ND human islets predominantly affect sex chromosome genes, major differences in autosomal gene expression between sexes appear during the transition to T2D and which highlight mitochondrial failure in female {beta} cells.
]]></description>
<dc:creator>Qadir, M. M. F.</dc:creator>
<dc:creator>Elgamal, R. M.</dc:creator>
<dc:creator>Song, K.</dc:creator>
<dc:creator>Kudtarkar, P.</dc:creator>
<dc:creator>Sakamuri, S. S. V. P.</dc:creator>
<dc:creator>Katakam, P. V.</dc:creator>
<dc:creator>El-Dahr, S.</dc:creator>
<dc:creator>Kolls, J. K.</dc:creator>
<dc:creator>Gaulton, K. J.</dc:creator>
<dc:creator>Mauvais-Jarvis, F.</dc:creator>
<dc:date>2024-04-14</dc:date>
<dc:identifier>doi:10.1101/2024.04.11.589096</dc:identifier>
<dc:title><![CDATA[Single cell regulatory architecture of human pancreatic islets suggests sex differences in β cell function and the pathogenesis of type 2 diabetes.]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-04-14</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.08.03.551876v1?rss=1">
<title>
<![CDATA[
Interface-guided phenotyping of coding variants in the transcription factor RUNX1 with SEUSS 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.08.03.551876v1?rss=1"
</link>
<description><![CDATA[
Understanding the consequences of single amino acid substitutions in cancer driver genes remains an unmet need. Perturb-seq provides a tool to investigate the effects of individual mutations on cellular programs. Here we deploy SEUSS, a Perturb-seq like approach, to generate and assay mutations at physical interfaces of the RUNX1 Runt domain. We measured the impact of 115 mutations on RNA profiles in single myelogenous leukemia cells and used the profiles to categorize mutations into three functionally distinct groups: wild-type (WT)-like, loss-of-function (LOF)-like and hypomorphic. Notably, the largest concentration of functional mutations (non-WT-like) clustered at the DNA binding site and contained many of the more frequently observed mutations in human cancers. Hypomorphic variants shared characteristics with loss of function variants but had gene expression profiles indicative of response to neural growth factor and cytokine recruitment of neutrophils. Additionally, DNA accessibility changes upon perturbations were enriched for RUNX1 binding motifs, particularly near differentially expressed genes. Overall, our work demonstrates the potential of targeting protein interaction interfaces to better define the landscape of prospective phenotypes reachable by amino acid substitutions.
]]></description>
<dc:creator>Ozturk, K.</dc:creator>
<dc:creator>Panwala, R.</dc:creator>
<dc:creator>Sheen, J.</dc:creator>
<dc:creator>Ford, K.</dc:creator>
<dc:creator>Payne, N.</dc:creator>
<dc:creator>Zhang, D.-E.</dc:creator>
<dc:creator>Hutter, S.</dc:creator>
<dc:creator>Haferlach, T.</dc:creator>
<dc:creator>Ideker, T.</dc:creator>
<dc:creator>Mali, P.</dc:creator>
<dc:creator>Carter, H.</dc:creator>
<dc:date>2023-08-04</dc:date>
<dc:identifier>doi:10.1101/2023.08.03.551876</dc:identifier>
<dc:title><![CDATA[Interface-guided phenotyping of coding variants in the transcription factor RUNX1 with SEUSS]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-08-04</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.10.24.513593v1?rss=1">
<title>
<![CDATA[
EUGENe: A Python toolkit for predictive analyses of regulatory sequences 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.10.24.513593v1?rss=1"
</link>
<description><![CDATA[
Deep learning (DL) has become a popular tool to study cis-regulatory element function. Yet efforts to design software for DL analyses in genomics that are Findable, Accessible, Interoperable and Reusable (FAIR) have fallen short of fully meeting these criteria. Here we present EUGENe (Elucidating the Utility of Genomic Elements with Neural Nets), a FAIR toolkit for the analysis of labeled sets of nucleotide sequences with DL. EUGENe consists of a set of modules that empower users to execute the key functionality of a DL workflow: 1) extracting, transforming and loading sequence data from many common file formats, 2) instantiating, initializing and training diverse model architectures, and 3) evaluating and interpreting model behavior. We designed EUGENe to be simple; users can develop workflows on new or existing datasets with two customizable Python objects, annotated sequence data (SeqData) and PyTorch models (BaseModel). The modularity and simplicity of EUGENe also make it highly extensible and we illustrate these principles through application of the toolkit to three predictive modeling tasks. First, we train and compare a set of built-in models along with a custom architecture for the accurate prediction of activities of plant promoters from STARR-seq data. Next, we apply EUGENe to an RNA binding prediction task and showcase how seminal model architectures can be retrained in EUGENe or imported from Kipoi. Finally, we train models to classify transcription factor binding by wrapping functionality from Janngu, which can efficiently extract sequences in BED file format from the human genome. We emphasize that the code used in each use case is simple, readable, and well documented (https://eugene-tools.readthedocs.io/en/latest/index.html). We believe that EUGENe represents a springboard toward a collaborative ecosystem for DL applications in genomics research. EUGENe is available for download on GitHub (https://github.com/cartercompbio/EUGENe) along with several introductory tutorials and for installation on PyPi (https://pypi.org/project/eugene-tools/).
]]></description>
<dc:creator>Klie, A.</dc:creator>
<dc:creator>Stites, H.</dc:creator>
<dc:creator>Jores, T.</dc:creator>
<dc:creator>Carter, H.</dc:creator>
<dc:date>2022-10-26</dc:date>
<dc:identifier>doi:10.1101/2022.10.24.513593</dc:identifier>
<dc:title><![CDATA[EUGENe: A Python toolkit for predictive analyses of regulatory sequences]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-10-26</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2026.01.31.703062v1?rss=1">
<title>
<![CDATA[
MissenseHMM: state-based annotations for missense variants through joint modeling of pathogenicity scores 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2026.01.31.703062v1?rss=1"
</link>
<description><![CDATA[
Many computational predictors of missense variant pathogenicity are available. To capture information across various predictors, we propose MissenseHMM, which learns states corresponding to combinatorial patterns of variant prioritizations. We applied MissenseHMM to 43 predictors, annotating over 70 million missense variants with 20 states that showed distinct predictor scores patterns, amino acid substitutions and other genomic annotation enrichments. MissenseHMM state annotations enhanced individual predictors associations with clinical pathogenic variants and deep mutational scanning data, and also provided insight into the performances of various protein language models. Overall, MissenseHMM complements pathogenicity predictors and is an annotation resource for missense variant interpretation.
]]></description>
<dc:creator>Li, R.</dc:creator>
<dc:creator>Ernst, J.</dc:creator>
<dc:date>2026-02-03</dc:date>
<dc:identifier>doi:10.64898/2026.01.31.703062</dc:identifier>
<dc:title><![CDATA[MissenseHMM: state-based annotations for missense variants through joint modeling of pathogenicity scores]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2026-02-03</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2021.06.17.448889v1?rss=1">
<title>
<![CDATA[
Fast and powerful statistical method for context-specific QTL mapping in multi-context genomic studies 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2021.06.17.448889v1?rss=1"
</link>
<description><![CDATA[
Context-specific eQTLs mediate genetic risk for complex diseases. However, limitations in current methods for identifying these eQTLs have hindered their comprehensive characterization and downstream interpretation of disease-associated variants. Here, we introduce FastGxC, a method to efficiently and powerfully map context-specific eQTLs by leveraging the correlation structure in genomic studies with repeated sampling, e.g., single-cell RNA-seq studies. Using simulations, we demonstrate that FastGxC is up to nine times more powerful and 106 times faster than existing approaches, reducing computation time from years to minutes. We applied FastGxC to bulk multi-tissue (N=698) and single-cell PBMC (N=1,218) RNA-seq datasets, generating comprehensive tissue- and cell-type-specific eQTL maps. These eQTLs exhibited up to four-fold enrichment in open chromatin regions from matched contexts and were twice as enriched as standard context-specific eQTLs, highlighting their biological relevance. Furthermore, we examined the relationship between context-specific eQTLs and complex human traits and diseases. FastGxC improved precision in identifying relevant contexts for each trait by three-fold and expanded candidate causal genes by 25% in cell types and 6% in tissues compared to standard eQTLs. In summary, FastGxC provides a powerful framework for mapping context-specific eQTLs, advancing our understanding of gene regulatory mechanisms underlying complex human traits and diseases.
]]></description>
<dc:creator>Lu, A.</dc:creator>
<dc:creator>Thompson, M.</dc:creator>
<dc:creator>Gordon, M. G.</dc:creator>
<dc:creator>Dahl, A.</dc:creator>
<dc:creator>Ye, C. J.</dc:creator>
<dc:creator>Zaitlen, N.</dc:creator>
<dc:creator>Balliu, B.</dc:creator>
<dc:date>2021-06-18</dc:date>
<dc:identifier>doi:10.1101/2021.06.17.448889</dc:identifier>
<dc:title><![CDATA[Fast and powerful statistical method for context-specific QTL mapping in multi-context genomic studies]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2021-06-18</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.10.10.681728v1?rss=1">
<title>
<![CDATA[
map3C: a computational tool for processing multiomic single-cell Hi-C data 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.10.10.681728v1?rss=1"
</link>
<description><![CDATA[
SummaryThe emergence of multiomic single-cell Hi-C methods, which simultaneously profile chromatin conformation and other modalities such as gene expression or DNA methylation, creates tremendous opportunities for studying the genomes structure-function relationships. Existing tools for processing multiomic single-cell Hi-C datasets have certain limitations for downstream bioinformatics analysis. We present map3C, a software tool designed to address these limitations. We demonstrate that map3C improves the quality of multiomic single-cell Hi-C data for analysis and its utility for identifying structural variant locations in the genome.
]]></description>
<dc:creator>Galasso, J.</dc:creator>
<dc:creator>Wang, Y.</dc:creator>
<dc:creator>Alber, F.</dc:creator>
<dc:creator>Ernst, J.</dc:creator>
<dc:creator>Luo, C.</dc:creator>
<dc:date>2025-10-15</dc:date>
<dc:identifier>doi:10.1101/2025.10.10.681728</dc:identifier>
<dc:title><![CDATA[map3C: a computational tool for processing multiomic single-cell Hi-C data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-10-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.10.02.679828v1?rss=1">
<title>
<![CDATA[
GECSI: Large-scale chromatin state imputation from gene expression 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.10.02.679828v1?rss=1"
</link>
<description><![CDATA[
Compendiums of chromatin state annotations based on integrating maps of multiple epigenetic marks such as from ChromHMM have become a powerful resource. While these compendiums have coverage of many biological samples, there are many additional biological samples that have gene expression data but lack epigenetic mark data and chromatin state annotations. The EpiAtlas resource of the International Human Epigenome Consortium (IHEC) contains a large compendium of chromatin state annotations for which many samples have matched gene expression data, which provides the opportunity to use it to train models to predict chromatin state annotations in additional biological samples with only gene expression data available. To address this, we develop Gene Expression-based Chromatin State Imputation (GECSI), which uses a multi-class logistic regression model trained using a large compendium of gene expression and chromatin state annotations, and apply it to IHEC data. Using cross-validation, we find that GECSI accurately predicts chromatin state assignments and generates probability estimates that are predictive of observed chromatin states, overall outperforming multiple other alternative and baseline methods. GECSI-predicted chromatin states reflect relationships among biological samples and show similar transcription factor and gene annotation enrichments as observed chromatin states. Using available IHEC gene expression data, we apply GECSI to predict chromatin state annotations for 449 additional epigenomes. We expect these predicted annotations and the GECSI software will be a useful resource for chromatin state analyses in many additional biological samples.
]]></description>
<dc:creator>Fu, J.</dc:creator>
<dc:creator>Ernst, J.</dc:creator>
<dc:date>2025-10-04</dc:date>
<dc:identifier>doi:10.1101/2025.10.02.679828</dc:identifier>
<dc:title><![CDATA[GECSI: Large-scale chromatin state imputation from gene expression]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-10-04</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.08.21.671671v1?rss=1">
<title>
<![CDATA[
Integrated ambient modeling and genetic demultiplexing of single-cell RNA+ATAC multiome experiments with Ambimux 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.08.21.671671v1?rss=1"
</link>
<description><![CDATA[
Single cell technologies have advanced at a rapid pace, providing assays for various molecular phenotypes. Droplet-based single cell technologies, particularly those based on nuclei isolation, such as simultaneous RNA+ATAC single-cell multiome, are susceptible to exogenous ambient molecule contamination, which can increase noise in cell type-level associations. We reasoned that genotype-based sample multiplexing can provide an opportunity to infer this ambient contamination by leveraging DNA variation in sequenced reads. Thus, we developed ambimux, a likelihood-based method to estimate ambient fractions and demultiplex single-cell multiome experiments using genotype-level data. Ambimux models the ambient or nuclear probability at the read level and thus can classify empty droplets and estimate droplet-specific ambient molecule fractions in each modality. We first evaluated our method using simulated data sets across a range of parameters. We found that ambimux closely estimated the ground truth droplet contamination fractions in the RNA (MAE=0.048) and ATAC (MAE=0.042) modalities. As a result, ambimux maintained high specificity (>95%) and was able to correctly assign singlets at considerably high ambient fractions (up to 60%) for both RNA and ATAC modalities. In comparison with models that do not consider ambient contamination, these only maintained similar sensitivity levels at considerably lower ambient fractions (up to 25%). We then generated a real data set of seven visceral adipose tissue biopsies run on a single 10x Multiome channel. We ran ambimux and detected 4,986 singlets, capturing similar numbers as other methods.

Then, we sought to evaluate the fidelity of the ambient fraction estimates from ambimux. We split singlets into ambient-enriched (>5% contamination in both modalities) or nuclear-enriched (<5% in both) droplets and performed gene-peak linkage analysis. Low ambient droplets resulted in more significant hits with gene-peak links enriched at the transcription start site relative to high ambient droplets, suggesting that the ambient droplets identified by ambimux hamper the identification of biologically meaningful signals. In summary, we developed a joint single-cell multiome demultiplexing method, ambimux, that accurately models and estimates ambient molecule contamination in each modality.
]]></description>
<dc:creator>Alvarez, M.</dc:creator>
<dc:creator>Li, T.</dc:creator>
<dc:creator>Lee, S. H. T.</dc:creator>
<dc:creator>Arasu, U. T.</dc:creator>
<dc:creator>Selvarajan, I.</dc:creator>
<dc:creator>Örd, T.</dc:creator>
<dc:creator>Rahmani, E.</dc:creator>
<dc:creator>Chen, Z. J.</dc:creator>
<dc:creator>Avram, O.</dc:creator>
<dc:creator>Kar, A.</dc:creator>
<dc:creator>Kaminska, D.</dc:creator>
<dc:creator>Männistö, V.</dc:creator>
<dc:creator>Halperin, E.</dc:creator>
<dc:creator>Pihlajamäki, J.</dc:creator>
<dc:creator>Luo, C.</dc:creator>
<dc:creator>Kaikkonen, M. U.</dc:creator>
<dc:creator>Zaitlen, N.</dc:creator>
<dc:creator>Pajukanta, P.</dc:creator>
<dc:date>2025-08-26</dc:date>
<dc:identifier>doi:10.1101/2025.08.21.671671</dc:identifier>
<dc:title><![CDATA[Integrated ambient modeling and genetic demultiplexing of single-cell RNA+ATAC multiome experiments with Ambimux]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-08-26</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.02.06.636969v1?rss=1">
<title>
<![CDATA[
The impact of ambient contamination on demultiplexing methods for single-nucleus multiome experiments 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.02.06.636969v1?rss=1"
</link>
<description><![CDATA[
Sample multiplexing has become an increasingly common design choice in droplet-based single-nucleus multi-omic sequencing experiments to reduce costs and remove technical variation. Genotype-based demultiplexing is one popular class of methods that was originally developed for single-cell RNA-seq, but has not been rigorously benchmarked in other assays, such as snATAC-seq and joint snRNA/snATAC assays, especially in the context of variable ambient RNA/DNA contamination. To address this, we develop ambisim, a genotype-aware read-level simulator that can flexibly control ambient molecule proportions and generate realistic joint snRNA/snATAC data. We use ambisim to evaluate demultiplexing methods across several important parameters: doublet rate, number of multiplexed donors, and coverage levels. Our simulations reveal that methods are variably impacted by ambient contamination in both modalities. We then applied the demultiplexing methods to two joint snRNA/snATAC datasets and found highly variable concordance between methods in both modalities. Finally, we develop a new metric, variant consistency, which we show is correlated with cell-level ambient molecule fractions in singlets. Applying our metric to two multiplexed joint snRNA/snATAC datasets reveals variable ambient contamination across experiments and modalities. We conclude that improved modelling of ambient material in demultiplexing algorithms will increase both sensitivity and specificity.
]]></description>
<dc:creator>Li, T.</dc:creator>
<dc:creator>Alvarez, M.</dc:creator>
<dc:creator>Liu, C.</dc:creator>
<dc:creator>Abuhanna, K.</dc:creator>
<dc:creator>Sun, Y.</dc:creator>
<dc:creator>Ernst, J.</dc:creator>
<dc:creator>Plath, K.</dc:creator>
<dc:creator>Balliu, B.</dc:creator>
<dc:creator>Luo, C.</dc:creator>
<dc:creator>Zaitlen, N.</dc:creator>
<dc:date>2025-02-08</dc:date>
<dc:identifier>doi:10.1101/2025.02.06.636969</dc:identifier>
<dc:title><![CDATA[The impact of ambient contamination on demultiplexing methods for single-nucleus multiome experiments]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-02-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.12.19.629547v1?rss=1">
<title>
<![CDATA[
Learning a Pairwise Epigenomic and Transcription Factor Binding Association Score Across the Human Genome 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.12.19.629547v1?rss=1"
</link>
<description><![CDATA[
Identifying pairwise associations between genomic loci is an important challenge for which large and diverse collections of epigenomic and transcription factor (TF) binding data can potentially be informative. We therefore developed Learning Evidence of Pairwise Association from Epigenomic and TF binding data (LEPAE). LEPAE uses neural networks to quantify evidence of association for pairs of genomic windows from large-scale epigenomic and TF binding data along with distance information. We applied LEPAE using thousands of human datasets. We present evidence using additional data that LEPAE captures biologically meaningful pairwise relationships between genomic loci and expect LEPAE scores to be a resource.
]]></description>
<dc:creator>Kwon, S. B.</dc:creator>
<dc:creator>Ernst, J.</dc:creator>
<dc:date>2024-12-22</dc:date>
<dc:identifier>doi:10.1101/2024.12.19.629547</dc:identifier>
<dc:title><![CDATA[Learning a Pairwise Epigenomic and Transcription Factor Binding Association Score Across the Human Genome]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-12-22</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.03.20.585624v1?rss=1">
<title>
<![CDATA[
Identifying associations of de novo noncoding variants with autism through integration of gene expression, sequence and sex information 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.03.20.585624v1?rss=1"
</link>
<description><![CDATA[
Whole-genome sequencing (WGS) data is facilitating genome-wide identification of rare noncoding variants, while elucidating their roles in disease remains challenging. Towards this end, we first revisit a reported significant brain-related association signal of autism spectrum disorder (ASD) detected from de novo noncoding variants attributed to deep-learning and show that local GC content can capture similar association signals. We further show that the association signal appears driven by variants from male proband-female sibling pairs that are upstream of assigned genes. We then develop Expression Neighborhood Sequence Association Study (ENSAS), which utilizes gene expression correlations and sequence information, to more systematically identify phenotype-associated variant sets. Applying ENSAS to the same set of de novo variants, we identify gene expression-based neighborhoods showing significant ASD association signal, enriched for synapse-related gene ontology terms. For these top neighborhoods, we also identify chromatin states annotations of variants that are predictive of the proband-sibling local GC content differences. Our work provides new insights into associations of non-coding de novo mutations in ASD and presents an analytical framework applicable to other phenotypes.
]]></description>
<dc:creator>Li, R.</dc:creator>
<dc:creator>Ernst, J.</dc:creator>
<dc:date>2024-03-21</dc:date>
<dc:identifier>doi:10.1101/2024.03.20.585624</dc:identifier>
<dc:title><![CDATA[Identifying associations of de novo noncoding variants with autism through integration of gene expression, sequence and sex information]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-03-21</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.07.14.549056v1?rss=1">
<title>
<![CDATA[
Integrative epigenomic and functional characterization assay based annotation of regulatory activity across diverse human cell types 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.07.14.549056v1?rss=1"
</link>
<description><![CDATA[
We introduce ChromActivity, a computational framework for predicting and annotating regulatory activity across the genome through integration of multiple epigenomic maps and various functional characterization datasets. ChromActivity generates genomewide predictions of regulatory activity associated with each functional characterization dataset across many cell types based on available epigenomic data. It then for each cell type produces (1) ChromScoreHMM genome annotations based on the combinatorial and spatial patterns within these predictions and (2) ChromScore tracks of overall predicted regulatory activity. ChromActivity provides a resource for analyzing and interpreting the human regulatory genome across diverse cell types.
]]></description>
<dc:creator>Dincer, T. U.</dc:creator>
<dc:creator>Ernst, J.</dc:creator>
<dc:date>2023-07-15</dc:date>
<dc:identifier>doi:10.1101/2023.07.14.549056</dc:identifier>
<dc:title><![CDATA[Integrative epigenomic and functional characterization assay based annotation of regulatory activity across diverse human cell types]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-07-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.12.19.521116v1?rss=1">
<title>
<![CDATA[
Universal chromatin state annotation of the mouse genome 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.12.19.521116v1?rss=1"
</link>
<description><![CDATA[
Genome-wide chromatin states learned from integrating genome-wide maps of multiple epigenetic marks within the same cell type have been widely used to generate genome annotations of individual cell types. An alternative strategy based on  stacked modeling can provide a single  universal chromatin state annotation based jointly on data from many cell types. In human, such an approach was recently demonstrated and the resulting chromatin state annotation, denoted full-stack, was shown to have complementary advantages to per-cell-type annotations. However, an analogous annotation has not been previously available in mouse. Here, we produce a chromatin state annotation for mouse based on 901 datasets assaying 14 chromatin marks in 26 different cell or tissue types. To characterize each chromatin state, we relate the states to other external annotations and compare them to analogously defined states in human. We expect the full-stack chromatin state annotation for mouse will be a useful resource for studying the genome of this key mammalian model organism.
]]></description>
<dc:creator>Vu, H. T.</dc:creator>
<dc:creator>Ernst, J.</dc:creator>
<dc:date>2022-12-20</dc:date>
<dc:identifier>doi:10.1101/2022.12.19.521116</dc:identifier>
<dc:title><![CDATA[Universal chromatin state annotation of the mouse genome]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-12-20</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.08.02.502571v1?rss=1">
<title>
<![CDATA[
Chromatin state modeling across individuals reveals global patterns of histone modifications 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.08.02.502571v1?rss=1"
</link>
<description><![CDATA[
Epigenetic mapping studies across individuals have identified many positions of epigenetic variation in various human tissues and conditions. However the relationships between these positions, and in particular global patterns that recur in many regions of the genome remains understudied. In this study, we use a stacked chromatin state model to systematically learn global patterns of epigenetic variation across individuals and annotate the human genome based on them. We applied this framework to histone modification data across individuals in lymphoblastoid cell lines and across autism spectrum disorder cases and controls in prefrontal cortex tissue. We find that global patterns are correlated across multiple histone modifications and with gene expression. We used the global patterns as a framework to predict transregulators, identify trans-QTL, and study complex disease. The frameworks for identifying and analyzing global patterns of epigenetic variation are general and we expect will be useful in other systems.
]]></description>
<dc:creator>Zou, J.</dc:creator>
<dc:creator>Ernst, J.</dc:creator>
<dc:date>2022-08-03</dc:date>
<dc:identifier>doi:10.1101/2022.08.02.502571</dc:identifier>
<dc:title><![CDATA[Chromatin state modeling across individuals reveals global patterns of histone modifications]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-08-03</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.05.24.493345v1?rss=1">
<title>
<![CDATA[
ChromGene: Gene-Based Modeling of Epigenomic Data 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.05.24.493345v1?rss=1"
</link>
<description><![CDATA[
BackgroundVarious computational approaches have been developed to annotate epigenomes on a per-position basis by modeling combinatorial and spatial patterns within epigenomic data. However, such annotations are less suitable for gene-based analyses, in which a single annotation for each gene is desired.

ResultsTo address this, we developed ChromGene, which annotates genes based on the combinatorial and spatial patterns of multiple epigenomic marks across the gene body and flanking regions. Specifically, ChromGene models the epigenomics maps using a mixture of hidden Markov models learned de novo. Using ChromGene, we generated annotations for the human protein-coding genes for over 100 cell and tissue types. We characterize the different mixture components and their associated gene sets in terms of gene expression, constraint, and other gene annotations. We also characterize variation in ChromGene gene annotations across cell and tissue types.

ConclusionsWe expect that the ChromGene method and provided annotations will be a useful resource for gene-based epigenomic analyses.
]]></description>
<dc:creator>Jaroszewicz, A.</dc:creator>
<dc:creator>Ernst, J.</dc:creator>
<dc:date>2022-05-25</dc:date>
<dc:identifier>doi:10.1101/2022.05.24.493345</dc:identifier>
<dc:title><![CDATA[ChromGene: Gene-Based Modeling of Epigenomic Data]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-05-25</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.05.08.491094v1?rss=1">
<title>
<![CDATA[
A framework for summarizing chromatin state annotations within and identifying differential annotations across groups of samples 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.05.08.491094v1?rss=1"
</link>
<description><![CDATA[
MotivationGenome-wide maps of epigenetic modifications are powerful resources for non-coding genome annotation. Maps of multiple epigenetics marks have been integrated into cell or tissue type-specific chromatin state annotations for many cell or tissue types. With the increasing availability of multiple chromatin state maps for biologically similar samples, there is a need for methods that can effectively summarize the information about chromatin state annotations within groups of samples and identify differences across groups of samples at a high resolution.

ResultsWe developed CSREP, which takes as input chromatin state annotations for a group of samples and then probabilistically estimates the state at each genomic position and derives a representative chromatin state map for the group. CSREP uses an ensemble of multi-class logistic regression classifiers to predict the chromatin state assignment of each sample given the state maps from all other samples. The difference of CSREPs probability assignments for two groups can be used to identify genomic locations with differential chromatin state patterns.

Using groups of chromatin state maps of a diverse set of cell and tissue types, we demonstrate the advantages of using CSREP to summarize chromatin state maps and identify biologically relevant differences between groups at a high resolution.

Availability and implementationThe CSREP source code is openly available under http://github.com/ernstlab/csrep.

Contact: jason.ernst@ucla.edu
]]></description>
<dc:creator>Vu, H. T.</dc:creator>
<dc:creator>Koch, Z.</dc:creator>
<dc:creator>Fiziev, P.</dc:creator>
<dc:creator>Ernst, J.</dc:creator>
<dc:date>2022-05-08</dc:date>
<dc:identifier>doi:10.1101/2022.05.08.491094</dc:identifier>
<dc:title><![CDATA[A framework for summarizing chromatin state annotations within and identifying differential annotations across groups of samples]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-05-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.04.24.650458v1?rss=1">
<title>
<![CDATA[
Deep dynamical models of single-cell multiomic velocities predict loss-of-function and rescue perturbations in B cells 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.04.24.650458v1?rss=1"
</link>
<description><![CDATA[
We present DynaVelo, a generative neural ordinary differential equation (ODE) model that learns the joint dynamics of gene expression and transcription factor (TF) motif activities in evolving cell systems using single-cell multiome data. DynaVelo leverages partial RNA velocity information together with single-cell TF motif accessibility data to improve the modeling of cell state dynamics and identification of TF drivers. We show that DynaVelo recovers the complex and bifurcating in vivo dynamics of wildtype murine germinal center (GC) B cells and reveals how these cell dynamics change under loss-of-function mutations in epigenetic regulators. DynaVelo resolves how TF motif activities evolve along latent time trajectories using analysis of training cells or through generated trajectories from the model. In silico perturbation analysis further enables DynaVelo to infer dynamic and cell-state-specific gene regulatory networks (GRNs), recovering many known TF-to-gene edges in the wildtype GC GRN and predicting those that are disrupted in mutants. Finally, in silico gene and TF perturbations allow both the prediction of cell dynamics under loss-of-function genetic mutations and the identification of TF perturbations to rescue loss-of-function dynamic and immunological phenotypes. DynaVelo therefore provides a powerful new deep learning framework for modeling and perturbing dynamic cell systems by harnessing single-cell multiome data sets.
]]></description>
<dc:creator>Karbalayghareh, A.</dc:creator>
<dc:creator>Barisic, D.</dc:creator>
<dc:creator>Chin, C. R.</dc:creator>
<dc:creator>Melnick, A.</dc:creator>
<dc:creator>Leslie, C. S.</dc:creator>
<dc:date>2025-04-26</dc:date>
<dc:identifier>doi:10.1101/2025.04.24.650458</dc:identifier>
<dc:title><![CDATA[Deep dynamical models of single-cell multiomic velocities predict loss-of-function and rescue perturbations in B cells]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-04-26</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.01.08.631928v1?rss=1">
<title>
<![CDATA[
Deep topic modeling of spatial transcriptomics in the rheumatoid arthritis synovium identifies distinct classes of ectopic lymphoid structures 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.01.08.631928v1?rss=1"
</link>
<description><![CDATA[
Single-cell RNA sequencing studies have revealed the heterogeneity of cell states present in the rheumatoid arthritis (RA) synovium. However, it remains unclear how these cell types interact with one another in situ and how synovial microenvironments shape observed cell states. Here, we use spatial transcriptomics (ST) to define stable microenvironments across eight synovial tissue samples from six RA patients and characterize the cellular composition of ectopic lymphoid structures (ELS). To identify disease-relevant cellular communities, we developed DeepTopics, a scalable reference-free deconvolution method based on a Dirichlet variational autoencoder architecture. DeepTopics identified 22 topics across tissue samples that were defined by specific cell types, activation states, and/or biological processes. Some topics were defined by multiple colocalizing cell types, such as CD34+ fibroblasts and LYVE1+ macrophages, suggesting functional interactions. Within ELS, we discovered two divergent cellular patterns that were stable across ELS in each patient and typified by the presence or absence of a "germinal-center-like" topic. DeepTopics is a versatile and computationally efficient method for identifying disease-relevant microenvironments from ST data, and our results highlight divergent cellular architectures in histologically similar RA synovial samples that have implications for disease pathogenesis.
]]></description>
<dc:creator>Periyakoil, P. K.</dc:creator>
<dc:creator>Smith, M. H.</dc:creator>
<dc:creator>Kshirsagar, M.</dc:creator>
<dc:creator>Ramirez, D.</dc:creator>
<dc:creator>DiCarlo, E. F.</dc:creator>
<dc:creator>Goodman, S. M.</dc:creator>
<dc:creator>Rudensky, A.</dc:creator>
<dc:creator>Donlin, L.</dc:creator>
<dc:creator>Leslie, C. S.</dc:creator>
<dc:date>2025-01-10</dc:date>
<dc:identifier>doi:10.1101/2025.01.08.631928</dc:identifier>
<dc:title><![CDATA[Deep topic modeling of spatial transcriptomics in the rheumatoid arthritis synovium identifies distinct classes of ectopic lymphoid structures]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-01-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.07.31.606073v1?rss=1">
<title>
<![CDATA[
Comprehensive transcription factor perturbations recapitulate fibroblast transcriptional states 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.07.31.606073v1?rss=1"
</link>
<description><![CDATA[
Cell atlas projects have nominated recurrent transcriptional states as drivers of biological processes and disease, but their origins, regulation, and properties remain unclear. To enable complementary functional studies, we developed a scalable approach for recapitulating cell states in vitro using CRISPR activation (CRISPRa) Perturb-seq. Aided by a novel multiplexing method, we activated 1,836 transcription factors in two cell types. Measuring 21,958 perturbations showed that CRISPRa activated targets within physiological ranges, that epigenetic features predicted activatable genes, and that the protospacer seed region drove an off-target effect. Perturbations recapitulated in vivo fibroblast states, including universal and inflammatory states, and identified KLF4 and KLF5 as key regulators of the universal state. Inducing the universal state suppressed disease-associated states, highlighting its therapeutic potential. Our findings cement CRISPRa as a tool for perturbing differentiated cells and indicate that in vivo states can be elicited via perturbation, enabling studies of clinically relevant states ex vivo.
]]></description>
<dc:creator>Southard, K. M.</dc:creator>
<dc:creator>Ardy, R. C.</dc:creator>
<dc:creator>Tang, A.</dc:creator>
<dc:creator>O'Sullivan, D. D.</dc:creator>
<dc:creator>Metzner, E.</dc:creator>
<dc:creator>Guruvayurappan, K.</dc:creator>
<dc:creator>Norman, T. M.</dc:creator>
<dc:date>2024-08-03</dc:date>
<dc:identifier>doi:10.1101/2024.07.31.606073</dc:identifier>
<dc:title><![CDATA[Comprehensive transcription factor perturbations recapitulate fibroblast transcriptional states]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-08-03</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.11.09.563812v1?rss=1">
<title>
<![CDATA[
An encyclopedia of enhancer-gene regulatory interactions in the human genome 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.11.09.563812v1?rss=1"
</link>
<description><![CDATA[
Identifying transcriptional enhancers and their target genes is essential for understanding gene regulation and the impact of human genetic variation on disease1-6. Here we create and evaluate a resource of >13 million enhancer-gene regulatory interactions across 352 cell types and tissues, by integrating predictive models, measurements of chromatin state and 3D contacts, and large-scale genetic perturbations generated by the ENCODE Consortium7. We first create a systematic benchmarking pipeline to compare predictive models, assembling a dataset of 10,411 element-gene pairs measured in CRISPR perturbation experiments, >30,000 fine-mapped eQTLs, and 569 fine-mapped GWAS variants linked to a likely causal gene. Using this framework, we develop a new predictive model, ENCODE-rE2G, that achieves state-of-the-art performance across multiple prediction tasks, demonstrating a strategy involving iterative perturbations and supervised machine learning to build increasingly accurate predictive models of enhancer regulation. Using the ENCODE-rE2G model, we build an encyclopedia of enhancer-gene regulatory interactions in the human genome, which reveals global properties of enhancer networks, identifies differences in the functions of genes that have more or less complex regulatory landscapes, and improves analyses to link noncoding variants to target genes and cell types for common, complex diseases. By interpreting the model, we find evidence that, beyond enhancer activity and 3D enhancer-promoter contacts, additional features guide enhancer-promoter communication including promoter class and enhancer-enhancer synergy. Altogether, these genome-wide maps of enhancer-gene regulatory interactions, benchmarking software, predictive models, and insights about enhancer function provide a valuable resource for future studies of gene regulation and human genetics.
]]></description>
<dc:creator>Gschwind, A. R.</dc:creator>
<dc:creator>Mualim, K. S.</dc:creator>
<dc:creator>Karbalayghareh, A.</dc:creator>
<dc:creator>Sheth, M. U.</dc:creator>
<dc:creator>Dey, K. K.</dc:creator>
<dc:creator>Jagoda, E.</dc:creator>
<dc:creator>Nurtdinov, R. N.</dc:creator>
<dc:creator>Xi, W.</dc:creator>
<dc:creator>Tan, A. S.</dc:creator>
<dc:creator>Jones, H.</dc:creator>
<dc:creator>Ma, X. R.</dc:creator>
<dc:creator>Yao, D.</dc:creator>
<dc:creator>Nasser, J.</dc:creator>
<dc:creator>Avsec, Z.</dc:creator>
<dc:creator>James, B. T.</dc:creator>
<dc:creator>Shamim, M. S.</dc:creator>
<dc:creator>Durand, N. C.</dc:creator>
<dc:creator>Rao, S. S. P.</dc:creator>
<dc:creator>Mahajan, R.</dc:creator>
<dc:creator>Doughty, B. R.</dc:creator>
<dc:creator>Andreeva, K.</dc:creator>
<dc:creator>Ulirsch, J. C.</dc:creator>
<dc:creator>Fan, K.</dc:creator>
<dc:creator>Perez, E. M.</dc:creator>
<dc:creator>Nguyen, T. C.</dc:creator>
<dc:creator>Kelley, D. R.</dc:creator>
<dc:creator>Finucane, H. K.</dc:creator>
<dc:creator>Moore, J. E.</dc:creator>
<dc:creator>Weng, Z.</dc:creator>
<dc:creator>Kellis, M.</dc:creator>
<dc:creator>Bassik, M. C.</dc:creator>
<dc:creator>Price, A. L.</dc:creator>
<dc:creator>Beer, M. A.</dc:creator>
<dc:creator>Guigo, R.</dc:creator>
<dc:creator>Stamatoyannopoulos, J. A.</dc:creator>
<dc:creator>Aiden, E. L.</dc:creator>
<dc:creator>Greenleaf, W. J.</dc:creator>
<dc:creator>Leslie, C. S.</dc:creator>
<dc:creator>Steinmetz, L. M.</dc:creator>
<dc:creator>Kundaje, A.</dc:creator>
<dc:creator>Engreitz, J. M.</dc:creator>
<dc:date>2023-11-13</dc:date>
<dc:identifier>doi:10.1101/2023.11.09.563812</dc:identifier>
<dc:title><![CDATA[An encyclopedia of enhancer-gene regulatory interactions in the human genome]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-11-13</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2023.07.27.550836v1?rss=1">
<title>
<![CDATA[
ChromaFold predicts the 3D contact map from single-cell chromatin accessibility 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2023.07.27.550836v1?rss=1"
</link>
<description><![CDATA[
The identification of cell-type-specific 3D chromatin interactions between regulatory elements can help to decipher gene regulation and to interpret the function of disease-associated non-coding variants. However, current chromosome conformation capture (3C) technologies are unable to resolve interactions at this resolution when only small numbers of cells are available as input. We therefore present ChromaFold, a deep learning model that predicts 3D contact maps and regulatory interactions from single-cell ATAC sequencing (scATAC-seq) data alone. ChromaFold uses pseudobulk chromatin accessibility, co-accessibility profiles across metacells, and predicted CTCF motif tracks as input features and employs a lightweight architecture to enable training on standard GPUs. Once trained on paired scATAC-seq and Hi-C data in human cell lines and tissues, ChromaFold can accurately predict both the 3D contact map and peak-level interactions across diverse human and mouse test cell types. In benchmarking against a recent deep learning method that uses bulk ATAC-seq, DNA sequence, and CTCF ChIP-seq to make cell-type-specific predictions, ChromaFold yields superior prediction performance when including CTCF ChIP-seq data as an input and comparable performance without. Finally, fine-tuning ChromaFold on paired scATAC-seq and Hi-C in a complex tissue enables deconvolution of chromatin interactions across cell subpopulations. ChromaFold thus achieves state-of-the-art prediction of 3D contact maps and regulatory interactions using scATAC-seq alone as input data, enabling accurate inference of cell-type-specific interactions in settings where 3C-based assays are infeasible.
]]></description>
<dc:creator>Gao, V. R.</dc:creator>
<dc:creator>Yang, R.</dc:creator>
<dc:creator>Das, A.</dc:creator>
<dc:creator>Luo, R.</dc:creator>
<dc:creator>Luo, H.</dc:creator>
<dc:creator>McNally, D. R.</dc:creator>
<dc:creator>Karagiannidis, I.</dc:creator>
<dc:creator>Rivas, M. A.</dc:creator>
<dc:creator>Wang, Z.-m.</dc:creator>
<dc:creator>Barisic, D.</dc:creator>
<dc:creator>Karbalayghareh, A.</dc:creator>
<dc:creator>Wong, W.</dc:creator>
<dc:creator>Zhan, Y.</dc:creator>
<dc:creator>Chin, C. R.</dc:creator>
<dc:creator>Noble, W. S.</dc:creator>
<dc:creator>Bilmes, J. A.</dc:creator>
<dc:creator>Apostolou, E.</dc:creator>
<dc:creator>Kharas, M.</dc:creator>
<dc:creator>Beguelin, W.</dc:creator>
<dc:creator>Viny, A. D.</dc:creator>
<dc:creator>Huangfu, D.</dc:creator>
<dc:creator>Rudensky, A.</dc:creator>
<dc:creator>Melnick, A.</dc:creator>
<dc:creator>Leslie, C. S.</dc:creator>
<dc:date>2023-07-28</dc:date>
<dc:identifier>doi:10.1101/2023.07.27.550836</dc:identifier>
<dc:title><![CDATA[ChromaFold predicts the 3D contact map from single-cell chromatin accessibility]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2023-07-28</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.05.02.490368v1?rss=1">
<title>
<![CDATA[
Genome-wide CRISPR guide RNA design and specificity analysis with GuideScan2 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.05.02.490368v1?rss=1"
</link>
<description><![CDATA[
We present GuideScan2 for memory-efficient, parallelizable construction of high-specificity CRISPR guide RNA (gRNA) databases and user-friendly gRNA/library design in custom genomes. GuideScan2 analysis identified widespread confounding effects of low-specificity gRNAs in published CRISPR knockout, interference and activation screens and enabled construction of a ready-to-use gRNA library that reduced off-target effects in a novel gene essentiality screen. GuideScan2 also enabled the design and experimental validation of allele-specific gRNAs in a hybrid mouse genome.
]]></description>
<dc:creator>Schmidt, H.</dc:creator>
<dc:creator>Zhang, M.</dc:creator>
<dc:creator>Mourelatos, H.</dc:creator>
<dc:creator>Sanchez-Rivera, F. J.</dc:creator>
<dc:creator>Lowe, S. W.</dc:creator>
<dc:creator>Ventura, A.</dc:creator>
<dc:creator>Leslie, C. S.</dc:creator>
<dc:creator>Pritykin, Y.</dc:creator>
<dc:date>2022-05-03</dc:date>
<dc:identifier>doi:10.1101/2022.05.02.490368</dc:identifier>
<dc:title><![CDATA[Genome-wide CRISPR guide RNA design and specificity analysis with GuideScan2]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-05-03</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2022.02.28.482131v1?rss=1">
<title>
<![CDATA[
Heterogeneity of Inflammation-associated Synovial Fibroblasts in Rheumatoid Arthritis and Its Drivers 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2022.02.28.482131v1?rss=1"
</link>
<description><![CDATA[
Inflammation of non-barrier immunologically quiescent tissues is associated with a massive influx of blood-borne innate and adaptive immune cells. Cues from the latter are likely to alter and expand the spectrum of states observed in cells that are constitutively resident. However, local communications between immigrant and resident cell types in human inflammatory disease remain poorly understood. Here, we explored heterogeneity of synovial fibroblasts (FLS) in inflamed joints of rheumatoid arthritis (RA) patients using paired single cell RNA and ATAC sequencing (scRNA/ATAC-seq), multiplexed imaging, and spatial transcriptomics along with in vitro modeling of cell extrinsic factor signaling. These analyses suggest that local exposures to myeloid and T cell derived cytokines, TNF, IFN{gamma}, IL-1{beta}, or lack thereof, drive six distinct FLS states some of which closely resemble fibroblast states in other disease-affected tissues including skin and colon. Our results highlight a role for concurrent, spatially distributed cytokine signaling within the inflamed synovium.
]]></description>
<dc:creator>Smith, M. H.</dc:creator>
<dc:creator>Gao, V. R.</dc:creator>
<dc:creator>Schizas, M.</dc:creator>
<dc:creator>Kochen, A.</dc:creator>
<dc:creator>DiCarlo, E.</dc:creator>
<dc:creator>Goodman, S.</dc:creator>
<dc:creator>Norman, T. M.</dc:creator>
<dc:creator>Donlin, L.</dc:creator>
<dc:creator>Leslie, C. S.</dc:creator>
<dc:creator>Rudensky, A. Y.</dc:creator>
<dc:date>2022-03-02</dc:date>
<dc:identifier>doi:10.1101/2022.02.28.482131</dc:identifier>
<dc:title><![CDATA[Heterogeneity of Inflammation-associated Synovial Fibroblasts in Rheumatoid Arthritis and Its Drivers]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2022-03-02</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2026.02.14.705848v1?rss=1">
<title>
<![CDATA[
A scalable approach to resolving variants of uncertain significance 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2026.02.14.705848v1?rss=1"
</link>
<description><![CDATA[
Over 90% of missense variants across [~]4,000 disease-associated genes are variants of uncertain significance (VUS). Experimental variant effect measurements provide critical evidence about pathogenicity and inform disease biology, but most variants lack data and clinical translation has been limited. The Impact of Genomic Variation on Function Consortium generated experimental data for 62,215 variants across ten genes using multiplexed assays and 1,407 variants across 163 genes using arrayed assays, curated 193,139 additional community-generated variant effect measurements across 30 additional genes, and developed automated calibration methods for translating experimental data and variant effect predictions into clinical evidence. To reduce current VUS, we developed a scalable workflow using only experimental and predictive evidence, enabling reclassification of 75% of the 16,115 VUS in these genes as pathogenic or benign with <1% error. To minimize future VUS, we analyzed >90,000 unobserved variants; 62% had enough evidence to be "preclassified" as pathogenic or benign. We validated our data, evidence and classifications using All of Us and created interactive resources to enable clinical use of the calibrated data. Thus, for 40 genes, representing 1% of the clinical genome, we resolve most existing VUS and future variants, illustrating how systematic use of scalable evidence can empower genomic medicine.
]]></description>
<dc:creator>Tejura, M.</dc:creator>
<dc:creator>Chen, Y.</dc:creator>
<dc:creator>McEwen, A. E.</dc:creator>
<dc:creator>Stewart, R.</dc:creator>
<dc:creator>Sverchkov, Y.</dc:creator>
<dc:creator>Laval, F.</dc:creator>
<dc:creator>Woo, I.</dc:creator>
<dc:creator>Zeiberg, D.</dc:creator>
<dc:creator>Shen, R.</dc:creator>
<dc:creator>Fayer, S.</dc:creator>
<dc:creator>Stone, J.</dc:creator>
<dc:creator>Smith, N.</dc:creator>
<dc:creator>Casadei, S.</dc:creator>
<dc:creator>Wang, Z. R.</dc:creator>
<dc:creator>Snyder, M.</dc:creator>
<dc:creator>Capodanno, B. J.</dc:creator>
<dc:creator>Gupta, P.</dc:creator>
<dc:creator>Benazouz, M.</dc:creator>
<dc:creator>Jain, S.</dc:creator>
<dc:creator>Heidl, S.</dc:creator>
<dc:creator>Muffley, L.</dc:creator>
<dc:creator>Dong, S.</dc:creator>
<dc:creator>Lin, K.</dc:creator>
<dc:creator>Hitz, B. C.</dc:creator>
<dc:creator>Gabdank, I.</dc:creator>
<dc:creator>Da, E. Y.</dc:creator>
<dc:creator>Best, S.</dc:creator>
<dc:creator>Grindstaff, S.</dc:creator>
<dc:creator>Reinhart, D.</dc:creator>
<dc:creator>Rodriguez-Salas, L.</dc:creator>
<dc:creator>Seid, O.</dc:creator>
<dc:creator>Vandi, A. J.</dc:creator>
<dc:creator>Wenman, C.</dc:creator>
<dc:creator>Wheelock, M. K.</dc:creator>
<dc:creator>Pendyala, S.</dc:creator>
<dc:creator>Holmes, D.</dc:creator>
<dc:creator>Xu, A.</dc:creator>
<dc:creator>Hosokai, A.</dc:creator>
<dc:creator>Tixhon, M.</dc:creator>
<dc:creator>Reno, C.</dc:creator>
<dc:creator>Ewald, J. D.</dc:creator>
<dc:creator>Spirohn-Fitzgerald, K.</dc:creator>
<dc:creator>Teelucksingh, T.</dc:creator>
<dc:creator>Hao, T.</dc:creator>
<dc:creator>Chen, Z. S.</dc:creator>
<dc:creator>Haghighi, M.</dc:creator>
<dc:creator>Hamid, A. K.</dc:creator>
<dc:creator></dc:creator>
<dc:date>2026-02-15</dc:date>
<dc:identifier>doi:10.64898/2026.02.14.705848</dc:identifier>
<dc:title><![CDATA[A scalable approach to resolving variants of uncertain significance]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2026-02-15</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2026.02.25.708096v1?rss=1">
<title>
<![CDATA[
Uncertainty-aware synthetic lethality prediction with pretrained foundation models 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2026.02.25.708096v1?rss=1"
</link>
<description><![CDATA[
Synthetic lethality (SL) offers a promising paradigm for targeted cancer therapy, yet experimental identification of SL gene pairs remains costly, context-dependent, and biased toward well-studied genes. Existing computational approaches often rely on curated protein-protein interaction (PPI) networks and Gene Ontology (GO) annotations, which limit their ability to generalize to novel genes. Here we introduce CO_SCPLOWILANTROC_SCPLOWO_SCPCAP-C_SCPCAPO_SCPLOWSLC_SCPLOW, a two-stage, graph-free framework that leverages pretrained biological foundation models to predict SL pairs with calibrated uncertainty. In Stage 1, we apply a pretrained single-cell foundation model to bulk RNA-seq profiles of cancer cell lines to obtain context-aware embeddings and perform in silico gene knockouts to generate delta embeddings. These perturbation signals are further conditioned on a data-driven gene prior and supervised with CRISPR viability readouts to learn knockout-aware viability embeddings. In Stage 2, we derive pairwise features from these embeddings and train a lightweight classifier to distinguish SL from non-SL pairs. To enable reliable experimental prioritization, CO_SCPLOWILANTROC_SCPLOWO_SCPCAP-C_SCPCAPO_SCPLOWSLC_SCPLOW incorporates conformal prediction, producing calibrated and interpretable prediction sets that highlight high-confidence SL candidates. Across two evaluation settings, including zero-shot generalization to unseen gene pairs and to unseen genes, ablation analyses show that viability pretraining and the gene prior substantially improve performance while avoiding reliance on PPI and GO features. CO_SCPLOWILANTROC_SCPLOWO_SCPCAP-C_SCPCAPO_SCPLOWSLC_SCPLOW therefore transforms pretrained biological representations into practical, uncertainty-aware hypotheses that support robust and scalable discovery of therapeutic targets.
]]></description>
<dc:creator>Hua, K.</dc:creator>
<dc:creator>Haber, E.</dc:creator>
<dc:creator>Ma, J.</dc:creator>
<dc:date>2026-02-27</dc:date>
<dc:identifier>doi:10.64898/2026.02.25.708096</dc:identifier>
<dc:title><![CDATA[Uncertainty-aware synthetic lethality prediction with pretrained foundation models]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2026-02-27</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2026.02.14.705860v1?rss=1">
<title>
<![CDATA[
Fully T2T pedigree assemblies reveal genetic stability and epigenetic plasticity of human centromeres across inheritance and cell-fate transitions 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2026.02.14.705860v1?rss=1"
</link>
<description><![CDATA[
Centromeres are essential chromosome components yet remain poorly understood due to their highly repetitive sequence architecture. Using fully-phased telomere-to-telomere diploid assemblies from a three-generation pedigree integrated with long-read epigenomes from matched peripheral blood mononuclear cells, induced pluripotent stem cells, and neural progenitor cells, we generate allele-resolved single basepair resolution maps of centromere genetic and epigenetic dynamics across inheritance, reprogramming, and differentiation. We show that centromeric dip regions (CDRs), which define the functional core of centromeres, are positionally stable across generations and cell-fate transitions. In contrast, CDR epigenetic architecture is highly dynamic. Reprogramming markedly attenuates CDR hypomethylation, which is partially restored during differentiation in parallel with global hypomethylation of active alpha-satellite arrays and coordinated changes in nucleosome organization and protein occupancy. Centromeric remodeling is insulated from X-chromosome status, including Xa, Xi, and erosion. Finally, de novo mutations arising during reprogramming are enriched in centromeric regions but depleted within functional centromeric cores.
]]></description>
<dc:creator>Dong, S.</dc:creator>
<dc:creator>Xing, X.</dc:creator>
<dc:creator>Cechova, M.</dc:creator>
<dc:creator>Loucks, H.</dc:creator>
<dc:creator>Vijayalingam, S.</dc:creator>
<dc:creator>Neilson, A.</dc:creator>
<dc:creator>Sentmanat, M.</dc:creator>
<dc:creator>Macias-Velasco, J. F.</dc:creator>
<dc:creator>Liu, T.</dc:creator>
<dc:creator>Dong, Z.</dc:creator>
<dc:creator>Miao, B.</dc:creator>
<dc:creator>Zhang, W.</dc:creator>
<dc:creator>Tomlinson, C.</dc:creator>
<dc:creator>Schmidt, H.</dc:creator>
<dc:creator>Belter, E. A.</dc:creator>
<dc:creator>Hu, M.</dc:creator>
<dc:creator>Cui, X.</dc:creator>
<dc:creator>Stitziel, N. O.</dc:creator>
<dc:creator>Miga, K. H.</dc:creator>
<dc:creator>Wang, T.</dc:creator>
<dc:date>2026-02-17</dc:date>
<dc:identifier>doi:10.64898/2026.02.14.705860</dc:identifier>
<dc:title><![CDATA[Fully T2T pedigree assemblies reveal genetic stability and epigenetic plasticity of human centromeres across inheritance and cell-fate transitions]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2026-02-17</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.07.28.667228v1?rss=1">
<title>
<![CDATA[
ToxiTaRGET: a multi-omics resource for toxicant-responsive molecular targets 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.07.28.667228v1?rss=1"
</link>
<description><![CDATA[
Environmental toxicant exposures can induce widespread alterations in both the transcriptome and epigenome of mammals, and directly contribute to the increased risk of various diseases, including cardiovascular disorders, cancer, and neurological disorders. To evaluate how early-life toxicants produce long-term impacts on the transcriptome and epigenome in mice, the Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription II (TaRGET II) Consortium generated a landmark resource comprising 3,607 multi-omics from longitudinal studies in mice. The molecular changes in responding to distinct environmental toxicants, including arsenic (As), lead (Pb), bisphenol A (BPA), tributyltin (TBT), di-2-ethylhexyl phthalate (DEHP), dioxin (TCDD), and fine particulate matter (PM2.5), were systematically identified and visualized on an integrative platform, ToxiTaRGET, to allow quickly search and browse by researchers. ToxiTaRGET houses a rich repository of molecular signatures, including gene expression, chromatin accessibility, and DNA methylation profiles, in response to early-life toxicant exposures. These molecular signatures span multiple biologically important tissues in both male and female mice at three distinct life stages, offering a valuable resource for the environmental health and toxicogenomic research communities.
]]></description>
<dc:creator>Kumar, R.</dc:creator>
<dc:creator>Fu, T.</dc:creator>
<dc:creator>Kuntala, P. K.</dc:creator>
<dc:creator>Fu, S.</dc:creator>
<dc:creator>Li, D.</dc:creator>
<dc:creator>Bartolomei, M. S.</dc:creator>
<dc:creator>Walker, C. L.</dc:creator>
<dc:creator>Wang, T.</dc:creator>
<dc:creator>Zhang, B. A.</dc:creator>
<dc:date>2025-08-02</dc:date>
<dc:identifier>doi:10.1101/2025.07.28.667228</dc:identifier>
<dc:title><![CDATA[ToxiTaRGET: a multi-omics resource for toxicant-responsive molecular targets]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-08-02</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.05.19.654884v1?rss=1">
<title>
<![CDATA[
EYKTHYR reveals transcriptional regulators of spatial gene programs 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.05.19.654884v1?rss=1"
</link>
<description><![CDATA[
Understanding how transcription factors (TFs) orchestrate gene regulatory networks that define complex tissue structures is central to uncovering tissue organization and disease mechanisms. Although spatial multiome technologies now enable in situ measurement of both transcriptional activity and chromatin accessibility, existing computational methods either overlook spatial tissue context or are hindered by the high dropout rates characteristic of such data. Here, we introduce EO_SCPLOWYKTHYRC_SCPLOW, a computational framework that integrates gene expression and chromatin accessibility within a spatially aware model to identify TFs driving spatial gene programs. EO_SCPLOWYKTHYRC_SCPLOW mitigates dropout effects by leveraging interpretable, low-dimensional embeddings of gene expression and chromatin accessibility - both linear with respect to their input - enabling robust identification and scalable inference of spatial transcriptional regulators. Applied across diverse spatial multiome datasets, EO_SCPLOWYKTHYRC_SCPLOW consistently outperforms existing approaches, accurately identifying TFs that coordinate spatial gene programs in mouse brain development and regulate T-cell states within tumor microenvironments. EO_SCPLOWYKTHYRC_SCPLOW establishes a foundation for decoding how TFs interpret local intercellular signaling to shape tissue structure, offering insights into the regulatory logic underlying spatial organization in health and disease.
]]></description>
<dc:creator>Krieger, S.</dc:creator>
<dc:creator>Haber, E.</dc:creator>
<dc:creator>Ma, J.</dc:creator>
<dc:date>2025-05-23</dc:date>
<dc:identifier>doi:10.1101/2025.05.19.654884</dc:identifier>
<dc:title><![CDATA[EYKTHYR reveals transcriptional regulators of spatial gene programs]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-05-23</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.05.08.652741v1?rss=1">
<title>
<![CDATA[
POPARI: Modeling multisample variation in spatial transcriptomics 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.05.08.652741v1?rss=1"
</link>
<description><![CDATA[
Integrating spatially-resolved transcriptomics (SRT) across biological samples is essential for understanding dynamic changes in tissue architecture and cell-cell interactions in situ. While tools exist for multisample single-cell RNA-seq, methods tailored to multisample SRT remain limited. Here, we introduce PO_SCPLOWOPARIC_SCPLOW, a probabilistic graphical model for factor-based decomposition of multisample SRT that captures condition-specific changes in spatial organization. PO_SCPLOWOPARIC_SCPLOW jointly learns spatial metagenes - linear gene expression programs - and their spatial affinities across samples. Its key innovations include a differential prior to regularize spatial accordance and spatial downsampling to enable multiresolution, hierarchical analysis. Simulations show PO_SCPLOWOPARIC_SCPLOW outperforms existing methods on multisample and multi-resolution spatial metrics. Applications to real datasets uncover spatial metagene dynamics, spatial accordance, and cell identities. In mouse brain (STARmap PLUS), PO_SCPLOWOPARIC_SCPLOW identifies spatial metagenes linked to AD; in thymus (Slide-TCR-seq), it captures increasing colocalization of V(D)J recombination and T cell proliferation; and in ovarian cancer (CosMx), it reveals sample-specific malignant-immune interactions. Overall, PO_SCPLOWOPARIC_SCPLOW provides a general, interpretable framework for analyzing variation in multisample SRT.
]]></description>
<dc:creator>Alam, S.</dc:creator>
<dc:creator>Zhou, T.</dc:creator>
<dc:creator>Haber, E.</dc:creator>
<dc:creator>Chidester, B.</dc:creator>
<dc:creator>Liu, S.</dc:creator>
<dc:creator>Chen, F.</dc:creator>
<dc:creator>Ma, J.</dc:creator>
<dc:date>2025-05-13</dc:date>
<dc:identifier>doi:10.1101/2025.05.08.652741</dc:identifier>
<dc:title><![CDATA[POPARI: Modeling multisample variation in spatial transcriptomics]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-05-13</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.04.06.647437v1?rss=1">
<title>
<![CDATA[
STEAMBOAT: Attention-based multiscale delineation of cellular interactions in tissues 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.04.06.647437v1?rss=1"
</link>
<description><![CDATA[
Spatial-omics technologies profile cells in their native spatial context within tissues, enabling more complete understanding of cellular properties. However, a key computational challenge remains: identifying cellular interactions that underlie cell types and states - interactions that are essential for spatial organization and provide a biologically grounded framework for understanding cell identities and spatial patterns. These interactions span different distances and thus require multiscale modeling, which remains a major gap in existing methods. Here, we introduce SO_SCPLOWTEAMBOATC_SCPLOW, an interpretable machine learning framework that leverage a self-supervised, multi-head attention model to uniquely decompose gene expression of a cell into multiple key factors: intrinsic cell programs, neighboring cell communication, and long-range interactions. By applying SO_SCPLOWTEAMBOATC_SCPLOW to diverse tissues in health and disease across various spatial-omics technologies, we demonstrate its ability to uncover critical multiscale cellular interactions, capturing classical contact signaling and revealing previously unrecognized patterns of cellular communication. SO_SCPLOWTEAMBOATC_SCPLOW provides a powerful approach for spatial-omics analysis, offering new insights into the multiscale spatial organization of cells and their communication across a wide range of biological contexts.
]]></description>
<dc:creator>Liang, S.</dc:creator>
<dc:creator>Tang, J.</dc:creator>
<dc:creator>Wang, G.</dc:creator>
<dc:creator>Ma, J.</dc:creator>
<dc:date>2025-04-10</dc:date>
<dc:identifier>doi:10.1101/2025.04.06.647437</dc:identifier>
<dc:title><![CDATA[STEAMBOAT: Attention-based multiscale delineation of cellular interactions in tissues]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-04-10</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.03.31.646238v1?rss=1">
<title>
<![CDATA[
Unified integration of spatial transcriptomics across platforms 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.03.31.646238v1?rss=1"
</link>
<description><![CDATA[
Spatial transcriptomics (ST) has transformed our understanding of tissue architecture and cellular interactions, but integrating ST data across platforms remains challenging due to differences in gene panels, data sparsity, and technical variability. Here, we introduce LO_SCPLOWLOKIC_SCPLOW, a novel framework for integrating imaging-based ST data from diverse platforms without requiring shared gene panels. LO_SCPLOWLOKIC_SCPLOW addresses ST integration through two key alignment tasks: feature alignment across technologies and batch alignment across datasets. Optimal transport-guided feature propagation adjusts data sparsity to match scRNA-seq references through graph-based imputation, enabling single-cell foundation models such as scGPT to generate unified features. Batch alignment then refines scGPT-transformed embeddings, mitigating batch effects while preserving biological variability. Evaluations on mouse brain samples from five different technologies demonstrate that LO_SCPLOWLOKIC_SCPLOW outperforms existing methods and is effective for cross-technology spatial gene program identification and tissue slice alignment. Applying LO_SCPLOWLOKIC_SCPLOW to five ovarian cancer datasets, we identify an integrated gene program indicative of tumor-infiltrating T cells across gene panels. Together, LO_SCPLOWLOKIC_SCPLOW provides a robust foundation for cross-platform ST studies, with the potential to scale to large atlas datasets, enabling deeper insights into cellular organization and tissue environments.
]]></description>
<dc:creator>Haber, E.</dc:creator>
<dc:creator>Deshpande, A.</dc:creator>
<dc:creator>Ma, J.</dc:creator>
<dc:creator>Krieger, S.</dc:creator>
<dc:date>2025-04-05</dc:date>
<dc:identifier>doi:10.1101/2025.03.31.646238</dc:identifier>
<dc:title><![CDATA[Unified integration of spatial transcriptomics across platforms]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-04-05</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.03.29.646092v1?rss=1">
<title>
<![CDATA[
Expression spectrum of TE-derived transcripts in human adult tissues 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.03.29.646092v1?rss=1"
</link>
<description><![CDATA[
Transposable elements (TEs) are vital components of eukaryotic genomes and have played a critical role in genome evolution. Although most TEs are silenced in the mammalian genome, increasing evidence suggests that certain TEs are actively involved in gene regulation during early developmental stages. However, the extent to which human TEs drive gene transcription in adult tissues remains largely unexplored. In this study, we systematically analyzed 17,329 human transcriptomes to investigate how TEs influence gene transcription across 47 adult tissues. Our findings reveal that TE-derived transcripts are broadly expressed in human tissues, contributing to both housekeeping functions and tissue-specific gene regulation. We identified sex-specific expression of TE-derived transcripts regulated by sex hormones in breast tissue between females and males. Our results demonstrated that TE-derived alternative transcription initiation significantly enhances the variety of translated protein products, e.g., changes in the N-terminal peptide length of WNT2B caused by TE-derived transcription result in isoform-specific subcellular localization. Additionally, we identified 68 human-specific TE-derived transcripts associated with metabolic processes and environmental adaptation. Together, these findings highlight the pivotal evolutionary role of TEs in shaping the human transcriptome, demonstrating how conserved and human-specific TEs contribute to transcriptional and translational innovation in human genome evolution.
]]></description>
<dc:creator>Miao, B.</dc:creator>
<dc:creator>Zhang, B.</dc:creator>
<dc:creator>WU, T. P.</dc:creator>
<dc:creator>Luo, X.</dc:creator>
<dc:creator>Ademovic, A.</dc:creator>
<dc:creator>Yang, Y.</dc:creator>
<dc:date>2025-04-03</dc:date>
<dc:identifier>doi:10.1101/2025.03.29.646092</dc:identifier>
<dc:title><![CDATA[Expression spectrum of TE-derived transcripts in human adult tissues]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-04-03</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.03.29.646090v1?rss=1">
<title>
<![CDATA[
Machine Learning-Based Identification of Survival-Associated CpG Biomarkers in Pancreatic Ductal Adenocarcinoma 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.03.29.646090v1?rss=1"
</link>
<description><![CDATA[
Pancreatic ductal adenocarcinoma (PDAC) is an exceptionally aggressive cancer with a 5-year survival rate of less than 10%, driven by late-stage diagnosis, limited treatment options, and a lack of reliable biomarkers for early detection and prognosis. In this study, we integrated DNA methylation data from TCGA and ICGC cohorts, categorizing samples based on survival time, and identified 684 differentially methylated CpG sites, along with 224 CpG biomarkers significantly associated with patient survival through statistical and machine learning-based analyses. We developed a random forest model to predict patient survival, achieving 85.2% accuracy for short-survival patients and 70.0% for long-survival patients in the validation set. External dataset validation further confirmed the models robustness and accuracy. De novo motif analysis of genomic regions surrounding the 224 CpG biomarkers identified TWIST1 and FOXA2 as key transcriptional regulators enriched in survival-associated CpG sites, linking their activity to patient survival outcomes. Collectively, our findings highlight valuable epigenetic biomarkers and provide a predictive model to assess PDAC risk levels post-surgery, offering the potential for improved patient stratification and personalized therapeutic strategies.
]]></description>
<dc:creator>Zhang, B.</dc:creator>
<dc:creator>Zhang, Y.</dc:creator>
<dc:creator>Zhao, Y.</dc:creator>
<dc:date>2025-04-01</dc:date>
<dc:identifier>doi:10.1101/2025.03.29.646090</dc:identifier>
<dc:title><![CDATA[Machine Learning-Based Identification of Survival-Associated CpG Biomarkers in Pancreatic Ductal Adenocarcinoma]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-04-01</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2025.01.06.631595v1?rss=1">
<title>
<![CDATA[
DNALONGBENCH: A Benchmark Suite for Long-Range DNA Prediction Tasks 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2025.01.06.631595v1?rss=1"
</link>
<description><![CDATA[
Modeling long-range DNA dependencies is crucial for understanding genome structure and function across a wide range of biological contexts. However, effectively capturing these extensive dependencies, which may span millions of base pairs in tasks such as three-dimensional (3D) chromatin folding prediction, remains a significant challenge. Furthermore, a comprehensive benchmark suite for evaluating tasks that rely on long-range dependencies is notably absent. To address this gap, we introduce DNALO_SCPLOWONGC_SCPLOWBO_SCPLOWENCHC_SCPLOW, a benchmark dataset encompassing five important genomics tasks that consider long-range dependencies up to 1 million base pairs: enhancer-target gene interaction, expression quantitative trait loci, 3D genome organization, regulatory sequence activity, and transcription initiation signals. To comprehensively assess DNALO_SCPLOWONGC_SCPLOWBO_SCPLOWENCHC_SCPLOW, we evaluate the performance of five methods: a task-specific expert model, a convolutional neural network (CNN)-based model, and three fine-tuned DNA foundation models - HyenaDNA, Caduceus-Ph, and Caduceus-PS. We envision DNALO_SCPLOWONGC_SCPLOWBO_SCPLOWENCHC_SCPLOW as a standardized resource with the potential to facilitate comprehensive comparisons and rigorous evaluations of emerging DNA sequence-based deep learning models that account for long-range dependencies.
]]></description>
<dc:creator>Cheng, W.</dc:creator>
<dc:creator>Song, Z.</dc:creator>
<dc:creator>Zhang, Y.</dc:creator>
<dc:creator>Wang, S.</dc:creator>
<dc:creator>Wang, D.</dc:creator>
<dc:creator>Yang, M.</dc:creator>
<dc:creator>Li, L.</dc:creator>
<dc:creator>Ma, J.</dc:creator>
<dc:date>2025-01-08</dc:date>
<dc:identifier>doi:10.1101/2025.01.06.631595</dc:identifier>
<dc:title><![CDATA[DNALONGBENCH: A Benchmark Suite for Long-Range DNA Prediction Tasks]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2025-01-08</prism:publicationDate>
<prism:section></prism:section>
</item>
<item rdf:about="https://biorxiv.org/cgi/content/short/2024.12.09.627422v1?rss=1">
<title>
<![CDATA[
L2G: Repurposing Language Models for Genomics Tasks 
]]>
</title>
<link>
https://biorxiv.org/cgi/content/short/2024.12.09.627422v1?rss=1"
</link>
<description><![CDATA[
Pre-trained language models have transformed the field of natural language processing (NLP), and their success has inspired efforts in genomics to develop domain-specific foundation models (FMs). However, creating high-quality genomic FMs from scratch is resource-intensive, requiring significant computational power and high-quality pre-training data. The success of large language models (LLMs) in NLP has largely been driven by industrial-scale efforts leveraging vast, diverse corpora and massive computing infrastructure. In this work, we aim to bypass the data and computational bottlenecks of creating genomic FMs from scratch and instead propose repurposing existing LLMs for genomics tasks. Inspired by the recently observed  cross-modal transfer phenomenon - where transformers pre-trained on natural language can generalize to other modalities - we introduce L2G, which adapts a pre-trained LLM architecture for genomics using neural architecture search (NAS) and a novel three-stage training procedure. Remarkably, without requiring extensive pre-training on DNA sequence data, L2G achieves superior performance to fine-tuned genomic FMs and task-specific models on more than half of tasks across multiple genomics benchmarks. In an enhancer activity prediction task, L2G further demonstrates its capacity to identify significant transcription factor motifs. Our work not only highlights the generalizability and efficacy of language models in out-of-domain tasks such as genomics, but also opens new avenues for more efficient and less resource-intensive methodologies in genomic research.
]]></description>
<dc:creator>Cheng, W.</dc:creator>
<dc:creator>Shen, J.</dc:creator>
<dc:creator>Khodak, M.</dc:creator>
<dc:creator>Ma, J.</dc:creator>
<dc:creator>Talwalkar, A.</dc:creator>
<dc:date>2024-12-11</dc:date>
<dc:identifier>doi:10.1101/2024.12.09.627422</dc:identifier>
<dc:title><![CDATA[L2G: Repurposing Language Models for Genomics Tasks]]></dc:title>
<dc:publisher>Cold Spring Harbor Laboratory Press</dc:publisher>
<prism:publicationDate>2024-12-11</prism:publicationDate>
<prism:section></prism:section>
</item>
</rdf:RDF>
