An accurate and robust imputation method scImpute for single-cell RNA-seq data（Nature Communications）
The emerging single-cell RNA sequencing (scRNA-seq) technologies enable the investigation of transcriptomic landscapes at the single-cell resolution. ScRNA-seq data analysis is complicated by excess zero counts, the so-called dropouts due to low amounts of mRNA sequenced within individual cells. We introduce scImpute, a statistical method to accurately and robustly impute the dropouts in scRNA-seq data. scImpute automatically identifies likely dropouts, and only perform imputation on these values without introducing new biases to the rest data. scImpute also detects outlier cells and excludes them from imputation. Evaluation based on both simulated and real human and mouse scRNA-seq data suggests that scImpute is an effective tool to recover transcriptome dynamics masked by dropouts. scImpute is shown to identify likely dropouts, enhance the clustering of cell subpopulations, improve the accuracy of differential expression analysis, and aid the study of gene expression dynamics.
Bias, robustness and scalability in single-cell differential expression analysis（Nature Methods）
Many methods have been used to determine differential gene expression from single-cell RNA (scRNA)-seq data. We evaluated 36 approaches using experimental and synthetic data and found considerable differences in the number and characteristics of the genes that are called differentially expressed. Prefiltering of lowly expressed genes has important effects, particularly for some of the methods developed for bulk RNA-seq data analysis. However, we found that bulk RNA-seq analysis methods do not generally perform worse than those developed specifically for scRNA-seq. We also present conquer, a repository of consistently processed, analysis-ready public scRNA-seq data sets that is aimed at simplifying method evaluation and reanalysis of published results. Each data set provides abundance estimates for both genes and transcripts, as well as quality control and exploratory analysis reports.
Quartz-Seq2: a high-throughput single-cell RNA-sequencing method that effectively uses limited sequence reads（Genome Biology）
High-throughput single-cell RNA-seq methods assign limited unique molecular identifier (UMI) counts as gene expression values to single cells from shallow sequence reads and detect limited gene counts. We thus developed a high-throughput single-cell RNA-seq method, Quartz-Seq2, to overcome these issues. Our improvements in the reaction steps make it possible to effectively convert initial reads to UMI counts, at a rate of 30–50%, and detect more genes. To demonstrate the power of Quartz-Seq2, we analyzed approximately 10,000 transcriptomes from in vitro embryonic stem cells and an in vivo stromal vascular fraction with a limited number of reads.
4. 來自復雜動物組織的 microRNAs的細胞特異性測序
Cell-type specific sequencing of microRNAs from complex animal tissues（Nature Methods）
MicroRNAs (miRNAs) play an essential role in the post-transcriptional regulation of animal development and physiology. However, in vivostudies aimed at linking miRNA function to the biology of distinct cell types within complex tissues remain challenging, partly because in vivomiRNA-profiling methods lack cellular resolution. We report microRNome by methylation-dependent sequencing (mime-seq), an in vivo enzymatic small-RNA-tagging approach that enables high-throughput sequencing of tissue- and cell-type-specific miRNAs in animals. The method combines cell-type-specific 3′-terminal 2′-O-methylation of animal miRNAs by a genetically encoded, plant-specific methyltransferase (HEN1), with chemoselective small-RNA cloning and high-throughput sequencing. We show that mime-seq uncovers the miRNomes of specific cells within Caenorhabditis elegans andDrosophila at unprecedented specificity and sensitivity, enabling miRNA profiling with single-cell resolution in whole animals. Mime-seq overcomes current challenges in cell-type-specific small-RNA profiling and provides novel entry points for understanding the function of miRNAs in spatially restricted physiological settings.
SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification（Genome Reseach）
High-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in well-annotated mammalian species. The advances in sequencing technology have created a need for studies and tools that can characterize these novel variants. Here, we present SQANTI, an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline using 47 unique descriptors. We apply SQANTI to a neuronal mouse transcriptome using Pacific Biosciences (PacBio) long reads and illustrate how the tool is effective in characterizing and describing the composition of the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated transcriptome are novel combinations of existing splice sites, resulting more frequently in novel ORFs than novel UTRs, and are enriched in both general metabolic and neural-specific functions. We show that these new transcripts have a major impact in the correct quantification of transcript levels by state-of-the-art short-read-based quantification algorithms. By comparing our iso-transcriptome with public proteomics databases, we find that alternative isoforms are elusive to proteogenomics detection. SQANTI allows the user to maximize the analytical outcome of long-read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes.
Pan-cancer genome and transcriptome analyses of 1,699 paediatric leukaemias and solid tumours(Nature)
Analysis of molecular aberrations across multiple cancer types, known as pan-cancer analysis, identifies commonalities and differences in key biological processes that are dysregulated in cancer cells from diverse lineages. Pan-cancer analyses have been performed for adult1,2,3,4 but not paediatric cancers, which commonly occur in developing mesodermic rather than adult epithelial tissues5. Here we present a pan-cancer study of somatic alterations, including single nucleotide variants, small insertions or deletions, structural variations, copy number alterations, gene fusions and internal tandem duplications in 1,699 paediatric leukaemias and solid tumours across six histotypes, with whole-genome, whole-exome and transcriptome sequencing data processed under a uniform analytical framework. We report 142 driver genes in paediatric cancers, of which only 45% match those found in adult pan-cancer studies; copy number alterations and structural variants constituted the majority (62%) of events. Eleven genome-wide mutational signatures were identified, including one attributed to ultraviolet-light exposure in eight aneuploid leukaemias. Transcription of the mutant allele was detectable for 34% of protein-coding mutations, and 20% exhibited allele-specific expression. These data provide a comprehensive genomic architecture for paediatric cancers and emphasize the need for paediatric cancer-specific development of precision therapies.
Sheep genome functional annotation reveals proximal regulatory elements contributed to the evolution of modern breeds(Nature Communications)
Domestication fundamentally reshaped animal morphology, physiology and behaviour, offering the opportunity to investigate the molecular processes driving evolutionary change. Here we assess sheep domestication and artificial selection by comparing genome sequence from 43 modern breeds (Ovis aries) and their Asian mouflon ancestor (O. orientalis) to identify selection sweeps. Next, we provide a comparative functional annotation of the sheep genome, validated using experimental ChIP-Seq of sheep tissue. Using these annotations, we evaluate the impact of selection and domestication on regulatory sequences and find that sweeps are significantly enriched for protein coding genes, proximal regulatory elements of genes and genome features associated with active transcription. Finally, we find individual sites displaying strong allele frequency divergence are enriched for the same regulatory features. Our data demonstrate that remodelling of gene expression is likely to have been one of the evolutionary forces that drove phenotypic diversification of this common livestock species.
Genome-wide analysis yields new loci associating with aortic valve stenosis(Nature Communications)
Aortic valve stenosis (AS) is the most common valvular heart disease, and valve replacement is the only definitive treatment. Here we report a large genome-wide association (GWA) study of 2,457 Icelandic AS cases and 349,342 controls with a follow-up in up to 4,850 cases and 451,731 controls of European ancestry. We identify two new AS loci, on chromosome 1p21 near PALMD (rs7543130; odds ratio (OR) = 1.20, P = 1.2 × 10?22) and on chromosome 2q22 in TEX41 (rs1830321; OR = 1.15, P = 1.8 × 10?13). Rs7543130 also associates with bicuspid aortic valve (BAV) (OR = 1.28, P = 6.6 × 10?10) and aortic root diameter (P = 1.30 × 10?8), and rs1830321 associates with BAV (OR = 1.12, P = 5.3 × 10?3) and coronary artery disease (OR = 1.05, P = 9.3 × 10?5). The results implicate both cardiac developmental abnormalities and atherosclerosis-like processes in the pathogenesis of AS. We show that several pathways are shared by CAD and AS. Causal analysis suggests that the shared risk factors of Lp(a) and non-high-density lipoprotein cholesterol contribute substantially to the frequent co-occurence of these diseases.
Assembly of 913 microbial genomes from metagenomic sequencing of the cow rumen(Nature Communications)
The cow rumen is adapted for the breakdown of plant material into energy and nutrients, a task largely performed by enzymes encoded by the rumen microbiome. Here we present 913 draft bacterial and archaeal genomes assembled from over 800 Gb of rumen metagenomic sequence data derived from 43 Scottish cattle, using both metagenomic binning and Hi-C-based proximity-guided assembly. Most of these genomes represent previously unsequenced strains and species. The draft genomes contain over 69,000 proteins predicted to be involved in carbohydrate metabolism, over 90% of which do not have a good match in public databases. Inclusion of the 913 genomes presented here improves metagenomic read classification by sevenfold against our own data, and by fivefold against other publicly available rumen datasets. Thus, our dataset substantially improves the coverage of rumen microbial genomes in the public databases and represents a valuable resource for biomass-degrading enzyme discovery and studies of the rumen microbiome.
The human noncoding genome defined by genetic diversity(Nature Genetics)
Understanding the significance of genetic variants in the noncoding genome is emerging as the next challenge in human genomics. We used the power of 11,257 whole-genome sequences and 16,384 heptamers (7-nt motifs) to build a map of sequence constraint for the human species. This build differed substantially from traditional maps of interspecies conservation and identified regulatory elements among the most constrained regions of the genome. Using new Hi-C experimental data, we describe a strong pattern of coordination over 2 Mb where the most constrained regulatory elements associate with the most essential genes. Constrained regions of the noncoding genome are up to 52-fold enriched for known pathogenic variants as compared to unconstrained regions (21-fold when compared to the genome average). This map of sequence constraint across thousands of individuals is an asset to help interpret noncoding elements in the human genome, prioritize variants and reconsider gene units at a larger scale.
High-resolution 3D models of Caulobacter crescentuschromosome reveal genome structural variability and organization(Nucleic Acids Research)
High-resolution three-dimensional models of Caulobacter crescentus nucleoid structures were generated via a multi-scale modeling protocol. Models were built as a plectonemically supercoiled circular DNA and by incorporating chromosome conformation capture based data to generate an ensemble of base pair resolution models consistent with the experimental data. Significant structural variability was found with different degrees of bending and twisting but with overall similar topologies and shapes that are consistent withC. crescentus cell dimensions. The models allowed a direct mapping of the genomic sequence onto the three-dimensional nucleoid structures. Distinct spatial distributions were found for several genomic elements such as AT-rich sequence elements where nucleoid associated proteins (NAPs) are likely to bind, promoter sites, and some genes with common cellular functions. These findings shed light on the correlation between the spatial organization of the genome and biological functions.
FIND: difFerential chromatin INteractions Detection using a spatial Poisson process(Genome Reseach)
Polymer-based simulations and experimental studies indicate the existence of a spatial dependency between the adjacent DNA fibers involved in the formation of chromatin loops. However, the existing strategies for detecting differential chromatin interactions assume that the interacting segments are spatially independent from the other segments nearby. To resolve this issue, we developed a new computational method, FIND, which considers the local spatial dependency between interacting loci. FIND uses a spatial Poisson process to detect differential chromatin interactions that show a significant difference in their interaction frequency and the interaction frequency of their neighbors. Simulation and biological data analysis show that FIND outperforms the widely used count-based methods and has a better signal-to-noise ratio.
Metaviz: interactive statistical and visual analysis of metagenomic data（Nucleic Acids Research）
Large studies profiling microbial communities and their association with healthy or disease phenotypes are now commonplace. Processed data from many of these studies are publicly available but significant effort is required for users to effectively organize, explore and integrate it, limiting the utility of these rich data resources. Effective integrative and interactive visual and statistical tools to analyze many metagenomic samples can greatly increase the value of these data for researchers. We present Metaviz, a tool for interactive exploratory data analysis of annotated microbiome taxonomic community profiles derived from marker gene or whole metagenome shotgun sequencing. Metaviz is uniquely designed to address the challenge of browsing the hierarchical structure of metagenomic data features while rendering visualizations of data values that are dynamically updated in response to user navigation. We use Metaviz to provide the UMD Metagenome Browser web service, allowing users to browse and explore data for more than 7000 microbiomes from published studies. Users can also deploy Metaviz as a web service, or use it to analyze data through the metavizr package to interoperate with state-of-the-art analysis tools available through Bioconductor. Metaviz is free and open source with the code, documentation and tutorials publicly accessible.