In [1]:
# This step sets my R_LIBS_USER
source("~/R.rc")

A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor

This notebook is fully based on the tutorial available on https://www.bioconductor.org/help/workflows/simpleSingleCell/#normalizing-based-on-spike-in-coverage.

Authors:

Aaron T. L. Lun1, Davis J. McCarthy2,3 and John C. Marioni1,2,4 1.Cancer Research UK Cambridge Institute, Li Ka Shing Centre, Robinson Way, Cambridge CB2 0RE, United Kingdom

2.EMBL European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom

3.St Vincent's Institute of Medical Research, 41 Victoria Parade, Fitzroy, Victoria 3065, Australia

4.Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom

1. Introduction

Single-cell RNA sequencing (scRNA-seq) is widely used to measure the genome-wide expression profile of individual cells. From each cell, mRNA is isolated and reverse transcribed to cDNA for high-throughput sequencing (Stegle, Teichmann, and Marioni 2015). This can be done using microfluidics platforms like the Fluidigm C1 (Pollen et al. 2014), protocols based on microtiter plates like Smart-seq2 (Picelli et al. 2014), or droplet-based technologies like inDrop (Klein et al. 2015; Macosko et al. 2015). The number of reads mapped to each gene is then used to quantify its expression in each cell. Alternatively, unique molecular identifiers (UMIs) can be used to directly measure the number of transcript molecules for each gene (Islam et al. 2014). Count data are analyzed to detect highly variable genes (HVGs) that drive heterogeneity across cells in a population, to find correlations between genes and cellular phenotypes, or to identify new subpopulations via dimensionality reduction and clustering. This provides biological insights at a single-cell resolution that cannot be achieved with conventional bulk RNA sequencing of cell populations.

Strategies for scRNA-seq data analysis differ markedly from those for bulk RNA-seq. One technical reason is that scRNA-seq data are much noisier than bulk data (Brennecke et al. 2013; Marinov et al. 2014). Reliable capture (i.e., conversion) of transcripts into cDNA for sequencing is difficult with the low quantity of RNA in a single cell. This increases the frequency of drop-out events where none of the transcripts for a gene are captured. Dedicated steps are required to deal with this noise during analysis, especially during quality control. In addition, scRNA-seq data can be used to study cell-to-cell heterogeneity, e.g., to identify new cell subtypes, to characterize differentiation processes, to assign cells into their cell cycle phases, or to identify HVGs driving variability across the population (Vallejos, Marioni, and Richardson 2015; J. Fan et al. 2016; Trapnell et al. 2014). This is simply not possible with bulk data, meaning that custom methods are required to perform these analyses.

This article describes a computational workflow for basic analysis of scRNA-seq data, using software packages from the open-source Bioconductor project (release 3.5) (Huber et al. 2015). Starting from a count matrix, this workflow contains the steps required for quality control to remove problematic cells; normalization of cell-specific biases, with and without spike-ins; cell cycle phase classification from gene expression data; data exploration to identify putative subpopulations; and finally, HVG and marker gene identification to prioritize interesting genes. The application of different steps in the workflow will be demonstrated on several public scRNA-seq datasets involving haematopoietic stem cells, brain-derived cells, T-helper cells and mouse embryonic stem cells, generated with a range of experimental protocols and platforms (N. K. Wilson et al. 2015; Zeisel et al. 2015; Buettner et al. 2015; Kołodziejczyk et al. 2015). The aim is to provide a variety of modular usage examples that can be applied to construct custom analysis pipelines.

Note: to cite this article, please refer to http://f1000research.com/articles/5-2122/v1 for instructions.

If you wish to start from raw data and generate the counts table your self follow these stepts (otherwise jump to the "Download" section):

Download the list of accession files from https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP047290&go=go using "Download" > "Accession list". Then use the sratoolkit to download the files into compressed (gz) fastq files:

while read line; do fastq-dump --gzip --split-files ${line} ; done < Accession_list.txt

Map the reasd to the reference genome with:

hisat2 -p 18 --rna-strandness R --dta-cufflinks --met-file hisat_output/${file%_1.fastq.gz}.stats \
-x ${hisat_index} -S hisat_output/${file%_1.fastq.gz}.sam -1 ${file} -2 ${file%_1.fastq.gz}_2.fastq.gz

Having this we need to count the number of reads mapped to each gene (exons only).

In [2]:
library( "GenomicFeatures" )
library("GenomicAlignments")

files<-c('SRR1578712.bam', 'SRR1578672.bam', 'SRR1578704.bam', 'SRR1578663.bam', 'SRR1578693.bam', 'SRR1578681.bam', 'SRR1578696.bam', 'SRR1578703.bam', 'SRR1578699.bam', 'SRR1578653.bam', 'SRR1578707.bam', 'SRR1578652.bam', 'SRR1578708.bam', 'SRR1578715.bam', 'SRR1578730.bam', 'SRR1578735.bam', 'SRR1578676.bam', 'SRR1578650.bam', 'SRR1578675.bam', 'SRR1578716.bam', 'SRR1578664.bam', 'SRR1578705.bam', 'SRR1578718.bam', 'SRR1578706.bam', 'SRR1578679.bam', 'SRR1578689.bam', 'SRR1578647.bam', 'SRR1578723.bam', 'SRR1578656.bam', 'SRR1578666.bam', 'SRR1578686.bam', 'SRR1578643.bam', 'SRR1578654.bam', 'SRR1578680.bam', 'SRR1578659.bam', 'SRR1578727.bam', 'SRR1578734.bam', 'SRR1578737.bam', 'SRR1578665.bam', 'SRR1578733.bam', 'SRR1578736.bam', 'SRR1578688.bam', 'SRR1578690.bam', 'SRR1578726.bam', 'SRR1578698.bam', 'SRR1578661.bam', 'SRR1578645.bam', 'SRR1578720.bam', 'SRR1578649.bam', 'SRR1578658.bam', 'SRR1578695.bam', 'SRR1578710.bam', 'SRR1578683.bam', 'SRR1578732.bam', 'SRR1578692.bam', 'SRR1578697.bam', 'SRR1578728.bam', 'SRR1578721.bam', 'SRR1578694.bam', 'SRR1578668.bam', 'SRR1578678.bam', 'SRR1578731.bam', 'SRR1578644.bam', 'SRR1578677.bam', 'SRR1578657.bam', 'SRR1578691.bam', 'SRR1578682.bam', 'SRR1578724.bam', 'SRR1578667.bam', 'SRR1578738.bam', 'SRR1578655.bam', 'SRR1578673.bam', 'SRR1578662.bam', 'SRR1578670.bam', 'SRR1578714.bam', 'SRR1578717.bam', 'SRR1578660.bam', 'SRR1578709.bam', 'SRR1578719.bam', 'SRR1578669.bam', 'SRR1578702.bam', 'SRR1578687.bam', 'SRR1578713.bam', 'SRR1578648.bam', 'SRR1578700.bam', 'SRR1578725.bam', 'SRR1578651.bam', 'SRR1578701.bam', 'SRR1578685.bam', 'SRR1578646.bam', 'SRR1578711.bam', 'SRR1578684.bam', 'SRR1578674.bam', 'SRR1578729.bam', 'SRR1578671.bam', 'SRR1578722.bam')
names<-c('HSC1_1', 'HSC1_2', 'HSC1_3', 'HSC1_4', 'HSC1_5', 'HSC1_6', 'HSC1_7', 'HSC1_8', 'HSC1_9', 'HSC1_10', 'HSC1_11', 'HSC1_12', 'HSC1_13', 'HSC1_14', 'HSC1_15', 'HSC1_16', 'HSC1_17', 'HSC1_18', 'HSC1_19', 'HSC1_20', 'HSC1_21', 'HSC1_22', 'HSC1_23', 'HSC1_24', 'HSC1_25', 'HSC1_26', 'HSC1_27', 'HSC1_28', 'HSC1_29', 'HSC1_30', 'HSC1_31', 'HSC1_32', 'HSC1_33', 'HSC1_34', 'HSC1_35', 'HSC1_36', 'HSC1_37', 'HSC1_38', 'HSC1_39', 'HSC1_40', 'HSC1_41', 'HSC1_42', 'HSC1_43', 'HSC1_44', 'HSC1_45', 'HSC1_46', 'HSC1_47', 'HSC1_48', 'HSC1_49', 'HSC1_50', 'HSC1_51', 'HSC1_52', 'HSC1_53', 'HSC1_54', 'HSC1_55', 'HSC1_56', 'HSC1_57', 'HSC1_58', 'HSC1_59', 'HSC1_60', 'HSC1_61', 'HSC1_62', 'HSC1_63', 'HSC1_64', 'HSC1_65', 'HSC1_66', 'HSC1_67', 'HSC1_68', 'HSC1_69', 'HSC1_70', 'HSC1_71', 'HSC1_72', 'HSC1_73', 'HSC1_74', 'HSC1_75', 'HSC1_76', 'HSC1_77', 'HSC1_78', 'HSC1_79', 'HSC1_80', 'HSC1_81', 'HSC1_82', 'HSC1_83', 'HSC1_84', 'HSC1_85', 'HSC1_86', 'HSC1_87', 'HSC1_88', 'HSC1_89', 'HSC1_90', 'HSC1_91', 'HSC1_92', 'HSC1_93', 'HSC1_94', 'HSC1_95', 'HSC1_96')
sampleTable <- data.frame(sampleName=names, fileName=files)
extDataDir<-"/beegfs/group_bit/data/projects/departments/Bioinformatics/bit_single_cell_tutorial/hisat_output/"
bamFiles <- file.path( extDataDir, sampleTable$fileName )

GTF<-"/beegfs/common/genomes/mus_musculus/91/original.gtf"
mmus <- makeTxDbFromGFF(GTF, format="gtf")
exonsByGene <- exonsBy( mmus, by="gene" )

se <- summarizeOverlaps( exonsByGene, BamFileList( bamFiles ), mode="Union", singleEnd=FALSE, ignore.strand=FALSE)

colnames(se) <- sampleTable$sampleName 
counts <- assay(se)

Download data

In [3]:
%%bash
cd /beegfs/group_bit/data/projects/departments/Bioinformatics/bit_single_cell_tutorial/raw_data/
wget https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE61533&format=file&file=GSE61533%5FHTSEQ%5Fcount%5Fresults%2Exls%2Egz
unpigz GSE61533_HTSEQ_count_results.xls.gz
wget https://github.com/MarioniLab/scran/raw/master/inst/exdata/mouse_cycle_markers.rds
wget https://storage.googleapis.com/linnarsson-lab-www-blobs/blobs/cortex/expression_mRNA_17-Aug-2014.txt
wget https://storage.googleapis.com/linnarsson-lab-www-blobs/blobs/cortex/expression_mito_17-Aug-2014.txt
wget https://storage.googleapis.com/linnarsson-lab-www-blobs/blobs/cortex/expression_spikes_17-Aug-2014.txt
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE29nnn/GSE29087/suppl/GSE29087_L139_expression_tab.txt.gz
wget https://media.nature.com/original/nature-assets/nbt/journal/v33/n2/extref/nbt.3102-S7.xlsx

Load required packages

In [4]:
library(readxl)
library(SingleCellExperiment)
library(org.Mm.eg.db)
library(scater)
library(scran)
library(pheatmap)
library(limma)
library(edgeR)
library(dynamicTreeCut)
library(cluster)
Loading required package: SummarizedExperiment
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colMeans, colnames,
    colSums, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match,
    mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which, which.max, which.min

Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: DelayedArray
Loading required package: matrixStats

Attaching package: ‘matrixStats’

The following objects are masked from ‘package:Biobase’:

    anyMissing, rowMedians


Attaching package: ‘DelayedArray’

The following objects are masked from ‘package:matrixStats’:

    colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

The following object is masked from ‘package:base’:

    apply

Loading required package: AnnotationDbi

Loading required package: ggplot2

Attaching package: ‘scater’

The following object is masked from ‘package:S4Vectors’:

    rename

The following object is masked from ‘package:stats’:

    filter

Loading required package: BiocParallel

Attaching package: ‘limma’

The following object is masked from ‘package:scater’:

    plotMDS

The following object is masked from ‘package:BiocGenerics’:

    plotMA


Attaching package: ‘edgeR’

The following object is masked from ‘package:SingleCellExperiment’:

    cpm

2 Analysis of haematopoietic stem cells


2.1 Overview

To introduce most of the concepts of scRNA-seq data analysis, we use a relatively simple dataset from a study of haematopoietic stem cells (HSCs) (N. K. Wilson et al. 2015). Single mouse HSCs were isolated into microtiter plates and libraries were prepared for 96 cells using the Smart-seq2 protocol. A constant amount of spike-in RNA from the External RNA Controls Consortium (ERCC) was also added to each cell’s lysate prior to library preparation. High-throughput sequencing was performed and the expression of each gene was quantified by counting the total number of reads mapped to its exonic regions. Similarly, the quantity of each spike-in transcript was measured by counting the number of reads mapped to the spike-in reference sequences. Counts for all genes/transcripts in each cell were obtained from the NCBI Gene Expression Omnibus (GEO) as a supplementary file under the accession number GSE61533 (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE61533).

For simplicity, we forego a description of the read processing steps required to generate the count matrix, i.e., read alignment and counting into features. These steps have been described in some detail elsewhere (Love et al. 2015; Y. Chen, Lun, and Smyth 2016), and are largely the same for bulk and single-cell data. The only additional consideration is that the spike-in information must be included in the pipeline. Typically, spike-in sequences can be included as additional FASTA files during genome index building prior to alignment, while genomic intervals for both spike-in transcripts and endogenous genes can be concatenated into a single GTF file prior to counting. For users favouring an R-based approach to read alignment and counting, we suggest using the methods in the Rsubread package (Liao, Smyth, and Shi 2013; Liao, Smyth, and Shi 2014). Alternatively, rapid quantification of expression with alignment-free methods such as kallisto (Bray et al. 2016) or Salmon (Patro, Duggal, and Kingsford 2015) can be performed using the functions runKallisto and runSalmon in the scater package.

2.2 Count loading

The first task is to load the count matrix into memory. In this case, some work is required to retrieve the data from the Gzip-compressed Excel format. Each row of the matrix represents an endogenous gene or a spike-in transcript, and each column represents a single HSC.

In [5]:
countsfile="/beegfs/group_bit/data/projects/departments/Bioinformatics/bit_single_cell_tutorial/raw_data/GSE61533_HTSEQ_count_results.xls"
all.counts <- as.data.frame(read_excel(countsfile, sheet=1))
rownames(all.counts) <- all.counts$ID
all.counts <- as.matrix(all.counts[,-1])

For convenience, the counts for spike-in transcripts and endogenous genes are stored in a SingleCellExperiment object from the SingleCellExperiment package.

In [6]:
sce <- SingleCellExperiment(list(counts=all.counts))
dim(sce)
  1. 38498
  2. 96

We identify the rows corresponding to ERCC spike-in transcripts from the row names. We store this information in the SingleCellExperiment object for future use. This is necessary as spike-ins require special treatment in some downstream steps such as normalization.

In [7]:
is.spike <- grepl("^ERCC", rownames(sce))
isSpike(sce, "ERCC") <- is.spike
summary(is.spike)
   Mode   FALSE    TRUE 
logical   38406      92 

We also identify the rows corresponding to mitochondrial genes, which is useful for quality control. In this case, the information can be obtained from the row names, More generally, though, identifying mitochondrial genes from standard identifiers like Ensembl requires extra annotation (this will be discussed later in more detail).

In [8]:
is.mito <- grepl("^mt-", rownames(sce))
summary(is.mito)
   Mode   FALSE    TRUE 
logical   38461      37 

2.3 Quality control on the cells

2.3.1 Defining the quality control metrics

Low-quality cells need to be removed to ensure that technical effects do not distort downstream analysis results. We use several quality control (QC) metrics:

The library size is defined as the total sum of counts across all features, i.e., genes and spike-in transcripts. Cells with small library sizes are of low quality as the RNA has not been efficiently captured (i.e., converted into cDNA and amplified) during library preparation.

The number of expressed features in each cell is defined as the number of features with non-zero counts for that cell. Any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured.

The proportion of reads mapped to spike-in transcripts is calculated relative to the library size for each cell. High proportions are indicative of poor-quality cells, where endogenous RNA has been lost during processing (e.g., due to cell lysis or RNA degradation). The same amount of spike-in RNA to each cell, so an enrichment in spike-in counts is symptomatic of loss of endogenous RNA.

In the absence of spike-in transcripts, the proportion of reads mapped to genes in the mitochondrial genome can also be used. High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.

For each cell, we calculate these quality control metrics using the calculateQCMetrics function from the scater package (McCarthy et al. 2016). These are stored in the row- and column-wise metadata of the SingleCellExperiment for future reference.

In [9]:
sce <- calculateQCMetrics(sce, feature_controls=list(ERCC=is.spike, Mt=is.mito))
head(colnames(colData(sce)))
  1. 'total_features'
  2. 'log10_total_features'
  3. 'total_counts'
  4. 'log10_total_counts'
  5. 'pct_counts_top_50_features'
  6. 'pct_counts_top_100_features'

The distributions of these metrics are shown in Figure 1. The aim is to remove putative low-quality cells that have low library sizes, low numbers of expressed features, and high spike-in (or mitochondrial) proportions.

In [10]:
par(mfrow=c(2,2), mar=c(5.1, 4.1, 0.1, 0.1))
hist(sce$total_counts/1e6, xlab="Library sizes (millions)", main="", 
    breaks=20, col="grey80", ylab="Number of cells")
hist(sce$total_features, xlab="Number of expressed genes", main="", 
    breaks=20, col="grey80", ylab="Number of cells")
hist(sce$pct_counts_ERCC, xlab="ERCC proportion (%)", 
    ylab="Number of cells", breaks=20, main="", col="grey80")
hist(sce$pct_counts_Mt, xlab="Mitochondrial proportion (%)", 
    ylab="Number of cells", breaks=20, main="", col="grey80")

Figure 1: Histograms of various QC metrics for all cells in the HSC data set. This includes the library sizes, number of expressed genes, and proportion of reads mapped to spike-in transcripts or mitochondrial genes.

2.3.2 Removing low-quality cells based on outliers

Picking a threshold for these metrics is not straightforward as their absolute values depend on the experimental protocol. For example, sequencing to greater depth will lead to more reads and more expressed features, regardless of the quality of the cells. Similarly, using more spike-in RNA in the protocol will result in higher spike-in proportions. To obtain an adaptive threshold, we assume that most of the dataset consists of high-quality cells, and identify cells that are outliers for the various QC metrics.

Outliers are defined based on the median absolute deviation (MADs) from the median value of each metric across all cells. We remove cells with log-library sizes that are more than 3 MADs below the median log-library size. A log-transformation improves resolution at small values, especially when the MAD of the raw values is comparable to or greater than the median. We also remove cells where the log-transformed number of expressed genes is 3 MADs below the median value.

In [11]:
libsize.drop <- isOutlier(sce$total_counts, nmads=3, type="lower", log=TRUE)
feature.drop <- isOutlier(sce$total_features, nmads=3, type="lower", log=TRUE)

We identify outliers for the proportion-based metrics in a similar manner. Here, no transformation is required as we are identifying large outliers, for which the distinction should be fairly clear on the raw scale. We do not use the mitochondrial proportions as we already have the spike-in proportions for this data set.

In [12]:
spike.drop <- isOutlier(sce$pct_counts_ERCC, nmads=3, type="higher")

Subsetting by column will retain only the high-quality cells that pass each filter described above. We examine the number of cells removed by each filter as well as the total number of retained cells. Removal of a substantial proportion of cells (> 10%) may be indicative of an overall issue with data quality.

In [13]:
sce <- sce[,!(libsize.drop | feature.drop | spike.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop),
    BySpike=sum(spike.drop), Remaining=ncol(sce))
ByLibSizeByFeatureBySpikeRemaining
2 2 3 92

2.3.3 Assumptions of outlier identification

We have already mentioned the assumption that most cells are of high quality. This is usually reasonable, and can be experimentally supported in some situations by visually checking that the cells are intact (e.g., on the microwell plate or in the microfluidics system). Another assumption is that the QC metrics are independent on the biological state of each cell. This ensures that the outlier values for these metrics are driven by technical factors rather than biological processes. Thus, removing cells based on the metrics will not misrepresent the biology in downstream analyses.

The second assumption is most likely to be violated in highly heterogeneous cell populations. For example, some cell types may naturally have less RNA or express fewer genes than other cell types. Such cell types are more likely to be considered outliers and removed, even if they are of high quality. The use of the MAD mitigates this problem by accounting for biological variability in the QC metrics. A heterogeneous population should have higher variability in the metrics amongst high-quality cells, increasing the MAD and reducing the chance of incorrectly removing particular cell types. Nonetheless, our cell-filtering procedure may not be appropriate in extreme cases where one cell type is very different from the others.

Systematic differences in the QC metrics can be handled to some extent using the batch argument in the isOutlier function. This is obviously useful for batch effects caused by known differences in experimental processing, e.g., sequencing at different depth or had different amounts of spike-in added. It may also be useful if an a priori cell type has systematically fewer expressed genes or lower RNA content. Analyzing all cell types together would unnecessarily inflate the MAD and compromise the removal of low-quality cells, at best; or lead to the entire loss of one cell type, at worst.

2.3.4 Alternative approaches to quality control

An alternative approach to quality control is to set pre-defined thresholds on each QC metric. For example, we might remove all cells with library sizes below 100000 and numbers of expressed genes below 4000. This generally requires a great deal of experience to determine appropriate thresholds for each experimental protocol and biological system. Indeed, even with the same protocol and system, the appropriate threshold can vary from run to run due to the vagaries of RNA capture and sequencing.

Another strategy is to perform a principal components analysis (PCA) based on the quality metrics for each cell, e.g., the total number of reads, the total number of features and the proportion of mitochondrial or spike-in reads. Outliers on a PCA plot may be indicative of low-quality cells that have aberrant technical properties compared to the (presumed) majority of high-quality cells. In Figure 2, no obvious outliers are present, which is consistent with the removal of suspect cells in the preceding quality control steps.

In [14]:
fontsize <- theme(axis.text=element_text(size=12), axis.title=element_text(size=16))
plotPCA(sce, pca_data_input="pdata") + fontsize
The following selected_variables were not found in colData(object): pct_counts_feature_controlsThe following selected_variables were not found in colData(object): total_features_feature_controlsThe following selected_variables were not found in colData(object): log10_total_counts_feature_controls
Other variables from colData(object) can be used by specifying a vector of variable names as the selected_variables argument.
PCA is being conducted using the following variables:pct_counts_top_100_featurestotal_featureslog10_total_counts_endogenous

Figure 2: PCA plot for cells in the HSC dataset, constructed using quality metrics. The first and second components are shown on each axis, along with the percentage of total variance explained by each component. Bars represent the coordinates of the cells on each axis.

Methods like PCA-based outlier detection and support vector machines can provide more power to distinguish low-quality cells from high-quality counterparts (Ilicic et al. 2016). This is because they are able to detect subtle patterns across many quality metrics simultaneously. However, this comes at some cost to interpretability, as the reason for removing a given cell may not always be obvious. Thus, for this workflow, we will use the simple approach whereby each quality metric is considered separately. Users interested in the more sophisticated approaches are referred to the scater and cellity packages.

For completeness, we note that outliers can also be identified based on the gene expression profiles, rather than QC metrics. However, we consider this to be a risky strategy as it can remove high-quality cells in rare populations.

2.4 Classification of cell cycle phase

We use the prediction method described by Scialdone et al. (2015) to classify cells into cell cycle phases based on the gene expression data. Using a training dataset, the sign of the difference in expression between two genes was computed for each pair of genes. Pairs with changes in the sign across cell cycle phases were chosen as markers. Cells in a test dataset can then be classified into the appropriate phase, based on whether the observed sign for each marker pair is consistent with one phase or another. This approach is implemented in the cyclone function using a pre-trained set of marker pairs for mouse data. (Some additional work is necessary to match the gene symbols in the data to the Ensembl annotation in the pre-trained marker set.)

In [15]:
set.seed(100)
ff="/beegfs/group_bit/data/projects/departments/Bioinformatics/bit_single_cell_tutorial/raw_data/mouse_cycle_markers.rds"
mm.pairs <- readRDS(ff)
ensembl <- mapIds(org.Mm.eg.db, keys=rownames(sce), keytype="SYMBOL", column="ENSEMBL")
assignments <- cyclone(sce, mm.pairs, gene.names=ensembl)
'select()' returned 1:many mapping between keys and columns

The cyclone result for each cell in the HSC dataset is shown in Figure 3. Each cell is assigned a score for each phase, with a higher score corresponding to a higher probability that the cell is in that phase. We focus on the G1 and G2/M scores as these are the most informative for classification.

In [16]:
plot(assignments$score$G1, assignments$score$G2M, 
     xlab="G1 score", ylab="G2/M score", pch=16)

Figure 3: Cell cycle phase scores from applying the pair-based classifier on the HSC dataset, where each point represents a cell.

Cells are classified as being in G1 phase if the G1 score is above 0.5 and greater than the G2/M score; in G2/M phase if the G2/M score is above 0.5 and greater than the G1 score; and in S phase if neither score is above 0.5. Here, the vast majority of cells are classified as being in G1 phase. We save these assignments into the SingleCellExperiment object for later use.

In [17]:
sce$phases <- assignments$phases
table(sce$phases)
 G1 G2M   S 
 89   2   1 

Pre-trained classifiers are available in scran for human and mouse data. While the mouse classifier used here was trained on data from embryonic stem cells, it is still accurate for other cell types (Scialdone et al. 2015). This may be due to the conservation of the transcriptional program associated with the cell cycle (Bertoli, Skotheim, and Bruin 2013; Conboy et al. 2007). The pair-based method is also a non-parametric procedure that is robust to most technical differences between datasets.

Comments from Aaron:

  • To remove confounding effects due to cell cycle phase, we can filter the cells to only retain those in a particular phase (usually G1) for downstream analysis. Alternatively, if a non-negligible number of cells are in other phases, we can use the assigned phase as a blocking factor. This protects against cell cycle effects without discarding information, and will be discussed later in more detail.
  • The classifier may not be accurate for data that are substantially different from those used in the training set, e.g., due to the use of a different protocol. In such cases, users can construct a custom classifier from their own training data using the sandbag function. This will also be necessary for other model organisms where pre-trained classifiers are not available.
  • Do not filter out low-abundance genes before applying cyclone. Even if a gene is not expressed in any cell, it may still be useful for classification if it is phase-specific. Its lack of expression relative to other genes will still yield informative pairs, and filtering them out would reduce power.

2.5 Examining gene-level expression metrics

2.5.1 Inspecting the most highly expressed genes

We also look at the identities of the most highly expressed genes (Figure 4). This should generally be dominated by constitutively expressed transcripts, such as those for ribosomal or mitochondrial proteins. The presence of other classes of features may be cause for concern if they are not consistent with expected biology. For example, a top set containing many spike-in transcripts suggests that too much spike-in RNA was added during library preparation, while the absence of ribosomal proteins and/or the presence of their pseudogenes are indicative of suboptimal alignment.

In [18]:
plotQC(sce, type = "highest-expression", n=50) + fontsize

Figure 4: Percentage of total counts assigned to the top 50 most highly-abundant features in the HSC dataset. For each feature, each bar represents the percentage assigned to that feature for a single cell, while the circle represents the average across all cells. Bars are coloured by the total number of expressed features in each cell, while circles are coloured according to whether the feature is labelled as a control feature.

2.5.2 Filtering out low-abundance genes

Low-abundance genes are problematic as zero or near-zero counts do not contain much information for reliable statistical inference (Bourgon, Gentleman, and Huber 2010). These genes typically do not provide enough evidence to reject the null hypothesis during testing, yet they still increase the severity of the multiple testing correction. In addition, the discreteness of the counts may interfere with statistical procedures, e.g., by compromising the accuracy of continuous approximations. Thus, low-abundance genes are often removed in many RNA-seq analysis pipelines before the application of downstream methods.

The “optimal” choice of filtering strategy depends on the downstream application. A more aggressive filter is usually required to remove discreteness (e.g., for normalization) compared to that required for removing underpowered tests. For hypothesis testing, the filter statistic should also be independent of the test statistic under the null hypothesis. Thus, we (or the relevant function) will filter at each step as needed, rather than applying a single filter for the entire analysis.

Several metrics can be used to define low-abundance genes. The most obvious is the average count for each gene, computed across all cells in the data set. We calculate this using the calcAverage function, which also performs some adjustment for library size differences between cells We typically observe a peak of moderately expressed genes following a plateau of lowly expressed genes (Figure 5).

In [19]:
ave.counts <- calcAverage(sce)
hist(log10(ave.counts), breaks=100, main="", col="grey80", 
    xlab=expression(Log[10]~"average count"))

Figure 5: Histogram of log-average counts for all genes in the HSC dataset.

A minimum threshold can be applied to this value to filter out genes that are lowly expressed. The example below demonstrates how we could remove genes with average counts less than 1. The number of TRUE values in demo.keep corresponds to the number of retained rows/genes after filtering.

In [20]:
demo.keep <- ave.counts >= 1
filtered.sce <- sce[demo.keep,]
summary(demo.keep)
   Mode   FALSE    TRUE 
logical   24439   14059 

We also examine the number of cells that express each gene. This is closely related to the average count for most genes, as expression in many cells will result in a higher average (Figure 6). Genes expressed in very few cells are often uninteresting as they are driven by amplification artifacts (though they may also also arise from rare populations). We could then remove genes that are expressed in fewer than n cells.

In [21]:
num.cells <- nexprs(sce, byrow=TRUE)
smoothScatter(log10(ave.counts), num.cells, ylab="Number of cells", 
    xlab=expression(Log[10]~"average count"))

Figure 6: The number of cells expressing each gene in the HSC data set, plotted against the log-average count. Intensity of colour corresponds to the number of genes at any given location.

As mentioned above, these filters will be applied at each step (automatically, in most cases, within the relevant function). This ensures that the most appropriate filter is used in each application. Nonetheless, we remove genes that are not expressed in any cell to reduce computational work in downstream steps. Such genes provide no information and would be removed by any filtering strategy.

In [22]:
to.keep <- num.cells > 0
sce <- sce[to.keep,]
summary(to.keep)
   Mode   FALSE    TRUE 
logical   17229   21269 

2.6 Normalization of cell-specific biases

2.6.1 Using the deconvolution method to deal with zero counts

Read counts are subject to differences in capture efficiency and sequencing depth between cells (Stegle, Teichmann, and Marioni 2015). Normalization is required to eliminate these cell-specific biases prior to downstream quantitative analyses. This is often done by assuming that most genes are not differentially expressed (DE) between cells. Any systematic difference in count size across the non-DE majority of genes between two cells is assumed to represent bias and is removed by scaling. More specifically, “size factors” are calculated that represent the extent to which counts should be scaled in each library.

Size factors can be computed with several different approaches, e.g., using the estimateSizeFactorsFromMatrix function in the DESeq2 package (Anders and Huber 2010; Love, Huber, and Anders 2014), or with the calcNormFactors function (Robinson and Oshlack 2010) in the edgeR package. However, single-cell data can be problematic for these bulk data-based methods due to the dominance of low and zero counts. To overcome this, we pool counts from many cells to increase the count size for accurate size factor estimation (Lun, Bach, and Marioni 2016). Pool-based size factors are then “deconvolved” into cell-based factors for cell-specific normalization.

In [23]:
sce <- computeSumFactors(sce)
summary(sizeFactors(sce))
Warning message in .computeSumFactors(assay(x, i = assay.type), subset.row = subset.row, :
“not enough cells in at least one cluster for some 'sizes'”
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4076  0.8101  0.9546  1.0000  1.1585  2.0264 

In this case, the size factors are tightly correlated with the library sizes for all cells (Figure 7). This suggests that the systematic differences between cells are primarily driven by differences in capture efficiency or sequencing depth. Any DE between cells would yield a non-linear trend between the total count and size factor, and/or increased scatter around the trend. This does not occur here as strong DE is unlikely to exist within a homogeneous population of cells.

In [24]:
plot(sizeFactors(sce), sce$total_counts/1e6, log="xy",
    ylab="Library size (millions)", xlab="Size factor")

Figure 7: Size factors from deconvolution, plotted against library sizes for all cells in the HSC dataset. Axes are shown on a log-scale.

Comments from Aaron:

  • While the deconvolution approach is robust to the high frequency of zeroes in scRNA-seq data, it will eventually fail if too many counts are zero. This manifests as negative size factors, which are obviously nonsensical. To avoid this, the computeSumFactors function will automatically remove low-abundance genes prior to the calculation of size factors. Genes with an average count below a specified threshold (min.mean) are ignored. For read count data, the default value of 1 is usually satisfactory. For UMI data, counts are lower so a threshold of 0.1 is recommended.
  • Cell-based QC should always be performed prior to normalization, to remove cells with very low numbers of expressed genes. If this is not done, the computeSumFactors function may yield negative size factors for low-quality cells.
  • The sizes argument can be used to specify the number of pool sizes to use to compute the size factors. More sizes yields more precise estimates at the cost of some computational time and memory. In general, sizes should not be below 20 cells, to ensure that there are sufficient non-zero expression values in each pool. We also recommend that the total number of cells should be at least 100 for effective pooling.
  • For highly heterogeneous data sets, it is advisable to perform a rough clustering of the cells. This can be done with the quickCluster function and the results passed to computeSumFactors via the cluster argument. Cells in each cluster are normalized separately, and the size factors are rescaled to be comparable across clusters. This avoids the need to assume that most genes are non-DE across the entire population - only a non-DE majority is required between pairs of clusters. We demonstrate this approach later with a larger data set.

2.6.2 Computing separate size factors for spike-in transcripts

Size factors computed from the counts for endogenous genes are usually not appropriate for normalizing the counts for spike-in transcripts. Consider an experiment without library quantification, i.e., the amount of cDNA from each library is not equalized prior to pooling and multiplexed sequencing. Here, cells containing more RNA have greater counts for endogenous genes and thus larger size factors to scale down those counts. However, the same amount of spike-in RNA is added to each cell during library preparation. This means that the counts for spike-in transcripts are not subject to the effects of RNA content. Attempting to normalize the spike-in counts with the gene-based size factors will lead to over-normalization and incorrect quantification of expression. Similar reasoning applies in cases where library quantification is performed. For a constant total amount of cDNA, any increases in endogenous RNA content will suppress the coverage of spike-in transcripts. As a result, the bias in the spike-in counts will be opposite to that captured by the gene-based size factor.

To ensure normalization is performed correctly, we compute a separate set of size factors for the spike-in set. For each cell, the spike-in-specific size factor is defined as the total count across all transcripts in the spike-in set. This assumes that none of the spike-in transcripts are differentially expressed, which is reasonable given that the same amount and composition of spike-in RNA should have been added to each cell. (See below for a more detailed discussion on spike-in normalization.) These size factors are stored in a separate field of the SingleCellExperiment object by setting general.use=FALSE in computeSpikeFactors. This ensures that they will only be used with the spike-in transcripts but not the endogenous genes.

In [25]:
sce <- computeSpikeFactors(sce, type="ERCC", general.use=FALSE)

2.6.3 Applying the size factors to normalize gene expression

The count data are used to compute normalized log-expression values for use in downstream analyses. Each value is defined as the log2-ratio of each count to the size factor for the corresponding cell, after adding a prior count of 1 to avoid undefined values at zero counts. Division by the size factor ensures that any cell-specific biases are removed. If spike-in-specific size factors are present in sce, they will be automatically applied to normalize the spike-in transcripts separately from the endogenous genes.

In [26]:
sce <- normalize(sce)

The log-transformation is useful as it means that any differences in the values directly represent log2-fold changes in expression between cells. This is usually more relevant than the absolute differences in coverage, which need to be interpreted in the context of the overall abundance. The log-transformation also provides some measure of variance stabilization (Law et al. 2014), so that high-abundance genes with large variances do not dominate downstream analyses. The computed values are stored as an "logcounts" matrix in addition to the other assay elements.

2.7 Checking for confounding technical factors

We check whether there are technical factors that contribute substantially to the heterogeneity of gene expression. If so, the factor may need to be regressed out to ensure that it does not inflate the variances or introduce spurious correlations. For this dataset, the simple experimental design means that there are no plate or batch effects to examine. Instead, we use the (log-transformed) total count for the spike-in transcripts as a proxy for the technical bias in each sample. This is based on the fact that the same amount of spike-in RNA should have been added to each cell. Thus, any association of gene expression with this factor is not biologically interesting and should be removed.

For each gene, we calculate the percentage of the variance of the expression values that is explained by the spike-in totals (Figure 8). The percentages are generally small (1-3%), indicating that the expression profiles of most genes are not strongly associated with this factor. This result is consistent with successful removal of cell-specific biases by scaling normalization. Thus, the spike-in total does not need to be explicitly modelled in our downstream analyses.

In [27]:
plotExplanatoryVariables(sce, variables=c("total_counts_ERCC", 
    "log10_total_counts_ERCC")) + fontsize

Figure 8: Density plot of the percentage of variance explained by the (log-transformed) total spike-in counts across all genes in the HSC dataset. For each gene, the percentage of the variance of the normalized log-expression values across cells that is explained by each factor is calculated. Each curve corresponds to one factor and represents the distribution of percentages across all genes.

Note that the use of the spike-in total as an accurate proxy for the relative technical bias assumes that no library quantification was performed. Otherwise, the coverage of the spike-in transcripts would be dependent on the total amount of endogenous RNA in each cell. (Specifically, if the same amount of cDNA is used for sequencing per cell, any increase in the amount of endogenous RNA will suppress the coverage of the spike-in transcripts.) This means that the spike-in totals could be confounded with genuine biological effects associated with changes in RNA content.

2.8 Modelling the technical noise in gene expression

2.8.1 Fitting a trend to the spike-in variances

Variability in the observed expression values across genes can be driven by genuine biological heterogeneity or uninteresting technical noise. To distinguish between these two possibiltiies, we need to model the technical component of the variance of the expression values for each gene. We use the trendVar function to fit a mean-dependent trend to the variances of the log-expression values for the spike-in transcripts, Recall that the same set of spike-ins was added in the same quantity to each cell. This means that the spike-in transcripts should exhibit no biological variability, i.e., any variance in their counts should be technical in origin.

In [28]:
var.fit <- trendVar(sce, parametric=TRUE, span=0.2)

Given the mean abundance of a gene, the fitted value of the trend can be used as an estimate of the technical component for that gene. The biological component of the variance can then be calculated by subtracting the technical component from the total variance of each gene with the decomposeVar function.

In [29]:
var.out <- decomposeVar(sce, var.fit)
head(var.out)
meantotalbiotechp.valueFDR
0610005C13Rik0.41698965 1.9224938 -0.43042666 2.352920 0.8976314 1
0610007P14Rik6.07800839 8.3082437 -1.24081999 9.549064 0.8072030 1
0610008F07Rik0.06343649 0.3702253 0.01370032 0.356525 0.3800895 1
0610009B22Rik3.93889838 11.2012213 -0.5810259311.782247 0.6135645 1
0610009D07Rik8.03594711 1.3638512 -0.63517861 1.999030 0.9911969 1
0610009E02Rik0.27789890 1.0950584 -0.48206851 1.577127 0.9883633 1

We visually inspect the trend to confirm that it corresponds to the spike-in variances (Figure 9)). The wave-like shape is typical of the mean-variance trend for log-expression values. A linear increase in the variance is observed as the mean increases from zero, as larger variances are possible when the counts increase. At very high abundances, the effect of sampling noise decreases due to the law of large numbers, resulting in a decrease in the variance.

In [30]:
plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression", 
    ylab="Variance of log-expression")
curve(var.fit$trend(x), col="dodgerblue", lwd=2, add=TRUE)
cur.spike <- isSpike(sce)
points(var.out$mean[cur.spike], var.out$total[cur.spike], col="red", pch=16)

Figure 9: Variance of normalized log-expression values for each gene in the HSC dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

We check the distribution of expression values for the genes with the largest biological components. This ensures that the variance estimate is not driven by one or two outlier cells (Figure 10).

In [31]:
chosen.genes <- order(var.out$bio, decreasing=TRUE)[1:10]
plotExpression(sce, features=rownames(var.out)[chosen.genes]) + fontsize

Figure 10: Violin plots of normalized log-expression values for the top 10 genes with the largest biological components in the HSC dataset. Each point represents the log-expression value in a single cell.

Comments from Aaron:

  • In practice, trend fitting is complicated by the small number of spike-in transcripts and the uneven distribution of their abundances. For low numbers of cells, these issues are exacerbated by the low precision of the variance estimates. Some tuning of trend parameters such as span may be required to achieve a suitable fit - see ?trendVar for more details. Setting parametric=TRUE is especially useful for modelling the expected wave-like shape of the mean-variance relationship. (This is not the default setting as it is not robust for arbitrary trend shapes.)
  • The trendVar function will automatically filter out low-abundance genes prior to trend fitting. This ensures that low-abundance genes do not interfere with the fit - either due to discreteness, which biases the estimate of variability of the variances around the trend; or due to the frequency of low-abundance genes, which reduces the sensitivity of span-based smoothing algorithms at higher abundances.
    • Filtering uses the average of log-expression values rather than the (library size-adjusted) average count. The mean log-expression is independent of the variance estimate in a linear modelling framework (Bourgon, Gentleman, and Huber 2010), which ensures that the filter does not introduce spurious trends in the variances at the filter boundary.
    • The filter threshold is specified with the min.mean argument in trendVar. We use the default threshold of 0.1 (min.mean) based on the appearance of discrete patterns in the variance estimates for simulated Poisson-distributed counts. Lower thresholds of 0.001-0.01 may be more suitable for very sparse data, e.g., from droplet-based protocols.
    • The filter used in trendVar is not applied in decomposeVar by default. Retention of all genes ensures that weak biological signal from rare subpopulations is not discarded. To apply the filter in decomposeVar, users should set subset.row=rowMeans(logcounts(sce)) > 0.1 in the function call.
  • Negative biological components are often obtained from decomposeVar. These are intuitively meaningless as it is impossible for a gene to have total variance below technical noise. Nonetheless, such values occur due to imprecise estimation of the total variance, especially for low numbers of cells.
  • decomposeVar also yields p-values that can be used to define HVGs at a specific threshold for the false discovery rate (FDR). We will discuss this in more detail later, as formal detection of HVGs is not necessary for feature selection during data exploration.

2.8.2 Trend fitting when spike-ins are unavailable

If spike-in RNA has not been added in appropriate quantities, an alternative approach is to fit the trend to the variance estimates of the endogenous genes. This is done using the use.spikes=FALSE setting in trendVar, as shown below.

In [32]:
var.fit.nospike <- trendVar(sce, parametric=TRUE, use.spikes=FALSE, span=0.2)
var.out.nospike <- decomposeVar(sce, var.fit.nospike)

This assumes that the majority of genes are not variably expressed, such that the technical component dominates the total variance for most genes. In Figure 11, the trend passes through or close to most of the spike-in variances, indicating that our assumption is valid.

In [33]:
plot(var.out.nospike$mean, var.out.nospike$total, pch=16, cex=0.6, 
    xlab="Mean log-expression", ylab="Variance of log-expression")
curve(var.fit.nospike$trend(x), col="dodgerblue", lwd=2, add=TRUE)
points(var.out.nospike$mean[cur.spike], var.out.nospike$total[cur.spike], col="red", pch=16)

Figure 11: Variance of normalized log-expression values for each gene in the HSC dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the endogenous genes (black), with spike-in transcripts shown in red.

If our assumption does not hold, the output of decomposeVar is more difficult to interpret. The function quantifies changes in variances for each gene over the majority of genes with the same abundance. One could assume that the variabilities of most genes are driven by constitutive “house-keeping” processes, which are generally uninteresting. Any gene with an increase in its variance is relatively highly variable and can be prioritized for further study.

2.9 Denoising expression values using PCA

Once the technical noise is modelled, we can use principal components analysis (PCA) to remove random technical noise. Consider that each cell represents a point in the high-dimensional expression space, where the spread of points represents the total variance. PCA identifies axes in this space that capture as much of this variance as possible. Each axis is a principal component (PC), where any early PC will explain more of the variance than a later PC. We assume that biological processes involving co-regulated groups of genes will account for the most variance in the data. If this is the case, this process should be represented by one or more of the earlier PCs. In contrast, random technical noise affects each gene independently and will be represented by later PCs. The denoisePCA function removes later PCs until the total discarded variance is equal to the sum of technical components for all genes used in the PCA.

In [34]:
sce <- denoisePCA(sce, technical=var.fit$trend) 
dim(reducedDim(sce, "PCA"))
  1. 92
  2. 10

The function returns a SingleCellExperiment object containing the PC scores for each cell in the reducedDims slot. The aim is to eliminate technical noise and enrich for biological signal in the retained PCs. This improves resolution of the underlying biology during downstream procedures such as clustering.

Comments from Aaron:

  • denoisePCA will internally filter to only genes that have positive biological components in denoisePCA. This guarantees that the total technical variance to be discarded will not be greater than the total variance in the data.
  • No filtering is performed on abundance here, which ensures that PCs corresponding to rare subpopulations can still be detected. Discreteness is less of an issue as low-abundance genes also have lower variance, thus reducing their contribution to the PCA.
  • It is also possible to obtain a low-rank approximation of the original expression matrix, capturing the variance equivalent to the retained PCs. This is useful for denoising prior to downstream procedures that require gene-wise expression values.
In [35]:
sce2 <- denoisePCA(sce, technical=var.fit$trend, value="lowrank") 
assayNames(sce2)
  1. 'counts'
  2. 'logcounts'
  3. 'lowrank'

2.10 Data exploration with dimensionality reduction

We visualize the relationships between cells by constructing pairwise PCA plots for the first three components (Figure 12). Cells with similar expression profiles should be located close together in the plot, while dissimilar cells should be far apart. In this case, no clear separation of cells into distinct subpopulations is observed. This is consistent with the presence of a highly homogeneous population of HSCs (N. K. Wilson et al. 2015).

In [36]:
plotReducedDim(sce, use_dimred="PCA", ncomponents=3, colour_by="total_features") + fontsize

Figure 12: Pairwise PCA plots of the first three PCs in the HSC data set, constructed from normalized log-expression values of genes with positive biological components. Each point represents a cell, coloured according to its total number of expressed features. Bars represent the coordinates of the cells on each axis.

Another widely used approach is the t-stochastic neighbour embedding (t-SNE) method (Van der Maaten and Hinton 2008). t-SNE tends to work better than PCA for separating cells in more diverse populations. This is because the former can directly capture non-linear relationships in high-dimensional space, whereas the latter must represent them on linear axes. However, this improvement comes at the cost of more computational effort and requires the user to consider parameters such as the random seed and perplexity (see comments). We demonstrate the generation of t-SNE plots in Figure 13, using the low-rank approximation of the data to take advantage of the denoising step. As with the PCA plots, no consistent substructure is observed.

In [37]:
out5 <- plotTSNE(sce, use_dimred="PCA", perplexity=5, colour_by="total_features", 
    rand_seed=100) + fontsize + ggtitle("Perplexity = 5")
out10 <- plotTSNE(sce, use_dimred="PCA", perplexity=10, colour_by="total_features",
    rand_seed=100) + fontsize + ggtitle("Perplexity = 10")
out20 <- plotTSNE(sce, use_dimred="PCA", perplexity=20, colour_by="total_features",
    rand_seed=100) + fontsize + ggtitle("Perplexity = 20")
out5
out10
out20
#multiplot(out5, out10, out20, cols=3)

Figure 13: t-SNE plots constructed from the denoised PCs in the HSC data set, using a range of perplexity values. Each point represents a cell, coloured according to its total number of expressed features. Bars represent the coordinates of the cells on each axis.

There are many other dimensionality reduction techniques that we do not consider here but could also be used, e.g., multidimensional scaling, diffusion maps. These have their own advantages and disadvantages – for example, diffusion maps (see plotDiffusionMap) place cells along a continuous trajectory and are suited for visualizing graduated processes like differentiation (Angerer et al. 2016).

Comments from Aaron:

  • For each visualization method, additional cell-specific information can be incorporated into the colour, size or shape of each point. Here, cells are coloured by the total number of expressed features, which is strongly correlated with the first PC. We will discuss this in more detail in the next section.
  • For PCA, more components can be shown but these are usually less informative (and more difficult to interpret) as they explain less of the variance.
  • t-SNE is a stochastic method, so users should run the algorithm several times to ensure that the results are representative. Scripts should set a seed (via the rand_seed argument) to ensure that the chosen results are reproducible. It is also advisable to test different settings of the “perplexity” parameter as this will affect the distribution of points in the low-dimensional space. A good guide on how to interpret t-SNE plots can be found at http://distill.pub/2016/misread-tsne/.
  • The selectorPlot function from scran can also be used to interactively select groups of cells in two-dimensional space. This facilitates data exploration as visually identified subpopulations can be directly selected for further examination. The exploreData function can also be used to interactively visualize gene expression, dimensionality reduction results and other covariates simultaneously.

2.11 Interpreting heterogeneity across a continuum

Putative subpopulations of cells can be computationally defined by unsupervised classification, i.e., clustering. We do not attempt this here due to the absence of distinct subpopulations in the plots above. However, there is a clear visual correlation between the number of expressed features and the first PC in Figure 12. We can study this in more detail by detecting genes that are differentially expressed along the first PC. This is done by applying methods from the limma package to the log-expression values (Ritchie et al. 2015).

In [38]:
pc1 <- reducedDim(sce, "PCA")[,1]
design <- model.matrix(~pc1)
library(limma)
fit <- lmFit(logcounts(sce), design)
fit <- eBayes(fit, trend=TRUE, robust=TRUE)
topTable(fit)
Removing intercept from test coefficients
logFCAveExprtP.Valueadj.P.ValB
Itm2b-0.00870041711.8221808 -8.815075 5.838858e-141.241867e-0920.67640
Pdzk1ip1-0.016530223 9.5875906 -7.938460 4.173793e-124.438621e-0816.42923
Sapcd2 0.020087254 0.3475062 7.533954 2.989518e-112.119468e-0714.47635
Ran 0.012404428 9.5856439 6.971714 4.203583e-102.235150e-0611.85585
Rcc1 0.045340434 2.3498833 6.816527 8.835801e-103.512452e-0611.12254
Eif5a 0.00904188910.8472265 6.769698 1.080516e-093.512452e-0610.92226
Cks1b 0.047745432 3.8117201 6.705714 1.478134e-093.512452e-0610.61398
Asf1b 0.025461656 0.5682945 6.687560 1.607737e-093.512452e-0610.53094
Gtse1 0.020142152 0.4539465 6.660821 1.819362e-093.512452e-0610.40879
Rad54l 0.028353685 0.9622899 6.654762 1.871024e-093.512452e-0610.38113

Figure @ref{fig:heatmaphsc} indicates that the majority of genes increase in expression along the first PC. Some of these genes are involved in DNA replication and the cell cycle, e.g., Chek1. However, there are also a few genes that decrease in expression, e.g., Lst1. This suggests that the first PC corresponds to genuine biology (namely the cell cycle), not just a change in total RNA content that non-specifically increases the expression of all transcripts.

In [39]:
de.genes <- rownames(topTable(fit, coef=2, n=50))
heat.vals <- logcounts(sce)[de.genes,]
heat.vals <- heat.vals - rowMeans(heat.vals)
heat.vals[heat.vals > 2] <- 2
heat.vals[heat.vals < -2] <- -2
pheatmap(heat.vals[,order(pc1)], cluster_cols=FALSE)

Figure 14: Heatmap of the top 50 DE genes along the first PC in the HSC data set. The colour for each cell (column) represents the log-fold change from the average log-expression for each gene (row), bounded to [-2, 2] for visualization purposes. Cells are ordered by their location on the first PC.

Comments from Aaron:

  • Users are discouraged from treating the adjusted p-values from this analysis with any particular reverence. This is because the DE testing is performed on the same data used to define the first PC. PCA empirically captures the maximum variance in the data, so there will always be some genes with low p-values, even if the variance is purely driven by random noise.
  • The procedure above can be used to test for changes along any covariate, e.g., pseudotime. Note that specifying ~pc1 only models linear changes across the covariate. Other trends can be more flexibly modelled using splines, i.e., ~splines::ns(pc1, df=5), though the spline coefficients are not easy to interpret.
  • Technically, we could achieve more power by filtering out low-abundance genes to reduce the severity of the multiple testing correction. This would be performed on the mean of the log-expression values (i.e., rowMeans(logcounts(sce)), which is an independent filter statistic for linear models. However, this is probably unnecessary given that the interpretation of the adjusted p-values is already compromised.
  • It is not uncommon for cell cycle to be the major driver of - variability in cell populations that are very homogeneous. If necessary, the cyclone results can be used to filter out non-G1 cells prior to further analysis.

2.12 Additional comments

Once the basic analysis is completed, it is often useful to save the SingleCellExperiment object to file with the saveRDS function. The object can then be easily restored into new R sessions using the readRDS function. This allows further work to be conducted without having to repeat all of the processing steps described above.

In [40]:
saveRDS(file="/beegfs/group_bit/data/projects/departments/Bioinformatics/bit_single_cell_tutorial/hsc_data.rds", sce)

A variety of methods are available to perform more complex analyses on the processed expression data. For example, cells can be ordered in pseudotime (e.g., for progress along a differentiation pathway) with monocle (Trapnell et al. 2014) or TSCAN (Z. Ji and Ji 2016); cell-state hierarchies can be characterized with the sincell package (Julia, Telenti, and Rausell 2015); and oscillatory behaviour can be identified using Oscope (Leng et al. 2015). HVGs can be used in gene set enrichment analyses to identify biological pathways and processes with heterogeneous activity, using packages designed for bulk data like topGO or with dedicated single-cell methods like scde (J. Fan et al. 2016). Full descriptions of these analyses are outside the scope of this workflow, so interested users are advised to consult the relevant documentation.

3 Analysis of cell types in the brain


3.1 Overview

We proceed to a more heterogeneous dataset from a study of cell types in the mouse brain (Zeisel et al. 2015). This contains approximately 3000 cells of varying types such as oligodendrocytes, microglia and neurons. Individual cells were isolated using the Fluidigm C1 microfluidics system and library preparation was performed on each cell using a UMI-based protocol. After sequencing, expression was quantified by counting the number of UMIs mapped to each gene. Count data for all endogenous genes, mitochondrial genes and spike-in transcripts were obtained from http://linnarssonlab.org/cortex.

3.2 Count loading

The count data are distributed across several files, so some work is necessary to consolidate them into a single matrix. We define a simple utility function for loading data in from each file. (We stress that this function is only relevant to the current dataset, and should not be used for other datasets. This kind of effort is generally not required if all of the counts are in a single file and separated from the metadata.)

In [41]:
readFormat <- function(infile) { 
    # First column is empty.
    metadata <- read.delim(infile, stringsAsFactors=FALSE, header=FALSE, nrow=10)[,-1] 
    rownames(metadata) <- metadata[,1]
    metadata <- metadata[,-1]
    metadata <- as.data.frame(t(metadata))

    # First column after row names is some useless filler.
    counts <- read.delim(infile, stringsAsFactors=FALSE, 
        header=FALSE, row.names=1, skip=11)[,-1] 
    counts <- as.matrix(counts)
    return(list(metadata=metadata, counts=counts))
}

Using this function, we read in the counts for the endogenous genes, ERCC spike-ins and mitochondrial genes.

In [42]:
endo.data <- readFormat("/beegfs/group_bit/data/projects/departments/Bioinformatics/bit_single_cell_tutorial/raw_data/expression_mRNA_17-Aug-2014.txt")
spike.data <- readFormat("/beegfs/group_bit/data/projects/departments/Bioinformatics/bit_single_cell_tutorial/raw_data/expression_spikes_17-Aug-2014.txt")
mito.data <- readFormat("/beegfs/group_bit/data/projects/departments/Bioinformatics/bit_single_cell_tutorial/raw_data/expression_mito_17-Aug-2014.txt")

We also need to rearrange the columns for the mitochondrial data, as the order is not consistent with the other files.

In [43]:
m <- match(endo.data$metadata$cell_id, mito.data$metadata$cell_id)
mito.data$metadata <- mito.data$metadata[m,]
mito.data$counts <- mito.data$counts[,m]

In this particular data set, some genes are represented by multiple rows corresponding to alternative genomic locations. We sum the counts for all rows corresponding to a single gene for ease of interpretation.

In [44]:
raw.names <- sub("_loc[0-9]+$", "", rownames(endo.data$counts))
new.counts <- rowsum(endo.data$counts, group=raw.names, reorder=FALSE)
endo.data$counts <- new.counts

The counts are then combined into a single matrix for constructing a SingleCellExperiment object. For convenience, metadata for all cells are stored in the same object for later access.

In [45]:
all.counts <- rbind(endo.data$counts, mito.data$counts, spike.data$counts)
sce <- SingleCellExperiment(list(counts=all.counts), colData=endo.data$metadata)
dim(sce)
  1. 19896
  2. 3005

We also add annotation identifying rows that correspond to each class of features.

In [46]:
nrows <- c(nrow(endo.data$counts), nrow(mito.data$counts), nrow(spike.data$counts))
is.spike <- rep(c(FALSE, FALSE, TRUE), nrows)
is.mito <- rep(c(FALSE, TRUE, FALSE), nrows)
isSpike(sce, "Spike") <- is.spike
sce
class: SingleCellExperiment 
dim: 19896 3005 
metadata(0):
assays(1): counts
rownames(19896): Tspan12 Tshz1 ... ERCC-00170 ERCC-00171
rowData names(0):
colnames(3005): V3 V4 ... V3006 V3007
colData names(10): tissue group # ... level1class level2class
reducedDimNames(0):
spikeNames(1): Spike

3.3 Quality control on the cells

The original authors of the study have already removed low-quality cells prior to data publication. Nonetheless, we compute some quality control metrics to check whether the remaining cells are satisfactory.

In [47]:
sce <- calculateQCMetrics(sce, feature_controls=list(Spike=is.spike, Mt=is.mito))

We examine the distribution of library sizes and numbers of expressed genes across cells (Figure 15). In particular, the spike-in proportions here are more variable than in the HSC dataset. This may reflect a greater variability in the total amount of endogenous RNA per cell when many cell types are present.

In [48]:
par(mfrow=c(2,2), mar=c(5.1, 4.1, 0.1, 0.1))
hist(sce$total_counts/1e3, xlab="Library sizes (thousands)", main="", 
    breaks=20, col="grey80", ylab="Number of cells")
hist(sce$total_features, xlab="Number of expressed genes", main="", 
    breaks=20, col="grey80", ylab="Number of cells")
hist(sce$pct_counts_Spike, xlab="ERCC proportion (%)",
    ylab="Number of cells", breaks=20, main="", col="grey80")
hist(sce$pct_counts_Mt, xlab="Mitochondrial proportion (%)", 
    ylab="Number of cells", breaks=20, main="", col="grey80")

Figure 15: Histograms of QC metrics including the library sizes, number of expressed genes and proportion of UMIs assigned to spike-in transcripts or mitochondrial genes for all cells in the brain dataset.

We remove small outliers for the library size and the number of expressed features, and large outliers for the spike-in proportions. Again, the presence of spike-in transcripts means that we do not have to use the mitochondrial proportions.

In [49]:
libsize.drop <- isOutlier(sce$total_counts, nmads=3, type="lower", log=TRUE)
feature.drop <- isOutlier(sce$total_features, nmads=3, type="lower", log=TRUE)
spike.drop <- isOutlier(sce$pct_counts_Spike, nmads=3, type="higher")

Removal of low-quality cells is then performed by combining the filters for all of the metrics. The vast majority of cells are retained, which suggests that the original quality control procedures were generally adequate.

In [50]:
sce <- sce[,!(libsize.drop | feature.drop | spike.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop), 
    BySpike=sum(spike.drop), Remaining=ncol(sce))
ByLibSizeByFeatureBySpikeRemaining
8 3 8 2989

3.4 Cell cycle classification

Application of cyclone to the brain dataset suggests that most of the cells are in G1 phase (Figure 16). However, the intepretation of this result requires some caution due to the differences between the test and training datasets. The classifier was trained on C1 SMARTer data (Scialdone et al. 2015) and accounts for the biases in that protocol. The brain dataset uses UMI counts, which has an entirely different set of biases, e.g., 3’-end coverage only, no length bias, no amplification noise. These new biases (and the absence of expected biases) may interfere with accurate classification of some cells.

In [51]:
ensembl <- mapIds(org.Mm.eg.db, keys=rownames(sce), keytype="SYMBOL", column="ENSEMBL")
assignments <- cyclone(sce, mm.pairs, gene.names=ensembl)
plot(assignments$score$G1, assignments$score$G2M, xlab="G1 score", ylab="G2/M score", pch=16)
'select()' returned 1:many mapping between keys and columns

Figure 16: Cell cycle phase scores from applying the pair-based classifier on the brain dataset, where each point represents a cell.

An additional complication is that many neuronal cell types are expected to lie in the G0 resting phase, which is distinct from the other phases of the cell cycle (Coller, Sang, and Roberts 2006). Application of cyclone to these cells may be suboptimal if each cell must be assigned into one of the G1, S or G2/M phases. To avoid problems from misclassification, we will not perform any processing of this dataset by cell cycle phase. This is unlikely to be problematic here, as the cell cycle effect will be relatively subtle compared to the obvious differences between cell types in a diverse population. Thus, the former is unlikely to distort the conclusions regarding the latter.

3.5 Examining gene-level metrics

Figure 17 shows the most highly expressed genes across the cell population in the brain data set. This is mostly occupied by spike-in transcripts, suggesting that too much spike-in RNA may be have been used. There are also a number of constitutively expressed genes, as expected.

In [52]:
plotQC(sce, type = "highest-expression", n=50) + fontsize

Figure 17: Percentage of total counts assigned to the top 50 most highly-abundant features in the brain dataset. For each feature, each bar represents the percentage assigned to that feature for a single cell, while the circle represents the average across all cells. Bars are coloured by the total number of expressed features in each cell, while circles are coloured according to whether the feature is labelled as a control feature.

Gene abundance is quantified by computing the average count across all cells (Figure 18). As previously mentioned, the UMI count is generally lower than the read count. This is because each transcript can only produce one UMI count but can yield many reads after fragmentation. Some power will be lost due to the decrease in the size of the counts, but this is mitigated by a concomitant reduction in their variability. Specifically, the use of UMIs eliminates technical noise due to amplification biases (Islam et al. 2014).

In [53]:
ave.counts <- calcAverage(sce)
hist(log10(ave.counts), breaks=100, main="", col="grey",
    xlab=expression(Log[10]~"average count"))
abline(v=log10(0.1), col="blue", lwd=2, lty=2)

Figure 18: Histogram of log-average counts for all genes in the brain dataset. The filter threshold is represented by the blue line.

We save the average counts into the SingleCellExperiment object for later use. We also remove genes that have average counts of zero, as this means that they are not expressed in any cell.

In [54]:
rowData(sce)$ave.count <- ave.counts
to.keep <- ave.counts > 0
sce <- sce[to.keep,]
summary(to.keep)
   Mode   FALSE    TRUE 
logical       2   19894 

3.6 Normalization of cell-specific biases

For endogenous genes, normalization is performed using the deconvolution method in the computeSumFactors function. Here, we cluster similar cells together and normalize the cells in each cluster using the deconvolution method. This improves normalization accuracy by reducing the number of DE genes between cells in the same cluster. Scaling is then performed to ensure that size factors of cells in different clusters are comparable.

In [55]:
high.ave <- rowData(sce)$ave.count >= 0.1
clusters <- quickCluster(sce, subset.row=high.ave, method="igraph")
sce <- computeSumFactors(sce, cluster=clusters, 
    subset.row=high.ave, min.mean=NULL)
summary(sizeFactors(sce))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1257  0.4550  0.8149  1.0000  1.3435  4.4814 

We set subset.row to use only the high-abundance genes for normalization (and clustering, for consistency). We use a threshold of 0.1 to define high.ave, which is lower than the threshold in the HSC analysis to reflect the fact that UMI counts are generally smaller. Setting min.mean=NULL simply avoids recomputing the average count within computeSumFactors, given that filtering has already been performed with subset.row. (Here, subset.row=high.ave and min.mean=0.1 are redundant as they do exactly the same thing.)

Compared to the HSC analysis, more scatter is observed around the trend between the total count and size factor for each cell (Figure 19). This is consistent with an increased amount of DE between cells of different types, which compromises the accuracy of library size normalization (Robinson and Oshlack 2010). In contrast, the size factors are estimated based on median ratios and are more robust to the presence of DE between cells.

In [56]:
plot(sizeFactors(sce), sce$total_counts/1e3, log="xy",
    ylab="Library size (thousands)", xlab="Size factor")

Figure 19: Size factors from deconvolution, plotted against library sizes for all cells in the brain dataset. Axes are shown on a log-scale.

We also compute size factors specific to the spike-in set, as previously described.

In [57]:
sce <- computeSpikeFactors(sce, type="Spike", general.use=FALSE)

Finally, normalized log-expression values are computed for each endogenous gene or spike-in transcript using the appropriate size factors.

In [58]:
sce <- normalize(sce)

Comments from Aaron:

  • Only a rough clustering is required to avoid pooling together very different cell types in computeSumFactors. The function is robust to a moderate level of differential expression between cells in the same cluster.
  • For large data sets, using method="igraph" in quickCluster will speed up clustering. This uses a graph-based clustering algorithm - see ?buildSNNGraph for more details.

3.7 Checking for important technical factors

Larger experiments contain more technical factors that need to be investigated. In this dataset, factors include the sex of the animal from which the cells were extracted, the age of the animal, the tissue of origin for each cell, and the total spike-in count in each cell. Figure 20 shows that the tissue of origin explains a substantial proportion of the variance for a subset of genes. This is probably because each tissue contains a different composition of cell types, leading to systematic differences in gene expression between tissues. The other factors explain only a small proportion of the variance for most genes and do not need to be incorporated into our downstream analyses.

In [59]:
plotExplanatoryVariables(sce, variables=c("log10_total_counts_Spike", 
    "log10_total_counts_Spike", "sex", "tissue", "age")) + fontsize

Figure 20: Density plot of the percentage of variance explained by each factor across all genes in the brain dataset. For each gene, the percentage of the variance of the normalized log-expression values that is explained by the (log-transformed) total spike-in counts, the sex or age of the mouse, or the tissue of origin is calculated. Each curve corresponds to one factor and represents the distribution of percentages across all genes.

Nonetheless, we demonstrate how to account for uninteresting technical factors by using sex as an example. We set up a design matrix with the sex of the animal as the explanatory factor for each cell. This ensures that any sex-specific changes in expression will be modelled in our downstream analyses. We do not block on the tissue of origin, despite the fact that it explains more of the variance than sex in Figure 20. This is because the tissue factor is likely to be associated with genuine differences between cell types, so including it in the model might regress out interesting biological effects.

In [60]:
design <- model.matrix(~sce$sex)

Other relevant factors include the chip or plate on which the cells were processed and the batch in which the libraries were sequenced. Blocking on these factors may be necessary to account for batch effects that are often observed in scRNA-seq data (Hicks, Teng, and Irizarry 2015; Tung et al. 2016).

3.8 Modelling and removing technical noise

We model the technical noise by fitting a mean-variance trend to the spike-in transcripts, as previously described. To account for uninteresting factors, we supply design to trendVar to regress out any technical differences due to sex.

In [61]:
var.fit <- trendVar(sce, parametric=TRUE, span=0.4, design=design)
var.out <- decomposeVar(sce, var.fit)

Figure 21 indicates that the trend is fitted accurately to the technical variances. The technical and total variances are also much smaller than those in the HSC dataset. This is due to the use of UMIs which reduces the noise caused by variable PCR amplification. Furthermore, the spike-in trend is consistently lower than the variances of the endogenous genes. This reflects the heterogeneity in gene expression across cells of different types. It also provides an example where most genes are highly variable, such that fitting a trend to their variances would not recover the technical component.

In [62]:
plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression", 
    ylab="Variance of log-expression")
points(var.out$mean[isSpike(sce)], var.out$total[isSpike(sce)], col="red", pch=16)
curve(var.fit$trend(x), col="dodgerblue", add=TRUE, lwd=2)

Figure 21: Variance of normalized log-expression values against the mean for each gene, calculated across all cells in the brain data set after blocking on the sex effect. The blue line represents the mean-dependent trend in the technical variance of the spike-in transcripts (also highlighted as red points).

We check the distribution of expression values for the genes with the largest biological components to ensure that they are not driven by outliers (Figure @(fig:hvgvioplotbrain)). Some tweaking of the plotExpression parameters is necessary to visualize a large number of cells.

In [63]:
chosen.genes <- order(var.out$bio, decreasing=TRUE)[1:10]
plotExpression(sce, rownames(var.out)[chosen.genes], 
    alpha=0.05, jitter="jitter") + fontsize

Figure 22: Violin plots of normalized log-expression values for the top 10 HVGs in the brain dataset. For each gene, each point represents the log-expression value for an individual cell.

Finally, we denoise the expression values using our PCA-based approach. We supply design to regress out uninteresting factors, and we use only the genes with positive biological components, This yields a set of coordinates for each cell where the technical noise has been removed.

In [64]:
sce <- denoisePCA(sce, technical=var.fit$trend, design=design, approximate=TRUE)
ncol(reducedDim(sce, "PCA"))
100

Comments from Aaron:

  • For data sets containing multiple batches, an alternative strategy is to perform trend fitting and variance decomposition separately for each batch. This accommodates differences in the mean-variance trends between batches, especially if a different amount of spike-in RNA was added to the cells in each batch. We demonstrate the second approach below by treating each sex as a different “batch”. Statistics are combined across multiple batches using the combineVar function.
In [65]:
collected <- list()
for (block in levels(sce$sex)) {
    cur.sce <- sce[,sce$sex==block]
    cur.sce <- normalize(cur.sce) 
    var.fit <- trendVar(cur.sce, parametric=TRUE, span=0.4)
    collected[[block]] <- decomposeVar(cur.sce, var.fit)
}
var.out <- do.call(combineVar, collected)
  • Some downstream procedures must be performed on all batches at once, e.g., clustering or dimensionality reduction of all cells across multiple batches. However, many of these procedures are not model-based and thus do not accept a design matrix to account for the batch effect. To remove uninteresting factors of variation beforehand, we use the removeBatchEffect function from the limma package (Ritchie et al. 2015). This computes new expression values where the batch effect is regressed out, ensuring that it does not drive separation between clusters or in low-dimensional space. This is demonstrated below for the sex effect in the brain data. Note that this step is automatically performed inside denoisePCA when design is supplied, and does not need to be repeated.
In [66]:
adj.exprs <- logcounts(sce)
adj.exprs <- removeBatchEffect(adj.exprs, batch=sce$sex)
norm_exprs(sce) <- adj.exprs
  • That being said, if an analysis method can accept a design matrix, blocking on nuisance factors in the design matrix is preferable to using removeBatchEffect. This is because the latter does not account for the loss of residual degrees of freedom, nor the uncertainty of estimation of the blocking factor terms.
  • Setting approximate=TRUE in denoisePCA will perform an approximate singular value decomposition, using methods from the irlba package. This is much faster than the exact algorithm on large data sets, without much loss of accuracy.

3.9 Data exploration with dimensionality reduction

We perform dimensionality reduction on the denoised PCs to check if there is any substructure. Cells separate into clear clusters in the t-SNE plot (Figure 23), corresponding to distinct subpopulations. This is consistent with the presence of multiple cell types in the diverse brain population.

In [67]:
tsne1 <- plotTSNE(sce, use_dimred="PCA", colour_by="Neurod6",
    perplexity=10, rand_seed=100) + fontsize
tsne2 <- plotTSNE(sce, use_dimred="PCA", colour_by="Mog",
    perplexity=10, rand_seed=100) + fontsize
multiplot(tsne1, tsne2, cols=2)

Figure 23: t-SNE plots constructed from the denoised PCs of the brain dataset. Each point represents a cell and is coloured according to its expression of Neurod6 (left) or Mog (right).

The PCA plot is less effective at separating cells into many different clusters (Figure 24). This is because the first two PCs are driven by strong differences between specific subpopulations, which reduces the resolution of more subtle differences between some of the other subpopulations. Nonetheless, some substructure is still visible.

In [68]:
pca1 <- plotReducedDim(sce, use_dimred="PCA", colour_by="Neurod6") + fontsize
pca2 <- plotReducedDim(sce, use_dimred="PCA", colour_by="Mog") + fontsize
multiplot(pca1, pca2, cols=2)

Figure 24: PCA plots constructed from the denoised PCs of the brain dataset. Each point represents a cell and is coloured according to its expression of the Neurod6 (left) or Mog (right).

For both methods, we colour each cell based on the expression of a particular gene. This is a useful strategy for visualizing changes in expression across the lower-dimensional space. It can also be used to characterise each cluster if the selected genes are known markers for particular cell types. For example, Mog can be used to identify clusters corresponding to oligodendrocytes.

3.10 Clustering cells into putative subpopulations

The denoised log-expression values are used to cluster cells into putative subpopulations. Specifically, we perform hierarchical clustering on the Euclidean distances between cells, using Ward’s criterion to minimize the total variance within each cluster. This yields a dendrogram that groups together cells with similar expression patterns across the chosen genes. An alternative approach is to cluster on a matrix of distances derived from correlations (e.g., as in quickCluster). This is more robust to noise and normalization errors, but is also less sensitive to subtle changes in the expression profiles.

In [69]:
pcs <- reducedDim(sce, "PCA")
my.dist <- dist(pcs)
my.tree <- hclust(my.dist, method="ward.D2")

Clusters are explicitly defined by applying a dynamic tree cut (Langfelder, Zhang, and Horvath 2008) to the dendrogram. This exploits the shape of the branches in the dendrogram to refine the cluster definitions, and is more appropriate than cutree for complex dendrograms. Greater control of the empirical clusters can be obtained by manually specifying cutHeight in cutreeDynamic.

In [70]:
my.clusters <- unname(cutreeDynamic(my.tree, distM=as.matrix(my.dist), verbose=0))

We visualize the cluster assignments for all cells on the t-SNE plot in Figure @(fig:tsneclusterbrain). Adjacent cells are generally assigned to the same cluster, indicating that the clustering procedure was applied correctly.

In [71]:
sce$cluster <- factor(my.clusters)
plotTSNE(sce, use_dimred="PCA", colour_by="cluster",
    perplexity=10, rand_seed=100) + fontsize

Figure 25: t-SNE plot of the denoised PCs of the brain data set. Each point represents a cell and is coloured according to the cluster identity to which it was assigned.

We check the separatedness of the clusters using the silhouette width (Figure ((silhouettebrain))). Cells with large positive silhouette widths are closer to other cells in the same cluster than to cells in different clusters. Conversely, cells with negative widths are closer to other clusters than to other cells in the cluster to which it was assigned. Each cluster would ideally contain many cells with large positive widths, indicating that it is well-separated from other clusters. This can be used to gauge the optimal parameter values (e.g., cut height, number of clusters) that maximize the separation between clusters. For example, we could vary the cut height in cutreeDynamic to maximize the average silhouette width across all cells.

In [72]:
clust.col <- scater:::.get_palette("tableau10medium") # hidden scater colours
sil <- silhouette(my.clusters, dist = my.dist)
sil.cols <- clust.col[ifelse(sil[,3] > 0, sil[,1], sil[,2])]
sil.cols <- sil.cols[order(-sil[,1], sil[,3])]
plot(sil, main = paste(length(unique(my.clusters)), "clusters"), 
    border=sil.cols, col=sil.cols, do.col.sort=FALSE)

Figure 26: Barplot of silhouette widths for cells in each cluster. Each cluster is assigned a colour and cells with positive widths are coloured according to the colour of its assigned cluster. Any cell with a negative width is coloured according to the colour of the cluster that it is closest to. The average width for all cells in each cluster is shown, along with the average width for all cells in the data set.

Comments from Aaron:

  • Very heterogeneous datasets may yield a few large clusters on the first round of clustering. It can be useful to repeat the variance modelling, denoising and clustering using only the cells within each of the initial clusters. This can be achieved by subsetting sce according to a particular level of my.clusters, and re-applying the relevant functions on the subset. Doing so may focus on a different set of genes that define heterogeneity within an initial cluster, as opposed to those that define differences between the initial clusters. This would allow fine-scale structure within each cluster to be explored at greater resolution. For simplicity, though, we will only use the broad clusters corresponding to clear subpopulations in this workflow.
  • For larger data sets, consider using buildSNNGraph and methods from the igraph package to perform clustering. This builds a shared-nearest-neighbour graph (Xu and Su 2015) in which cells are the nodes and edges are formed between cells that share nearest neighbours. Clusters are then defined as highly connected communities of cells within this graph. This is more efficient than forming a pairwise distance matrix for large numbers of cells. Clustering parameters can be optimized by maximizing the modularity score for the formed clusters.
In [73]:
snn.gr <- buildSNNGraph(sce, use.dimred="PCA")
gr.clusters <- igraph::cluster_fast_greedy(snn.gr)
table(gr.clusters$membership)
  1   2   3   4   5 
 24 654 483 943 885 

3.11 Detecting marker genes between subpopulations

Once putative subpopulations are identified by clustering, we can identify marker genes for each cluster using the findMarkers function. This fits a linear model to the log-expression values for each gene using limma (Ritchie et al. 2015). The aim is to test for DE in each cluster compared to the others while blocking on uninteresting factors in design. The top DE genes are likely to be good candidate markers as they can effectively distinguish between cells in different clusters.

In [74]:
markers <- findMarkers(sce, my.clusters, design=design)

For each cluster, the DE results of the relevant comparisons are consolidated into a single output table. This allows a set of marker genes to be easily defined by taking the top DE genes from each pairwise comparison between clusters. For example, to construct a marker set for cluster 1 from the top 10 genes of each comparison, one would filter marker.set to retain rows with Top less than or equal to 10. Other statistics are also reported for each gene, including the adjusted p-values (see below) and the log-fold changes relative to every other cluster.

In [75]:
marker.set <- markers[["1"]]
head(marker.set, 10)
TopGeneFDRlogFC.2logFC.3logFC.4logFC.5logFC.6logFC.7logFC.8logFC.9
1 Htr3a 0.000000e+00-0.02544582 -0.05029491 0.003954437 -3.083390 -0.080803010 -0.10641149 -0.1682300 -0.01381434
1 Prkar1b 3.616221e-96-2.31307866 -2.62374087 -2.117351873 -2.842120 -0.302273439 -0.15813814 -1.6838074 -0.14167017
1 Mllt11 1.922874e-80-2.09893817 -2.46454736 -2.723502487 -2.799965 0.089108874 0.29702192 -1.5253662 0.15808280
1 Clu 1.648528e-09-0.31856852 -1.41685915 -0.326919028 -1.603226 -0.284155665 -5.50183112 -0.7477490 0.17243936
1 Snap25 0.000000e+00-3.33463649 -4.58549993 -3.936689273 -4.269328 -0.614520507 -0.33752588 -3.5312786 -0.51804491
1 Syt11 4.920698e-175 1.03778587 1.44486085 1.896619347 1.413011 3.716447788 1.97750647 0.7890491 3.52560559
1 Mbp 0.000000e+00 5.55480175 5.89264938 6.036065931 6.197872 5.529471087 5.88570347 1.4499635 5.13646103
2 Kcnip1 0.000000e+00-0.01035611 -0.02764130 0.004894382 -1.367154 0.001269168 -0.03371936 -0.2007867 -0.01202454
2 Syt1 1.281077e-144-3.71635417 -3.62952192 -3.160845090 -4.144459 -0.981675691 -0.61319608 -2.8316384 -0.78607438
2 Ndrg4 7.600224e-201-3.07419227 -3.35083957 -3.093946311 -3.889991 -0.261166595 -0.35735308 -2.3267593 -0.24098869

We save the list of candidate marker genes for further examination. The overlapExprs function may also be useful here, to prioritize candidates where there is clear separation between the distributions of expression values of different clusters.

In [76]:
write.table(marker.set, file="/beegfs/group_bit/data/projects/departments/Bioinformatics/bit_single_cell_tutorial/brain_marker_1.tsv", sep="\t", quote=FALSE, col.names=NA)

We visualize the expression profiles of the top candidates to verify that the DE signature is robust. Figure 27 indicates that most of the top markers have strong and consistent up- or downregulation in cells of cluster 1 compared to some or all of the other clusters. Thus, cells from the subpopulation of interest can be identified as those that express the upregulated markers and do not express the downregulated markers.

In [77]:
top.markers <- marker.set$Gene[marker.set$Top <= 10]
top.exprs <- norm_exprs(sce)[top.markers,,drop=FALSE]
heat.vals <- top.exprs - rowMeans(top.exprs)
pheatmap(heat.vals, cluster_cols=my.tree,
    annotation_col=data.frame(Cluster=factor(my.clusters), row.names=colnames(sce)),
    annotation_colors=list(Cluster=setNames(clust.col, seq_along(unique(my.clusters)))))

Figure 27: Heatmap of mean-centred normalized and corrected log-expression values for the top set of markers for cluster 1 in the brain dataset. Column colours represent the cluster to which each cell is assigned, as indicated by the legend.

Many of the markers in Figure 27 are not uniquely up- or downregulated in the chosen cluster. Testing for unique DE tends to be too stringent as it overlooks important genes that are expressed in two or more clusters. For example, in a mixed population of CD4+-only, CD8+-only, double-positive and double-negative T cells, neither Cd4 or Cd8 would be detected as subpopulation-specific markers because each gene is expressed in two subpopulations. With our approach, both of these genes will be picked up as candidate markers as they will be DE between at least one pair of subpopulations. A combination of markers can then be chosen to characterize a subpopulation, which is more flexible than trying to find uniquely DE genes.

Comments from Aaron:

  • To avoid problems with discreteness when modelling the mean-variance relationship, findMarkers will automatically partition the data into low- and high-abundance genes. Empirical Bayes shrinkage is performed in each partition separately, prior to calculation of p-values using the shrunk variance estimates. This ensures that discreteness does not affect the inferences for high-abundance genes, without needing to entirely discard the low-abundance genes.
  • findMarkers can also be directed to find genes that are DE between the chosen cluster and all other clusters. This should be done by setting pval.type="all", which defines the p-value for each gene as the maximum value across all pairwise comparisons involving the chosen cluster. Combined with direction="up", this can be used to identify unique markers for each cluster. However, this is sensitive to overclustering, as unique marker genes will no longer exist if a cluster is split into two smaller subclusters.
  • It must be stressed that the (adjusted) p-values computed here cannot be properly interpreted as measures of significance. This is because the clusters have been empirically identified from the data. limma does not account for the uncertainty of clustering, which means that the p-values are much lower than they should be. This is not a concern in other analyses where the groups are pre-defined.
  • The SingleCellExperiment object can also be easily transformed for use in other DE analysis methods. For example, the convertTo function can be used to construct a DGEList for input into the edgeR pipeline (Robinson, McCarthy, and Smyth 2010). This allows users to construct their own marker detection pipeline, though we find that direct use of findMarkers is usually sufficient.
In [78]:
y <- convertTo(sce, type="edgeR")

3.12 Additional comments

Having completed the basic analysis, we save the SingleCellExperiment object with its associated data to file. This is especially important here as the brain dataset is quite large. If further analyses are to be performed, it would be inconvenient to have to repeat all of the pre-processing steps described above.

In [79]:
saveRDS(file="/beegfs/group_bit/data/projects/departments/Bioinformatics/bit_single_cell_tutorial/brain_data.rds", sce)

4 Alternative parameter settings and strategies


4.1 Normalizing based on spike-in coverage

Scaling normalization strategies for scRNA-seq data can be broadly divided into two classes. The first class assumes that there exists a subset of genes that are not DE between samples, as previously described. The second class uses the fact that the same amount of spike-in RNA was added to each cell. Differences in the coverage of the spike-in transcripts can only be due to cell-specific biases, e.g., in capture efficiency or sequencing depth. Scaling normalization is then applied to equalize spike-in coverage across cells.

The choice between these two normalization strategies depends on the biology of the cells and the features of interest. If the majority of genes are expected to be DE and there is no reliable house-keeping set, spike-in normalization may be the only option for removing cell-specific biases. Spike-in normalization should also be used if differences in the total RNA content of individual cells are of interest. In any particular cell, an increase in the amount of endogenous RNA will not increase spike-in coverage (with or without library quantification). Thus, the former will not be represented as part of the bias in the latter, which means that the effects of total RNA content on expression will not be removed upon scaling. With non-DE normalization, an increase in RNA content will systematically increase the expression of all genes in the non-DE subset, such that it will be treated as bias and removed.

We demonstrate the use of spike-in normalization on a dataset involving different cell types – namely, mouse embryonic stem cells (mESCs) and mouse embryonic fibroblasts (MEFs) (Islam et al. 2011). The count table was obtained from NCBI GEO as a supplementary file under the accession GSE29087 (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE29087). We load the counts into R and specify the rows corresponding to spike-in transcripts. The negative control wells do not contain any cells and are useful for quality control but need to be removed prior to downstream analysis.

In [80]:
counts <- read.table("/beegfs/group_bit/data/projects/departments/Bioinformatics/bit_single_cell_tutorial/raw_data/GSE29087_L139_expression_tab.txt.gz", colClasses=c(list("character", 
    NULL, NULL, NULL, NULL, NULL, NULL), rep("integer", 96)), skip=6, sep='\t', row.names=1)
is.spike <- grep("SPIKE", rownames(counts)) 
sce <- SingleCellExperiment(list(counts=as.matrix(counts)))
isSpike(sce, "spike") <- is.spike
sce$grouping <- rep(c("mESC", "MEF", "Neg"), c(48, 44, 4))
sce <- sce[,sce$grouping!="Neg"] # Removing negative control wells.
sce <- calculateQCMetrics(sce, feature_controls=list(spike=is.spike))
sce
class: SingleCellExperiment 
dim: 22936 92 
metadata(0):
assays(1): counts
rownames(22936): RNA_SPIKE_1 RNA_SPIKE_2 ... r_U14 r_(CGTAG)n
rowData names(9): is_feature_control is_feature_control_spike ...
  total_counts log10_total_counts
colnames(92): V8 V9 ... V98 V99
colData names(29): grouping total_features ... pct_counts_spike
  is_cell_control
reducedDimNames(0):
spikeNames(1): spike

We then apply the computeSpikeFactors method to estimate size factors for all cells. This method computes the total count over all spike-in transcripts in each cell, and calculates size factors to equalize the total spike-in count across cells. Here, we set general.use=TRUE as we intend to apply the spike-in factors to all counts.

In [81]:
sce <- computeSpikeFactors(sce, general.use=TRUE)

Applying normalize will use the spike-in-based size factors to compute normalized log-expression values. Unlike in the previous analyses, we do not have to set separate size factors for the spike-in transcripts. This is because the relevant factors are already being used for all genes and spike-in transcripts when general.use=TRUE. (The exception is if the experiment uses multiple spike-in sets that behave differently and need to be normalized separately.)

In [82]:
sce <- normalize(sce)

For comparison, we also compute the deconvolution size factors and plot them against the spike-in factors. We observe a negative correlation between the two sets of values (Figure 28). This is because MEFs contain more endogenous RNA, which reduces the relative spike-in coverage in each library (thereby decreasing the spike-in size factors) but increases the coverage of endogenous genes (thus increasing the deconvolution size factors). If the spike-in size factors were applied to the counts, the expression values in MEFs would be scaled up while expression in mESCs would be scaled down. However, the opposite would occur if deconvolution size factors were used.

In [83]:
colours <- c(mESC="red", MEF="grey")
deconv.sf <- computeSumFactors(sce, sf.out=TRUE, cluster=sce$grouping)
plot(sizeFactors(sce), deconv.sf, col=colours[sce$grouping], pch=16, log="xy", 
    xlab="Size factor (spike-in)", ylab="Size factor (deconvolution)")
legend("bottomleft", col=colours, legend=names(colours), pch=16)
Warning message in .computeSumFactors(assay(x, i = assay.type), subset.row = subset.row, :
“not enough cells in at least one cluster for some 'sizes'”Warning message in .computeSumFactors(assay(x, i = assay.type), subset.row = subset.row, :
“not enough cells in at least one cluster for some 'sizes'”

Figure 28: Size factors from spike-in normalization, plotted against the size factors from deconvolution for all cells in the mESC/MEF dataset. Axes are shown on a log-scale, and cells are coloured according to their identity. Deconvolution size factors were computed with small pool sizes owing to the low number of cells of each type.

Whether or not total RNA content is relevant – and thus, the choice of normalization strategy – depends on the biological hypothesis. In the HSC and brain analyses, variability in total RNA across the population was treated as noise and removed by non-DE normalization. This may not always be appropriate if total RNA is associated with a biological difference of interest. For example, Islam et al. (2011) observe a 5-fold difference in total RNA between mESCs and MEFs. Similarly, the total RNA in a cell changes across phases of the cell cycle (Buettner et al. 2015). Spike-in normalization will preserve these differences in total RNA content such that the corresponding biological groups can be easily resolved in downstream analyses.

Comments from Aaron:

  • We only use genes with average counts greater than 1 (as specified in subset.row) to compute the deconvolution size factors. This avoids problems with discreteness as mentioned in our previous uses of computeSumFactors.
  • Setting sf.out=TRUE will directly return the size factors, rather than a SingleCellExperiment object containing those factors. This is more convenient when only the size factors are required for further analysis.

4.2 Characterising heterogeneity with hypothesis tests

4.2.1 Detecting highly variable genes

HVGs are defined as genes with biological components that are significantly greater than zero. These genes are interesting as they drive differences in the expression profiles between cells, and should be prioritized for further investigation. Formal detection of HVGs avoids focusing on genes that are highly variable due to technical factors such as sampling noise during RNA capture and library preparation. We can identify HVGs based on the statistics reported by decomposeVar, as shown below for the HSC data set.

In [84]:
sce <- readRDS("/beegfs/group_bit/data/projects/departments/Bioinformatics/bit_single_cell_tutorial/hsc_data.rds")
var.fit <- trendVar(sce, parametric=TRUE, span=0.2)
var.out <- decomposeVar(sce, var.fit)
hvg.out <- var.out[which(var.out$FDR <= 0.05),]
nrow(hvg.out)
558

We rank the results by the biological component to focus on genes with larger biological components.

In [85]:
hvg.out <- hvg.out[order(hvg.out$bio, decreasing=TRUE),] 
write.table(file="/beegfs/group_bit/data/projects/departments/Bioinformatics/bit_single_cell_tutorial/hsc_hvg.tsv", hvg.out, sep="\t", quote=FALSE, col.names=NA)
head(hvg.out)
meantotalbiotechp.valueFDR
Fos6.461286 19.564764 11.449379 8.115385 1.320376e-121.600371e-10
Dusp16.821227 15.627372 9.431590 6.195782 6.020593e-148.292390e-12
Ly6a8.404743 10.074571 8.612042 1.462529 5.963419e-813.162252e-77
Rgs15.311914 20.307722 8.510507 11.797216 2.304492e-051.250143e-03
Vbp17.436976 11.145646 7.741000 3.404646 8.205728e-242.762725e-21
Ctla2a8.742807 8.773436 7.655442 1.117995 2.164447e-971.530336e-93

There are many other strategies for defining HVGs, based on a variety of metrics:

  • the coefficient of variation (Brennecke et al. 2013; Kołodziejczyk et al. 2015; Kim et al. 2015)
  • the dispersion parameter in the negative binomial distribution (McCarthy, Chen, and Smyth 2012) a proportion of total variability (Vallejos, Marioni, and Richardson 2015)
  • Some of these methods are available in scran – for example, see DM or technicalCV2 for calculations based on the coefficient of variation. Here, we use the variance of the log-expression values because the log-transformation protects against genes with strong expression in only one or two cells. This ensures that the set of top HVGs is not dominated by genes with (mostly uninteresting) outlier expression patterns.

4.3 Identifying correlated gene pairs with Spearman’s rho

Another use of scRNA-seq data is to identify correlations between pairs of genes. This is quantified by computing Spearman’s rho, which accommodates non-linear relationships in the expression values. Here, we use the correlatePairs function to identify significant correlations between the various histocompatability antigens in the HSC data set.

In [86]:
set.seed(100)
var.cor <- correlatePairs(sce, subset.row=grep("^H2-", rownames(sce)))
head(var.cor)
gene1gene2rhop.valueFDRlimited
H2-Ab1 H2-Eb1 0.4968944 1.999998e-060.0004349996 TRUE
H2-Aa H2-Ab1 0.4886488 1.999998e-060.0004349996 TRUE
H2-D1 H2-K1 0.4150548 2.999997e-050.0043499957FALSE
H2-Aa H2-Eb1 0.4093985 4.199996e-050.0045674954FALSE
H2-Ab1 H2-DMb1 0.3581986 4.859995e-040.0422819577FALSE
H2-Q6 H2-Q7 0.3432024 8.659991e-040.0627849372FALSE

The significance of each correlation is determined using a permutation test. For each pair of genes, the null hypothesis is that the expression profiles of two genes are independent. Shuffling the profiles and recalculating the correlation yields a null distribution that is used to obtain a p-value for each observed correlation value (Phipson and Smyth 2010). Correction for multiple testing across many gene pairs is performed by controlling the FDR at 5%.

In [87]:
sig.cor <- var.cor$FDR <= 0.05
summary(sig.cor)
   Mode   FALSE    TRUE 
logical     430       5 

We can also compute correlations between specific pairs of genes, or between all pairs between two distinct sets of genes. The example below computes the correlation between Fos and Jun, which dimerize to form the AP-1 transcription factor.

In [88]:
correlatePairs(sce, subset.row=cbind("Fos", "Jun"))
gene1gene2rhop.valueFDRlimited
Fos Jun 0.4634958 5.999994e-065.999994e-06FALSE

Examination of the expression profiles in Figure 29 confirms the presence of a modest correlation between these two genes.

In [89]:
plotExpression(sce, features="Fos", x="Jun")

Figure 29: Expression of Fos plotted against the expression of Jun for all cells in the HSC data set.

Comments from Aaron:

  • It is recommended to compute correlations between a subset of genes of interest, known either a priori or empirically defined, e.g., as HVGs. Computing correlations across all genes will take too long; unnecessarily increase the severity of the multiple testing correction; and may prioritize strong but uninteresting correlations, e.g., between tightly co-regulated house-keeping genes.
  • The correlatePairs function can also return gene-centric output by setting per.gene=TRUE. This calculates a combined p-value (Simes 1986) for each gene that indicates whether it is significantly correlated to any other gene. From a statistical perspective, this is a more natural approach to correcting for multiple testing when genes, rather than pairs of genes, are of interest.

4.4 Blocking on the cell cycle phase

Cell cycle phase is usually uninteresting in studies focusing on other aspects of biology. However, the effects of cell cycle on the expression profile can mask other effects and interfere with the interpretation of the results. This cannot be avoided by simply removing cell cycle marker genes, as the cell cycle can affect a substantial number of other transcripts (Buettner et al. 2015). Rather, more sophisticated strategies are required, one of which is demonstrated below using data from a study of T Helper 2 (TH2) cells (Mahata et al. 2014). Buettner et al. (2015) have already applied quality control and normalized the data, so we can use them directly as log-expression values (accessible as Supplementary Data 1 of https://dx.doi.org/10.1038/nbt.3102).

In [90]:
incoming <- as.data.frame(read_excel("/beegfs/group_bit/data/projects/departments/Bioinformatics/bit_single_cell_tutorial/raw_data/nbt.3102-S7.xlsx", sheet=1))
rownames(incoming) <- incoming[,1]
incoming <- incoming[,-1]
incoming <- incoming[,!duplicated(colnames(incoming))] # Remove duplicated genes.
sce <- SingleCellExperiment(list(logcounts=t(incoming)))

We empirically identify the cell cycle phase using the pair-based classifier in cyclone. The majority of cells in Figure 30 seem to lie in G1 phase, with small numbers of cells in the other phases.

In [91]:
ensembl <- mapIds(org.Mm.eg.db, keys=rownames(sce), keytype="SYMBOL", column="ENSEMBL")
set.seed(100)
assignments <- cyclone(sce, mm.pairs, gene.names=ensembl, assay.type="logcounts")
plot(assignments$score$G1, assignments$score$G2M, xlab="G1 score", ylab="G2/M score", pch=16)
'select()' returned 1:many mapping between keys and columns

Figure 30: Cell cycle phase scores from applying the pair-based classifier on the TH2 dataset, where each point represents a cell.

We can block directly on the phase scores in downstream analyses. This is more graduated than using a strict assignment of each cell to a specific phase, as the magnitude of the score considers the uncertainty of the assignment. The phase covariates in the design matrix will absorb any phase-related effects on expression such that they will not affect estimation of the effects of other experimental factors. Users should also ensure that the phase score is not confounded with other factors of interest. For example, model fitting is not possible if all cells in one experimental condition are in one phase, and all cells in another condition are in a different phase.

In [92]:
design <- model.matrix(~ G1 + G2M, assignments$score)
fit.block <- trendVar(sce, design=design, parametric=TRUE, use.spikes=NA)
sce.block <- denoisePCA(sce, technical=fit.block$trend, design=design)

The result of blocking on design is visualized with some PCA plots in Figure 31. Before removal, the distribution of cells along the first two principal components is strongly associated with their G1 and G2/M scores. This is no longer the case after removal, which suggests that the cell cycle effect has been mitigated.

In [93]:
sce$G1score <- sce.block$G1score <- assignments$score$G1
sce$G2Mscore <- sce.block$G2Mscore <- assignments$score$G2M

# Without blocking on phase score.
fit <- trendVar(sce, parametric=TRUE, use.spikes=NA) 
sce <- denoisePCA(sce, technical=fit$trend)
out <- plotReducedDim(sce, use_dimred="PCA", ncomponents=2, colour_by="G1score", 
    size_by="G2Mscore") + fontsize + ggtitle("Before removal")

# After blocking on the phase score.
out2 <- plotReducedDim(sce.block, use_dimred="PCA", ncomponents=2, colour_by="G1score", 
    size_by="G2Mscore") + fontsize + ggtitle("After removal")
multiplot(out, out2, cols=2)

Figure 31: PCA plots before (left) and after (right) removal of the cell cycle effect in the TH2 dataset. Each cell is represented by a point with colour and size determined by the G1 and G2/M scores, respectively.

As an aside, this dataset contains cells at various stages of differentiation (Mahata et al. 2014). This is an ideal use case for diffusion maps which perform dimensionality reduction along a continuous process. In Figure 32, cells are arranged along a trajectory in the low-dimensional space. The first diffusion component is likely to correspond to TH2 differentiation, given that a key regulator Gata3 (Zhu et al. 2006) changes in expression from left to right.

In [94]:
plotDiffusionMap(sce.block, use_dimred="PCA", sigma=25,
    colour_by="Gata3") + fontsize

Figure 32: A diffusion map for the TH2 dataset, where each cell is coloured by its expression of Gata3. A larger sigma is used compared to the default value to obtain a smoother plot.

4.5 Extracting annotation from Ensembl identifiers

Feature-counting tools typically report genes in terms of standard identifiers from Ensembl or Entrez. These identifiers are used as they are unambiguous and highly stable. However, they are difficult to interpret compared to the gene symbols which are more commonly used in the literature. We can easily convert from one to the other using annotation packages like org.Mm.eg.db. This is demonstrated below for Ensembl identifiers in a mESC dataset (Kołodziejczyk et al. 2015) obtained from http://www.ebi.ac.uk/teichmann-srv/espresso. The mapIds call extracts the specified data from the annotation object, returning the first gene symbol if multiple symbols correspond to a single Ensembl identifier.

In [95]:
# incoming <- read.table("counttable_es.csv", header=TRUE, row.names=1)
# my.ids <- rownames(incoming)
# symb <- mapIds(org.Mm.eg.db, keys=my.ids, keytype="ENSEMBL", column="SYMBOL")
# head(symb)

To identify which rows correspond to mitochondrial genes, we need to use extra annotation describing the genomic location of each gene. For Ensembl, this involves using the TxDb.Mmusculus.UCSC.mm10.ensGene package.

In [96]:
# library(TxDb.Mmusculus.UCSC.mm10.ensGene)
# location <- mapIds(TxDb.Mmusculus.UCSC.mm10.ensGene, keys=my.ids, 
#     column="CDSCHROM", keytype="GENEID")
# is.mito <- location == "chrM" & !is.na(location)
# sum(is.mito)

Identification of rows that correspond to spike-in transcripts is much easier, given that the ERCC spike-ins were used.

In [97]:
# is.spike <- grepl("^ERCC", my.ids)
# sum(is.spike)

All of this information can be consolidated into a SingleCellExperiment object for further manipulation. Alternatively, annotation from BioMart resources can be directly added to the object using the getBMFeatureAnnos function from scater.

In [98]:
# sce <- SingleCellExperiment(list(counts=as.matrix(incoming)), 
#     rowData=DataFrame(Symbol=symb, Chr=location))
# isSpike(sce, "ERCC") <- is.spike
# sce <- calculateQCMetrics(sce, feature_controls=list(ERCC=is.spike, Mito=is.mito))
# sce

We filter out rows that do not correspond to endogenous genes or spike-in transcripts. This will remove rows containing mapping statistics such as the number of unaligned or unassigned reads, which would be misleading if treated as gene expression values. The object is then ready for downstream analyses as previously described.

In [99]:
# sce <- sce[grepl("ENSMUS", rownames(sce)) | isSpike(sce),]
# dim(sce)

5 Conclusions


This workflow provides a step-by-step guide for performing basic analyses of single-cell RNA-seq data in R. It provides instructions for a number of low-level steps such as quality control, normalization, cell cycle phase assignment, data exploration, HVG and marker gene detection, and clustering. This is done with a number of different datasets to provide a range of usage examples. The workflow is modular so individual steps can be substituted with alternative methods according to user preferences. In addition, the processed data can be easily used for higher-level analyses with other Bioconductor packages. We anticipate that this workflow will assist readers in assembling analyses of their own scRNA-seq data.

In [100]:
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /beegfs/common/software/2017/modules/software/openblas/0.2.19/lib/libopenblas_sandybridgep-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] cluster_2.0.6              dynamicTreeCut_1.63-1     
 [3] edgeR_3.20.7               limma_3.34.6              
 [5] pheatmap_1.0.8             scran_1.6.6               
 [7] BiocParallel_1.12.0        scater_1.6.2              
 [9] ggplot2_2.2.1              org.Mm.eg.db_3.5.0        
[11] AnnotationDbi_1.40.0       SingleCellExperiment_1.0.0
[13] SummarizedExperiment_1.8.1 DelayedArray_0.4.1        
[15] matrixStats_0.53.0         Biobase_2.38.0            
[17] GenomicRanges_1.30.1       GenomeInfoDb_1.14.0       
[19] IRanges_2.12.0             S4Vectors_0.16.0          
[21] BiocGenerics_0.24.0        readxl_1.0.0              

loaded via a namespace (and not attached):
  [1] uuid_0.1-2             backports_1.1.2        Hmisc_4.1-1           
  [4] RcppEigen_0.3.3.3.1    plyr_1.8.4             igraph_1.1.2          
  [7] repr_0.12.0            lazyeval_0.2.1         sp_1.2-7              
 [10] shinydashboard_0.6.1   splines_3.4.3          digest_0.6.14         
 [13] htmltools_0.3.6        viridis_0.4.1          magrittr_1.5          
 [16] checkmate_1.8.5        memoise_1.1.0          xts_0.10-1            
 [19] prettyunits_1.0.2      colorspace_1.3-2       blob_1.1.0            
 [22] dplyr_0.7.4            crayon_1.3.4           RCurl_1.95-4.10       
 [25] jsonlite_1.5           tximport_1.6.0         lme4_1.1-15           
 [28] bindr_0.1              survival_2.41-3        zoo_1.8-1             
 [31] glue_1.2.0             gtable_0.2.0           zlibbioc_1.24.0       
 [34] XVector_0.18.0         MatrixModels_0.4-1     car_2.1-6             
 [37] DEoptimR_1.0-8         SparseM_1.77           VIM_4.7.0             
 [40] scales_0.5.0           DBI_0.7                Rcpp_0.12.15          
 [43] viridisLite_0.2.0      xtable_1.8-2           progress_1.1.2        
 [46] laeken_0.4.6           htmlTable_1.11.2       proxy_0.4-21          
 [49] foreign_0.8-69         bit_1.1-12             Formula_1.2-2         
 [52] DT_0.3                 vcd_1.4-4              htmlwidgets_1.0       
 [55] httr_1.3.1             FNN_1.1                RColorBrewer_1.1-2    
 [58] acepack_1.4.1          pkgconfig_2.0.1        XML_3.98-1.9          
 [61] nnet_7.3-12            locfit_1.5-9.1         labeling_0.3          
 [64] rlang_0.1.6            reshape2_1.4.3         munsell_0.4.3         
 [67] cellranger_1.1.0       tools_3.4.3            RSQLite_2.0           
 [70] evaluate_0.10.1        stringr_1.2.0          knitr_1.18            
 [73] bit64_0.9-7            robustbase_0.92-8      bindrcpp_0.2          
 [76] nlme_3.1-131           mime_0.5               quantreg_5.34         
 [79] biomaRt_2.34.2         compiler_3.4.3         pbkrtest_0.4-7        
 [82] rstudioapi_0.7         curl_3.1               beeswarm_0.2.3        
 [85] e1071_1.6-8            smoother_1.1           tibble_1.4.2          
 [88] statmod_1.4.30         stringi_1.1.6          lattice_0.20-35       
 [91] IRdisplay_0.4.4        Matrix_1.2-12          nloptr_1.0.4          
 [94] pillar_1.1.0           lmtest_0.9-35          data.table_1.10.4-3   
 [97] bitops_1.0-6           irlba_2.3.2            httpuv_1.3.5          
[100] R6_2.2.2               latticeExtra_0.6-28    KernSmooth_2.23-15    
[103] gridExtra_2.3          vipor_0.4.5            boot_1.3-20           
[106] MASS_7.3-47            assertthat_0.2.0       destiny_2.6.1         
[109] rhdf5_2.22.0           rjson_0.2.15           GenomeInfoDbData_1.0.0
[112] mgcv_1.8-22            grid_3.4.3             rpart_4.1-11          
[115] IRkernel_0.8.12.9000   class_7.3-14           minqa_1.2.4           
[118] Cairo_1.5-9            Rtsne_0.13             TTR_0.23-3            
[121] pbdZMQ_0.3-1           shiny_1.0.5            base64enc_0.1-3       
[124] ggbeeswarm_0.6.0      

References

Anders, S., and W. Huber. 2010. “Differential expression analysis for sequence count data.” Genome Biol. 11 (10): R106.

Angerer, P., L. Haghverdi, M. Buttner, F. J. Theis, C. Marr, and F. Buettner. 2016. “destiny: diffusion maps for large-scale single-cell data in R.” Bioinformatics 32 (8): 1241–3.

Bertoli, C., J. M. Skotheim, and R. A. de Bruin. 2013. “Control of cell cycle transcription during G1 and S phases.” Nat. Rev. Mol. Cell Biol. 14 (8): 518–28.

Bourgon, R., R. Gentleman, and W. Huber. 2010. “Independent filtering increases detection power for high-throughput experiments.” Proc. Natl. Acad. Sci. U.S.A. 107 (21): 9546–51.

Bray, N. L., H. Pimentel, P. Melsted, and L. Pachter. 2016. “Near-optimal probabilistic RNA-seq quantification.” Nat. Biotechnol. 34 (5): 525–27.

Brennecke, P., S. Anders, J. K. Kim, A. A. Kołodziejczyk, X. Zhang, V. Proserpio, B. Baying, et al. 2013. “Accounting for technical noise in single-cell RNA-seq experiments.” Nat. Methods 10 (11): 1093–5.

Buettner, F., K. N. Natarajan, F. P. Casale, V. Proserpio, A. Scialdone, F. J. Theis, S. A. Teichmann, J. C. Marioni, and O. Stegle. 2015. “Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells.” Nat. Biotechnol. 33 (2): 155–60.

Chen, Y., A. T. Lun, and G. K. Smyth. 2016. “From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline.” F1000Res 5: 1438.

Coller, H. A., L. Sang, and J. M. Roberts. 2006. “A new description of cellular quiescence.” PLoS Biol. 4 (3): e83.

Conboy, C. M., C. Spyrou, N. P. Thorne, E. J. Wade, N. L. Barbosa-Morais, M. D. Wilson, A. Bhattacharjee, et al. 2007. “Cell cycle genes are the evolutionarily conserved targets of the E2F4 transcription factor.” PLoS ONE 2 (10): e1061.

Fan, J., N. Salathia, R. Liu, G. E. Kaeser, Y. C. Yung, J. L. Herman, F. Kaper, et al. 2016. “Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis.” Nat. Methods 13 (3): 241–44.

Hicks, S. C., M. Teng, and R. A. Irizarry. 2015. “On the widespread and critical impact of systematic bias and batch effects in single-cell RNA-Seq data.” bioRxiv. Cold Spring Harbor Labs Journals. doi:10.1101/025528.

Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. “Orchestrating high-throughput genomic analysis with Bioconductor.” Nat. Methods 12 (2): 115–21.

Ilicic, T., J. K. Kim, A. A. Kołodziejczyk, F. O. Bagger, D. J. McCarthy, J. C. Marioni, and S. A. Teichmann. 2016. “Classification of low quality cells from single-cell RNA-seq data.” Genome Biol. 17 (1): 29.

Islam, S., U. Kjallquist, A. Moliner, P. Zajac, J. B. Fan, P. Lonnerberg, and S. Linnarsson. 2011. “Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq.” Genome Res. 21 (7): 1160–7.

Islam, S., A. Zeisel, S. Joost, G. La Manno, P. Zajac, M. Kasper, P. Lonnerberg, and S. Linnarsson. 2014. “Quantitative single-cell RNA-seq with unique molecular identifiers.” Nat. Methods 11 (2): 163–66.

Ji, Z., and H. Ji. 2016. “TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis.” Nucleic Acids Res. 44 (13): e117.

Julia, M., A. Telenti, and A. Rausell. 2015. “Sincell: an R/Bioconductor package for statistical assessment of cell-state hierarchies from single-cell RNA-seq.” Bioinformatics 31 (20): 3380–2.

Kim, J. K., A. A. Kołodziejczyk, T. Illicic, S. A. Teichmann, and J. C. Marioni. 2015. “Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression.” Nat. Commun. 6: 8687.

Klein, A. M., L. Mazutis, I. Akartuna, N. Tallapragada, A. Veres, V. Li, L. Peshkin, D. A. Weitz, and M. W. Kirschner. 2015. “Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells.” Cell 161 (5): 1187–1201.

Kołodziejczyk, A. A., J. K. Kim, J. C. Tsang, T. Ilicic, J. Henriksson, K. N. Natarajan, A. C. Tuck, et al. 2015. “Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation.” Cell Stem Cell 17 (4): 471–85.

Langfelder, P., B. Zhang, and S. Horvath. 2008. “Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R.” Bioinformatics 24 (5): 719–20.

Law, C. W., Y. Chen, W. Shi, and G. K. Smyth. 2014. “voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biol. 15 (2): R29.

Leng, N., L. F. Chu, C. Barry, Y. Li, J. Choi, X. Li, P. Jiang, R. M. Stewart, J. A. Thomson, and C. Kendziorski. 2015. “Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments.” Nat. Methods 12 (10): 947–50.

Liao, Y., G. K. Smyth, and W. Shi. 2013. “The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.” Nucleic Acids Res. 41 (10): e108.

———. 2014. “featureCounts: an efficient general purpose program for assigning sequence reads to genomic features.” Bioinformatics 30 (7): 923–30.

Love, M. I., S. Anders, V. Kim, and W. Huber. 2015. “RNA-Seq workflow: gene-level exploratory analysis and differential expression.” F1000Res 4: 1070.

Love, M. I., W. Huber, and S. Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biol. 15 (12): 550.

Lun, A. T., K. Bach, and J. C. Marioni. 2016. “Pooling Across Cells to Normalize Single-Cell Rna Sequencing Data with Many Zero Counts.” Genome Biol. 17: 75.

Macosko, E. Z., A. Basu, R. Satija, J. Nemesh, K. Shekhar, M. Goldman, I. Tirosh, et al. 2015. “Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets.” Cell 161 (5): 1202–14.

Mahata, B., X. Zhang, A. A. Kołodziejczyk, V. Proserpio, L. Haim-Vilmovsky, A. E. Taylor, D. Hebenstreit, et al. 2014. “Single-cell RNA sequencing reveals T helper cells synthesizing steroids de novo to contribute to immune homeostasis.” Cell Rep. 7 (4): 1130–42.

Marinov, G. K., B. A. Williams, K. McCue, G. P. Schroth, J. Gertz, R. M. Myers, and B. J. Wold. 2014. “From single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing.” Genome Res. 24 (3): 496–510.

McCarthy, D. J., K. R. Campbell, A. T. L. Lun, and Q. F. Wills. 2016. “Scater: Pre-Processing, Quality Control, Normalisation and Visualisation of Single-Cell Rna-Seq Data in R.” bioRxiv. Cold Spring Harbor Labs Journals. doi:10.1101/069633.

McCarthy, D. J., Y. Chen, and G. K. Smyth. 2012. “Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.” Nucleic Acids Res. 40 (10): 4288–97.

Patro, Rob, Geet Duggal, and Carl Kingsford. 2015. “Accurate, Fast, and Model-Aware Transcript Expression Quantification with Salmon.” bioRxiv. Cold Spring Harbor Labs Journals. doi:10.1101/021592.

Phipson, B., and G. K. Smyth. 2010. “Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn.” Stat. Appl. Genet. Mol. Biol. 9: Article 39.

Picelli, S., O. R. Faridani, A. K. Bjorklund, G. Winberg, S. Sagasser, and R. Sandberg. 2014. “Full-length RNA-seq from single cells using Smart-seq2.” Nat Protoc 9 (1): 171–81.

Pollen, A. A., T. J. Nowakowski, J. Shuga, X. Wang, A. A. Leyrat, J. H. Lui, N. Li, et al. 2014. “Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex.” Nat. Biotechnol. 32 (10): 1053–8.

Ritchie, M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. 2015. “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Res. 43 (7): e47.

Robinson, M. D., and A. Oshlack. 2010. “A scaling normalization method for differential expression analysis of RNA-seq data.” Genome Biol. 11 (3): R25.

Robinson, M. D., D. J. McCarthy, and G. K. Smyth. 2010. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (1): 139–40.

Scialdone, A., K. N. Natarajan, L. R. Saraiva, V. Proserpio, S. A. Teichmann, O. Stegle, J. C. Marioni, and F. Buettner. 2015. “Computational assignment of cell-cycle stage from single-cell transcriptome data.” Methods 85 (September): 54–61.

Simes, R. J. 1986. “An Improved Bonferroni Procedure for Multiple Tests of Significance.” Biometrika 73 (3): 751–54.

Stegle, O., S. A. Teichmann, and J. C. Marioni. 2015. “Computational and analytical challenges in single-cell transcriptomics.” Nat. Rev. Genet. 16 (3): 133–45.

Trapnell, C., D. Cacchiarelli, J. Grimsby, P. Pokharel, S. Li, M. Morse, N. J. Lennon, K. J. Livak, T. S. Mikkelsen, and J. L. Rinn. 2014. “The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells.” Nat. Biotechnol. 32 (4): 381–86.

Tung, P., J. D. Blischak, C. Hsiao, D. A. Knowles, J. E. Burnett, J. K. Pritchard, and Y. Gilad. 2016. “Batch Effects and the Effective Design of Single-Cell Gene Expression Studies.” bioRxiv. Cold Spring Harbor Labs Journals. doi:10.1101/062919.

Vallejos, C. A., J. C. Marioni, and S. Richardson. 2015. “BASiCS: Bayesian analysis of single-cell sequencing data.” PLoS Comput. Biol. 11 (6): e1004333.

Van der Maaten, L., and G. Hinton. 2008. “Visualizing Data Using T-SNE.” J. Mach. Learn. Res. 9 (2579-2605): 85.

Wilson, N. K., D. G. Kent, F. Buettner, M. Shehata, I. C. Macaulay, F. J. Calero-Nieto, M. Sanchez Castillo, et al. 2015. “Combined single-cell functional and gene expression analysis resolves heterogeneity within stem cell populations.” Cell Stem Cell 16 (6): 712–24.

Xu, C., and Z. Su. 2015. “Identification of cell types from single-cell transcriptomes using a novel clustering method.” Bioinformatics 31 (12): 1974–80.

Zeisel, A., A. B. Munoz-Manchado, S. Codeluppi, P. Lonnerberg, G. La Manno, A. Jureus, S. Marques, et al. 2015. “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.” Science 347 (6226): 1138–42.

Zhu, J., H. Yamane, J. Cote-Sierra, L. Guo, and W. E. Paul. 2006. “GATA-3 promotes Th2 responses through three different mechanisms: induction of Th2 cytokine production, selective growth of Th2 cells and inhibition of Th1 cell-specific factors.” Cell Res. 16 (1): 3–10.

In [ ]: