In [1]:
# This step sets my R_LIBS_USER

A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor

This notebook is fully based on the tutorial available on


Aaron T. L. Lun1, Davis J. McCarthy2,3 and John C. Marioni1,2,4 1.Cancer Research UK Cambridge Institute, Li Ka Shing Centre, Robinson Way, Cambridge CB2 0RE, United Kingdom

2.EMBL European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom

3.St Vincent's Institute of Medical Research, 41 Victoria Parade, Fitzroy, Victoria 3065, Australia

4.Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom

1. Introduction

Single-cell RNA sequencing (scRNA-seq) is widely used to measure the genome-wide expression profile of individual cells. From each cell, mRNA is isolated and reverse transcribed to cDNA for high-throughput sequencing (Stegle, Teichmann, and Marioni 2015). This can be done using microfluidics platforms like the Fluidigm C1 (Pollen et al. 2014), protocols based on microtiter plates like Smart-seq2 (Picelli et al. 2014), or droplet-based technologies like inDrop (Klein et al. 2015; Macosko et al. 2015). The number of reads mapped to each gene is then used to quantify its expression in each cell. Alternatively, unique molecular identifiers (UMIs) can be used to directly measure the number of transcript molecules for each gene (Islam et al. 2014). Count data are analyzed to detect highly variable genes (HVGs) that drive heterogeneity across cells in a population, to find correlations between genes and cellular phenotypes, or to identify new subpopulations via dimensionality reduction and clustering. This provides biological insights at a single-cell resolution that cannot be achieved with conventional bulk RNA sequencing of cell populations.

Strategies for scRNA-seq data analysis differ markedly from those for bulk RNA-seq. One technical reason is that scRNA-seq data are much noisier than bulk data (Brennecke et al. 2013; Marinov et al. 2014). Reliable capture (i.e., conversion) of transcripts into cDNA for sequencing is difficult with the low quantity of RNA in a single cell. This increases the frequency of drop-out events where none of the transcripts for a gene are captured. Dedicated steps are required to deal with this noise during analysis, especially during quality control. In addition, scRNA-seq data can be used to study cell-to-cell heterogeneity, e.g., to identify new cell subtypes, to characterize differentiation processes, to assign cells into their cell cycle phases, or to identify HVGs driving variability across the population (Vallejos, Marioni, and Richardson 2015; J. Fan et al. 2016; Trapnell et al. 2014). This is simply not possible with bulk data, meaning that custom methods are required to perform these analyses.

This article describes a computational workflow for basic analysis of scRNA-seq data, using software packages from the open-source Bioconductor project (release 3.5) (Huber et al. 2015). Starting from a count matrix, this workflow contains the steps required for quality control to remove problematic cells; normalization of cell-specific biases, with and without spike-ins; cell cycle phase classification from gene expression data; data exploration to identify putative subpopulations; and finally, HVG and marker gene identification to prioritize interesting genes. The application of different steps in the workflow will be demonstrated on several public scRNA-seq datasets involving haematopoietic stem cells, brain-derived cells, T-helper cells and mouse embryonic stem cells, generated with a range of experimental protocols and platforms (N. K. Wilson et al. 2015; Zeisel et al. 2015; Buettner et al. 2015; Kołodziejczyk et al. 2015). The aim is to provide a variety of modular usage examples that can be applied to construct custom analysis pipelines.

Note: to cite this article, please refer to for instructions.

If you wish to start from raw data and generate the counts table your self follow these stepts (otherwise jump to the "Download" section):

Download the list of accession files from using "Download" > "Accession list". Then use the sratoolkit to download the files into compressed (gz) fastq files:

while read line; do fastq-dump --gzip --split-files ${line} ; done < Accession_list.txt

Map the reasd to the reference genome with:

hisat2 -p 18 --rna-strandness R --dta-cufflinks --met-file hisat_output/${file%_1.fastq.gz}.stats \
-x ${hisat_index} -S hisat_output/${file%_1.fastq.gz}.sam -1 ${file} -2 ${file%_1.fastq.gz}_2.fastq.gz

Having this we need to count the number of reads mapped to each gene (exons only).

In [2]:
library( "GenomicFeatures" )

files<-c('SRR1578712.bam', 'SRR1578672.bam', 'SRR1578704.bam', 'SRR1578663.bam', 'SRR1578693.bam', 'SRR1578681.bam', 'SRR1578696.bam', 'SRR1578703.bam', 'SRR1578699.bam', 'SRR1578653.bam', 'SRR1578707.bam', 'SRR1578652.bam', 'SRR1578708.bam', 'SRR1578715.bam', 'SRR1578730.bam', 'SRR1578735.bam', 'SRR1578676.bam', 'SRR1578650.bam', 'SRR1578675.bam', 'SRR1578716.bam', 'SRR1578664.bam', 'SRR1578705.bam', 'SRR1578718.bam', 'SRR1578706.bam', 'SRR1578679.bam', 'SRR1578689.bam', 'SRR1578647.bam', 'SRR1578723.bam', 'SRR1578656.bam', 'SRR1578666.bam', 'SRR1578686.bam', 'SRR1578643.bam', 'SRR1578654.bam', 'SRR1578680.bam', 'SRR1578659.bam', 'SRR1578727.bam', 'SRR1578734.bam', 'SRR1578737.bam', 'SRR1578665.bam', 'SRR1578733.bam', 'SRR1578736.bam', 'SRR1578688.bam', 'SRR1578690.bam', 'SRR1578726.bam', 'SRR1578698.bam', 'SRR1578661.bam', 'SRR1578645.bam', 'SRR1578720.bam', 'SRR1578649.bam', 'SRR1578658.bam', 'SRR1578695.bam', 'SRR1578710.bam', 'SRR1578683.bam', 'SRR1578732.bam', 'SRR1578692.bam', 'SRR1578697.bam', 'SRR1578728.bam', 'SRR1578721.bam', 'SRR1578694.bam', 'SRR1578668.bam', 'SRR1578678.bam', 'SRR1578731.bam', 'SRR1578644.bam', 'SRR1578677.bam', 'SRR1578657.bam', 'SRR1578691.bam', 'SRR1578682.bam', 'SRR1578724.bam', 'SRR1578667.bam', 'SRR1578738.bam', 'SRR1578655.bam', 'SRR1578673.bam', 'SRR1578662.bam', 'SRR1578670.bam', 'SRR1578714.bam', 'SRR1578717.bam', 'SRR1578660.bam', 'SRR1578709.bam', 'SRR1578719.bam', 'SRR1578669.bam', 'SRR1578702.bam', 'SRR1578687.bam', 'SRR1578713.bam', 'SRR1578648.bam', 'SRR1578700.bam', 'SRR1578725.bam', 'SRR1578651.bam', 'SRR1578701.bam', 'SRR1578685.bam', 'SRR1578646.bam', 'SRR1578711.bam', 'SRR1578684.bam', 'SRR1578674.bam', 'SRR1578729.bam', 'SRR1578671.bam', 'SRR1578722.bam')
names<-c('HSC1_1', 'HSC1_2', 'HSC1_3', 'HSC1_4', 'HSC1_5', 'HSC1_6', 'HSC1_7', 'HSC1_8', 'HSC1_9', 'HSC1_10', 'HSC1_11', 'HSC1_12', 'HSC1_13', 'HSC1_14', 'HSC1_15', 'HSC1_16', 'HSC1_17', 'HSC1_18', 'HSC1_19', 'HSC1_20', 'HSC1_21', 'HSC1_22', 'HSC1_23', 'HSC1_24', 'HSC1_25', 'HSC1_26', 'HSC1_27', 'HSC1_28', 'HSC1_29', 'HSC1_30', 'HSC1_31', 'HSC1_32', 'HSC1_33', 'HSC1_34', 'HSC1_35', 'HSC1_36', 'HSC1_37', 'HSC1_38', 'HSC1_39', 'HSC1_40', 'HSC1_41', 'HSC1_42', 'HSC1_43', 'HSC1_44', 'HSC1_45', 'HSC1_46', 'HSC1_47', 'HSC1_48', 'HSC1_49', 'HSC1_50', 'HSC1_51', 'HSC1_52', 'HSC1_53', 'HSC1_54', 'HSC1_55', 'HSC1_56', 'HSC1_57', 'HSC1_58', 'HSC1_59', 'HSC1_60', 'HSC1_61', 'HSC1_62', 'HSC1_63', 'HSC1_64', 'HSC1_65', 'HSC1_66', 'HSC1_67', 'HSC1_68', 'HSC1_69', 'HSC1_70', 'HSC1_71', 'HSC1_72', 'HSC1_73', 'HSC1_74', 'HSC1_75', 'HSC1_76', 'HSC1_77', 'HSC1_78', 'HSC1_79', 'HSC1_80', 'HSC1_81', 'HSC1_82', 'HSC1_83', 'HSC1_84', 'HSC1_85', 'HSC1_86', 'HSC1_87', 'HSC1_88', 'HSC1_89', 'HSC1_90', 'HSC1_91', 'HSC1_92', 'HSC1_93', 'HSC1_94', 'HSC1_95', 'HSC1_96')
sampleTable <- data.frame(sampleName=names, fileName=files)
bamFiles <- file.path( extDataDir, sampleTable$fileName )

mmus <- makeTxDbFromGFF(GTF, format="gtf")
exonsByGene <- exonsBy( mmus, by="gene" )

se <- summarizeOverlaps( exonsByGene, BamFileList( bamFiles ), mode="Union", singleEnd=FALSE, ignore.strand=FALSE)

colnames(se) <- sampleTable$sampleName 
counts <- assay(se)

Download data

In [3]:
cd /beegfs/group_bit/data/projects/departments/Bioinformatics/bit_single_cell_tutorial/raw_data/
unpigz GSE61533_HTSEQ_count_results.xls.gz

Load required packages

In [4]:
Loading required package: SummarizedExperiment
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append,, cbind, colMeans, colnames,
    colSums,, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match,
    mget, order, paste, pmax,, pmin,, Position, rank,
    rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which, which.max, which.min

Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:


Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: DelayedArray
Loading required package: matrixStats

Attaching package: ‘matrixStats’

The following objects are masked from ‘package:Biobase’:

    anyMissing, rowMedians

Attaching package: ‘DelayedArray’

The following objects are masked from ‘package:matrixStats’:

    colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

The following object is masked from ‘package:base’:


Loading required package: AnnotationDbi

Loading required package: ggplot2

Attaching package: ‘scater’

The following object is masked from ‘package:S4Vectors’:


The following object is masked from ‘package:stats’:


Loading required package: BiocParallel

Attaching package: ‘limma’

The following object is masked from ‘package:scater’:


The following object is masked from ‘package:BiocGenerics’:


Attaching package: ‘edgeR’

The following object is masked from ‘package:SingleCellExperiment’:


2 Analysis of haematopoietic stem cells

2.1 Overview

To introduce most of the concepts of scRNA-seq data analysis, we use a relatively simple dataset from a study of haematopoietic stem cells (HSCs) (N. K. Wilson et al. 2015). Single mouse HSCs were isolated into microtiter plates and libraries were prepared for 96 cells using the Smart-seq2 protocol. A constant amount of spike-in RNA from the External RNA Controls Consortium (ERCC) was also added to each cell’s lysate prior to library preparation. High-throughput sequencing was performed and the expression of each gene was quantified by counting the total number of reads mapped to its exonic regions. Similarly, the quantity of each spike-in transcript was measured by counting the number of reads mapped to the spike-in reference sequences. Counts for all genes/transcripts in each cell were obtained from the NCBI Gene Expression Omnibus (GEO) as a supplementary file under the accession number GSE61533 (

For simplicity, we forego a description of the read processing steps required to generate the count matrix, i.e., read alignment and counting into features. These steps have been described in some detail elsewhere (Love et al. 2015; Y. Chen, Lun, and Smyth 2016), and are largely the same for bulk and single-cell data. The only additional consideration is that the spike-in information must be included in the pipeline. Typically, spike-in sequences can be included as additional FASTA files during genome index building prior to alignment, while genomic intervals for both spike-in transcripts and endogenous genes can be concatenated into a single GTF file prior to counting. For users favouring an R-based approach to read alignment and counting, we suggest using the methods in the Rsubread package (Liao, Smyth, and Shi 2013; Liao, Smyth, and Shi 2014). Alternatively, rapid quantification of expression with alignment-free methods such as kallisto (Bray et al. 2016) or Salmon (Patro, Duggal, and Kingsford 2015) can be performed using the functions runKallisto and runSalmon in the scater package.

2.2 Count loading

The first task is to load the count matrix into memory. In this case, some work is required to retrieve the data from the Gzip-compressed Excel format. Each row of the matrix represents an endogenous gene or a spike-in transcript, and each column represents a single HSC.

In [5]:
all.counts <-, sheet=1))
rownames(all.counts) <- all.counts$ID
all.counts <- as.matrix(all.counts[,-1])

For convenience, the counts for spike-in transcripts and endogenous genes are stored in a SingleCellExperiment object from the SingleCellExperiment package.

In [6]:
sce <- SingleCellExperiment(list(counts=all.counts))
  1. 38498
  2. 96

We identify the rows corresponding to ERCC spike-in transcripts from the row names. We store this information in the SingleCellExperiment object for future use. This is necessary as spike-ins require special treatment in some downstream steps such as normalization.

In [7]:
is.spike <- grepl("^ERCC", rownames(sce))
isSpike(sce, "ERCC") <- is.spike
   Mode   FALSE    TRUE 
logical   38406      92 

We also identify the rows corresponding to mitochondrial genes, which is useful for quality control. In this case, the information can be obtained from the row names, More generally, though, identifying mitochondrial genes from standard identifiers like Ensembl requires extra annotation (this will be discussed later in more detail).

In [8]:
is.mito <- grepl("^mt-", rownames(sce))
   Mode   FALSE    TRUE 
logical   38461      37 

2.3 Quality control on the cells

2.3.1 Defining the quality control metrics

Low-quality cells need to be removed to ensure that technical effects do not distort downstream analysis results. We use several quality control (QC) metrics:

The library size is defined as the total sum of counts across all features, i.e., genes and spike-in transcripts. Cells with small library sizes are of low quality as the RNA has not been efficiently captured (i.e., converted into cDNA and amplified) during library preparation.

The number of expressed features in each cell is defined as the number of features with non-zero counts for that cell. Any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured.

The proportion of reads mapped to spike-in transcripts is calculated relative to the library size for each cell. High proportions are indicative of poor-quality cells, where endogenous RNA has been lost during processing (e.g., due to cell lysis or RNA degradation). The same amount of spike-in RNA to each cell, so an enrichment in spike-in counts is symptomatic of loss of endogenous RNA.

In the absence of spike-in transcripts, the proportion of reads mapped to genes in the mitochondrial genome can also be used. High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.

For each cell, we calculate these quality control metrics using the calculateQCMetrics function from the scater package (McCarthy et al. 2016). These are stored in the row- and column-wise metadata of the SingleCellExperiment for future reference.

In [9]:
sce <- calculateQCMetrics(sce, feature_controls=list(ERCC=is.spike, Mt=is.mito))
  1. 'total_features'
  2. 'log10_total_features'
  3. 'total_counts'
  4. 'log10_total_counts'
  5. 'pct_counts_top_50_features'
  6. 'pct_counts_top_100_features'

The distributions of these metrics are shown in Figure 1. The aim is to remove putative low-quality cells that have low library sizes, low numbers of expressed features, and high spike-in (or mitochondrial) proportions.

In [10]:
par(mfrow=c(2,2), mar=c(5.1, 4.1, 0.1, 0.1))
hist(sce$total_counts/1e6, xlab="Library sizes (millions)", main="", 
    breaks=20, col="grey80", ylab="Number of cells")
hist(sce$total_features, xlab="Number of expressed genes", main="", 
    breaks=20, col="grey80", ylab="Number of cells")
hist(sce$pct_counts_ERCC, xlab="ERCC proportion (%)", 
    ylab="Number of cells", breaks=20, main="", col="grey80")
hist(sce$pct_counts_Mt, xlab="Mitochondrial proportion (%)", 
    ylab="Number of cells", breaks=20, main="", col="grey80")

Figure 1: Histograms of various QC metrics for all cells in the HSC data set. This includes the library sizes, number of expressed genes, and proportion of reads mapped to spike-in transcripts or mitochondrial genes.

2.3.2 Removing low-quality cells based on outliers

Picking a threshold for these metrics is not straightforward as their absolute values depend on the experimental protocol. For example, sequencing to greater depth will lead to more reads and more expressed features, regardless of the quality of the cells. Similarly, using more spike-in RNA in the protocol will result in higher spike-in proportions. To obtain an adaptive threshold, we assume that most of the dataset consists of high-quality cells, and identify cells that are outliers for the various QC metrics.

Outliers are defined based on the median absolute deviation (MADs) from the median value of each metric across all cells. We remove cells with log-library sizes that are more than 3 MADs below the median log-library size. A log-transformation improves resolution at small values, especially when the MAD of the raw values is comparable to or greater than the median. We also remove cells where the log-transformed number of expressed genes is 3 MADs below the median value.

In [11]:
libsize.drop <- isOutlier(sce$total_counts, nmads=3, type="lower", log=TRUE)
feature.drop <- isOutlier(sce$total_features, nmads=3, type="lower", log=TRUE)

We identify outliers for the proportion-based metrics in a similar manner. Here, no transformation is required as we are identifying large outliers, for which the distinction should be fairly clear on the raw scale. We do not use the mitochondrial proportions as we already have the spike-in proportions for this data set.

In [12]:
spike.drop <- isOutlier(sce$pct_counts_ERCC, nmads=3, type="higher")

Subsetting by column will retain only the high-quality cells that pass each filter described above. We examine the number of cells removed by each filter as well as the total number of retained cells. Removal of a substantial proportion of cells (> 10%) may be indicative of an overall issue with data quality.

In [13]:
sce <- sce[,!(libsize.drop | feature.drop | spike.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop),
    BySpike=sum(spike.drop), Remaining=ncol(sce))
2 2 3 92

2.3.3 Assumptions of outlier identification

We have already mentioned the assumption that most cells are of high quality. This is usually reasonable, and can be experimentally supported in some situations by visually checking that the cells are intact (e.g., on the microwell plate or in the microfluidics system). Another assumption is that the QC metrics are independent on the biological state of each cell. This ensures that the outlier values for these metrics are driven by technical factors rather than biological processes. Thus, removing cells based on the metrics will not misrepresent the biology in downstream analyses.

The second assumption is most likely to be violated in highly heterogeneous cell populations. For example, some cell types may naturally have less RNA or express fewer genes than other cell types. Such cell types are more likely to be considered outliers and removed, even if they are of high quality. The use of the MAD mitigates this problem by accounting for biological variability in the QC metrics. A heterogeneous population should have higher variability in the metrics amongst high-quality cells, increasing the MAD and reducing the chance of incorrectly removing particular cell types. Nonetheless, our cell-filtering procedure may not be appropriate in extreme cases where one cell type is very different from the others.

Systematic differences in the QC metrics can be handled to some extent using the batch argument in the isOutlier function. This is obviously useful for batch effects caused by known differences in experimental processing, e.g., sequencing at different depth or had different amounts of spike-in added. It may also be useful if an a priori cell type has systematically fewer expressed genes or lower RNA content. Analyzing all cell types together would unnecessarily inflate the MAD and compromise the removal of low-quality cells, at best; or lead to the entire loss of one cell type, at worst.

2.3.4 Alternative approaches to quality control

An alternative approach to quality control is to set pre-defined thresholds on each QC metric. For example, we might remove all cells with library sizes below 100000 and numbers of expressed genes below 4000. This generally requires a great deal of experience to determine appropriate thresholds for each experimental protocol and biological system. Indeed, even with the same protocol and system, the appropriate threshold can vary from run to run due to the vagaries of RNA capture and sequencing.

Another strategy is to perform a principal components analysis (PCA) based on the quality metrics for each cell, e.g., the total number of reads, the total number of features and the proportion of mitochondrial or spike-in reads. Outliers on a PCA plot may be indicative of low-quality cells that have aberrant technical properties compared to the (presumed) majority of high-quality cells. In Figure 2, no obvious outliers are present, which is consistent with the removal of suspect cells in the preceding quality control steps.

In [14]:
fontsize <- theme(axis.text=element_text(size=12), axis.title=element_text(size=16))
plotPCA(sce, pca_data_input="pdata") + fontsize
The following selected_variables were not found in colData(object): pct_counts_feature_controlsThe following selected_variables were not found in colData(object): total_features_feature_controlsThe following selected_variables were not found in colData(object): log10_total_counts_feature_controls
Other variables from colData(object) can be used by specifying a vector of variable names as the selected_variables argument.
PCA is being conducted using the following variables:pct_counts_top_100_featurestotal_featureslog10_total_counts_endogenous

Figure 2: PCA plot for cells in the HSC dataset, constructed using quality metrics. The first and second components are shown on each axis, along with the percentage of total variance explained by each component. Bars represent the coordinates of the cells on each axis.

Methods like PCA-based outlier detection and support vector machines can provide more power to distinguish low-quality cells from high-quality counterparts (Ilicic et al. 2016). This is because they are able to detect subtle patterns across many quality metrics simultaneously. However, this comes at some cost to interpretability, as the reason for removing a given cell may not always be obvious. Thus, for this workflow, we will use the simple approach whereby each quality metric is considered separately. Users interested in the more sophisticated approaches are referred to the scater and cellity packages.

For completeness, we note that outliers can also be identified based on the gene expression profiles, rather than QC metrics. However, we consider this to be a risky strategy as it can remove high-quality cells in rare populations.

2.4 Classification of cell cycle phase

We use the prediction method described by Scialdone et al. (2015) to classify cells into cell cycle phases based on the gene expression data. Using a training dataset, the sign of the difference in expression between two genes was computed for each pair of genes. Pairs with changes in the sign across cell cycle phases were chosen as markers. Cells in a test dataset can then be classified into the appropriate phase, based on whether the observed sign for each marker pair is consistent with one phase or another. This approach is implemented in the cyclone function using a pre-trained set of marker pairs for mouse data. (Some additional work is necessary to match the gene symbols in the data to the Ensembl annotation in the pre-trained marker set.)

In [15]:
mm.pairs <- readRDS(ff)
ensembl <- mapIds(, keys=rownames(sce), keytype="SYMBOL", column="ENSEMBL")
assignments <- cyclone(sce, mm.pairs, gene.names=ensembl)
'select()' returned 1:many mapping between keys and columns

The cyclone result for each cell in the HSC dataset is shown in Figure 3. Each cell is assigned a score for each phase, with a higher score corresponding to a higher probability that the cell is in that phase. We focus on the G1 and G2/M scores as these are the most informative for classification.

In [16]:
plot(assignments$score$G1, assignments$score$G2M, 
     xlab="G1 score", ylab="G2/M score", pch=16)

Figure 3: Cell cycle phase scores from applying the pair-based classifier on the HSC dataset, where each point represents a cell.

Cells are classified as being in G1 phase if the G1 score is above 0.5 and greater than the G2/M score; in G2/M phase if the G2/M score is above 0.5 and greater than the G1 score; and in S phase if neither score is above 0.5. Here, the vast majority of cells are classified as being in G1 phase. We save these assignments into the SingleCellExperiment object for later use.

In [17]:
sce$phases <- assignments$phases
 G1 G2M   S 
 89   2   1 

Pre-trained classifiers are available in scran for human and mouse data. While the mouse classifier used here was trained on data from embryonic stem cells, it is still accurate for other cell types (Scialdone et al. 2015). This may be due to the conservation of the transcriptional program associated with the cell cycle (Bertoli, Skotheim, and Bruin 2013; Conboy et al. 2007). The pair-based method is also a non-parametric procedure that is robust to most technical differences between datasets.

Comments from Aaron:

  • To remove confounding effects due to cell cycle phase, we can filter the cells to only retain those in a particular phase (usually G1) for downstream analysis. Alternatively, if a non-negligible number of cells are in other phases, we can use the assigned phase as a blocking factor. This protects against cell cycle effects without discarding information, and will be discussed later in more detail.
  • The classifier may not be accurate for data that are substantially different from those used in the training set, e.g., due to the use of a different protocol. In such cases, users can construct a custom classifier from their own training data using the sandbag function. This will also be necessary for other model organisms where pre-trained classifiers are not available.
  • Do not filter out low-abundance genes before applying cyclone. Even if a gene is not expressed in any cell, it may still be useful for classification if it is phase-specific. Its lack of expression relative to other genes will still yield informative pairs, and filtering them out would reduce power.

2.5 Examining gene-level expression metrics

2.5.1 Inspecting the most highly expressed genes

We also look at the identities of the most highly expressed genes (Figure 4). This should generally be dominated by constitutively expressed transcripts, such as those for ribosomal or mitochondrial proteins. The presence of other classes of features may be cause for concern if they are not consistent with expected biology. For example, a top set containing many spike-in transcripts suggests that too much spike-in RNA was added during library preparation, while the absence of ribosomal proteins and/or the presence of their pseudogenes are indicative of suboptimal alignment.

In [18]:
plotQC(sce, type = "highest-expression", n=50) + fontsize

Figure 4: Percentage of total counts assigned to the top 50 most highly-abundant features in the HSC dataset. For each feature, each bar represents the percentage assigned to that feature for a single cell, while the circle represents the average across all cells. Bars are coloured by the total number of expressed features in each cell, while circles are coloured according to whether the feature is labelled as a control feature.

2.5.2 Filtering out low-abundance genes

Low-abundance genes are problematic as zero or near-zero counts do not contain much information for reliable statistical inference (Bourgon, Gentleman, and Huber 2010). These genes typically do not provide enough evidence to reject the null hypothesis during testing, yet they still increase the severity of the multiple testing correction. In addition, the discreteness of the counts may interfere with statistical procedures, e.g., by compromising the accuracy of continuous approximations. Thus, low-abundance genes are often removed in many RNA-seq analysis pipelines before the application of downstream methods.

The “optimal” choice of filtering strategy depends on the downstream application. A more aggressive filter is usually required to remove discreteness (e.g., for normalization) compared to that required for removing underpowered tests. For hypothesis testing, the filter statistic should also be independent of the test statistic under the null hypothesis. Thus, we (or the relevant function) will filter at each step as needed, rather than applying a single filter for the entire analysis.

Several metrics can be used to define low-abundance genes. The most obvious is the average count for each gene, computed across all cells in the data set. We calculate this using the calcAverage function, which also performs some adjustment for library size differences between cells We typically observe a peak of moderately expressed genes following a plateau of lowly expressed genes (Figure 5).

In [19]:
ave.counts <- calcAverage(sce)
hist(log10(ave.counts), breaks=100, main="", col="grey80", 
    xlab=expression(Log[10]~"average count"))

Figure 5: Histogram of log-average counts for all genes in the HSC dataset.

A minimum threshold can be applied to this value to filter out genes that are lowly expressed. The example below demonstrates how we could remove genes with average counts less than 1. The number of TRUE values in demo.keep corresponds to the number of retained rows/genes after filtering.

In [20]:
demo.keep <- ave.counts >= 1
filtered.sce <- sce[demo.keep,]
   Mode   FALSE    TRUE 
logical   24439   14059 

We also examine the number of cells that express each gene. This is closely related to the average count for most genes, as expression in many cells will result in a higher average (Figure 6). Genes expressed in very few cells are often uninteresting as they are driven by amplification artifacts (though they may also also arise from rare populations). We could then remove genes that are expressed in fewer than n cells.

In [21]:
num.cells <- nexprs(sce, byrow=TRUE)
smoothScatter(log10(ave.counts), num.cells, ylab="Number of cells", 
    xlab=expression(Log[10]~"average count"))

Figure 6: The number of cells expressing each gene in the HSC data set, plotted against the log-average count. Intensity of colour corresponds to the number of genes at any given location.

As mentioned above, these filters will be applied at each step (automatically, in most cases, within the relevant function). This ensures that the most appropriate filter is used in each application. Nonetheless, we remove genes that are not expressed in any cell to reduce computational work in downstream steps. Such genes provide no information and would be removed by any filtering strategy.

In [22]:
to.keep <- num.cells > 0
sce <- sce[to.keep,]
   Mode   FALSE    TRUE 
logical   17229   21269 

2.6 Normalization of cell-specific biases

2.6.1 Using the deconvolution method to deal with zero counts

Read counts are subject to differences in capture efficiency and sequencing depth between cells (Stegle, Teichmann, and Marioni 2015). Normalization is required to eliminate these cell-specific biases prior to downstream quantitative analyses. This is often done by assuming that most genes are not differentially expressed (DE) between cells. Any systematic difference in count size across the non-DE majority of genes between two cells is assumed to represent bias and is removed by scaling. More specifically, “size factors” are calculated that represent the extent to which counts should be scaled in each library.

Size factors can be computed with several different approaches, e.g., using the estimateSizeFactorsFromMatrix function in the DESeq2 package (Anders and Huber 2010; Love, Huber, and Anders 2014), or with the calcNormFactors function (Robinson and Oshlack 2010) in the edgeR package. However, single-cell data can be problematic for these bulk data-based methods due to the dominance of low and zero counts. To overcome this, we pool counts from many cells to increase the count size for accurate size factor estimation (Lun, Bach, and Marioni 2016). Pool-based size factors are then “deconvolved” into cell-based factors for cell-specific normalization.

In [23]:
sce <- computeSumFactors(sce)
Warning message in .computeSumFactors(assay(x, i = assay.type), subset.row = subset.row, :
“not enough cells in at least one cluster for some 'sizes'”
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4076  0.8101  0.9546  1.0000  1.1585  2.0264 

In this case, the size factors are tightly correlated with the library sizes for all cells (Figure 7). This suggests that the systematic differences between cells are primarily driven by differences in capture efficiency or sequencing depth. Any DE between cells would yield a non-linear trend between the total count and size factor, and/or increased scatter around the trend. This does not occur here as strong DE is unlikely to exist within a homogeneous population of cells.

In [24]:
plot(sizeFactors(sce), sce$total_counts/1e6, log="xy",
    ylab="Library size (millions)", xlab="Size factor")

Figure 7: Size factors from deconvolution, plotted against library sizes for all cells in the HSC dataset. Axes are shown on a log-scale.

Comments from Aaron:

  • While the deconvolution approach is robust to the high frequency of zeroes in scRNA-seq data, it will eventually fail if too many counts are zero. This manifests as negative size factors, which are obviously nonsensical. To avoid this, the computeSumFactors function will automatically remove low-abundance genes prior to the calculation of size factors. Genes with an average count below a specified threshold (min.mean) are ignored. For read count data, the default value of 1 is usually satisfactory. For UMI data, counts are lower so a threshold of 0.1 is recommended.
  • Cell-based QC should always be performed prior to normalization, to remove cells with very low numbers of expressed genes. If this is not done, the computeSumFactors function may yield negative size factors for low-quality cells.
  • The sizes argument can be used to specify the number of pool sizes to use to compute the size factors. More sizes yields more precise estimates at the cost of some computational time and memory. In general, sizes should not be below 20 cells, to ensure that there are sufficient non-zero expression values in each pool. We also recommend that the total number of cells should be at least 100 for effective pooling.
  • For highly heterogeneous data sets, it is advisable to perform a rough clustering of the cells. This can be done with the quickCluster function and the results passed to computeSumFactors via the cluster argument. Cells in each cluster are normalized separately, and the size factors are rescaled to be comparable across clusters. This avoids the need to assume that most genes are non-DE across the entire population - only a non-DE majority is required between pairs of clusters. We demonstrate this approach later with a larger data set.

2.6.2 Computing separate size factors for spike-in transcripts

Size factors computed from the counts for endogenous genes are usually not appropriate for normalizing the counts for spike-in transcripts. Consider an experiment without library quantification, i.e., the amount of cDNA from each library is not equalized prior to pooling and multiplexed sequencing. Here, cells containing more RNA have greater counts for endogenous genes and thus larger size factors to scale down those counts. However, the same amount of spike-in RNA is added to each cell during library preparation. This means that the counts for spike-in transcripts are not subject to the effects of RNA content. Attempting to normalize the spike-in counts with the gene-based size factors will lead to over-normalization and incorrect quantification of expression. Similar reasoning applies in cases where library quantification is performed. For a constant total amount of cDNA, any increases in endogenous RNA content will suppress the coverage of spike-in transcripts. As a result, the bias in the spike-in counts will be opposite to that captured by the gene-based size factor.

To ensure normalization is performed correctly, we compute a separate set of size factors for the spike-in set. For each cell, the spike-in-specific size factor is defined as the total count across all transcripts in the spike-in set. This assumes that none of the spike-in transcripts are differentially expressed, which is reasonable given that the same amount and composition of spike-in RNA should have been added to each cell. (See below for a more detailed discussion on spike-in normalization.) These size factors are stored in a separate field of the SingleCellExperiment object by setting general.use=FALSE in computeSpikeFactors. This ensures that they will only be used with the spike-in transcripts but not the endogenous genes.

In [25]:
sce <- computeSpikeFactors(sce, type="ERCC", general.use=FALSE)

2.6.3 Applying the size factors to normalize gene expression

The count data are used to compute normalized log-expression values for use in downstream analyses. Each value is defined as the log2-ratio of each count to the size factor for the corresponding cell, after adding a prior count of 1 to avoid undefined values at zero counts. Division by the size factor ensures that any cell-specific biases are removed. If spike-in-specific size factors are present in sce, they will be automatically applied to normalize the spike-in transcripts separately from the endogenous genes.

In [26]:
sce <- normalize(sce)

The log-transformation is useful as it means that any differences in the values directly represent log2-fold changes in expression between cells. This is usually more relevant than the absolute differences in coverage, which need to be interpreted in the context of the overall abundance. The log-transformation also provides some measure of variance stabilization (Law et al. 2014), so that high-abundance genes with large variances do not dominate downstream analyses. The computed values are stored as an "logcounts" matrix in addition to the other assay elements.

2.7 Checking for confounding technical factors

We check whether there are technical factors that contribute substantially to the heterogeneity of gene expression. If so, the factor may need to be regressed out to ensure that it does not inflate the variances or introduce spurious correlations. For this dataset, the simple experimental design means that there are no plate or batch effects to examine. Instead, we use the (log-transformed) total count for the spike-in transcripts as a proxy for the technical bias in each sample. This is based on the fact that the same amount of spike-in RNA should have been added to each cell. Thus, any association of gene expression with this factor is not biologically interesting and should be removed.

For each gene, we calculate the percentage of the variance of the expression values that is explained by the spike-in totals (Figure 8). The percentages are generally small (1-3%), indicating that the expression profiles of most genes are not strongly associated with this factor. This result is consistent with successful removal of cell-specific biases by scaling normalization. Thus, the spike-in total does not need to be explicitly modelled in our downstream analyses.

In [27]:
plotExplanatoryVariables(sce, variables=c("total_counts_ERCC", 
    "log10_total_counts_ERCC")) + fontsize

Figure 8: Density plot of the percentage of variance explained by the (log-transformed) total spike-in counts across all genes in the HSC dataset. For each gene, the percentage of the variance of the normalized log-expression values across cells that is explained by each factor is calculated. Each curve corresponds to one factor and represents the distribution of percentages across all genes.

Note that the use of the spike-in total as an accurate proxy for the relative technical bias assumes that no library quantification was performed. Otherwise, the coverage of the spike-in transcripts would be dependent on the total amount of endogenous RNA in each cell. (Specifically, if the same amount of cDNA is used for sequencing per cell, any increase in the amount of endogenous RNA will suppress the coverage of the spike-in transcripts.) This means that the spike-in totals could be confounded with genuine biological effects associated with changes in RNA content.

2.8 Modelling the technical noise in gene expression

2.8.1 Fitting a trend to the spike-in variances

Variability in the observed expression values across genes can be driven by genuine biological heterogeneity or uninteresting technical noise. To distinguish between these two possibiltiies, we need to model the technical component of the variance of the expression values for each gene. We use the trendVar function to fit a mean-dependent trend to the variances of the log-expression values for the spike-in transcripts, Recall that the same set of spike-ins was added in the same quantity to each cell. This means that the spike-in transcripts should exhibit no biological variability, i.e., any variance in their counts should be technical in origin.

In [28]: <- trendVar(sce, parametric=TRUE, span=0.2)

Given the mean abundance of a gene, the fitted value of the trend can be used as an estimate of the technical component for that gene. The biological component of the variance can then be calculated by subtracting the technical component from the total variance of each gene with the decomposeVar function.

In [29]:
var.out <- decomposeVar(sce,
0610005C13Rik0.41698965 1.9224938 -0.43042666 2.352920 0.8976314 1
0610007P14Rik6.07800839 8.3082437 -1.24081999 9.549064 0.8072030 1
0610008F07Rik0.06343649 0.3702253 0.01370032 0.356525 0.3800895 1
0610009B22Rik3.93889838 11.2012213 -0.5810259311.782247 0.6135645 1
0610009D07Rik8.03594711 1.3638512 -0.63517861 1.999030 0.9911969 1
0610009E02Rik0.27789890 1.0950584 -0.48206851 1.577127 0.9883633 1

We visually inspect the trend to confirm that it corresponds to the spike-in variances (Figure 9)). The wave-like shape is typical of the mean-variance trend for log-expression values. A linear increase in the variance is observed as the mean increases from zero, as larger variances are possible when the counts increase. At very high abundances, the effect of sampling noise decreases due to the law of large numbers, resulting in a decrease in the variance.

In [30]:
plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression", 
    ylab="Variance of log-expression")
curve($trend(x), col="dodgerblue", lwd=2, add=TRUE)
cur.spike <- isSpike(sce)
points(var.out$mean[cur.spike], var.out$total[cur.spike], col="red", pch=16)

Figure 9: Variance of normalized log-expression values for each gene in the HSC dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

We check the distribution of expression values for the genes with the largest biological components. This ensures that the variance estimate is not driven by one or two outlier cells (Figure 10).

In [31]:
chosen.genes <- order(var.out$bio, decreasing=TRUE)[1:10]
plotExpression(sce, features=rownames(var.out)[chosen.genes]) + fontsize