HiC data analysis with HiC-PRO and HiTC¶

For more information about HiTC R package: https://bioconductor.org/packages/release/bioc/vignettes/HiTC/inst/doc/HiTC.pdf

Get the data for this tutorial¶

reads¶

wget https://zerkalo.curie.fr/partage/HiC-Pro/HiCPro_testdata.tar.gz && tar  -zxvf HiCPro_testdata.tar.gz

It will generate a folder test_data with 2 subfolders:

dixon_2M
dixon_2M_2

In addition, on file in the current folder: config_test_latest.txt

The Hic-pro config file config_test_latest.txt must be edited.

#¶

reference genome¶

wget ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes/hg19.zip

mkdir hg19

unzip hg19.zip  -d hg19

It will download the bowtie2 index files for human genome hg19.

HiC-pro config file¶

The final file will look like this:

################################

#Please change the variable settings below if necessary

#########################################################################
##Paths and Settings  - Do not edit !
#########################################################################

TMP_DIR = tmp

LOGS_DIR = logs

BOWTIE2_OUTPUT_DIR = bowtie_results

MAPC_OUTPUT = hic_results

RAW_DIR = rawdata

#######################################################################
##SYSTEM - PBS - Start Editing Here !!

### #commenting the PBS lines and keeping N_CPU and LOGFILE
#######################################################################

**N_CPU = 8**

LOGFILE = hicpro.log

#JOB_NAME = IMR90_split

#JOB_MEM = 10gb

#JOB_WALLTIME = 6:00:00

#JOB_QUEUE = batch

#JOB_MAIL = nservant@curie.fr

#########################################################################
##Data
#########################################################################

PAIR1_EXT = _R1

PAIR2_EXT = _R2

#######################################################################
##Alignment options
#######################################################################

FORMAT = phred33

MIN_MAPQ = 0


### #here is the path where we dowloaded the hg19 index

**BOWTIE2_IDX_PATH = hg19/**

BOWTIE2_GLOBAL_OPTIONS = --very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder

BOWTIE2_LOCAL_OPTIONS =  --very-sensitive -L 20 --score-min L,-0.6,-0.2 --end-to-end --reorder

#######################################################################
##Annotation files
#######################################################################

### #Name of index files. Here for this tutorial we will keep like this, HiC-pro is provided with chrom_hg19.sizes

**REFERENCE_GENOME = hg19**

GENOME_SIZE = chrom_hg19.sizes

#######################################################################
##Allele specific
#######################################################################

ALLELE_SPECIFIC_SNP = 

#######################################################################
##Digestion Hi-C
#######################################################################

### #this file is also provided for hg19 used in this tutorial

GENOME_FRAGMENT = HindIII_resfrag_hg19.bed

LIGATION_SITE = AAGCTAGCTT

MIN_FRAG_SIZE = 100

MAX_FRAG_SIZE = 100000

MIN_INSERT_SIZE = 100

MAX_INSERT_SIZE = 600

#######################################################################
##Hi-C processing
#######################################################################


MIN_CIS_DIST =

GET_ALL_INTERACTION_CLASSES = 1

GET_PROCESS_SAM = 1

RM_SINGLETON = 1

RM_MULTI = 1

RM_DUP = 1


#######################################################################
##Contact Maps
#######################################################################

### #bin size for define the resolution

BIN_SIZE = 500000 1000000

MATRIX_FORMAT = upper


#######################################################################
##ICE Normalization
#######################################################################

MAX_ITER = 100

FILTER_LOW_COUNT_PERC = 0.02

FILTER_HIGH_COUNT_PERC = 0

EPS = 0.1

Now just run the HiC-pro:¶

HiC-Pro -i test_data -o out_put_test_data -c config_test_latest.txt

This will generate the results on out_put_test_data/ including out_put_test_data/hic_results/¶

ls out_put_test_data/hic_results/

data matrix pic

ls out_put_test_data/hic_results/matrix

dixon_2M dixon_2M_2

ls out_put_test_data/hic_results/matrix/dixon_2M

iced raw

Inside iced, we find the normalized matrix for 2 resolutions seted in the config file: 1000000 and 500000¶

for example: dixon_2M_500000_iced.matrix¶

Now we can use HiTC package to import the iced (normalized) matrix¶

# Install HiTC package
source("https://bioconductor.org/biocLite.R")
biocLite("HiTC")

library(HiTC)
showClass("HTCexp")
showClass("HTClist")

Loading required package: IRanges
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colnames, do.call,
    duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect,
    is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
    paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
    Reduce, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit, which, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: GenomicRanges
Loading required package: GenomeInfoDb

Class "HTCexp" [package "HiTC"]

Slots:
                              
Name:  intdata     xgi     ygi
Class:  Matrix GRanges GRanges
Class "HTClist" [package "HiTC"]

Slots:
            
Name:  .Data
Class:  list

Extends: 
Class "list", from data part
Class "vector", by class "list", distance 2
Class "AssayData", by class "list", distance 2
Class "vectorORfactor", by class "list", distance 3

Import the iced matrix for sample dixon_2M¶

hic<-importC("/beegfs/group_bit/data/projects/departments/Bioinformatics/bit_HiC_tutorial/out_test_data/hic_results/matrix/dixon_2M/iced/500000/dixon_2M_500000_iced.matrix","../out_test_data/hic_results/matrix/dixon_2M/raw/500000/dixon_2M_500000_ord.bed")

Loading Genomic intervals ...
Reading file ...
Convert 'C' file in HTCexp object(s)

hic

HTClist object of length 325 
25 intra / 300 inter-chromosomal maps

## Descriptive statistics
head(summary(hic))

Quality control¶

CQC(hic, winsize = 1e+06, dev.new=FALSE, hist.dist=FALSE)

Get data ...
Generate quality control plots ...

Define the chr and binsize¶

## Go back to a smaller dataset (chr5, 6, 7) at lower resolution
sset <- reduce(hic, chr=c("chr5","chr6","chr7"))
hic90_500 <- HTClist(mclapply(sset, binningC,binsize=500000, bin.adjust=FALSE, method="sum", step=1))
mapC(hic90_500)

Plotting chr5chr5...
minrange= 0.449  - maxrange= 3.997
Plotting chr5chr6...
minrange= 0.45  - maxrange= 1.625
Plotting chr5chr7...
minrange= 0.438  - maxrange= 2.325
Plotting chr6chr6...
minrange= 0.469  - maxrange= 3.929
Plotting chr6chr7...
minrange= 0.443  - maxrange= 2.601
Plotting chr7chr7...
minrange= 0.489  - maxrange= 4.842

Plot contact maps for ChrX - ChrX¶

hic_x.binned <- binningC(hic$chrXchrX, binsize=500000, method="median", step=3)

Bin size 'xgi' =500871 [3x166957]
Bin size 'ygi' =500871 [3x166957]

## Look at exptected counts chrX 
hicexp <- getExpectedCounts(hic_x.binned, method="loess", stdev=TRUE, plot=TRUE)

Warning message in getExpectedCounts(hic_x.binned, method = "loess", stdev = TRUE, :
“Contact map looks big. Use mean method instead or be sure that the loess fit gives good results.”Lowess fit ...
Standard deviation calculation ...
Delta=772176.12
Calculating stdev ...

mapC(hic_x.binned,title="chrX")

minrange= 0.507  - maxrange= 3.315

# Annotate graphics with gene features:

# Get the bed file from http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=661039579_qIn3z1lzkaAypE9M4fryJAGV12wN&clade=mammal&org=&db=hg19&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_regionType=genome&position=&hgta_outputType=primaryTable&hgta_outFileName=

require(rtracklayer)
gene <- import(file.path("../hg19.bed"),format="bed")
mapC(hic_x.binned,tracks=list(RefSeqGene=gene),maxrange=10,ti="chrX contacts")

minrange= 0.281  - maxrange= 8.729

#The following code shows how to focus on TADs 

TAD <- extractRegion(hic$chr6chr6, chr="chr6", from=50e6, to=58e6)
plot(TAD, maxrange=50, col.pos=c("white", "orange", "red", "black"))

minrange= 0.293  - maxrange= 4.981

## Data Normalization by Expected number of Counts
hiC14norm <- normPerExpected(hiC14, method="loess")
mapC(HTClist(hiC14norm), log.data=TRUE)

Lowess fit ...
Plotting chr14chr14...
minrange= 0.15  - maxrange= 18.621

## Correlation Map of Chromosome 14
intdata(hiC14norm) <- HiTC:::sparseCor(intdata(hiC14norm))
mapC(HTClist(hiC14norm), maxrange=1, minrange=-1,col.pos=c("black", "red"), col.neg=c("blue","black"))

Plotting chr14chr14...
Warning message in max(if (length(x@x) < prod(d)) x@x else x@x[indTri(d[1], upper = x@uplo == :
“no non-missing arguments to max; returning -Inf”Warning message in max(if (length(x@x) < prod(d)) x@x else x@x[indTri(d[1], upper = x@uplo == :
“no non-missing arguments to max; returning -Inf”Warning message in max(x@x, ..., na.rm = na.rm):
“no non-missing arguments to max; returning -Inf”Warning message in min(x@x, ..., na.rm = na.rm):
“no non-missing arguments to min; returning Inf”

# Principal Component Analysis
# empty due to low number of reads
pc <- pca.hic(hiC14, normPerExpected=TRUE, method="loess", npc=1)

Lowess fit ...
Discard 1 'x' intervals
Discard 1 'y' intervals
Discard 4 'x' intervals
Discard 4 'y' intervals
Warning message in pca.hic(hiC14, normPerExpected = TRUE, method = "loess", npc = 1):
“Empty correlation matrix for chr14”

	seq1	seq2	nbreads	nbinteraction	averagefreq	medfreq	sparsity
chr1chr1	chr1	chr1	4309.549061	3006	1.4336	1.0605	0.9879
chr1chr2	chr1	chr2	323.662133	338	0.9576	0.8743	0.9986
chr1chr3	chr1	chr3	312.695056	329	0.9504	0.8555	0.9983
chr1chr4	chr1	chr4	337.958491	351	0.9628	0.8278	0.9982
chr1chr5	chr1	chr5	258.84582	264	0.9805	0.8421	0.9985
chr1chr6	chr1	chr6	244.091414	249	0.9803	0.8697	0.9985