HiC data analysis with HiC-PRO and HiTC

Get the data for this tutorial

reads

wget https://zerkalo.curie.fr/partage/HiC-Pro/HiCPro_testdata.tar.gz && tar  -zxvf HiCPro_testdata.tar.gz

It will generate a folder test_data with 2 subfolders:

dixon_2M
dixon_2M_2

In addition, on file in the current folder: config_test_latest.txt

The Hic-pro config file config_test_latest.txt must be edited.

#

reference genome

wget ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes/hg19.zip

mkdir hg19

unzip hg19.zip  -d hg19

It will download the bowtie2 index files for human genome hg19.

HiC-pro config file

The final file will look like this:

################################

#Please change the variable settings below if necessary

#########################################################################
##Paths and Settings  - Do not edit !
#########################################################################

TMP_DIR = tmp

LOGS_DIR = logs

BOWTIE2_OUTPUT_DIR = bowtie_results

MAPC_OUTPUT = hic_results

RAW_DIR = rawdata

#######################################################################
##SYSTEM - PBS - Start Editing Here !!

### #commenting the PBS lines and keeping N_CPU and LOGFILE
#######################################################################

**N_CPU = 8**

LOGFILE = hicpro.log

#JOB_NAME = IMR90_split

#JOB_MEM = 10gb

#JOB_WALLTIME = 6:00:00

#JOB_QUEUE = batch

#JOB_MAIL = nservant@curie.fr

#########################################################################
##Data
#########################################################################

PAIR1_EXT = _R1

PAIR2_EXT = _R2

#######################################################################
##Alignment options
#######################################################################

FORMAT = phred33

MIN_MAPQ = 0


### #here is the path where we dowloaded the hg19 index

**BOWTIE2_IDX_PATH = hg19/**

BOWTIE2_GLOBAL_OPTIONS = --very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder

BOWTIE2_LOCAL_OPTIONS =  --very-sensitive -L 20 --score-min L,-0.6,-0.2 --end-to-end --reorder

#######################################################################
##Annotation files
#######################################################################

### #Name of index files. Here for this tutorial we will keep like this, HiC-pro is provided with chrom_hg19.sizes

**REFERENCE_GENOME = hg19**

GENOME_SIZE = chrom_hg19.sizes

#######################################################################
##Allele specific
#######################################################################

ALLELE_SPECIFIC_SNP = 

#######################################################################
##Digestion Hi-C
#######################################################################

### #this file is also provided for hg19 used in this tutorial

GENOME_FRAGMENT = HindIII_resfrag_hg19.bed

LIGATION_SITE = AAGCTAGCTT

MIN_FRAG_SIZE = 100

MAX_FRAG_SIZE = 100000

MIN_INSERT_SIZE = 100

MAX_INSERT_SIZE = 600

#######################################################################
##Hi-C processing
#######################################################################


MIN_CIS_DIST =

GET_ALL_INTERACTION_CLASSES = 1

GET_PROCESS_SAM = 1

RM_SINGLETON = 1

RM_MULTI = 1

RM_DUP = 1


#######################################################################
##Contact Maps
#######################################################################

### #bin size for define the resolution

BIN_SIZE = 500000 1000000

MATRIX_FORMAT = upper


#######################################################################
##ICE Normalization
#######################################################################

MAX_ITER = 100

FILTER_LOW_COUNT_PERC = 0.02

FILTER_HIGH_COUNT_PERC = 0

EPS = 0.1

Now just run the HiC-pro:

HiC-Pro -i test_data -o out_put_test_data -c config_test_latest.txt

This will generate the results on out_put_test_data/ including out_put_test_data/hic_results/

ls out_put_test_data/hic_results/

data matrix pic

ls out_put_test_data/hic_results/matrix

dixon_2M dixon_2M_2

ls out_put_test_data/hic_results/matrix/dixon_2M

iced raw

Inside iced, we find the normalized matrix for 2 resolutions seted in the config file: 1000000 and 500000

for example: dixon_2M_500000_iced.matrix

Now we can use HiTC package to import the iced (normalized) matrix

In [1]:
# Install HiTC package
source("https://bioconductor.org/biocLite.R")
biocLite("HiTC")
In [2]:
library(HiTC)
showClass("HTCexp")
showClass("HTClist")
Loading required package: IRanges
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colnames, do.call,
    duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect,
    is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
    paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
    Reduce, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit, which, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Class "HTCexp" [package "HiTC"]

Slots:
                              
Name:  intdata     xgi     ygi
Class:  Matrix GRanges GRanges
Class "HTClist" [package "HiTC"]

Slots:
            
Name:  .Data
Class:  list

Extends: 
Class "list", from data part
Class "vector", by class "list", distance 2
Class "AssayData", by class "list", distance 2
Class "vectorORfactor", by class "list", distance 3

Import the iced matrix for sample dixon_2M

In [3]:
hic<-importC("/beegfs/group_bit/data/projects/departments/Bioinformatics/bit_HiC_tutorial/out_test_data/hic_results/matrix/dixon_2M/iced/500000/dixon_2M_500000_iced.matrix","../out_test_data/hic_results/matrix/dixon_2M/raw/500000/dixon_2M_500000_ord.bed")
Loading Genomic intervals ...
Reading file ...
Convert 'C' file in HTCexp object(s)
In [4]:
hic
HTClist object of length 325 
25 intra / 300 inter-chromosomal maps
In [83]:
## Descriptive statistics
head(summary(hic))
seq1seq2nbreadsnbinteractionaveragefreqmedfreqsparsity
chr1chr1chr1 chr1 4309.5490613006 1.4336 1.0605 0.9879
chr1chr2chr1 chr2 323.662133 338 0.9576 0.8743 0.9986
chr1chr3chr1 chr3 312.695056 329 0.9504 0.8555 0.9983
chr1chr4chr1 chr4 337.958491 351 0.9628 0.8278 0.9982
chr1chr5chr1 chr5 258.84582 264 0.9805 0.8421 0.9985
chr1chr6chr1 chr6 244.091414 249 0.9803 0.8697 0.9985

Quality control

In [5]:
CQC(hic, winsize = 1e+06, dev.new=FALSE, hist.dist=FALSE)
Get data ...
Generate quality control plots ...

Define the chr and binsize

In [84]:
## Go back to a smaller dataset (chr5, 6, 7) at lower resolution
sset <- reduce(hic, chr=c("chr5","chr6","chr7"))
hic90_500 <- HTClist(mclapply(sset, binningC,binsize=500000, bin.adjust=FALSE, method="sum", step=1))
mapC(hic90_500)
Plotting chr5chr5...
minrange= 0.449  - maxrange= 3.997
Plotting chr5chr6...
minrange= 0.45  - maxrange= 1.625
Plotting chr5chr7...
minrange= 0.438  - maxrange= 2.325
Plotting chr6chr6...
minrange= 0.469  - maxrange= 3.929
Plotting chr6chr7...
minrange= 0.443  - maxrange= 2.601
Plotting chr7chr7...
minrange= 0.489  - maxrange= 4.842

Plot contact maps for ChrX - ChrX

In [10]:
hic_x.binned <- binningC(hic$chrXchrX, binsize=500000, method="median", step=3)
Bin size 'xgi' =500871 [3x166957]
Bin size 'ygi' =500871 [3x166957]
In [12]:
## Look at exptected counts chrX 
hicexp <- getExpectedCounts(hic_x.binned, method="loess", stdev=TRUE, plot=TRUE)
Warning message in getExpectedCounts(hic_x.binned, method = "loess", stdev = TRUE, :
“Contact map looks big. Use mean method instead or be sure that the loess fit gives good results.”Lowess fit ...
Standard deviation calculation ...
Delta=772176.12
Calculating stdev ... 
In [24]:
mapC(hic_x.binned,title="chrX")
minrange= 0.507  - maxrange= 3.315
In [ ]:
# Annotate graphics with gene features:

# Get the bed file from http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=661039579_qIn3z1lzkaAypE9M4fryJAGV12wN&clade=mammal&org=&db=hg19&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_regionType=genome&position=&hgta_outputType=primaryTable&hgta_outFileName=
In [25]:
require(rtracklayer)
gene <- import(file.path("../hg19.bed"),format="bed")
mapC(hic_x.binned,tracks=list(RefSeqGene=gene),maxrange=10,ti="chrX contacts")
minrange= 0.281  - maxrange= 8.729
In [86]:
#The following code shows how to focus on TADs 

TAD <- extractRegion(hic$chr6chr6, chr="chr6", from=50e6, to=58e6)
plot(TAD, maxrange=50, col.pos=c("white", "orange", "red", "black"))
minrange= 0.293  - maxrange= 4.981
In [89]:
## Data Normalization by Expected number of Counts
hiC14norm <- normPerExpected(hiC14, method="loess")
mapC(HTClist(hiC14norm), log.data=TRUE)
Lowess fit ...
Plotting chr14chr14...
minrange= 0.15  - maxrange= 18.621
In [71]:
## Correlation Map of Chromosome 14
intdata(hiC14norm) <- HiTC:::sparseCor(intdata(hiC14norm))
mapC(HTClist(hiC14norm), maxrange=1, minrange=-1,col.pos=c("black", "red"), col.neg=c("blue","black"))
Plotting chr14chr14...
Warning message in max(if (length(x@x) < prod(d)) x@x else x@x[indTri(d[1], upper = x@uplo == :
“no non-missing arguments to max; returning -Inf”Warning message in max(if (length(x@x) < prod(d)) x@x else x@x[indTri(d[1], upper = x@uplo == :
“no non-missing arguments to max; returning -Inf”Warning message in max(x@x, ..., na.rm = na.rm):
“no non-missing arguments to max; returning -Inf”Warning message in min(x@x, ..., na.rm = na.rm):
“no non-missing arguments to min; returning Inf”
In [73]:
# Principal Component Analysis
# empty due to low number of reads
pc <- pca.hic(hiC14, normPerExpected=TRUE, method="loess", npc=1)
Lowess fit ...
Discard 1 'x' intervals
Discard 1 'y' intervals
Discard 4 'x' intervals
Discard 4 'y' intervals
Warning message in pca.hic(hiC14, normPerExpected = TRUE, method = "loess", npc = 1):
“Empty correlation matrix for chr14”
In [ ]: