Helper tools in fanc

FAN-C provides little helper tools that make working with Hi-C and associated data somewhat easier. These are not strictly necessary for matrix generation and analysis, but can often speed your analysis up or simply make it a little more convenient.

fanc from-txt: import Hic from text file

You can easily import Hi-C matrices from a compatible text file format, such as that from HiC-Pro, with fanc to-txt.

usage: fanc from-txt [-h] [-tmp] contacts regions output

Positional Arguments

contacts Contacts file in sparse matrix format, i.e. each row should contain <bin1><tab><bin2><tab><weight>.
regions Path to file with genomic regions, for example in BED format: <chromosome><tab><start><tab><end>. The BED can optionally contain the bin index, as corresponding to the index used in the contacts file.
output Output Hic file.

Named Arguments

-tmp, --work-in-tmp
 Work in temporary directory

The command requires two input files:

  1. A sparse matrix with the tab-separated format <bin1><tab><bin2><tab><weight>:
1  1       40.385642
1  828     5.272852
1  1264    5.205258
...
  1. A regions file in BED format <chromosome><tab><start><tab><end>[<tab><bin ID>]:
chr1       0       1000000 1
chr1       1000000 2000000 2
chr1       2000000 3000000 3
chr1       3000000 4000000 4
...

The <bin ID> field is optional, but if provided it must correspond to the bins used in the matrix file. If not provided, bin indices will be 0-based!

The FAN-C example data contains some HiC-Pro example files that you can try this out on:

fanc from-txt hicpro/dixon_2M_1000000_iced.matrix hicpro/dixon_2M_1000000_abs.bed hicpro/dixon_2M_1000000_iced.hic

fanc dump: export Hic objects to text file

You can easily export FAN-C Hic objects to a txt file using fanc dump.

usage: fanc dump [-h] [-s SUBSET] [-S] [--only-intra] [-e] [-l] [-u] [-tmp]
                 hic [matrix] [regions]

Positional Arguments

hic Hic file
matrix Output file for matrix entries. If not provided, will write to stdout.
regions Output file for Hic regions. If not provided, will write regions into matrix file.

Named Arguments

-s, --subset Only output this matrix subset. Format: <chr>[:<start>-<end>][–<chr>[:<start><end>]], e.g.: “chr1–chr1” to extract only the chromosome 1 submatrix; “chr2:3400000-4200000” to extract contacts of this region on chromosome 2 to all other regions in the genome;
-S, --no-sparse
 Store full, square matrix instead of sparse format.
--only-intra Only dump intra-chromosomal data. Dumps everything by default.
-e, --observed-expected
 O/E transform matrix values.
-l, --log2 Log2-transform matrix values. Useful for O/E matrices (-e option)
-u, --uncorrected
 Output uncorrected (not normalised) matrix values).
-tmp, --work-in-tmp
 Work in temporary directory

If you only pass the Hic object the fanc dump, it will write all Hi-C contacts to the command line in a tab-delimited format with the columns: chromosome1, start1, end1, chromosome2, start2, end2, weight (number of contacts). If you add a file path as second argument, the data will be written to that file. If you instead pass the Hic file and two output files, the first output file will have the matrix entries in sparse notation, and the second file will have the Hic regions/bins. You can use -S to export a full matrix instead of a sparse one, but be warned that these can be extremely large.

If you are only interested in a specific sub-matrix, use the -s or --subset argument of the for <chromosome>:[<start>-<end>] to export all contacts made by this particular region across the whole genome. Use <chr>[:<start>-<end>]–<chr>[:<start><end>] to export all contacts made between two regions. E.g. use chr1–chr1 to export the chromosome 1 sub-matrix.

fanc subset: create Hic objects by subsetting

It is sometimes useful to work with smaller Hi-C objects, for example for speed reasons or to focus the analysis on a particular genomic region of interest. The fanc subset command makes it possible to create a Hic object that only contains regions and contacts between a user-specified genomic regions from an existing Hic object.

usage: fanc subset [-h] input output regions [regions ...]

Positional Arguments

input Input Hic file.
output Output Hic file.
regions List of regions that will be used in the output Hic object. All contacts between these regions will be in the output object. For example, “chr1 chr3” will result in a Hic object with all regions in chromosomes 1 and 3, plus all contacts within chromosome 1, all contacts within chromosome 3, and all contacts between chromosome 1 and 3. “chr1” will only contain regions and contactswithin chromosome 1.

fanc downsample: downsample Hic objects

Often Hi-C matrices have differing numbers of valid pairs, which can be a confounding factor in many analyses. Differences can stem from varying sequencing depths, different library qualities, or other experimental and computational factors. fanc downsample is a utility that downsamples Hic objects to a specific number of valid pairs.

usage: fanc downsample [-h] [-tmp] hic n output

Positional Arguments

hic Hic object to be downsampled.
n Sample size or reference Hi-C object. If sample size is < 1,will be interpreted as a fraction of valid pairs.
output Downsampled Hic output.

Named Arguments

-tmp, --work-in-tmp
 Work in temporary directory

By default, the sampling is done without replacement. This requires a fairly large amount of system memory. If you are having trouble with memory usage, use sampling with replacement (--with-replacement).

Note

Sampling is done on uncorrected matrix values, so you may want to apply matrix balancing using fanc hic -k afterwards.

fanc fragments: in silico genome digestion

The fanc pairs and fanc auto commands accept FASTA files as --genome argument, and fanc conveniently calculates the restriction fragments for you using the restriction enzyme name specified with --restriction-enzyme. However, the in silico digestion can be time-consuming, and if you are processing multiple similar Hi-C libraries, you can use the fanc fragments utility to generate restriction fragments up front, and use the resulting BED file as input for the --genome argument.

If you supply an integer as the second positional argument instead of a restriction enzyme name, fanc fragments will perform binning rather than in silico digestion and return a BED file with equally sized regions.

usage: fanc fragments [-h] [-c CHROMOSOMES] input re_or_bin_size output

Positional Arguments

input Path to genome file (FASTA, folder with FASTA, hdf5 file), which will be used in conjunction with the type of restriction enzyme to calculate fragments directly.
re_or_bin_size Restriction enzyme name or bin size to divide genome into fragments. Restriction names can be any supported by Biopython, which obtains data from REBASE (http://rebase.neb.com/rebase/rebase.html). Use commas to separate multiple restriction enzymes, e.g. ‘HindIII,MboI’
output Output file with restriction fragments in BED format.

Named Arguments

-c, --chromosomes
 Comma-separated list of chromosomes to include in fragments BED file. Other chromosomes will be excluded. The order of chromosomes will be as stated in the list.

fanc sort-sam: sort SAM files by name

The fanc pairs command expects SAM/BAM files as input that have been sorted by name (fanc auto automatically sorts files). You can use samtools sort -n to sort files, but fanc sam-sort will also do the sorting for you. it automatically chooses the fastest sorting implementation available and also provides the option to work in a temporary folder, which can speed the sorting up if you are working on a network volume.

usage: fanc sort-sam [-h] [-t THREADS] [-S] [-tmp] sam [output]

Positional Arguments

sam Input SAM/BAM
output Output SAM/BAM. If not provided, will replace input file with sorted version after sorting.

Named Arguments

-t, --threads Number of sorting threads (only when sambamba is available). Default: 1
-S, --no-sambamba
 Do not use sambamba, even when it is available. Use pysam instead.
-tmp, --work-in-tmp
 Work in temporary directory