Example Page 1
Usage
Supported Formats
- 
Pre-processed Matrices: If the data is already processed into matrices for intra-chromosomal contacts, the chromosome from the same cell must be stored in the same folder with chromosome names as file names (e.g., scHiC/cell_1/chr1.txt). You only need to provide the folder name for a cell (e.g., scHiC/cell_1).
- npy: numpy.array / numpy.matrix
 - npz: scipy.sparse.coo_matrix
 - matrix: matrix stored as pure text
 - matrix_txt: matrix stored as .txt file
 - HiCRep: the format required by HiCRep package
 
 - 
Edge List
For all formats below:
str - strand (forward / reverse)
chr - chromosome
pos - position
score - contact reads
frag - fragments (will be ignored)
mapq - map quality- Shortest
 
<chr1> <pos1> <chr2> <pos2>- Shortest_Score
 
<chr1> <pos1> <chr2> <pos2> <score>- Short
 
<str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2>- Short_Score
 
<str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <score>- Medium
 
<readname> <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <mapq2>- Long
 
<str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <cigar1> <sequence1> <mapq2> <cigar2> <sequence2> <readname1> <readname2>- 4DN
 
## pairs format v1.0 #columns: readID chr1 position1 chr2 position2 strand1 strand2 - 
.hic format: we adapted “straw” from JuiceTools.
 - 
.mcool format: we adapted “dump” from cool.
 - 
Other formats: simply give the indices (start from 1) in the order of
“chromosome1 - position1 - chromosome2 - position2 - score” or
“chromosome1 - position1 - chromosome2 - position2” or
“chromosome1 - position1 - chromosome2 - position2 - mapq1 - mapq2”.
For example, you can provide “2356” or [2, 3, 5, 6] if the file takes this format: 
<name> <chromosome1> <position1> <frag1> <chromosome2> <position2> <frag2> <strand1> <strand2>
contact_1 chr1 3000000 1 chr1 3001000 1 + -
Import Package
>>>import scHiCTools
Load scHiC data
The scHiC data is stored in a series of files, with each of the files corresponding to one cell. You need to specify the list of scHiC file paths.
Only intra-chromosomal interactions are counted.
>>>from scHiCTools import scHiCs
>>>files = ['./cell_1', './cell_2', './cell_3']
>>>loaded_data = scHiCs(
... files, reference_genome='mm9',
... resolution=500000, keep_n_strata=10,
... format='customized', adjust_resolution=True,
... customized_format=12345, header=0, chromosomes='except Y',
... operations=['OE_norm', 'convolution']
... )
- reference genome (dict or str): now supporting ‘mm9’, ‘mm10’, ‘hg19’, ‘hg38’. If your reference genome is not in [‘mm9’, ‘mm10’, ‘hg19’, ‘hg38’], you need to provide the lengths of chromosomes you are going to use with a Python dict. e.g. {‘chr1’: 150000000, ‘chr2’: 130000000, ‘chr3’: 200000000}
 - resolution (int): the resolution to separate genome into bins. If using .hic file format, the given resolution must match with the resolutions in .hic file.
 - keep_n_strata (None or int): only store contacts within n strata near the diagonal. Default: 10. If ‘None’, it will not store strata
 - store_full_map (bool): whether store full contact maps in numpy matrices or scipy sparse matrices,If False, it will save memory.
 - sparse (bool): whether to use sparse matrices
 - format (str): file format, supported formats: ‘shortest’, ‘shortest_score’, ‘short’, ‘short_score’ , ‘medium’, ‘long’, ‘4DN’, ‘.hic’, ‘.mcool’, ‘npy’, ‘npz’, ‘matrix’, ‘HiCRep’, ‘matrix_txt’ and ‘customized’. Default: ‘customized’.
 - customized_format (int or str or list): the column indices in the order of “chromosome 1 - position 1 - chromosome 2 - position 2 - contact reads” or “chromosome 1 - position 1 - chromosome 2 - position 2” or “chromosome 1 - position 1 - chromosome 2 - position 2 - map quality 1 - map quality 2”. e.g. if the line is “chr1 5000000 chr2 3500000 2”, the format should be ‘12345’ or [1, 2, 3, 4, 5]; if there is no column indicating number of reads, you can just provide 4 numbers like ‘1234’, and contact read will be set as 1. Default: ‘12345’.
 - adjust_resolution: whether to adjust resolution for the input file. Sometimes the input file is already in the proper resolution (e.g. position 3000000 has already been changed to 6 with 500kb resolution). For this situation you can set adjust_resolution=False. Default: True.
 - map_filter (float): keep all contacts with mapq higher than this threshold. Default: 0.0
 - header (int): how many header lines does the file have. Default: 0.
 - chromosomes (list or str): chromosomes to use, eg. [‘chr1’, ‘chr2’], or just ‘except Y’, ‘except XY’, ‘all’. Default: ‘all’, which means chr 1-19 + XY for mouse and chr 1-22 + XY for human.
 - operations (list or None): the operations use for pre-processing or smoothing the maps given in a list. The operations will happen in the given order. Supported operations: ‘logarithm’, ‘power’, ‘convolution’, ‘random_walk’, ‘network_enhancing’, ‘OE_norm’, ‘VC_norm’, ‘VC_SQRT_norm’, ‘KR_norm’。 Default: None.
 - For preprocessing and smoothing operations, sometimes you need additional arguments (introduced in next sub-section).
 
You can also skip pre-processing and smoothing in loading step (operations=None), and do them in next lines.
Plot number of contacts and select cells
You can plot the number of contacts of your cells.
>>>loaded_data.plot_contacts(hist=True, percent=True)
If hist is True, plot Histogram of the number of contacts. If percent is True, plot the scatter plot of cells with  of short-range contacts (< 2 Mb) versus contacts at the mitotic band (2-12 Mb).
You can select cells based on number of contacts:
>>>loaded_data.select_cells(min_n_contacts=10000,max_short_range_contact=0.99)
The above command select cells have number of contacts bigger than 10000 and percent of short range contacts small than .99.
Pre-processing and Smoothing Operations Stand alone pre-processing and smoothing:
>>>loaded_data.processing(['random_walk', 'network_enhancing'])
If you didn’t store full map (i.e. store_full_map=False), processing is not doable in a separate step.
- 
logarithm: new_W_ij = log_(base) (W_ij + epsilon). Additional arguments:
- log_base: default: e
 - epsilon: default: 1
 
 - 
power: new_W_ij = (W_ij)^pow. Additional argument:
- pow: default: 0.5 (i.e., sqrt(W_ij))
 
 - 
VC_norm: VC normalization - each value divided by the sum of corresponding row then divided by the sum of corresponding column
 - 
VC_SQRT_norm: VC_SQRT normalization - each value divided by the sqrt of the sum of corresponding row then divided by the sqrt of the sum of corresponding column
 - 
KR_norm: KR normalization - iterating until the sum of each row / column is one Argument:
- maximum_error_rate (float): iteration ends when max error is smaller than (maximum_error_rate). Default: 1e-4
 
 - 
OE_norm: OE normalization - each value divided by the average of its corresponding strata (diagonal line)
 - 
convolution: smoothing with a N by N convolution kernel, with each value equal to 1/N^2. Argument:
- kernel_shape: an integer. e.g. kernel_shape=3 means a 3*3 matrix with each value = 1/9. Default: 3.
 
 - 
Random walk: multiply by a transition matrix (also calculated from contact map itself). Argument:
- random_walk_ratio: a value between 0 and 1, e.g. if ratio=0.9, the result will be 0.9 * matrix_after_random_walk + 0.1 * original_matrix. Default: 1.0.
 
 - 
Network enhancing: transition matrix only comes from k-nearest neighbors of each line. Arguments:
- kNN: value ‘k’ in kNN. Default: 20.
 - iterations: number of iterations for network enhancing. Default: 1
 - alpha: similar with random_walk_ratio. Default: 0.9
 
 
Learn Embeddings
>>>embs = loaded_data.learn_embedding(
... dim=2, similarity_method='inner_product', embedding_method='MDS',
... n_strata=None, aggregation='median', return_distance=False
... )
This function will return the embeddings in the format of a numpy array with shape ( # of cells, # of dimensions).
- dim (int): the dimension for the embedding
 - similarity_method (str): reproducibility measure, ‘InnerProduct’, ‘HiCRep’ or ‘Selfish’. Default: ‘InnerProduct’
 - embedding_method (str): ‘MDS’, ‘tSNE’ or ‘UMAP’
 - n_strata (int): only consider contacts within this genomic distance. Default: None. If it is None, it will use the all strata kept (the argument keep_n_strata from previous loading process). Thus n_strata and keep_n_strata (loading step) cannot be None at the same time.
 - aggregation (str): method to aggregate different chromosomes, ‘mean’ or ‘median’. Default: ‘median’.
 - return_distance (bool): if True, return (embeddings, distance_matrix); if False, only return embeddings. Default: False.
 - Some additional argument for Selfish:
- n_windows (int): split contact map into n windows, default: 10
 - sigma (float): sigma in the Gaussian-like kernel: default: 1.6
 
 
Clustering
There are two functions to cluster cells.
>>>label=loaded_data.clustering(
... n_clusters=4, clustering_method='kmeans', similarity_method='innerproduct',
... aggregation='median', n_strata=None
... )
clustering function returns a numpy array of cell labels clustered.
- 
n_clusters (int): Number of clusters.
 - 
clustering_method (str): Clustering method in ‘kmeans’, ‘spectral_clustering’ or ‘HAC’(hierarchical agglomerative clustering).
 - 
similarity_method (str): Reproducibility measure. Value in ‘InnerProduct’, ‘HiCRep’ or ‘Selfish’.
 - 
aggregation (str): Method to aggregate different chromosomes. Value is either ‘mean’ or ‘median’. Default: ‘median’.
 - 
n_strata (int or None): Only consider contacts within this genomic distance. If it is None, it will use the all strata kept (the argument keep_n_strata) from previous loading process. Default: None.
 - 
print_time (bool): Whether to print the processing time. Default: False.
 
>>>hicluster=loaded_data.scHiCluster(dim=2,cutoff=0.8,n_PCs=10,k=4)
scHiCluster function returns two componments.
First componment is a numpy array of embedding of cells using HiCluster.
Second componment is a numpy of cell labels clustered by HiCluster.
- 
dim (int): Number of dimension of embedding. Default: 2.
 - 
cutoff (float): The cutoff proportion to convert the real contact matrix into binary matrix. Default: 0.8.
 - 
n_PCs (int): Number of principal components. Default: 10.
 - 
k (int): Number of clusters. Default: 4.
 
Visualization
>>>scatter(data, dimension="2D", point_size=3, sty='default',
... label=None, title=None, alpha=None, aes_label=None
... )
>>>plt.show()
This function is to plot scatter plot of embedding points of single cell data. Scatter plot of either two-dimensions or three-dimensions will be generated.
- 
data (numpy.array): A numpy array which has 2 or 3 columns, every row represent a point.
 - 
dimension (str): Specifiy the dimension of the plot, either “2D” or “3D”. Default: “2D”.
 - 
point_size (float): Set the size of the points in scatter plot. Default: 3.
 - 
sty (str): Styles of Matplotlib. Default: ‘default’.
 - 
label (list or None): specifiy the label of each point. Default: None.
 - 
title (str): Title of the plot. Default: None.
 - 
alpha (float): The alpha blending value. Default: None.
 - 
aes_label (list): Set the label of every axis. Default: None.
 
“scHiCTools” also support interactive scatter plot which require the module ‘plotly’
>>>interactive_scatter(loaded_data, data, out_file, dimension='2D', point_size=3,
... label=None, title=None, alpha=1, aes_label=None)
This function is to generate an interactive scatter plot of embedded single cell data. The plot will be stored in a file.
- schic (scHiCs): A 
scHiCsobject. - data (numpy.array): A numpy array which has 2 or 3 columns, every row represent a point.
 - out_file (str): Output file path.
 - dimension (str): Specifiy the dimension of the plot, either “2D” or “3D”. The default is “2D”.
 - point_size (float): Set the size of the points in scatter plot. The default is 3.
 - label (list or None): Specifiy the label of each point. The default is None.
 - title (str): Title of the plot. The default is None.
 - alpha (float): The alpha blending value. The default is 1.
 - aes_label (list): Set the label of every axis. The default is None.