Example Page 1

Usage

Supported Formats

  • Pre-processed Matrices: If the data is already processed into matrices for intra-chromosomal contacts, the chromosome from the same cell must be stored in the same folder with chromosome names as file names (e.g., scHiC/cell_1/chr1.txt). You only need to provide the folder name for a cell (e.g., scHiC/cell_1).

    • npy: numpy.array / numpy.matrix
    • npz: scipy.sparse.coo_matrix
    • matrix: matrix stored as pure text
    • matrix_txt: matrix stored as .txt file
    • HiCRep: the format required by HiCRep package
  • Edge List
    For all formats below:
      str - strand (forward / reverse)
      chr - chromosome
      pos - position
      score - contact reads
      frag - fragments (will be ignored)
      mapq - map quality

    • Shortest
    <chr1> <pos1> <chr2> <pos2>
    
    • Shortest_Score
    <chr1> <pos1> <chr2> <pos2> <score>
    
    • Short
    <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2>
    
    • Short_Score
    <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <score>
    
    • Medium
    <readname> <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <mapq2>
    
    • Long
    <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <cigar1> <sequence1> <mapq2> <cigar2> <sequence2> <readname1> <readname2>
    
    • 4DN
    ## pairs format v1.0
    #columns: readID chr1 position1 chr2 position2 strand1 strand2
    
  • .hic format: we adapted “straw” from JuiceTools.

  • .mcool format: we adapted “dump” from cool.

  • Other formats: simply give the indices (start from 1) in the order of
    “chromosome1 - position1 - chromosome2 - position2 - score” or
    “chromosome1 - position1 - chromosome2 - position2” or
    “chromosome1 - position1 - chromosome2 - position2 - mapq1 - mapq2”.
    For example, you can provide “2356” or [2, 3, 5, 6] if the file takes this format:

<name> <chromosome1> <position1> <frag1> <chromosome2> <position2> <frag2> <strand1> <strand2>
contact_1 chr1 3000000 1 chr1 3001000 1 + -

Import Package

>>>import scHiCTools

Load scHiC data

The scHiC data is stored in a series of files, with each of the files corresponding to one cell. You need to specify the list of scHiC file paths.

Only intra-chromosomal interactions are counted.

>>>from scHiCTools import scHiCs
>>>files = ['./cell_1', './cell_2', './cell_3']
>>>loaded_data = scHiCs(
... files, reference_genome='mm9',
... resolution=500000, keep_n_strata=10,
... format='customized', adjust_resolution=True,
... customized_format=12345, header=0, chromosomes='except Y',
... operations=['OE_norm', 'convolution']
... )
  • reference genome (dict or str): now supporting ‘mm9’, ‘mm10’, ‘hg19’, ‘hg38’. If your reference genome is not in [‘mm9’, ‘mm10’, ‘hg19’, ‘hg38’], you need to provide the lengths of chromosomes you are going to use with a Python dict. e.g. {‘chr1’: 150000000, ‘chr2’: 130000000, ‘chr3’: 200000000}
  • resolution (int): the resolution to separate genome into bins. If using .hic file format, the given resolution must match with the resolutions in .hic file.
  • keep_n_strata (None or int): only store contacts within n strata near the diagonal. Default: 10. If ‘None’, it will not store strata
  • store_full_map (bool): whether store full contact maps in numpy matrices or scipy sparse matrices,If False, it will save memory.
  • sparse (bool): whether to use sparse matrices
  • format (str): file format, supported formats: ‘shortest’, ‘shortest_score’, ‘short’, ‘short_score’ , ‘medium’, ‘long’, ‘4DN’, ‘.hic’, ‘.mcool’, ‘npy’, ‘npz’, ‘matrix’, ‘HiCRep’, ‘matrix_txt’ and ‘customized’. Default: ‘customized’.
  • customized_format (int or str or list): the column indices in the order of “chromosome 1 - position 1 - chromosome 2 - position 2 - contact reads” or “chromosome 1 - position 1 - chromosome 2 - position 2” or “chromosome 1 - position 1 - chromosome 2 - position 2 - map quality 1 - map quality 2”. e.g. if the line is “chr1 5000000 chr2 3500000 2”, the format should be ‘12345’ or [1, 2, 3, 4, 5]; if there is no column indicating number of reads, you can just provide 4 numbers like ‘1234’, and contact read will be set as 1. Default: ‘12345’.
  • adjust_resolution: whether to adjust resolution for the input file. Sometimes the input file is already in the proper resolution (e.g. position 3000000 has already been changed to 6 with 500kb resolution). For this situation you can set adjust_resolution=False. Default: True.
  • map_filter (float): keep all contacts with mapq higher than this threshold. Default: 0.0
  • header (int): how many header lines does the file have. Default: 0.
  • chromosomes (list or str): chromosomes to use, eg. [‘chr1’, ‘chr2’], or just ‘except Y’, ‘except XY’, ‘all’. Default: ‘all’, which means chr 1-19 + XY for mouse and chr 1-22 + XY for human.
  • operations (list or None): the operations use for pre-processing or smoothing the maps given in a list. The operations will happen in the given order. Supported operations: ‘logarithm’, ‘power’, ‘convolution’, ‘random_walk’, ‘network_enhancing’, ‘OE_norm’, ‘VC_norm’, ‘VC_SQRT_norm’, ‘KR_norm’。 Default: None.
  • For preprocessing and smoothing operations, sometimes you need additional arguments (introduced in next sub-section).

You can also skip pre-processing and smoothing in loading step (operations=None), and do them in next lines.

Plot number of contacts and select cells

You can plot the number of contacts of your cells.

>>>loaded_data.plot_contacts(hist=True, percent=True)

If hist is True, plot Histogram of the number of contacts. If percent is True, plot the scatter plot of cells with of short-range contacts (< 2 Mb) versus contacts at the mitotic band (2-12 Mb).

You can select cells based on number of contacts:

>>>loaded_data.select_cells(min_n_contacts=10000,max_short_range_contact=0.99)

The above command select cells have number of contacts bigger than 10000 and percent of short range contacts small than .99.

Pre-processing and Smoothing Operations Stand alone pre-processing and smoothing:

>>>loaded_data.processing(['random_walk', 'network_enhancing'])

If you didn’t store full map (i.e. store_full_map=False), processing is not doable in a separate step.

  • logarithm: new_W_ij = log_(base) (W_ij + epsilon). Additional arguments:

    • log_base: default: e
    • epsilon: default: 1
  • power: new_W_ij = (W_ij)^pow. Additional argument:

    • pow: default: 0.5 (i.e., sqrt(W_ij))
  • VC_norm: VC normalization - each value divided by the sum of corresponding row then divided by the sum of corresponding column

  • VC_SQRT_norm: VC_SQRT normalization - each value divided by the sqrt of the sum of corresponding row then divided by the sqrt of the sum of corresponding column

  • KR_norm: KR normalization - iterating until the sum of each row / column is one Argument:

    • maximum_error_rate (float): iteration ends when max error is smaller than (maximum_error_rate). Default: 1e-4
  • OE_norm: OE normalization - each value divided by the average of its corresponding strata (diagonal line)

  • convolution: smoothing with a N by N convolution kernel, with each value equal to 1/N^2. Argument:

    • kernel_shape: an integer. e.g. kernel_shape=3 means a 3*3 matrix with each value = 1/9. Default: 3.
  • Random walk: multiply by a transition matrix (also calculated from contact map itself). Argument:

    • random_walk_ratio: a value between 0 and 1, e.g. if ratio=0.9, the result will be 0.9 * matrix_after_random_walk + 0.1 * original_matrix. Default: 1.0.
  • Network enhancing: transition matrix only comes from k-nearest neighbors of each line. Arguments:

    • kNN: value ‘k’ in kNN. Default: 20.
    • iterations: number of iterations for network enhancing. Default: 1
    • alpha: similar with random_walk_ratio. Default: 0.9

Learn Embeddings

>>>embs = loaded_data.learn_embedding(
... dim=2, similarity_method='inner_product', embedding_method='MDS',
... n_strata=None, aggregation='median', return_distance=False
... )

This function will return the embeddings in the format of a numpy array with shape ( # of cells, # of dimensions).

  • dim (int): the dimension for the embedding
  • similarity_method (str): reproducibility measure, ‘InnerProduct’, ‘HiCRep’ or ‘Selfish’. Default: ‘InnerProduct’
  • embedding_method (str): ‘MDS’, ‘tSNE’ or ‘UMAP’
  • n_strata (int): only consider contacts within this genomic distance. Default: None. If it is None, it will use the all strata kept (the argument keep_n_strata from previous loading process). Thus n_strata and keep_n_strata (loading step) cannot be None at the same time.
  • aggregation (str): method to aggregate different chromosomes, ‘mean’ or ‘median’. Default: ‘median’.
  • return_distance (bool): if True, return (embeddings, distance_matrix); if False, only return embeddings. Default: False.
  • Some additional argument for Selfish:
    • n_windows (int): split contact map into n windows, default: 10
    • sigma (float): sigma in the Gaussian-like kernel: default: 1.6

Clustering

There are two functions to cluster cells.

>>>label=loaded_data.clustering(
... n_clusters=4, clustering_method='kmeans', similarity_method='innerproduct',
... aggregation='median', n_strata=None
... )

clustering function returns a numpy array of cell labels clustered.

  • n_clusters (int): Number of clusters.

  • clustering_method (str): Clustering method in ‘kmeans’, ‘spectral_clustering’ or ‘HAC’(hierarchical agglomerative clustering).

  • similarity_method (str): Reproducibility measure. Value in ‘InnerProduct’, ‘HiCRep’ or ‘Selfish’.

  • aggregation (str): Method to aggregate different chromosomes. Value is either ‘mean’ or ‘median’. Default: ‘median’.

  • n_strata (int or None): Only consider contacts within this genomic distance. If it is None, it will use the all strata kept (the argument keep_n_strata) from previous loading process. Default: None.

  • print_time (bool): Whether to print the processing time. Default: False.

>>>hicluster=loaded_data.scHiCluster(dim=2,cutoff=0.8,n_PCs=10,k=4)

scHiCluster function returns two componments. First componment is a numpy array of embedding of cells using HiCluster. Second componment is a numpy of cell labels clustered by HiCluster.

  • dim (int): Number of dimension of embedding. Default: 2.

  • cutoff (float): The cutoff proportion to convert the real contact matrix into binary matrix. Default: 0.8.

  • n_PCs (int): Number of principal components. Default: 10.

  • k (int): Number of clusters. Default: 4.

Visualization

>>>scatter(data, dimension="2D", point_size=3, sty='default',
... label=None, title=None, alpha=None, aes_label=None
... )
>>>plt.show()

This function is to plot scatter plot of embedding points of single cell data. Scatter plot of either two-dimensions or three-dimensions will be generated.

  • data (numpy.array): A numpy array which has 2 or 3 columns, every row represent a point.

  • dimension (str): Specifiy the dimension of the plot, either “2D” or “3D”. Default: “2D”.

  • point_size (float): Set the size of the points in scatter plot. Default: 3.

  • sty (str): Styles of Matplotlib. Default: ‘default’.

  • label (list or None): specifiy the label of each point. Default: None.

  • title (str): Title of the plot. Default: None.

  • alpha (float): The alpha blending value. Default: None.

  • aes_label (list): Set the label of every axis. Default: None.

“scHiCTools” also support interactive scatter plot which require the module ‘plotly’

>>>interactive_scatter(loaded_data, data, out_file, dimension='2D', point_size=3,
... label=None, title=None, alpha=1, aes_label=None)

This function is to generate an interactive scatter plot of embedded single cell data. The plot will be stored in a file.

  • schic (scHiCs): A scHiCs object.
  • data (numpy.array): A numpy array which has 2 or 3 columns, every row represent a point.
  • out_file (str): Output file path.
  • dimension (str): Specifiy the dimension of the plot, either “2D” or “3D”. The default is “2D”.
  • point_size (float): Set the size of the points in scatter plot. The default is 3.
  • label (list or None): Specifiy the label of each point. The default is None.
  • title (str): Title of the plot. The default is None.
  • alpha (float): The alpha blending value. The default is 1.
  • aes_label (list): Set the label of every axis. The default is None.
Next