This page shows notes for each module. Note that the parameters can be set up in the configure_user.txt file.

1 Processing

1.1 demplx_fastq

Adding the barcode information to the name of each read in the fastq files)

  • Input: one of

    • Paths to fastq files for each of the paired-end reads and the index information, separated by comma like PE1_fastq,PE2_fastq,index1_fastq(,inde2_fastq,index3_fastq…)
    • Path to the folder of 10x fastq files
  • Output: Demultiplexed PE1 and PE2 fastq files with index information embedded in the read name as: @index3_index2_index1:original_read_name, saved in output/demplxed_fastq/

  • Note: there could be multiple index files

1.2 trimming

Perform sequence adapter trimming.

  • Input: demultiplexed PE1 and PE2 fastq files, separated by a comma

  • Output: trimmed and demultiplexed PE1 and PE2 fastq files, saved in output/trimmed_fastq/

  • Parameters:

    • TRIM_METHOD, default value ‘trim_galore’, other options: ‘none’ or ‘Trimmomatic’
    • ADAPTER_SEQ, default NA, set it to the path of the adapter .fa file if TRIM_METHOD is set to Trimmomatic, otherwise ignore it
  • Note: you can specify TRIM_METHOD=none to ignore trimming step, which usually is OK.

1.3 mapping

Sequence alignment.

  • Input: the demultiplexed (and trimmed) paired-end fastq files, separated by comma: PE1.fastq,PE2.fastq

  • Output: Position sorted bam file, and position sorted MAPQ30 bam file, saved in output/mapping_result/ and plain text files of mapping QC metrics and fragments.txt file saved in output/summary/

  • Parameters:

    • MAPPING_METHOD, default ‘bwa’, Read alignment method, three options: bwa, bowtie or bowtie2
    • BWA_OPTS: default ‘-t 16’, additional options for bwa, ignore it if MAPPING_METHOD is not set to bwa
    • BOWTIE_OPTS, BOWTIE2_OPTS: additional options for running bowtie and bowtie2
    • BWA_INDEX, Index file for bwa of the used genome (the path of the .fa file of the genome)
    • MAPQ, default 30, MAPQ score cutoff to filter reads with low quality reads
    • CELL_MAPQ_QC, default TRUE, Report mapping qc for cell barcodes (need to run module get_bam4Cells) or not
  • Note: only need to set up one of BWA_OPTS, BOWTIE_OPTS and BOWTIE2_OTPS, and only one of BWA_INDEX, BOWTIE_OPTS and BOWTIE2_OPTS, corresponding to the specified mapping method (MAPPING_METHOD)

1.4 call_peak

Peak calling using aggregated bam file.

  • Input: the MAPQ30 bam file outputted from the mapping step.

  • Output: peak files, saved as output/peaks/PEAK_CALLER/OUTPUT_PREFIX_features_Blacklist_Removed.bed

  • Parameters:

    • PEAK_CALLER, peak calling method, default ‘MACS2’, three options: MACS2, BIN, COMBINED
    • MACS2_OPTS: provided extra options for macs2, default ‘-q 0.05 -g hs –nomodel –extsize 200 –shift -100’, (NO NEED TO SPECIFY -t, -n, -f here)

1.5 aggr_signal

Aggregate and normalize signal into .bw or .bedgraph file (can be uploaded to UCSC genome browser).

  • Input: the MAPQ30 bam file outputted from the mapping step.

  • Output: Aggregated data in .bw and .bedgraph file, saved in output/signal/

1.6 qc_per_barcode

Generate quality control metrics for each barcode.

  • Input: fragments.tsv.gz file (outputted from module mapping) and peak file (outputted from module call_peak), separated by comma

  • Output: qc_per_barcode.txt file, saved in output/summary/

  • Note: these qc metrics for each cell will be loaded into the seurat object as meta data when the clustering module was executed

1.7 get_mtx

Build raw peak-by-cell matrix

  • input: fragments.tsv.gz file, outputted from the mapping module, and features/peak file, outputted from the call_peak module, separated by a comma

  • output: sparse peak-by-cell count matrix in Matrix Market format, barcodes and feature files in plain text format, saved in output/raw_matrix/PEAK_CALLER/

1.8 call_cell

Perform cell calling

  • input: raw peak-by-barcode matrix file, outputted from the get_mtx module

  • output: filtered peak-by-cell matrix in Market Matrix format and .rds format, barcodes and features, saved in output/filtered_matrix/PEAK_CALLER/CELL_CALLER/

  • Parameters:

    • CELL_CALLER, cell calling method, default ‘filtered’, three options: filtered, EmptyDrop, cellranger
    • EmptyDrop_FDR: FDR for EmptyDrop, default 0.001, (NO NEED TO SPECIFY if EmptyDrop method is not selected)
    • FILTER_BC_CUTOFF, filtering rules if ‘filtered’ is selected as the CELL_CALLER, default, –min_uniq_frags 3000 –max_uniq_frags 50000 –min_frac_peak 0.5 –min_frac_tss 0.0 –min_frac_promoter 0 –min_frac_enhancer 0.0 –max_frac_mito 0.1 –min_tss_escore 3 (Ignored if CELL CALLER was specified other than FILTER)å

1.9 get_bam4Cells

Extract bam file for cell barcodes and calculate mapping stats correspondingly

  • input: A bam file for aggregated data outputted from mapping module and a barcodes.txt file outputted from module call_cell, separated by comma

  • output: A bam file saved in output/mapping_results and mapping stats (optional) saved in output/summary for cell barcodes

1.10 Perform all process steps in one command

Some of the processing modules can be run together by a single command:

1.10.1 process

processing data - including demplx_fastq, mapping, call_peak, get_mtx, aggr_signal, qc_per_barcode, call_cell and get_bam4Cells

  • input: either fastq files for both reads and index, separated by comma, or path to folder of 10x fastq files like: fastq1,fastq2,index_fastq1,index_fastq2, index_fastq3…, or the PATH_TO_10xfastqs_folder

  • output: peak-by-cell matrix and all intermediate results

1.10.2 process_no_dex

Conduct all processing modules except demultiplexing step

  • input: demultiplexed fastq files for both reads and index, separated by comma like: fastq1,fastq2;

  • output: peak-by-cell matrix and all intermediate results

1.10.3 process_with_bam

Conduct all processing modules after mapping step

  • input: bam file for aggregated data, outputted from the mapping module

  • output: filtered peak-by-cell matrix and all intermediate results

2 Downstream Analysis

2.1 rmDoublets

Remove potential doublets

  • input: a peak-by-cell matrix file or a seurat object file in .rds format, and the expected fraction of doublets, separated by a comma

  • output: doublets removed matrix.rds and barcodes.txt file and seurat objects w/ and w/o doublets saved in the input directory (and a umap plot colored by singlet/doubet)

2.2 clustering

cell clustering

  • input: filtered peak-by-cell matrix file, outputted from the call_cell module (or a seurat.rds file)

  • output: seurat objects with clustering label in the metadata (.rds file) and barcodes with cluster labels (cell_cluster_table.tsv file), and umap plot colorred

  • Parameters to specify (in configure_user.txt file):

    • norm_by, normalization method, default tf-idf, other options: log (just log transformation) or NA (no normalization)
    • Top_Variable_Features, number/fraction of variable features used for seurat, default 5000, other options: a real number within (0, 1)
    • REDUCTION, dimension reduction method, default pca, other option: lda, note that UMAP and TSNE will be automatically calculated correspondly
    • nREDUCTION, number of reduced dimention, default 30
    • CLUSTERING_METHOD, clustering method, default seurat (the same as Louvain), options: seurat/Louvain/cisTopic/kmeans/LSI/SCRAT/chromVAR/scABC
    • K_CLUSTERS, either the number of cluster (an integer) or the resolution parameter (a float number) for louvain algorithm (implemented by seurat), default 0.2
    • prepCello, generate inputs for VisCello (for visaulization) or not, default TRUE

2.3 motif_analysis

Motif analysis based on chromVAR

  • input: filtered peak-by-cell matrix file, outputted from the call_cell module

  • output: a chromVAR object with TF-by-cell deviation score/zscore, a table and heatmap indicating TF enrichment for each cell cluster, saved in output/downstream_analysiss/PEAK_CALLER/CELL_CALLER/

2.4 runDA

Perform differential accessibility analysis for peaks

  • input: path_to_seurat_object with two groups of clusters to compare, could be like: seurat_obj.rds,0:1,2 (will compare cells in cluster 0 or cluster 1 with cells in cluster2 for the given seurat object) or seurat_obj.rds,0,rest (will compare cells in cluster 0 with the rest of cells) or seurat_obj.rds,one,rest (will compare cells in any one of the clusters with the rest of the cells)

    • Note: the parameters specified here will overwrite the group1, group2 paraters in the configure_user.txt file
  • output: differential accessibility peaks in a tsv file saved in the same in the same folder of the input seurat object

  • Parameters:

    • group1, group one, default 0:1, could be either one or multiple cluster names, separated by colon, or ‘one’
    • group2, group twom default 2, cluster name as group1 or ‘rest’
    • test_use, statistical testing method, default wilcox, other options: negbinom, LR, wilcox, t, DESeq2

2.5 runGO

preform GO term enrichment analysis for genes close to cluster specific peaks

  • input: differential accessible features file, outputted from runDA module (.tsv file)

  • output: enriched GO terms in .xlsx format saved in the same directory as the input file

2.6 runCicero

Run cicero for calculating gene activity score and predicting cis chromatin interactions

  • input: seurat_obj.rds file outputted from the clustering module

  • output: cicero gene activity in .rds format and predicted interactions in .txt format, saved in output/downstream_analysiss/PEAK_CALLER/CELL_CALLER/

2.7 split_bam

Split bam file to generate bam file for each cluster

  • input: barcodes with cluster label (cell_cluster_table.tsv file, outputted from clustering module)

  • note: users can specify any two column text files, for barcodes and the corresponding cluster/subpopulation label.

  • output: .bam file (saved in output/downstream/PEAK_CALLER/CELL_CALLER/data_by_cluster), .bw, .bedgraph (saved in output/signal/) file for each cluster/subpopulation

2.8 footprint

Perform TF footprinting analysis, supports comparison between two sets of cell clusters and one cluster vs the rest of cell clusters (one-vs-rest)

  • input: two groups of cells (separated by a comma), each group is labeled with a combination of cluster labels, default 0:1,2, comparing cluster0,1 to cluster2

  • Note: you can also specify ‘one,rest’ to conduct all one cluster vs the rest clusters comparisons.

  • output: footprinting summary statistics in tables and heatmap, saved in output/downstream/PEAK_CALLER/CELL_CALLER/

2.9 Perform all downstream analysis by one command

Perform all downstream analyses, including clustering, motif_analysis, split_bam (optional) and footprinting analysis (optional), the corresponding parameters should be set up in configure_user.txt file.

  • input: filtered peak-by-cell matrix file, outputted from call_cell module

  • output: all outputs from each module

3 Data Integration

3.1 mergePeaks

Merge peaks (called from different data sets) if the distance is less than a given size in basepairs (200 if not specified)

  • input: peak files and a distance parameter separated by comma: peakFile1,peakFile2,peakFile3,200

  • output: merged peaks saved in file output/peaks/merged.bed

3.2 reconstMtx

reconstruct peak-by-cell matrix given peak file, fragments.tsv.gz file, barcodes.txt and an optional path for the output reconstructed matrix

  • input: different files separated by comma: peakFilePath,fragmentFilePath,barcodesPath,reconstructMatrixPath

  • output: reconstructed peak-by-cell matrix saved in reconstructMatrixPath, if reconstructMatrixPath is not specified, a sub-folder reConstruct_matrix will be created under the same path as the input barcodes.txt file

3.3 integrate_mtx

Perform integration of two ore more data matrices, which have the same rownames (set of peaks)

  • input: mtx1,mtx2, separated by comma like, mtx1_path,mtx2_path

  • output: integrated seurat obj and umap plot, saved in output/integrated/

3.4 integrate

Perform integration of two ore more data sets, given the corresponding peaks for each data sets.

  • input: peak/feature files and a optional distance parameter separated by comma: peak_file1,peak_file2,200

  • output: merged peaks, reconstructed matrix, integrated seurat obj and umap plot, saved in output/integrated/

  • Note: this module will search corresponding fragments.tsv.gz file, barcodes.txt file for each data set, and merge all the peaks within 200bp distance, reconstruct the matrix with the mergered peaks, and perform matrix integration. In other words, it’s combination of modules mergePeaks, reconstMtx and integrate_mtx.

3.5 labelTransfer

Label transfer (cell annotation) from scRNA-seq data

  • input: paths for a seurat object for scATAC-seq, a seurat object for scRNA-seq data in .rds format, and an optional .gtf file for gene annotation, separated by a comma.

  • output: a updated seurat object for atac with the Predicted_Cell_Type as a metadata variable and an umap plot colored by Predicted_Cell_Type, saved in the same directory as the input atac seurat object.

  • Note: the cell annotation should be given as a metadata (named Cell_Type) in the seurat object of scRNA-seq. Both seurat objects should have pca and umap dimemsion reduction done.

4 Visualization

4.1 report

Generate summary report in html file

  • input: directory to QC files, output/summary as default

  • output: summary report in html format, saved in output/summary and .eps figures for each panel saved in output/summary/Figures and tables in output/summary/Tables

4.2 visualize

Interactively visualize the data through VisCello

  • input: VisCello_obj directory, outputted from the clustering module

  • output: launch VisCello through web browser for interactively visualization"

5 Relate to 10x bam file

5.1 convert10xbam

Convert bam file in 10x genomics format to bam file in scATAC-pro format

  • input: bam file (position sorted) in 10x format

  • output: position sorted bam file in scATAC-pro format saved in output/mapping_result, mapping qc stat and fragment.txt files saved in output/summary/

5.2 addCB2bam

Add cell barcode tag to bam file

  • input: a bam file generated by scATAC-pro

  • output: the bam file with column ‘CB:Z:cellbarcode’ added (saved in the same directory as the input bam file)