1 Processing

1.1 demplx_fastq
1.2 trimming
1.3 mapping
1.4 call_peak
1.5 aggr_signal
1.6 qc_per_barcode
1.7 get_mtx
1.8 call_cell
1.9 get_bam4Cells
1.10 Perform all process steps in one command

1.10.1 process
1.10.2 process_no_dex
1.10.3 process_with_bam

2 Downstream Analysis

2.1 rmDoublets
2.2 clustering
2.3 motif_analysis
2.4 runDA
2.5 runGO
2.6 runCicero
2.7 split_bam
2.8 footprint
2.9 Perform all downstream analysis by one command

3 Data Integration

3.1 mergePeaks
3.2 reconstMtx
3.3 integrate_mtx
3.4 integrate
3.5 labelTransfer

4 Visualization

4.1 report
4.2 visualize

5 Relate to 10x bam file

5.1 convert10xbam
5.2 addCB2bam

This page shows notes for each module. Note that the parameters can be set up in the configure_user.txt file.

1 Processing

1.1 demplx_fastq

Adding the barcode information to the name of each read in the fastq files)

Input: one of
- Paths to fastq files for each of the paired-end reads and the index information, separated by comma like PE1_fastq,PE2_fastq,index1_fastq(,inde2_fastq,index3_fastq…)
- Path to the folder of 10x fastq files
Output: Demultiplexed PE1 and PE2 fastq files with index information embedded in the read name as: @index3_index2_index1:original_read_name, saved in output/demplxed_fastq/
Note: there could be multiple index files

1.2 trimming

Perform sequence adapter trimming.

Input: demultiplexed PE1 and PE2 fastq files, separated by a comma
Output: trimmed and demultiplexed PE1 and PE2 fastq files, saved in output/trimmed_fastq/
Parameters:
- TRIM_METHOD, default value ‘trim_galore’, other options: ‘none’ or ‘Trimmomatic’
- ADAPTER_SEQ, default NA, set it to the path of the adapter .fa file if TRIM_METHOD is set to Trimmomatic, otherwise ignore it
Note: you can specify TRIM_METHOD=none to ignore trimming step, which usually is OK.

1.3 mapping

Sequence alignment.

Input: the demultiplexed (and trimmed) paired-end fastq files, separated by comma: PE1.fastq,PE2.fastq
Output: Position sorted bam file, and position sorted MAPQ30 bam file, saved in output/mapping_result/ and plain text files of mapping QC metrics and fragments.txt file saved in output/summary/
Parameters:
- MAPPING_METHOD, default ‘bwa’, Read alignment method, three options: bwa, bowtie or bowtie2
- BWA_OPTS: default ‘-t 16’, additional options for bwa, ignore it if MAPPING_METHOD is not set to bwa
- BOWTIE_OPTS, BOWTIE2_OPTS: additional options for running bowtie and bowtie2
- BWA_INDEX, Index file for bwa of the used genome (the path of the .fa file of the genome)
- MAPQ, default 30, MAPQ score cutoff to filter reads with low quality reads
- CELL_MAPQ_QC, default TRUE, Report mapping qc for cell barcodes (need to run module get_bam4Cells) or not
Note: only need to set up one of BWA_OPTS, BOWTIE_OPTS and BOWTIE2_OTPS, and only one of BWA_INDEX, BOWTIE_OPTS and BOWTIE2_OPTS, corresponding to the specified mapping method (MAPPING_METHOD)

1.4 call_peak

Peak calling using aggregated bam file.

Input: the MAPQ30 bam file outputted from the mapping step.
Output: peak files, saved as output/peaks/PEAK_CALLER/OUTPUT_PREFIX_features_Blacklist_Removed.bed
Parameters:
- PEAK_CALLER, peak calling method, default ‘MACS2’, three options: MACS2, BIN, COMBINED
- MACS2_OPTS: provided extra options for macs2, default ‘-q 0.05 -g hs –nomodel –extsize 200 –shift -100’, (NO NEED TO SPECIFY -t, -n, -f here)

1.5 aggr_signal

Aggregate and normalize signal into .bw or .bedgraph file (can be uploaded to UCSC genome browser).

Input: the MAPQ30 bam file outputted from the mapping step.
Output: Aggregated data in .bw and .bedgraph file, saved in output/signal/

1.6 qc_per_barcode

Generate quality control metrics for each barcode.

Input: fragments.tsv.gz file (outputted from module mapping) and peak file (outputted from module call_peak), separated by comma
Output: qc_per_barcode.txt file, saved in output/summary/
Note: these qc metrics for each cell will be loaded into the seurat object as meta data when the clustering module was executed

1.7 get_mtx

Build raw peak-by-cell matrix

input: fragments.tsv.gz file, outputted from the mapping module, and features/peak file, outputted from the call_peak module, separated by a comma
output: sparse peak-by-cell count matrix in Matrix Market format, barcodes and feature files in plain text format, saved in output/raw_matrix/PEAK_CALLER/

1.8 call_cell

Perform cell calling

input: raw peak-by-barcode matrix file, outputted from the get_mtx module
output: filtered peak-by-cell matrix in Market Matrix format and .rds format, barcodes and features, saved in output/filtered_matrix/PEAK_CALLER/CELL_CALLER/
Parameters:
- CELL_CALLER, cell calling method, default ‘filtered’, three options: filtered, EmptyDrop, cellranger
- EmptyDrop_FDR: FDR for EmptyDrop, default 0.001, (NO NEED TO SPECIFY if EmptyDrop method is not selected)
- FILTER_BC_CUTOFF, filtering rules if ‘filtered’ is selected as the CELL_CALLER, default, –min_uniq_frags 3000 –max_uniq_frags 50000 –min_frac_peak 0.5 –min_frac_tss 0.0 –min_frac_promoter 0 –min_frac_enhancer 0.0 –max_frac_mito 0.1 –min_tss_escore 3 (Ignored if CELL CALLER was specified other than FILTER)å

1.9 get_bam4Cells

Extract bam file for cell barcodes and calculate mapping stats correspondingly

input: A bam file for aggregated data outputted from mapping module and a barcodes.txt file outputted from module call_cell, separated by comma
output: A bam file saved in output/mapping_results and mapping stats (optional) saved in output/summary for cell barcodes

1.10 Perform all process steps in one command

Some of the processing modules can be run together by a single command:

1.10.1 process

processing data - including demplx_fastq, mapping, call_peak, get_mtx, aggr_signal, qc_per_barcode, call_cell and get_bam4Cells

input: either fastq files for both reads and index, separated by comma, or path to folder of 10x fastq files like: fastq1,fastq2,index_fastq1,index_fastq2, index_fastq3…, or the PATH_TO_10xfastqs_folder
output: peak-by-cell matrix and all intermediate results

1.10.2 process_no_dex

Conduct all processing modules except demultiplexing step

input: demultiplexed fastq files for both reads and index, separated by comma like: fastq1,fastq2;
output: peak-by-cell matrix and all intermediate results

1.10.3 process_with_bam

Conduct all processing modules after mapping step

input: bam file for aggregated data, outputted from the mapping module
output: filtered peak-by-cell matrix and all intermediate results

2 Downstream Analysis

2.1 rmDoublets

Remove potential doublets

input: a peak-by-cell matrix file or a seurat object file in .rds format, and the expected fraction of doublets, separated by a comma
output: doublets removed matrix.rds and barcodes.txt file and seurat objects w/ and w/o doublets saved in the input directory (and a umap plot colored by singlet/doubet)

2.2 clustering

cell clustering

input: filtered peak-by-cell matrix file, outputted from the call_cell module (or a seurat.rds file)
output: seurat objects with clustering label in the metadata (.rds file) and barcodes with cluster labels (cell_cluster_table.tsv file), and umap plot colorred
Parameters to specify (in configure_user.txt file):
- norm_by, normalization method, default tf-idf, other options: log (just log transformation) or NA (no normalization)
- Top_Variable_Features, number/fraction of variable features used for seurat, default 5000, other options: a real number within (0, 1)
- REDUCTION, dimension reduction method, default pca, other option: lda, note that UMAP and TSNE will be automatically calculated correspondly
- nREDUCTION, number of reduced dimention, default 30
- CLUSTERING_METHOD, clustering method, default seurat (the same as Louvain), options: seurat/Louvain/cisTopic/kmeans/LSI/SCRAT/chromVAR/scABC
- K_CLUSTERS, either the number of cluster (an integer) or the resolution parameter (a float number) for louvain algorithm (implemented by seurat), default 0.2
- prepCello, generate inputs for VisCello (for visaulization) or not, default TRUE

2.3 motif_analysis

Motif analysis based on chromVAR

input: filtered peak-by-cell matrix file, outputted from the call_cell module
output: a chromVAR object with TF-by-cell deviation score/zscore, a table and heatmap indicating TF enrichment for each cell cluster, saved in output/downstream_analysiss/PEAK_CALLER/CELL_CALLER/

2.4 runDA

Perform differential accessibility analysis for peaks

input: path_to_seurat_object with two groups of clusters to compare, could be like: seurat_obj.rds,0:1,2 (will compare cells in cluster 0 or cluster 1 with cells in cluster2 for the given seurat object) or seurat_obj.rds,0,rest (will compare cells in cluster 0 with the rest of cells) or seurat_obj.rds,one,rest (will compare cells in any one of the clusters with the rest of the cells)
- Note: the parameters specified here will overwrite the group1, group2 paraters in the configure_user.txt file
output: differential accessibility peaks in a tsv file saved in the same in the same folder of the input seurat object
Parameters:
- group1, group one, default 0:1, could be either one or multiple cluster names, separated by colon, or ‘one’
- group2, group twom default 2, cluster name as group1 or ‘rest’
- test_use, statistical testing method, default wilcox, other options: negbinom, LR, wilcox, t, DESeq2

2.5 runGO

preform GO term enrichment analysis for genes close to cluster specific peaks

input: differential accessible features file, outputted from runDA module (.tsv file)
output: enriched GO terms in .xlsx format saved in the same directory as the input file

2.6 runCicero

Run cicero for calculating gene activity score and predicting cis chromatin interactions

input: seurat_obj.rds file outputted from the clustering module
output: cicero gene activity in .rds format and predicted interactions in .txt format, saved in output/downstream_analysiss/PEAK_CALLER/CELL_CALLER/

2.7 split_bam

Split bam file to generate bam file for each cluster

input: barcodes with cluster label (cell_cluster_table.tsv file, outputted from clustering module)
note: users can specify any two column text files, for barcodes and the corresponding cluster/subpopulation label.
output: .bam file (saved in output/downstream/PEAK_CALLER/CELL_CALLER/data_by_cluster), .bw, .bedgraph (saved in output/signal/) file for each cluster/subpopulation

2.8 footprint

Perform TF footprinting analysis, supports comparison between two sets of cell clusters and one cluster vs the rest of cell clusters (one-vs-rest)

input: two groups of cells (separated by a comma), each group is labeled with a combination of cluster labels, default 0:1,2, comparing cluster0,1 to cluster2
Note: you can also specify ‘one,rest’ to conduct all one cluster vs the rest clusters comparisons.
output: footprinting summary statistics in tables and heatmap, saved in output/downstream/PEAK_CALLER/CELL_CALLER/

2.9 Perform all downstream analysis by one command

Perform all downstream analyses, including clustering, motif_analysis, split_bam (optional) and footprinting analysis (optional), the corresponding parameters should be set up in configure_user.txt file.

input: filtered peak-by-cell matrix file, outputted from call_cell module
output: all outputs from each module

3 Data Integration

3.1 mergePeaks

Merge peaks (called from different data sets) if the distance is less than a given size in basepairs (200 if not specified)

input: peak files and a distance parameter separated by comma: peakFile1,peakFile2,peakFile3,200
output: merged peaks saved in file output/peaks/merged.bed

3.2 reconstMtx

reconstruct peak-by-cell matrix given peak file, fragments.tsv.gz file, barcodes.txt and an optional path for the output reconstructed matrix

input: different files separated by comma: peakFilePath,fragmentFilePath,barcodesPath,reconstructMatrixPath
output: reconstructed peak-by-cell matrix saved in reconstructMatrixPath, if reconstructMatrixPath is not specified, a sub-folder reConstruct_matrix will be created under the same path as the input barcodes.txt file

3.3 integrate_mtx

Perform integration of two ore more data matrices, which have the same rownames (set of peaks)

input: mtx1,mtx2, separated by comma like, mtx1_path,mtx2_path
output: integrated seurat obj and umap plot, saved in output/integrated/

3.4 integrate

Perform integration of two ore more data sets, given the corresponding peaks for each data sets.

input: peak/feature files and a optional distance parameter separated by comma: peak_file1,peak_file2,200
output: merged peaks, reconstructed matrix, integrated seurat obj and umap plot, saved in output/integrated/
Note: this module will search corresponding fragments.tsv.gz file, barcodes.txt file for each data set, and merge all the peaks within 200bp distance, reconstruct the matrix with the mergered peaks, and perform matrix integration. In other words, it’s combination of modules mergePeaks, reconstMtx and integrate_mtx.

3.5 labelTransfer

Label transfer (cell annotation) from scRNA-seq data

input: paths for a seurat object for scATAC-seq, a seurat object for scRNA-seq data in .rds format, and an optional .gtf file for gene annotation, separated by a comma.
output: a updated seurat object for atac with the Predicted_Cell_Type as a metadata variable and an umap plot colored by Predicted_Cell_Type, saved in the same directory as the input atac seurat object.
Note: the cell annotation should be given as a metadata (named Cell_Type) in the seurat object of scRNA-seq. Both seurat objects should have pca and umap dimemsion reduction done.

4 Visualization

4.1 report

Generate summary report in html file

input: directory to QC files, output/summary as default
output: summary report in html format, saved in output/summary and .eps figures for each panel saved in output/summary/Figures and tables in output/summary/Tables

4.2 visualize

Interactively visualize the data through VisCello

input: VisCello_obj directory, outputted from the clustering module
output: launch VisCello through web browser for interactively visualization"

5 Relate to 10x bam file

5.1 convert10xbam

Convert bam file in 10x genomics format to bam file in scATAC-pro format

input: bam file (position sorted) in 10x format
output: position sorted bam file in scATAC-pro format saved in output/mapping_result, mapping qc stat and fragment.txt files saved in output/summary/

5.2 addCB2bam

Add cell barcode tag to bam file

input: a bam file generated by scATAC-pro
output: the bam file with column ‘CB:Z:cellbarcode’ added (saved in the same directory as the input bam file)

Step by Step Notes