Customized Secondary Analysis with Cell Ranger ARC Reanalyze

The cellranger-arc reanalyze command reruns secondary analysis performed on the peak-barcode matrix (dimensionality reduction, clustering and visualization) using different parameter settings.

These are the required command line arguments:

Argument	Description
`--id=ID`	Required. A unique run id and output folder name [a-zA-Z0-9_-]+ of maximum length 64 characters.
`--matrix=H5`	Required. Path to a feature-barcode matrix H5 generated by `cellranger-arc count` or `aggr`. If you intend to subset to a set of barcodes then use the raw matrix, otherwise use the filtered feature-barcode matrix.
`--atac-fragments=TSV.GZ`	Required. Path to the `atac_fragments.tsv.gz` generated by `cellranger-arc count` or `aggr`. Note it is assumed that the tabix index file `atac_fragments.tsv.gz.tbi` is present in the same directory.
`--reference=PATH`	Required. Path to folder containing cellranger-arc-compatible reference. Reference packages can be downloaded from support.10xgenomics.com or constructed using the `cellranger-arc mkref` command. Note this reference must match the reference used for the initial `cellranger-arc count` run.

Optional command line parameters are listed below (also available through cellranger-arc reanalyze --help):

Option	Description
`--description=TXT`	Sample description to embed in output files [default: ]
`--barcodes=LIST`	Specify barcodes to use in analysis. The barcodes could be specified in a text file that contains one barcode per line (blank lines are ignored). CSV (with/without a header) is also accepted. Only the first column of the CSV is used — exports from Loupe Browser will have this format. Required if neither `--peaks` nor `--params` has been specified.
`--min-atac-count=NUM`	Cell caller override: define the minimum number of ATAC transposition events in peaks (ATAC counts) for a cell barcode. Note: this option must be specified in conjunction with `min-gex-count`. With `--min-atac-count=X` and `--min-gex-count=Y` a barcode is defined as a cell if it contains at least X ATAC counts AND at least Y GEX UMI counts.
`--min-gex-count=NUM`	Cell caller override: define the minimum number of GEX UMI counts for a cell barcode. Note: this option must be specified in conjunction with `min-atac-count`. With `--min- atac-count=X` and `--min-gex-count=Y` a barcode is defined as a cell if it contains at least X ATAC counts AND at least Y GEX UMI counts.
`--peaks=BED`	Override peak caller: specify peaks to use in secondary analyses from supplied 3-column BED file. The supplied peaks file must be sorted by position and not contain overlapping peaks; comment lines beginning with `#` are allowed. Required if neither `--barcodes` nor `--params` has been specified.
`--params=CSV`	Specify key-value pairs in CSV format for analysis: any subset of `random_seed`, `k_means_max_clusters`, `feature_linkage_max_dist_mb`, `num_gex_pcs`, `num_atac_pcs`. For example, to override the number of GEX principal components used to 15 and the distance threshold for feature linkage computation to 2.5 MB, the CSV would take the form: `num_gex_pcs,15` `feature_linkage_max_dist_mb,2.5` Required if neither `--peaks` nor `--barcodes` has been specified.
`--agg=AGGREGATION_CSV`	If the input matrix was produced by `cellranger-arc aggr`, it is possible to pass the same aggregation CSV in order to retain per-library tag information in the resulting `.cloupe` file.
`--jobmode=MODE`	Job manager to use. Valid options: local (default), sge, lsf, slurm, or path to a `.template` file. Search for help on "Cluster Mode" at support.10xgenomics.com for more details on configuring the pipeline to use a compute cluster [default: local].
`--localcores=NUM`	Set max cores the pipeline may request at one time. Only applies to local jobs.
`--localmem=NUM`	Set max GB the pipeline may request at one time. Only applies to local jobs.
`--localvmem=NUM`	Set max virtual address space in GB for the pipeline. Only applies to local jobs.
`--mempercore=NUM`	Reserve enough threads for each job to ensure enough memory will be available, assuming each core on your cluster has at least this much memory available. Only applies to cluster jobmodes.
`--maxjobs=NUM`	Set max jobs submitted to cluster at one time. Only applies to cluster jobmodes.
`--jobinterval=NUM`	Set delay between submitting jobs to cluster, in ms. Only applies to cluster jobmodes.
`--overrides=PATH`	The path to a JSON file that specifies stage-level overrides for cores and memory. Finer-grained than `--localcores`, `--mempercore`, and `--localmem`.
`--uiport=PORT`	Serve web UI at `http://localhost:PORT`

After determining input arguments and options, run cellranger-arc reanalyze. This example reanalyzes the results of an aggregation named AGG123 in order to filter out doublet barcodes:


$ cd /home/jdoe/runs
$ ls -1 AGG123/outs/*.gz # verify the input file exists
AGG123/outs/fragments.tsv.gz
$ cellranger-arc reanalyze  --id=AGG123_reanalysis \
                            --barcodes=no_doublets.csv \
                            --matrix=/home/jdoe/runs/AGG123/outs/raw_feature_bc_matrix.h5 \
                            --reference=/home/jdoe/refs/hg19 \
                            --atac-fragments=/home/jdoe/runs/AGG123/outs/atac_fragments.tsv.gz

The pipeline will begin to run, creating a new folder named with the reanalysis ID specified with the --id argument (e.g. /home/jdoe/runs/AGG123_reanalysis) for its output. If this output folder already exists, cellranger-arc will assume it is an existing pipestance and attempt to resume running it.

A successful run should conclude with a message similar to this:


2021-04-26 03:30:46 [runtime] (update)          ID.AGG123_reanalysis.SC_ATAC_GEX_REANALYZER_CS.ATAC_GEX_CLOUPE_PREPROCESS.fork0 join_running
2021-04-26 03:36:05 [runtime] (join_complete)   ID.AGG123_reanalysis.SC_ATAC_GEX_REANALYZER_CS.ATAC_GEX_CLOUPE_PREPROCESS

Outputs:
- Secondary analysis outputs:
    clustering:
      atac: {
        ...
      }
      gex:  {
        ...
      }
    dimensionality_reduction:
      atac: {
        ...
      }
      gex:  {
        ...
      }
    feature_linkage:
      ...
    tf_analysis:
      ...
- Filtered feature barcode matrix HDF5:          /home/jdoe/runs/AGG123_reanalysis/outs/filtered_feature_bc_matrix.h5
- Loupe browser visualization file:              /home/jdoe/runs/AGG123_reanalysis/outs/cloupe.cloupe
- ATAC peak annotations based on proximal genes: /home/jdoe/runs/AGG123_reanalysis/outs/atac_peak_annotation.tsv
- Secondary analysis summary:                    /home/jdoe/runs/AGG123_reanalysis/outs/summary.json

Pipestance completed successfully!

Refer to the Multiome ATAC + Gene Expression Analysis page for an explanation of the output.

The CSV file passed to --params should have one row for every parameter that you want to customize. There is no header row. If a parameter is not specified in your CSV, its default value will be used.

Here are detailed descriptions of each parameter. For parameters that subset the data, a default value of null indicates that no subsetting happens by default.

Parameter	Type	Default Value	Recommended Range	Description
`feature_linkage_max_dist_mb`	float	1	0.1-5, depending on the what is a biological meaningful length scale for the organism	Change the distance over which pairs of features are considered for feature linkage estimation. Increasing this number will increase the number of linkage, but features that are very far away on the genome are less likely to be causally linked.
`num_atac_pcs`	int	15	10-100, depending on the number of cell populations / clusters you expect to see.	Compute N principal components for LSA. Setting this too high may cause spurious clusters to be called.
`num_gex_pcs`	int	10	10-100, depending on the number of cell populations / clusters you expect to see.	Compute N principal components for PCA. Setting this too high may cause spurious clusters to be called.
`k_means_max_clusters`	int	5	2-5, depending on the number of cell populations / clusters you expect to see.	Compute K-means clustering using K values of 2 to N. Setting this too high may cause spurious clusters to be called.
`random_seed`	int	0	any 64-bit integer	Due to the randomized nature of the algorithms, changing this will produce slightly different results.

Customized Secondary Analysis with Cell Ranger ARC Reanalyze

Command line interface

Pipeline outputs

Parameters