Aggregating Multiple GEM Wells with cellranger-arc aggr

When conducting large studies involving multiple GEM wells, run cellranger-arc count on FASTQ data from each of the GEM wells individually, then pool the results using cellranger-arc aggr, as described here.

cellranger-arc aggr is not designed for combining multiple sequencing runs of the same GEM well. Instead, use the --libraries argument of cellranger-arc count** and add a row for each sequencing run.

cellranger-arc aggr only accepts data generated by version 2.0+ of cellranger-arc count. If you have data from cellranger-arc 1.0.1 or earlier, you need to rerun cellranger-arc count before the data can be aggregated.

The cellranger-arc aggr command takes as input a CSV file specifying a list of cellranger-arc count output files for each GEM well being aggregated and produces a single feature-barcode matrix containing all the data.

When combining multiple GEM wells, the barcode sequences for each channel are distinguished by a GEM well suffix appended to the barcode sequence.

By default, the reads from each GEM well are subsampled such that all GEM wells have the same effective sequencing depth for both ATAC and gene expression modalities; for the ATAC data it is measured in terms of median unique fragments per cell and for gene expression it is measured in terms of the average number of reads that are confidently mapped to the transcriptome per cell. However, it is possible to turn off this normalization altogether (see Depth Normalization).

The first step is to run a single instance of cellranger-arc count on each individual GEM well prepared using the 10x Chromium™ platform, as described in Single Sample Analysis.

For example, suppose you ran three count pipelines as follows:


cd /opt/runs
cellranger-arc count --id=LV123 ...
... wait for pipeline to finish ...
cellranger-arc count --id=LB456 ...
... wait for pipeline to finish ...
cellranger-arc count --id=LP789 ...
... wait for pipeline to finish ...

You can aggregate these three runs to get an aggregated matrix and analysis. In order to do so, you need to create an Aggregation CSV.

Create a CSV file with a header line containing the following columns:

library_id: Unique identifier for this input GEM well. This will be used for labeling purposes only; it does not need to match any previous ID assigned to the GEM well.
atac_fragments: Path to the atac_fragments.tsv.gz file produced by cellranger-arc count. For example, if you processed your GEM well by calling cellranger-arc count --id=ID in some directory /DIR, the atac_fragments would be /DIR/ID/outs/atac_fragments.tsv.gz. Note, we assume that the index for the atac_fragments file is located alongside at /DIR/ID/outs/atac_fragments.tsv.gz.tbi.
per_barcode_metrics: Path to the per_barcode_metrics.csv file produced by cellranger-arc count. This file is used to identify the barcodes that are to be considered as cells for secondary analysis.
gex_molecule_info: Path to the gex_molecule_info.h5 file produced by cellranger-arc count.
(Optional) Additional custom columns containing library meta-data (e.g., lab or sample origin). These custom library annotations do not affect the analysis pipeline but can be visualized downstream in the Loupe Browser. Note that unlike other CSV inputs to Cell Ranger ARC, these custom columns may contain characters outside the ASCII range (e.g., non-Latin characters).

You can either make the CSV file in a text editor, or create it in Excel and export to CSV. Continuing the example from the previous section, your Excel spreadsheet should look like this:

	A	B	C	D
1	library_id	atac_fragments	per_barcode_metrics	gex_molecule_info
2	LV123	/opt/runs/LV123/outs/atac_fragments.tsv.gz	/opt/runs/LV123/outs/per_barcode_metrics.csv	/opt/runs/LV123/outs/gex_molecule_info.h5
3	LB456	/opt/runs/LB456/outs/atac_fragments.tsv.gz	/opt/runs/LB456/outs/per_barcode_metrics.csv	/opt/runs/LB456/outs/gex_molecule_info.h5
4	LP789	/opt/runs/LP789/outs/atac_fragments.tsv.gz	/opt/runs/LP789/outs/per_barcode_metrics.csv	/opt/runs/LP789/outs/gex_molecule_info.h5

When you save it as a CSV, the result looks like this:


library_id,atac_fragments,per_barcode_metrics,gex_molecule_info
LV123,/opt/runs/LV123/outs/atac_fragments.tsv.gz,/opt/runs/LV123/outs/per_barcode_metrics.csv,/opt/runs/LV123/outs/gex_molecule_info.h5
LB456,/opt/runs/LB456/outs/atac_fragments.tsv.gz,/opt/runs/LB456/outs/per_barcode_metrics.csv,/opt/runs/LB456/outs/gex_molecule_info.h5
LP789,/opt/runs/LP789/outs/atac_fragments.tsv.gz,/opt/runs/LP789/outs/per_barcode_metrics.csv,/opt/runs/LP789/outs/gex_molecule_info.h5

These are the required command line arguments (also available through cellranger-arc aggr --help):

Argument	Description
`--id=ID`	A unique run id and output folder name [a-zA-Z0-9_-]+ of maximum length 64 characters.
`--csv=CSV`	Path to CSV file enumerating `cellranger-arc count` outputs (see Setting up a CSV).
`--reference=PATH`	Path to folder containing cellranger-arc-compatible reference. Reference packages can be downloaded from support.10xgenomics.com or constructed using the `cellranger-arc mkref` command. Note this reference must match the reference used for the initial `cellranger-arc count` run.

Additional optional parameters are available:

Option	Description
`--description=TEXT`	Sample description to embed in output files [default: ]
`--peaks=BED`	Override peak caller: specify peaks to use in downstream analyses from supplied 3-column BED file. The supplied peaks file must be sorted by position and not contain overlapping peaks; comment lines beginning with `#` are allowed
`--normalize=MODE`	Library depth normalization mode [default: depth] [possible values: none, depth]
`--nosecondary`	Skip secondary analysis which includes dimensionality reduction, clustering, and visualization. This is applicable if you plan to use cellranger-arc reanalyze or other custom analyses.
`--jobmode=MODE`	Job manager to use. Valid options: local (default), sge, lsf, slurm or path to a .template file. Search for help on "Cluster Mode" at support.10xgenomics.com for more details on configuring the pipeline to use a compute cluster [default: local]
`--localcores=NUM`	Set max cores the pipeline may request at one time. Only applies to local jobs
`--localmem=NUM`	Set max memory (GB) the pipeline may request at one time. Only applies to local jobs
`--localvmem=NUM`	Set max virtual address space in GB for the pipeline. Only applies to local jobs
`--mempercore=NUM`	Reserve enough threads for each job to ensure enough memory will be available, assuming each core on your cluster has at least this much memory available. Only applies to cluster jobmodes
`--maxjobs`	Set max jobs submitted to cluster at one time. Only applies to cluster jobmodes
`--jobinterval`	Set delay between submitting jobs to cluster, in ms. Only applies to cluster jobmodes
`--overrides=PATH`	The path to a JSON file that specifies stage-level overrides for cores and memory. Finer-grained than `--localcores`, `--mempercore` and `--localmem`.
`--uiport=PORT`	Serve web UI at http://localhost:PORT

After specifying input arguments and options, run cellranger-arc aggr:


cd /home/jdoe/runs
cellranger-arc aggr --id=AGG123 \
                  --csv=AGG123_libraries.csv \
                  --normalize=depth \
                  --reference=/home/jdoe/refs/hg19

The pipeline will begin to run, creating a new folder named with the aggregation ID specified with the --id argument (e.g. /home/jdoe/runs/AGG123). If this output folder already exists, cellranger-arc will assume it is an existing pipestance and attempt to resume running it.

When combining data from multiple GEM wells, the cellranger-arc aggr pipeline automatically equalizes the average read depth per cell between groups before merging. Note that does this for both assay modalities - ATAC and gene expression. When libraries are sequenced to very different read depth per cell you may observe that cells cluster by library of origin rather than cell type. This is commonly referred to as a batch effect in the literature. A multitude of factors can cause batch effects in single cell data and sequencing depth is only one of them. The downsampling normalization in cellranger-arc aggr specifically addresses sequencing depth batch effects but not others. It is possible to turn off normalization or change the way normalization is done. The none option may be appropriate if you want to maximize sensitivity and plan to deal with depth normalization or more general batch correction in a downstream step.

There are two normalization modes:

none: Do not normalize at all.
depth (default): Subsample reads from higher-depth GEM wells until we equalize the
- median number of unique fragments per cell for each ATAC library
- mean number of reads that are confidently mapped to the transcriptome per cell for each gene expression library

The cellranger-arc aggr pipeline generates output files that contain all of the data from the individual input jobs, aggregated into single output files, for convenient multi-sample analysis. The GEM well suffix of each barcode is updated to prevent barcode collisions, as described below.

Each output file produced by cellranger-arc aggr follows the format described in the Understanding Output section, but includes the union of all the relevant barcodes from each input job.

cellranger-arc aggr performs aggregation of cell calls from multiple input libraries into a final set of cell calls, without performing a cell-calling step.

A successful run will conclude with a message like this:


2021-04-26 05:16:01 [runtime] (update)          ID.AGG123.SC_ATAC_GEX_AGGREGATOR_CS.ATAC_GEX_CLOUPE_PREPROCESS.fork0 join_running
2021-04-26 05:20:28 [runtime] (join_complete)   ID.AGG123.SC_ATAC_GEX_AGGREGATOR_CS.ATAC_GEX_CLOUPE_PREPROCESS

Outputs:
- Barcoded and aligned fragment file:           /home/jdoe/runs/AGG123/outs/atac_fragments.tsv.gz
- Fragment file index:                          /home/jdoe/runs/AGG123/outs/atac_fragments.tsv.gz.tbi
- Bed file of all called peak locations:        /home/jdoe/runs/AGG123/outs/atac_peaks.bed
- Filtered peak barcode matrix in hdf5 format:  /home/jdoe/runs/AGG123/outs/raw_feature_bc_matrix.h5
- Filtered peak barcode matrix in mex format:   /home/jdoe/runs/AGG123/outs/raw_feature_bc_matrix
- Filtered peak barcode matrix in hdf5 format:  /home/jdoe/runs/AGG123/outs/filtered_feature_bc_matrix.h5
- Filtered peak barcode matrix in mex format:   /home/jdoe/runs/AGG123/outs/filtered_feature_bc_matrix
- Secondary analysis outputs:
    clustering:
      atac: {
        ...
      }
      gex:  {
        ...
      }
    dimensionality_reduction:
      atac: {
        ...
      }
      gex:  {
        ...
      }
    feature_linkage:
      ...
    tf_analysis:
      ...
- Loupe Browser input file:                     /home/jdoe/runs/AGG123/outs/cloupe.cloupe
- csv summarizing important metrics and values: /home/jdoe/runs/AGG123/outs/summary.csv
- Annotation of peaks with genes:               /home/jdoe/runs/AGG123/outs/atac_peak_annotation.tsv
- HTML summary:                                 /home/jdoe/runs/AGG123/outs/web_summary.html
- Input data supplied for aggregation:          [
    {
        "atac_fragments": "/home/jdoe/runs/LV123/outs/atac_fragments.tsv.gz",
        "gex_molecule_info": "/home/jdoe/runs/LV123/outs/gex_molecule_info.h5",
        "library_id": "LV123",
        "metadata": {},
        "per_barcode_metrics": "/home/jdoe/runs/LV123/outs/per_barcode_metrics.csv"
    },
    {
        "atac_fragments": "/home/jdoe/runs/LB456/outs/atac_fragments.tsv.gz",
        "gex_molecule_info": "/home/jdoe/runs/LB456/outs/gex_molecule_info.h5",
        "library_id": "LB456",
        "metadata": {},
        "per_barcode_metrics": "/home/jdoe/runs/LB456/outs/per_barcode_metrics.csv"
    },
    {
        "atac_fragments": "/home/jdoe/runs/LP789/outs/atac_fragments.tsv.gz",
        "gex_molecule_info": "/home/jdoe/runs/LP789/outs/gex_molecule_info.h5",
        "library_id": "LP789",
        "metadata": {},
        "per_barcode_metrics": "/home/jdoe/runs/LP789/outs/per_barcode_metrics.csv"
    }
  ]
- Input data supplied for aggregation as CSV:   /home/jdoe/runs/AGG123/outs/aggr.csv

Pipestance completed successfully!

Once cellranger-arc aggr has successfully completed, you can browse the resulting summary HTML file in any supported web browser, open the .cloupe file in Loupe Browser, or refer to the Understanding Output section to explore the data by hand. For machine-readable versions of the summary metrics, refer to the cellranger-arc aggr section of the Summary Metrics page.

Each GEM well is a physically distinct set of GEM partitions, but draws barcode sequences randomly from the pool of valid barcodes, known as the barcode whitelist. To keep the barcodes unique when aggregating multiple libraries, we append a small integer identifying the GEM well to the barcode nucleotide sequence, and use that nucleotide sequence plus ID as the unique identifier in the feature-barcode matrix. For example, AGACCATTGAGACTTA-1 and AGACCATTGAGACTTA-2 are distinct cell barcodes from different GEM wells, despite having the same barcode nucleotide sequence.

This number, which indicates which GEM well the barcode sequence came from, is called the GEM well suffix. The numbering of the GEM wells will reflect the order that the GEM wells were provided in the Aggregation CSV.

Aggregating Multiple GEM Wells with cellranger-arc aggr

Requirements

Setting up an Aggregation CSV

Command line interface

Depth normalization

Pipeline outputs

Understanding GEM wells