In addition to the MEX format, we also provide matrices in the Hierarchical Data Format (HDF5 or H5). H5 is a binary format that can compress and access data much more efficiently than text formats such as MEX, which is especially useful when dealing with large datasets. H5 files are supported in both R and Python.
For more information on the format, see the Introduction to HDF5.
Cell Ranger generates an output file with per-molecule information in HDF5 format. General information about the HDF5 file format here applies to the molecule_info.h5
or sample_molecule_info.h5
file, but see the documentation for specific details about the Molecule Info HDF5 file.
The top level of the file contains a single HDF5 group, called matrix
, and metadata stored as HDF5 attributes. Within the matrix group are datasets containing the dimensions of the matrix
, the matrix entries, as well as the features and cell-barcodes associated with the matrix rows and columns, respectively.
Column | Type | Description |
---|---|---|
barcodes | string | Barcode sequences and their corresponding GEM wells (e.g. AAACGGGCAGCTCGAC-1 ) |
data | uint32 | Nonzero UMI counts in column-major order |
indices | uint32 | Zero-based row index of corresponding element in data |
indptr | uint32 | Zero-based index into data / indices of the start of each column, i.e., the data corresponding to each barcode sequence |
shape | uint64 | Tuple of (# rows, # columns) indicating the matrix dimensions |
The matrix entries are stored in Compressed Sparse Column (CSC) format. For more details on the format, see this SciPy introduction. CSC represents the matrix in column-major order, such that each barcode is represented by a contiguous chunk of data values.
The feature reference is stored as an HDF5 group called features
, within the matrix
group.
(root)
└── matrix [HDF5 group]
├── barcodes
├── data
├── indices
├── indptr
├── shape
└── features [HDF5 group]
├─ _all_tag_keys
├─ target_sets [for Targeted Gene Expression or Fixed RNA Profiling]
│ └─ [target set name]
├─ feature_type
├─ genome
├─ id
├─ name
├─ pattern [Feature Barcode only]
├─ read [Feature Barcode only]
└─ sequence [Feature Barcode only]
You can examine the contents of the H5 file using software such as HDFView or the h5dump command, as demonstrated below.
Show the file contents of the entire H5 object:
h5dump -n ./filtered_feature_bc_matrix.h5
HDF5 "filtered_feature_bc_matrix.h5" {
FILE_CONTENTS {
group /
group /matrix
dataset /matrix/barcodes
dataset /matrix/data
group /matrix/features
dataset /matrix/features/_all_tag_keys
dataset /matrix/features/feature_type
dataset /matrix/features/genome
dataset /matrix/features/id
dataset /matrix/features/name
dataset /matrix/indices
dataset /matrix/indptr
dataset /matrix/shape
}
}
Show the top few lines of a specific part of the object (e.g., the matrix/barcodes
dataset):
h5dump -d matrix/barcodes ./filtered_feature_bc_matrix.h5 | head -n 15
HDF5 "./filtered_feature_bc_matrix.h5" {
DATASET "matrix/barcodes" {
DATATYPE H5T_STRING {
STRSIZE 18;
STRPAD H5T_STR_NULLPAD;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 1225 ) / ( 1225 ) }
DATA {
(0): "AAACCCAAGGAGAGTA-1", "AAACGCTTCAGCCCAG-1", "AAAGAACAGACGACTG-1",
(3): "AAAGAACCAATGGCAG-1", "AAAGAACGTCTGCAAT-1", "AAAGGATAGTAGACAT-1",
(6): "AAAGGATCACCGGCTA-1", "AAAGGATTCAGCTTGA-1", "AAAGGATTCCGTTTCG-1",
(9): "AAAGGGCTCATGCCCT-1", "AAAGGGCTCCGTAGGC-1", "AAAGGTACAACTGCTA-1",
(12): "AAAGTCCAGCGGGTTA-1", "AAAGTCCAGTCAACAA-1", "AAAGTCCCACCAGCCA-1",
...
For suggestions on downstream analysis with 3rd party R and Python tools, see the 10x Genomics Analysis Guides resource.