Skip to main content
Skip table of contents

ARK Portal Data Standards

What are data standards?

Data standards are a set of rules that define how data is recorded, described, and shared. They help ensure that data is consistent, accurate, and comply with FAIR data practices.

This document, and any accompanying pages, describe standards for common assay data types found in the ARK Portal and outlines expectations for data contributions.

When to apply standards?

Please be familiar with ARK Portal data standards requirements and plan to have your data files and metadata tables conform before uploading to Synapse. This means that files should be named and organized according to the conventions outlined here.

Olink data standards

Folder Name

Data Description

Expectation

File Formats

Data Level*

raw_data

raw protein abundance measurements

required

parquet or CSV, one per plate

1

processed_data

processed aggregated data

required

single, aggregated parquet or CSV consisting of all finalized data points.

2+

metadata

target panel

required

a tabular file format (e.g., csv, xlsx) detailing all the protein targets profiled in the experiment.

N/A

At a minimum, all Olink data contributions to the ARK Portal should include all raw data in parquet or CSV format, one file per plate profiled, and a final aggregated data object in parquet or CSV format, that has been used to derive research findings reported in publications. This finalized data object should include any normalization, integration, transforms, etc. that have been applied to the raw data.

To ensure transparency and to support future ARK Portal developments we require contributors to also provide a “target panel” file that details all the proteins profiled by the Olink assay. At a minimum this should be a tabular file format (e.g., csv, xlsx) (PDFs will not be accepted) that lists all protein targets using established unique identifiers such as Uniprot IDs or HGNC approved gene symbol. This file should be readily available from the manufacturer.

Olink data contributors are also required to provide standardized metadata conforming to the ARK Portal Data model. Templates will be provided to guide the collection of these critical metadata.

scRNA-seq/sn-RNA-seq data standards

Folder Name

Data Description

Expectation

File Formatsa

Data Level*

fastq_files

raw fastq files

required

gzipped fastq files

1

bam_files

read alignment files

optional

bam or cram files

2

CellRanger_counts

or

raw_gene_counts

raw gene counts

preferred

compressed tar archive (e.g., .tgz) or .h5file - these are readily available after running Cell Ranger counts on 10x Genomics sc/sn-RNA-seq data

3

processed_data

processed aggregated gene counts

preferred

AnnData (as an h5adfile) or SeuratObj (as an Rds or similarly binary compressed R-compatible file)

4+

*https://ark-portal.github.io/data_model/docs/attributes/dataLevel.html

aFile name conventions are described at Supplemental Standards: File Names

bFor data processed by 10x Genomics Cell Ranger software, if contributors wish to upload the MEX output they should first convert either the raw_feature_bc_matrix/ or filtered_feature_bc_matrix/ folder to a gzip compressed tar archive.

At a minimum, all sc/snRNA-seq data contributions to the ARK Portal should include the raw fastq files. We additionally request that contributors provide raw gene counts, either aggregated or split by library/sample, and a final aggregated AnnData (as an h5adfile) or SeuratObj (as an Rds or similarly binary compressed R-compatible file) of prepared counts data that has been used to derive research findings reported in publications. This finalized data object should include any normalization, integration, transforms, etc. that have been applied to the gene counts along with a critical cell metadata as outlined at Supplemental Standards: Single-cell Metadata.

CITE-seq data standards

Folder Name

Data Description

Expectation

File Formatsa

Data Level*

fastq_files/GEX_fastq

scRNA-seq raw fastq files

required

gzipped fastq files

1

fastq_files/feature_barcode_fastq

feature barcode raw fastq files

required

gzipped fastq files

1

CellRanger_counts

or

raw_counts

raw gene and proteins counts

preferred

compressed tar archive (e.g., .tgz) or .h5file - these are readily available after running Cell Ranger counts on 10x Genomics-derived data.

3

processed_data

processed aggregated gene counts

preferred

AnnData (as an h5adfile) or SeuratObj (as an Rds or similarly binary compressed R-compatible file)

4+

aFile name conventions are described at Supplemental Standards: File Names

CITE-seq is a multi-modal data type that simultaneously profiles transcript and target protein abundance at single-cell resolution. Protein targets are profiled by sequencing barcodes contained within oligonucleotides conjugated to antibodies that bind to proteins of interest - where each barcode is uniquely associated with a specific protein. These antibody-derived barcodes are sometimes referred to as antibody-derived tags (ADT) or as feature barcodes. The ARK Portal used the latter terminology.

The protein abundance libraries are created and sequenced as distinct libraries from the scRNA-seq libraries and are treated as a distinct assay type within the ARK Portal data model. Specifically, the ARK Portal classifies these libraries under the ‘feature barcode sequencing’ assay. This is distinct from other antibody-derived barcode methods like hash-tag oligos that are used to demultiplex libraries made of pooled cell suspensions and which do not target specific proteins for the purpose of quantifying protein abundance.

At a minimum, all CITE-seq data contributions to the ARK Portal should include the raw fastq files for both the scRNA-seq libraries and the feature barcode sequencing libraries. We additionally request that contributors provide raw gene and protein counts, either aggregated or split by library/sample, and a final aggregated AnnData (as an h5adfile) or SeuratObj (as an Rds or similarly binary compressed R-compatible file) of prepared counts data that has been used to derive research findings reported in publications. This finalized data object should include any normalization, integration, transforms, etc. that have been applied to the gene counts along with a critical cell metadata as outlined at Supplemental Standards: Single-cell Metadata.

Supplemental Standards

File Names

Single specimen files

In the tables above, File Formats indicates the expected format and extension of data files. Here we describe conventions regarding information to include in your file names. The examples below use fastq files from sequencing based experiments to demonstrate ARK Portal file name conventions, but the convention is applicable to many other file types as well.

TL;DR - if a data file contains data pertaining to a single specimen then the biospecimenID should be included in the file name. If a file contains pooled data, particularly raw data files, then the file name should include the corresponding string/variable corresponding to that pool. This variable will differ depending on the assay. For example, pooled sequencing libraries should use the libraryID. Olink level 1 files should include the plateID, barcoded and multiplexed FCS files should include the sampleProcessingBatch or dataCollectionBatch, etc.

For single-specimen libraries, i.e., libraries consisting of only a single sample, fastq files should include the biospecimenID of that sample. All fastq files should include the read (R) or index (I) label and note the flow cell lane that the library was sequenced on as this is necessary for indicating libraries that were sequenced across multiple lanes, e.g.,

RASLE_000001_L001_R1_001.fastq.gz

Where RASLE_000001 is the biospecimenID, L001 is the flow cell lane, and R1is the read of the library fragment sequenced in the file. More details on Illumina fastq file naming convention is available at BaseSpace Naming Convention.

While not common, there are some scenarios in which a library may be sequenced across multiple flow cells. In these cases it is important to create and assign distinct batch labels that distinguish between these runs. This can be the flow cell ID, a simple letter or number code, etc. This label should then be included in the fastq file name and will also be captured in the corresponding ARK Portal Assay Metadata Template.

For multi-modal assays where multiple libraries are derived from the same biospecimen, e.g., CITE-seq, the fastq file names should follow the above examples with the addition of a short abbreviation distinguishing libraries for each assay type. The table below outlines the different abbreviations that should be used to distinguish between different libraries in a multi-modal experiment:

Where abbreviationis appended to the beginning of the fastq file name

Abbreviation

Assay

Example

GEX

scRNA-seq (GEX = gene expression)

GEX_RASLE_000002_L001_R1_001.fastq.gz

FB

feature barcode sequencing

FB_RASLE_000002_L001_R1_001.fastq.gz

VDJ

V(D)J sequencing (TCR + BCR)

VDJ_RASLE_000002_L001_R1_001.fastq.gz

TCR

V(D)J sequencing (TCR only)

TCR_RASLE_000002_L001_R1_001.fastq.gz

BCR

V(D)J sequencing (BCR only)

BCR_RASLE_000002_L001_R1_001.fastq.gz

ATAC

ATAC-seq

ATAC_RASLE_000002_L001_R1_001.fastq.gz

Multispecimen libraries

For multispecimen libraries the file names should use the libraryID in place of the biospecimenID.

Single-cell Metadata

Cell-level metadata is a critical component of single-cell and single-nucleus datasets. Contributors are asked to include a minimal set of standardized cell-metadata to streamline data reuse and support a more harmonized metadata infrastructure for ARK Portal data:

  • biospecimenID - by including this ID, each cell will be connected to the associated metadata collected via ARK Biospecimen metadata templates.

  • cellOntologyID (if cell type annotations are included) corresponding to the predicted cell type. The Cell Ontology (CL) is a structured, controlled vocabulary for cell types and provides a set of unique identifiers for specifying cell types.You can explore the cell ontology at https://bioportal.bioontology.org/ontologies/CL?p=summary.

Any additional cell-level metadata fields should be defined in an accompanying dictionary to clearly document what information/data is also captured in these tables.

To learn more about “FAIRification” efforts for single-cell data please visit https://sc-fair.org/ and https://github.com/chanzuckerberg/single-cell-curation/tree/main.

Resources

ARK Portal Data Model and Dictionary

The ARK Portal Data Model Dictionary is hosted online at https://ark-portal.github.io/data_model/. This site is built directly from the data model files hosted in a public repository on GitHub at https://github.com/ARK-Portal/data_model/tree/main. All are welcome to review, submitt issues, and contribute to the ARK Portal data model.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.