ARK Portal Data Standards
What are data standards?
Data standards are a set of rules that define how data is recorded, described, and shared. They help ensure that data is consistent, accurate, and comply with FAIR data practices.
This document, and any accompanying pages, describe standards for common assay data types found in the ARK Portal and outlines expectations for data contributions.
When to apply standards?
Please be familiar with ARK Portal data standards requirements and plan to have your data files and metadata tables conform before uploading to Synapse. This means that files should be named and organized according to the conventions outlined here.
Olink data standards
Folder Name | Data Description | Expectation | File Formats | Data Level* |
|---|---|---|---|---|
| raw protein abundance measurements | required |
| 1 |
| processed aggregated data | required | single, aggregated | 2+ |
| target panel | required | a tabular file format (e.g., | N/A |
At a minimum, all Olink data contributions to the ARK Portal should include all raw data in parquet or CSV format, one file per plate profiled, and a final aggregated data object in parquet or CSV format, that has been used to derive research findings reported in publications. This finalized data object should include any normalization, integration, transforms, etc. that have been applied to the raw data.
To ensure transparency and to support future ARK Portal developments we require contributors to also provide a “target panel” file that details all the proteins profiled by the Olink assay. At a minimum this should be a tabular file format (e.g., csv, xlsx) (PDFs will not be accepted) that lists all protein targets using established unique identifiers such as Uniprot IDs or HGNC approved gene symbol. This file should be readily available from the manufacturer.
Olink data contributors are also required to provide standardized metadata conforming to the ARK Portal Data model. Templates will be provided to guide the collection of these critical metadata.
scRNA-seq/sn-RNA-seq data standards
Folder Name | Data Description | Expectation | File Formatsa | Data Level* |
|---|---|---|---|---|
| raw fastq files | required | gzipped fastq files | 1 |
| read alignment files | optional | bam or cram files | 2 |
or
| raw gene counts | preferred | compressed tar archive (e.g., | 3 |
| processed aggregated gene counts | preferred | AnnData (as an | 4+ |
*https://ark-portal.github.io/data_model/docs/attributes/dataLevel.html
aFile name conventions are described at Supplemental Standards: File Names
bFor data processed by 10x Genomics Cell Ranger software, if contributors wish to upload the MEX output they should first convert either the raw_feature_bc_matrix/ or filtered_feature_bc_matrix/ folder to a gzip compressed tar archive.
At a minimum, all sc/snRNA-seq data contributions to the ARK Portal should include the raw fastq files. We additionally request that contributors provide raw gene counts, either aggregated or split by library/sample, and a final aggregated AnnData (as an h5adfile) or SeuratObj (as an Rds or similarly binary compressed R-compatible file) of prepared counts data that has been used to derive research findings reported in publications. This finalized data object should include any normalization, integration, transforms, etc. that have been applied to the gene counts along with a critical cell metadata as outlined at Supplemental Standards: Single-cell Metadata.
CITE-seq data standards
Folder Name | Data Description | Expectation | File Formatsa | Data Level* |
|---|---|---|---|---|
| scRNA-seq raw fastq files | required | gzipped fastq files | 1 |
| feature barcode raw fastq files | required | gzipped fastq files | 1 |
or
| raw gene and proteins counts | preferred | compressed tar archive (e.g., | 3 |
| processed aggregated gene counts | preferred | AnnData (as an | 4+ |
aFile name conventions are described at Supplemental Standards: File Names
CITE-seq is a multi-modal data type that simultaneously profiles transcript and target protein abundance at single-cell resolution. Protein targets are profiled by sequencing barcodes contained within oligonucleotides conjugated to antibodies that bind to proteins of interest - where each barcode is uniquely associated with a specific protein. These antibody-derived barcodes are sometimes referred to as antibody-derived tags (ADT) or as feature barcodes. The ARK Portal used the latter terminology.
The protein abundance libraries are created and sequenced as distinct libraries from the scRNA-seq libraries and are treated as a distinct assay type within the ARK Portal data model. Specifically, the ARK Portal classifies these libraries under the ‘feature barcode sequencing’ assay. This is distinct from other antibody-derived barcode methods like hash-tag oligos that are used to demultiplex libraries made of pooled cell suspensions and which do not target specific proteins for the purpose of quantifying protein abundance.
At a minimum, all CITE-seq data contributions to the ARK Portal should include the raw fastq files for both the scRNA-seq libraries and the feature barcode sequencing libraries. We additionally request that contributors provide raw gene and protein counts, either aggregated or split by library/sample, and a final aggregated AnnData (as an h5adfile) or SeuratObj (as an Rds or similarly binary compressed R-compatible file) of prepared counts data that has been used to derive research findings reported in publications. This finalized data object should include any normalization, integration, transforms, etc. that have been applied to the gene counts along with a critical cell metadata as outlined at Supplemental Standards: Single-cell Metadata.
Supplemental Standards
File Names
Single specimen files
In the tables above, File Formats indicates the expected format and extension of data files. Here we describe conventions regarding information to include in your file names. The examples below use fastq files from sequencing based experiments to demonstrate ARK Portal file name conventions, but the convention is applicable to many other file types as well.
TL;DR - if a data file contains data pertaining to a single specimen then the
biospecimenIDshould be included in the file name. If a file contains pooled data, particularly raw data files, then the file name should include the corresponding string/variable corresponding to that pool. This variable will differ depending on the assay. For example, pooled sequencing libraries should use thelibraryID. Olink level 1 files should include theplateID, barcoded and multiplexed FCS files should include thesampleProcessingBatchordataCollectionBatch, etc.
For single-specimen libraries, i.e., libraries consisting of only a single sample, fastq files should include the biospecimenID of that sample. All fastq files should include the read (R) or index (I) label and note the flow cell lane that the library was sequenced on as this is necessary for indicating libraries that were sequenced across multiple lanes, e.g.,
RASLE_000001_L001_R1_001.fastq.gz
Where RASLE_000001 is the biospecimenID, L001 is the flow cell lane, and R1is the read of the library fragment sequenced in the file. More details on Illumina fastq file naming convention is available at BaseSpace Naming Convention.
While not common, there are some scenarios in which a library may be sequenced across multiple flow cells. In these cases it is important to create and assign distinct batch labels that distinguish between these runs. This can be the flow cell ID, a simple letter or number code, etc. This label should then be included in the fastq file name and will also be captured in the corresponding ARK Portal Assay Metadata Template.
For multi-modal assays where multiple libraries are derived from the same biospecimen, e.g., CITE-seq, the fastq file names should follow the above examples with the addition of a short abbreviation distinguishing libraries for each assay type. The table below outlines the different abbreviations that should be used to distinguish between different libraries in a multi-modal experiment:
Where abbreviationis appended to the beginning of the fastq file name
Abbreviation | Assay | Example |
|---|---|---|
GEX | scRNA-seq (GEX = gene expression) | GEX_RASLE_000002_L001_R1_001.fastq.gz |
FB | feature barcode sequencing | FB_RASLE_000002_L001_R1_001.fastq.gz |
VDJ | V(D)J sequencing (TCR + BCR) | VDJ_RASLE_000002_L001_R1_001.fastq.gz |
TCR | V(D)J sequencing (TCR only) | TCR_RASLE_000002_L001_R1_001.fastq.gz |
BCR | V(D)J sequencing (BCR only) | BCR_RASLE_000002_L001_R1_001.fastq.gz |
ATAC | ATAC-seq | ATAC_RASLE_000002_L001_R1_001.fastq.gz |
Multispecimen libraries
For multispecimen libraries the file names should use the libraryID in place of the biospecimenID.
Single-cell Metadata
Cell-level metadata is a critical component of single-cell and single-nucleus datasets. Contributors are asked to include a minimal set of standardized cell-metadata to streamline data reuse and support a more harmonized metadata infrastructure for ARK Portal data:
biospecimenID - by including this ID, each cell will be connected to the associated metadata collected via ARK Biospecimen metadata templates.
cellOntologyID (if cell type annotations are included) corresponding to the predicted cell type. The Cell Ontology (CL) is a structured, controlled vocabulary for cell types and provides a set of unique identifiers for specifying cell types.You can explore the cell ontology at https://bioportal.bioontology.org/ontologies/CL?p=summary.
Any additional cell-level metadata fields should be defined in an accompanying dictionary to clearly document what information/data is also captured in these tables.
To learn more about “FAIRification” efforts for single-cell data please visit https://sc-fair.org/ and https://github.com/chanzuckerberg/single-cell-curation/tree/main.
Resources
ARK Portal Data Model and Dictionary
The ARK Portal Data Model Dictionary is hosted online at https://ark-portal.github.io/data_model/. This site is built directly from the data model files hosted in a public repository on GitHub at https://github.com/ARK-Portal/data_model/tree/main. All are welcome to review, submitt issues, and contribute to the ARK Portal data model.