ARK Portal Data Standards | ARK Portal Docs

What are data standards?

Data standards are a set of rules that define how data is recorded, described, and shared. They help ensure that data is consistent, accurate, and comply with FAIR data practices.

This document, and any accompanying pages, describe standards for common assay data types found in the ARK Portal and outlines expectations for data contributions.

When to apply standards?

Please be familiar with ARK Portal data standards requirements and plan to have your data files and metadata tables conform before uploading to Synapse. This means that files should be named and organized according to the conventions outlined here.

Olink data standards

Folder Name	Data Description	Expectation	File Formats	Data Level*
`raw_data`	raw protein abundance measurements	required	`parquet` or `CSV`, one per plate	1
`processed_data`	processed aggregated data	required	single, aggregated `parquet` or `CSV` consisting of all finalized data points.	2+
`metadata`	target panel^a, standardized metadata**, user-defined metadata (optional)	required	a tabular file formats (e.g., `csv`, `xlsx`)	N/A

At a minimum, all Olink data contributions to the ARK Portal should include all raw data in parquet or CSV format, one file per plate profiled, and a final aggregated data object in parquet or CSV format, that has been used to derive research findings reported in publications. This finalized data object should include any normalization, integration, transforms, etc. that have been applied to the raw data.

^aFor more details about target panels see Supplemental Standards: Target Panel.

*https://ark-portal.github.io/data_model/docs/attributes/dataLevel.html

**Data contributors are also required to provide standardized metadata conforming to the ARK Portal Data model. Templates will be provided to guide the collection of these critical metadata.

scRNA-seq/sn-RNA-seq data standards

Folder Name	Data Description	Expectation	File Formats^a	Data Level*
`fastq_files`	raw fastq files	required	gzipped fastq files	1
`bam_files`	read alignment files	optional	bam or cram files	2
`CellRanger_counts` or `raw_gene_counts`	raw gene counts	preferred	compressed tar archive (e.g., `.tgz`) or `.h5`file - these are readily available after running Cell Ranger counts on 10x Genomics sc/sn-RNA-seq data	3
`processed_data`	processed aggregated gene counts	preferred	AnnData (as an `h5ad`file) or SeuratObj (as an `Rds` or similarly binary compressed R-compatible file)	4+
metadata	standardized metadata**, user-defined metadata (optional)

^aFile name conventions are described at Supplemental Standards: File Names

^bFor data processed by 10x Genomics Cell Ranger software, if contributors wish to upload the MEX output they should first convert either the raw_feature_bc_matrix/ or filtered_feature_bc_matrix/ folder to a gzip compressed tar archive.

At a minimum, all sc/snRNA-seq data contributions to the ARK Portal should include the raw fastq files. We additionally request that contributors provide raw gene counts, either aggregated or split by library/sample, and a final aggregated AnnData (as an h5adfile) or SeuratObj (as an Rds or similarly binary compressed R-compatible file) of prepared counts data that has been used to derive research findings reported in publications. This finalized data object should include any normalization, integration, transforms, etc. that have been applied to the gene counts along with a critical cell metadata as outlined at Supplemental Standards: Single-cell Metadata.

*https://ark-portal.github.io/data_model/docs/attributes/dataLevel.html

**Data contributors are also required to provide standardized metadata conforming to the ARK Portal Data model. Templates will be provided to guide the collection of these critical metadata.

CITE-seq data standards

Folder Name	Data Description	Expectation	File Formats^a	Data Level*
`fastq_files/GEX_fastq`	scRNA-seq raw fastq files	required	gzipped fastq files	1
`fastq_files/feature_barcode_fastq`	feature barcode raw fastq files	required	gzipped fastq files	1
`CellRanger_counts` or `raw_counts`	raw gene and proteins counts	preferred	compressed tar archive (e.g., `.tgz`) or `.h5`file - these are readily available after running Cell Ranger counts on 10x Genomics-derived data.	3
`processed_data`	processed aggregated gene counts	preferred	AnnData (as an `h5ad`file) or SeuratObj (as an `Rds` or similarly binary compressed R-compatible file)	4+
`metadata`	target panel^b, standardized metadata**, user-defined metadata (optional)	required	a tabular file formats (e.g., `csv`, `xlsx`)	N/A

^aFile name conventions are described at Supplemental Standards: File Names

^bFor more details about target panels see Supplemental Standards: Target Panel.

CITE-seq is a multi-modal data type that simultaneously profiles transcript and target protein abundance at single-cell resolution. Protein targets are profiled by sequencing barcodes contained within oligonucleotides conjugated to antibodies that bind to proteins of interest - where each barcode is uniquely associated with a specific protein. These antibody-derived barcodes are sometimes referred to as antibody-derived tags (ADT) or as feature barcodes. The ARK Portal used the latter terminology.

The protein abundance libraries are created and sequenced as distinct libraries from the scRNA-seq libraries and are treated as a distinct assay type within the ARK Portal data model. Specifically, the ARK Portal classifies these libraries under the ‘feature barcode sequencing’ assay. This is distinct from other antibody-derived barcode methods like hash-tag oligos that are used to demultiplex libraries made of pooled cell suspensions and which do not target specific proteins for the purpose of quantifying protein abundance.

At a minimum, all CITE-seq data contributions to the ARK Portal should include the raw fastq files for both the scRNA-seq libraries and the feature barcode sequencing libraries. We additionally request that contributors provide raw gene and protein counts, either aggregated or split by library/sample, and a final aggregated AnnData (as an h5adfile) or SeuratObj (as an Rds or similarly binary compressed R-compatible file) of prepared counts data that has been used to derive research findings reported in publications. This finalized data object should include any normalization, integration, transforms, etc. that have been applied to the gene counts along with a critical cell metadata as outlined at Supplemental Standards: Single-cell Metadata.

**Data contributors are also required to provide standardized metadata conforming to the ARK Portal Data model. Templates will be provided to guide the collection of these critical metadata.

Supplemental Standards

File Names

Single specimen files

In the tables above, File Formats indicates the expected format and extension of data files. Here we describe conventions regarding information to include in your file names. The examples below use fastq files from sequencing based experiments to demonstrate ARK Portal file name conventions, but the convention is applicable to many other file types as well.

TL;DR - if a data file contains data pertaining to a single specimen then the biospecimenID should be included in the file name. If a file contains pooled data, particularly raw data files, then the file name should include the corresponding string/variable corresponding to that pool. This variable will differ depending on the assay. For example, pooled sequencing libraries should use the libraryID. Olink level 1 files should include the plateID, barcoded and multiplexed FCS files should include the sampleProcessingBatch or dataCollectionBatch, etc.

For single-specimen libraries, i.e., libraries consisting of only a single sample, fastq files should include the biospecimenID of that sample. All fastq files should include the read (R) or index (I) label and note the flow cell lane that the library was sequenced on as this is necessary for indicating libraries that were sequenced across multiple lanes, e.g.,

RASLE_000001_L001_R1_001.fastq.gz

Where RASLE_000001 is the biospecimenID, L001 is the flow cell lane, and R1is the read of the library fragment sequenced in the file. More details on Illumina fastq file naming convention is available at BaseSpace Naming Convention.

While not common, there are some scenarios in which a library may be sequenced across multiple flow cells. In these cases it is important to create and assign distinct batch labels that distinguish between these runs. This can be the flow cell ID, a simple letter or number code, etc. This label should then be included in the fastq file name and will also be captured in the corresponding ARK Portal Assay Metadata Template.

For multi-modal assays where multiple libraries are derived from the same biospecimen, e.g., CITE-seq, the fastq file names should follow the above examples with the addition of a short abbreviation distinguishing libraries for each assay type. The table below outlines the different abbreviations that should be used to distinguish between different libraries in a multi-modal experiment:

Where abbreviationis appended to the beginning of the fastq file name

Abbreviation	Assay	Example
GEX	scRNA-seq (GEX = gene expression)	GEX_RASLE_000002_L001_R1_001.fastq.gz
FB	feature barcode sequencing	FB_RASLE_000002_L001_R1_001.fastq.gz
VDJ	V(D)J sequencing (TCR + BCR)	VDJ_RASLE_000002_L001_R1_001.fastq.gz
TCR	V(D)J sequencing (TCR only)	TCR_RASLE_000002_L001_R1_001.fastq.gz
BCR	V(D)J sequencing (BCR only)	BCR_RASLE_000002_L001_R1_001.fastq.gz
ATAC	ATAC-seq	ATAC_RASLE_000002_L001_R1_001.fastq.gz

Multispecimen libraries

For multispecimen libraries the file names should use the libraryID, plateID, slideID, etc. in place of the biospecimenID.

Single-cell Metadata

Cell-level metadata is a critical component of single-cell and single-nucleus datasets. Contributors are asked to include a minimal set of standardized cell-metadata to streamline data reuse and support a more harmonized metadata infrastructure for ARK Portal data:

biospecimenID - by including this ID, each cell will be connected to the associated metadata collected via ARK Biospecimen metadata templates.
cellOntologyID - (if cell type annotations are included) corresponding to the predicted cell type. The Cell Ontology (CL) is a structured, controlled vocabulary for cell types and provides a set of unique identifiers for specifying cell types.You can explore the cell ontology at https://bioportal.bioontology.org/ontologies/CL?p=summary.

Any additional cell-level metadata fields should be defined in an accompanying dictionary to clearly document what information/data is also captured in these tables.

To learn more about “FAIRification” efforts for single-cell data please visit https://sc-fair.org/ and https://github.com/chanzuckerberg/single-cell-curation/tree/main.

Target Panel

To ensure transparency and to support future ARK Portal developments certain datatypes will require the submission of a “target panel” file that details all the molecules (e.g., proteins) targeted and profiled in an assay. At a minimum this should be a tabular file format (e.g., csv, xlsx) (PDFs will not be accepted) that lists all targets using established unique identifiers, for example Uniprot IDs or HGNC approved gene symbol. This file is often be readily available from manufacturers of pre-defined kits.

Resources

ARK Portal Data Model and Dictionary

The ARK Portal Data Model Dictionary is hosted online at https://ark-portal.github.io/data_model/. This site is built directly from the data model files hosted in a public repository on GitHub at https://github.com/ARK-Portal/data_model/tree/main. All are welcome to review, submitt issues, and contribute to the ARK Portal data model.