Input data

ClinOmicsTrailbc is able to read various input file formats through which the user can provide measurement data that should be analyzed. In general, ClinOmicsTrailbc will try to automatically detect the meta-data of the uploaded data. This means it attempts to detect the used data format, identifier type, and organism the data was derived from. If errors arise during this step, it is important to understand which input types are supported by ClinOmicsTrailbc .

Thus, in the following we discuss the expected input formats and the assumptions ClinOmicsTrailbc makes about their contents.

As ClinOmicsTrailbc is able to process data not only from microarray experiments, but also from e.g. mass-spectrometry experiments, we use the term entity for talking about genes, protein, miRNA, etc. Similarly, we uses the term identifier whenever we mean the name of such an entity as it is used in some database such as Ensembl, UniProt, or NCBI Gene.

Gene expression

Gene expression values can be provided as score list in a text based format containing one identifier per line. In the first column of each line, the identifier is given and in a second column the score, a numerical value measuring the relevance of the entity, is provided. Please note that positive scores should indicate up-regulated genes, whereas negative scores correspond to down-regulated genes. Both columns are separated by a whitespace, preferably by a tab character.

GDA 0.05501
SCN3A   -0.017374
SCN3B   0.33427200000000046
RPLP2   -0.10048799999999997
GFER    0.08075766666666603
SNORA68 0.2532145
SNORA65 -0.289492
PIP5KL1 0.267125
BTBD1   -0.824291000000001
RPLP0   0.050174750000000046
BTBD2   -0.424771999999999
BTBD3   0.267594
RPLP1   -0.1359804999999995
ATP6    -0.2206155
...

Besides precomputed score files, ClinOmicsTrailbc provides support for directly analyzing normalized gene expression matrices. The gene expression matrix can be uploaded in a whitespace-separated file format such as TSV and should contain normalized gene expression values for the measured genes in the rows and the different samples in the columns. Here, data for one tumor sample and one or several healthy control samples needs to be provided. Based on the number of reference samples, either log-fold-quotients or z-scores are then computed to obtain scores of differential expression.

Sample1	Sample2	Sample3
GeneA	0.1	4.3	2.3
GeneB	3.2	-1.2	1.1
GeneC	2.7	9.1	0.3
...

Tumor vs. control and tumor vs. tumor comparisons

ClinOmicsTrailbc offers the option to consider either scores for a tumor vs. control comparison or to compare a tumor under investigation to other tumor samples. These two types of comparisons provide two different views on the tumor under investigation. While the first type of comparison gives a more general overview on the tumor's aberrations, the second type allows to investigate the more fine-grained differences between tumors that otherwise might be disguised.

More specifically, when comparing a tumor sample to a healthy control (group), it is likely that various pathways (especially those involved in growth-promoting processes) will be upregulated. The investigation of tumor vs. tumor data, on the other hand, can potentially elucidate which pathways and processes are relatively more (or less) active than to be expected 'on average' and hence are rather specific to the tumor under investigation. Hence, the analysis of both types of comparisons would yield a very comprehensive picture of the sample under investigation.

Depending on whether the provided gene expression data is specified to be a tumor vs. control or a tumor vs. tumor comparison, respective reference datasets are used in the comparative analyses performed by ClinOmicsTrailbc, i.e. the pathway activity radar chart and the clustering.

Handling of technology biases

When using combining and comparing gene expression data from various resources, processing and technology biases are an important aspect to consider. In order to account for technology biases, ClinOmicsTrailbc differentiates between gene expression data obtained from microarrays and RNA-Seq by considering respective reference files in its comparative analyses (i.e. the clustering and the radar chart). Ideally, reference files not just for different technologies, but also for specific platforms should be provided, which is subject to the further development of ClinOmicsTrailbc.

Genetic variations

Genetic variations need to be provided in variant call format (.vcf). A .vcf file contains optional meta-information lines that start with '##', a header line indicated by '#' and data lines, where each data line contains information about a position in the genome.

##fileformat=VCFv4.1
#CHROM POS     ID        REF    ALT     QUAL FILTER INFO                      FORMAT      NA00001
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2   GT:GQ:DP:HQ 0|0:48:1:51,51
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017       GT:GQ:DP:HQ 0|0:49:3:58,50
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333;DB 	  GT:GQ:DP:HQ 1|2:21:6:23,27
20     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T           GT:GQ:DP:HQ 0|0:54:7:56,60
20     1234567 microsat1 GTC    G,GTCT  50   PASS   NS=3;DP=9;AA=G            GT:GQ:DP    0/1:35:4
...
		

The use of different variant callers and processing pipelines can significantly affect the results, in particular with respect to the number of mutations called. This difference becomes especially evident when comparing the tumor mutational burden of a sample under investigation with those of a reference cohort. To account for this, we provide reference data for four types of variant callers (MuTect2, MuSE, SomaticSniper, and VarScan2), which will be listed in the 'Immunotherapy' tab of the results page. Moreover, we recommend the TCGA DNA-Seq Analysis Pipeline as best practice guideline.

Copy number variations

Copy number variations need to be provided in segmented data format (.seg). A .seg file is a tab-delimited text file that contains a header line and one line per genetic locus. Each of the data lines contains in the first four columns an identifier, the chromosome, the start location, and the end location. The last column contains a numerical value that represents the segment mean, i.e. the log-2-transformed quotient of the copy number in the sample under investigation and the reference copy number for this locus.

ID	chrom	loc.start	loc.end	num.mark	seg.mean
GenomeWideSNP_416532	1	51598	76187	14	-0.7116
GenomeWideSNP_416532	1	76204	16022502	8510	-0.029
GenomeWideSNP_416532	1	16026084	16026512	6	-2.0424
GenomeWideSNP_416532	1	16026788	17063449	424	-0.1024
GenomeWideSNP_416532	1	17067742	17134834	61	-0.6868
...
		

Methylation data

Methylation data should be provided as score file or matrix. A score file is a whitespace-separated text-file that contains a gene identifier and the corresponding gene's score per row. For methylation data, these scores might be e.g. the beta values of a gene's promoter region or already precomputed scores of differential methylation. Alternatively, beta values for one tumor sample and one or several healthy control samples can to be uploaded. Based on the number of reference samples, either log-fold-quotients or z-scores are then computed to obtain scores of differential methylation.

GDA 0.055
SCN3A   0.0173
SCN3B   0.534
RPLP2   0.101
GFER    0.084
SNORA68 0.253
BTBD3   0.267
RPLP1   0.735
ATP6    0.820
...

Clinical information

ClinOmicsTrailbc analyzes the status of several standard clinical markers for breast cancer diagnosis and treatment: hormone receptors (estrogen receptor and progesterone receptor), HER2/neu amplification and the menopausal status of the patient can inform the eligibility of several types of drugs, including aromatase inhibitors, estrogen receptor-targeting drugs and antibodies like trastuzumab or pertuzumab. Also, information on tumor growth (Ki-67 staining, s-phase fraction), the histopathological subtype, tumor size and grade, lymph node and metastasis status, as well as clinical metadata like a patient ID, the origin of the sample (primary tumor vs metastasis), the fraction of tumor tissue in the sample and the date of biopsy can be provided to ClinOmicsTrailbc.

The tumor stage can be provided based on the TNM Staging System:

  • Primary tumor:
    • TX: Main tumor cannot be measured
    • T0: Main tumor cannot be found
    • T1-4: Refers to the size and/or extend of the main tumor. The higher the number, the larger the tumor or the more it has grown into nearby tissues.
  • Regional lymph nodes:
    • NX: Cancer in nearby lymph nodes cannot be measured
    • N0: There is no cancer in nearby lymph nodes
    • N1-3: Refers to the number and location of lymph nodes that contain cancer. The higher the number, the more lymph nodes contain cancer.
  • Distant metastasis:
    • MX: Metastasis cannot be measured
    • M0: Cancer has not spread to other parts of the body
    • M1: Cancer has spread to other parts of the body

Corresponding values for the clinical markers can be selected from the respective dropdown menus. In cases where some of these markers were not assessed, the default option Unknown/ambiguous can be chosen.

Troubleshooting

ClinOmicsTrailbc does not recognize my score list exported from Excel

MS Excel is a popular tool for managing biological datasets. However, there are some pitfalls especially when it comes to interoperability with other tools. It can happen that Excel reformats gene identifiers as dates. For example the gene Apr1 is routinely recognized as April the first. Please make sure, that no such conversions have taken place before exporting your data from Excel.

For more information see also Zeeberg et al. [1].

Bibliography

  1. Zeeberg, Barry R and Riss, Joseph and Kane, David W and Bussey, Kimberly J and Uchio, Edward and Linehan, W Marston and Barrett, J Carl and Weinstein, John N Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics BMC bioinformatics BioMed Central Ltd (View online)