ClinOmicsTrailbc 1.0
A visual analytics tool for breast cancer treatment stratification using multi-omics data
Input data
ClinOmicsTrailbc is able to read various input file formats through which the user can provide measurement data that should be analyzed. In general, ClinOmicsTrailbc will try to automatically detect the meta-data of the uploaded data. This means it attempts to detect the used data format, identifier type, and organism the data was derived from. If errors arise during this step, it is important to understand which input types are supported by ClinOmicsTrailbc .
Thus, in the following we discuss the expected input formats and the assumptions ClinOmicsTrailbc makes about their contents.
nameof such an entity as it is used in some database such as Ensembl, UniProt, or NCBI Gene.
Gene expression
Gene expression values can be provided as score list in a text based format containing one identifier per line. In the first column of
each line, the identifier is given and in a second column the score, a numerical value measuring the relevance
of the entity, is provided. Please note that positive scores should indicate up-regulated genes, whereas negative
scores correspond to down-regulated genes. Both columns are separated by a whitespace, preferably by a tab character.
GDA 0.05501 SCN3A -0.017374 SCN3B 0.33427200000000046 RPLP2 -0.10048799999999997 GFER 0.08075766666666603 SNORA68 0.2532145 SNORA65 -0.289492 PIP5KL1 0.267125 BTBD1 -0.824291000000001 RPLP0 0.050174750000000046 BTBD2 -0.424771999999999 BTBD3 0.267594 RPLP1 -0.1359804999999995 ATP6 -0.2206155 ...
Besides precomputed score files, ClinOmicsTrailbc provides support for directly analyzing normalized gene expression matrices. The gene expression matrix can be uploaded in a whitespace-separated file format such as TSV and should contain normalized gene expression values for the measured genes in the rows and the different samples in the columns. Here, data for one tumor sample and one or several healthy control samples needs to be provided. Based on the number of reference samples, either log-fold-quotients or z-scores are then computed to obtain scores of differential expression.
Sample1 Sample2 Sample3 GeneA 0.1 4.3 2.3 GeneB 3.2 -1.2 1.1 GeneC 2.7 9.1 0.3 ...
Tumor vs. control and tumor vs. tumor comparisons
ClinOmicsTrailbc offers the option to consider either scores for a tumor vs. control comparison or to compare a tumor under investigation to other tumor samples. These two types of comparisons provide two different views on the tumor under investigation. While the first type of comparison gives a more general overview on the tumor's aberrations, the second type allows to investigate the more fine-grained differences between tumors that otherwise might be disguised.
More specifically, when comparing a tumor sample to a healthy control (group), it is likely that various pathways (especially those involved in growth-promoting processes) will be upregulated. The investigation of tumor vs. tumor data, on the other hand, can potentially elucidate which pathways and processes are relatively more (or less) active than to be expected 'on average' and hence are rather specific to the tumor under investigation. Hence, the analysis of both types of comparisons would yield a very comprehensive picture of the sample under investigation.
Depending on whether the provided gene expression data is specified to be a tumor vs. control or a tumor vs. tumor comparison, respective reference datasets are used in the comparative analyses performed by ClinOmicsTrailbc, i.e. the pathway activity radar chart and the clustering.
Handling of technology biases
When using combining and comparing gene expression data from various resources, processing and technology biases are an important aspect to consider. In order to account for technology biases, ClinOmicsTrailbc differentiates between gene expression data obtained from microarrays and RNA-Seq by considering respective reference files in its comparative analyses (i.e. the clustering and the radar chart). Ideally, reference files not just for different technologies, but also for specific platforms should be provided, which is subject to the further development of ClinOmicsTrailbc.
Genetic variations
Genetic variations need to be provided in variant call format (.vcf). A .vcf file contains optional meta-information lines that start with '##', a header line indicated by '#' and data lines, where each data line contains information about a position in the genome.
##fileformat=VCFv4.1 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333;DB GT:GQ:DP:HQ 1|2:21:6:23,27 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 ...
The use of different variant callers and processing pipelines can significantly affect the results, in particular with respect to the number of mutations called. This difference becomes especially evident when comparing the tumor mutational burden of a sample under investigation with those of a reference cohort. To account for this, we provide reference data for four types of variant callers (MuTect2, MuSE, SomaticSniper, and VarScan2), which will be listed in the 'Immunotherapy' tab of the results page. Moreover, we recommend the TCGA DNA-Seq Analysis Pipeline as best practice guideline.
Copy number variations
Copy number variations need to be provided in segmented data format (.seg). A .seg file is a tab-delimited text file that contains a header line and one line per genetic locus. Each of the data lines contains in the first four columns an identifier, the chromosome, the start location, and the end location. The last column contains a numerical value that represents the segment mean, i.e. the log-2-transformed quotient of the copy number in the sample under investigation and the reference copy number for this locus.
ID chrom loc.start loc.end num.mark seg.mean GenomeWideSNP_416532 1 51598 76187 14 -0.7116 GenomeWideSNP_416532 1 76204 16022502 8510 -0.029 GenomeWideSNP_416532 1 16026084 16026512 6 -2.0424 GenomeWideSNP_416532 1 16026788 17063449 424 -0.1024 GenomeWideSNP_416532 1 17067742 17134834 61 -0.6868 ...
Methylation data
Methylation data should be provided as score file or matrix. A score file is a whitespace-separated text-file that contains a gene identifier and the corresponding gene's score per row. For methylation data, these scores might be e.g. the beta values of a gene's promoter region or already precomputed scores of differential methylation. Alternatively, beta values for one tumor sample and one or several healthy control samples can to be uploaded. Based on the number of reference samples, either log-fold-quotients or z-scores are then computed to obtain scores of differential methylation.
GDA 0.055 SCN3A 0.0173 SCN3B 0.534 RPLP2 0.101 GFER 0.084 SNORA68 0.253 BTBD3 0.267 RPLP1 0.735 ATP6 0.820 ...
Clinical information
ClinOmicsTrailbc analyzes the status of several standard clinical markers for breast cancer diagnosis and treatment: hormone receptors (estrogen receptor and progesterone receptor), HER2/neu amplification and the menopausal status of the patient can inform the eligibility of several types of drugs, including aromatase inhibitors, estrogen receptor-targeting drugs and antibodies like trastuzumab or pertuzumab. Also, information on tumor growth (Ki-67 staining, s-phase fraction), the histopathological subtype, tumor size and grade, lymph node and metastasis status, as well as clinical metadata like a patient ID, the origin of the sample (primary tumor vs metastasis), the fraction of tumor tissue in the sample and the date of biopsy can be provided to ClinOmicsTrailbc.
The tumor stage can be provided based on the TNM Staging System:
- Primary tumor:
- TX: Main tumor cannot be measured
- T0: Main tumor cannot be found
- T1-4: Refers to the size and/or extend of the main tumor. The higher the number, the larger the tumor or the more it has grown into nearby tissues.
- Regional lymph nodes:
- NX: Cancer in nearby lymph nodes cannot be measured
- N0: There is no cancer in nearby lymph nodes
- N1-3: Refers to the number and location of lymph nodes that contain cancer. The higher the number, the more lymph nodes contain cancer.
- Distant metastasis:
- MX: Metastasis cannot be measured
- M0: Cancer has not spread to other parts of the body
- M1: Cancer has spread to other parts of the body
Corresponding values for the clinical markers can be selected from the respective dropdown menus. In cases where some of these markers were not assessed, the default option Unknown/ambiguous can be chosen.
Troubleshooting
ClinOmicsTrailbc does not recognize my score list exported from Excel
MS Excel is a popular tool for managing biological datasets. However, there are some pitfalls especially when it
comes to interoperability with other tools. It can happen that Excel reformats gene identifiers as dates. For
example the gene Apr1
is routinely recognized as April the first. Please make sure, that no such
conversions have taken place before exporting your data from Excel.
For more information see also Zeeberg et al. [1].
Bibliography
- Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics BMC bioinformatics BioMed Central Ltd (View online)