Blog Post
May 17, 2018

2018 Kidney Cancer Hackathon: Data

We have a set of TCGA pan cancer data created by Clemson (Note 1), as well as sequenced blood and tumor data (Note 2), which has been processed by Sean Davis (Note3).  These data sets are available in Google cloud as well as on 4 disks.

 

    Seagate Backup Plus Google Cloud Google Cloud My Passport My Passport WD Element G Drive Mobile
Name Description Genome 5TB Rare kidney cancer _patient_0 p1Rcc Passport1 1TB Passport2 1TB

Untitled   2 TB

G Drive Mobile 4TB
BGI_Data_List_F18FTSUSAT0015_20180420-NTFS.pdf ~ X X X X X
20180207Clemson_TCGA_GEM/ ~ X X X X X
20180517PanTCGA_Expression_Data/ Note 1 ~ X
F18FTSUSAT0015_HUMaasR_20180412/ Note 2 X X
F18FTSUSAT0015_HUMaasR/ Note 2 X X
F18FTSUSAT0015_HUMaasR/T1_1A/result_alignment Note 2
F18FTSUSAT0015_HUMaasR/T1_1A/result_variation Note 2
F18FTSUSAT0015_HUMaasR/WBA/result_alignment Note 2
F18FTSUSAT0015_HUMaasR/WBA/result_variation Note 2
report (1).zip Note 2 X
T1_1A.sv.tar.gz Note 2 X
somatic/ Note 3 X

Note1: Clemson's PanTCGA Expression Data

The folder contains the following files:

  • contains documentation for downloading all available gene expression profiles for cancer types available through the TCGA GDC(https://portal.gdc.cancer.gov/)
  • PanTCGA_FPKM_5_16_18.tab.tar.gz : A Gene expression matrix containing FPKM values for 60,483 genes in 11,093 cancer samples.  The first row contains file IDs.  
    • Note that there is no header for the first column of Gene IDs, so the first row is shifted one column to the left.  These FPKM values are normalized for gene length and number of mapped reads
  • Kidney_FPKM_Quantile_No-Outliers_5_17_18.tab : A log2 transformed, quantile normalized subset of the above matrix.  Only kidney cancer samples are present, and 12 outlier samples were removed.
  • PanTCGA_Counts_5_16_18.tab.tar.gz :  A gene expression matrix containing raw molecule counts for 11,093 cancer samples.
    • Unless you are performing differential gene expression analysis that requires raw count input, it is likely better to use the FPKM file instead of this one
  • PanTCGA_Worksheet-v1-20180514.xlsx :  An excel spreadsheet containing important information about sample IDs, tissue types(ie, "solid tissue normal" vs "primary tumor"), clinical attributes, and exposure data.
    • The 'File-Case-Mapping' sheet in this workbook contains a mapping table that you can use to translate information between the other three sheets.



Note2: p1RCC DATA DESCRIPTION

Patient Samples

The patient submitted a Whole Blood sample and FFPE Tissue Samples (Fixed Formalin Paraffin-Embedded; a method for preserving tissue). The Whole Blood sample serves as the source of “normal” DNA and the Tissue Sample serves as the source of “tumour” DNA. Within the project directories, WBA is the Whole Blood Sample and T1_1A is the Tumour Tissue Sample.  

Whole Genome Sequencing

The qualified genomic DNA samples were randomly fragmented by Covaris technology and 350bp fragments were obtained after fragment selection. The end repair of DNA fragments were performed and an "A" base was added at the 3'-end of each strand. Adapters were then ligated to both ends of the end repaired/dA tailed DNA fragments, amplified by ligation-mediated PCR (LM- PCR ), then single strands were separated and cyclized. Rolling circle amplification (RCA) was then performed to produce DNA Nanoballs (DNBs). The qualified DNBs were loaded into patterned nanoarrays and pair-end reads were read through on the BGISEQ-500 platform. High-throughput sequencing is performed for each library to ensure that each sample meets the average sequencing coverage requirement (90x). Sequencing-derived raw image files were processed by BGISEQ-500 base-calling software for base-calling with default parameters. The sequence data of each individual sample is generated as paired-end reads, which is defined as "raw data" and stored in FASTQ format.

Video Overview of Methods   | In-depth Info on BGISEQ-500   | Comparison to Ilumina HiSeq X Ten

Data

There are two primary directories within the project folder: rarekidneycancer_patient_0 and somatic. Within rarekidneycancer_patient_0 →  F18FTSUSAT0015_HUMaasR you’ll find the data from the BGISEQ-500 DNB sequencing (includes both raw and processed data). Two folders corresponding to the two samples, named WBA and T1_1A, are laid out similarly within the directory. As previously noted, the WBA is the "normal" whole blood sample. The T1_1A sample is the "tumor" sample. Inside each are 3 folders, clean_data, result_alignment, and result_variation.

Inside the clean_data folder, you’ll find data which has been filtered to decrease the noise of the sequencing data. The process included (1) Removing reads containing the sequencing adapter; (2) Removing reads whose low-quality base ratio (base quality less than or equal to 5) is more than 50%; (3) Removing reads whose unknown base ('N'base) ratio is more than 10%. All downstream analysis was performed on this data.

Inside the results_alignment folder you’ll find the BAM files for each sample for loading as reads. All clean reads were aligned to the human reference genome (GRCh37/HG19) using the BWA-MEM method within the Burrows-Wheeler Aligner (BWA V0.7.12).

Inside the results_variation folder you’ll find directories for cnv (Copy Number Variants), indel (Insertions Deletions), and snp (Single Nucleotide Polymorphisms) directories (plus sv [Structural Variants] for the WBA sample). Within the indel and snp directories you’ll find the vcf files (both raw and filtered for each) for loading as callsets. Within the sv and cnv folders you’ll find csv/xls files for use. The genomic variations, were detected by HaplotypeCaller of GATK (v3.3.0). After that, the variant quality score recalibration (VQSR) method was applied to get high-confident variant calls. The CNVs were called using the CNVnator \[8\] v0.2.7 read-depth algorithm. The SVs were detected using Breakdancer or CREST. Then the SnpEff tool was applied to perform a series of annotations for the variants.

Within the somatic directory, you’ll find vcf files corresponding to the somatics variants between the tumour and “normal” DNA in the patient’s samples. The inputs to generate the somatic variant calls are the tumor and normal genomes, represented by the normal and tumor BAM files. The README file in the somatic directory contains the actual command-lines used as well as links to the software.

Review of Whole Genome Sequencing  |  Review of Somatic Mutations in Cancer  | Wiki on Genetic Variation

Useful Resources

Note3: Sean Davis' Somatic Data

<Inside Baseball tag>

## strelka calling

See: https://github.com/Illumina/strelka

```{sh}

configureStrelkaSomaticWorkflow.py \

    --normalBam ../F18FTSUSAT0015_HUMaasR/WBA/result_alignment/WBA.bam\

    --tumorBam ../F18FTSUSAT0015_HUMaasR/T1_1A/result_alignment/T1_1A.bam\

    --ref ../hg19/hg19.fa \

    --runDir T1_1A_somatic

T1_1A_somatic/runWorkflow.py -m local -j 32

```



## snpEff

See: http://snpeff.sourceforge.net/SnpEff_manual.html

```{sh}

snpEff -Xmx48g -v GRCh37.75 somatic.snvs.vcf.gz | gzip > somatic.snvs.snpeff.vcf.gz

snpEff -Xmx48g -v GRCh37.75 somatic.indels.vcf.gz | gzip > somatic.indels.snpeff.vcf.gz

```

Add new comment