Working with VCF files

Somatic Variant Calling

VarScan

A command line like this works well to generate a VCF file that contains putative somatic mutations:

java64 Xmx4g -jar VarScan.jar somatic \
   <( samtools mpileup -C50 -BQ0 -d 1000000 -A -f FASTAFILE NORMALBAM ) \
   <( samtools mpileup -C50 -d 1000000 -A -f FASTAFILE TUMORBAM ) \
   PREFIX --output-vcf -min-var-freq 0.05

The FASTAFILE, NORMALBAM, and TUMORBAM should be self-explanatory. The PREFIX should be replaced with the prefix for the .snp and .indel files from VarScan. The min-var-freq setting turns up the sensitivity for somatic variants; these may need to be filtered out later.

After generating the VarScan VCF files, they need to be “fixed” using seqtools varscan fixVcf command-line script.

Annotating VCF Files

A few command-lines for annotating VCF files using snpEff and snpSift are given below. In addition, one should run the GATK VariantAnnotator script to generate rich quality control information for the VCF files.

snpEff

Download data

java -jar /data/CCRBioinfo/biowulf/local/snpEff_3_0/snpEff.jar download -c /data/CCRBioinfo/biowulf/local/snpEff_3_0/snpEff.config GRCh37.66

Run effect prediction

java -jar /data/CCRBioinfo/biowulf/local/snpEff_3_0/snpEff.jar eff -c /data/CCRBioinfo/biowulf/local/snpEff_3_0/snpEff.config GRCh37.66

snpSift

Annotate with dbSNP

java -Xmx8g -jar /data/CCRBioinfo/biowulf/local/SnpSift_latest.jar annotate  /data/CCRBioinfo/public/GATK/bundle/1.5/hg19/dbsnp_135.hg19.excluding_sites_after_129.vcf tmp2.vcf > tmp2.dbsnp.vcf

Annotate with dbNSFP

# Donload and uncompress database (you need to do this only once):
# WARNING: The database is 3Gb when compressed and 30Gb uncompressed.
wget http://sourceforge.net/projects/snpeff/files/databases/dbNSFP2.0b3.txt.gz
gunzip dbNSFP2.0b3.txt.gz

java -jar SnpSift.jar dbnsfp /data/CCRBioinfo/public/snpEff/data/dbNSFP2.0b3.txt myFile.vcf > myFile.annotated.vcf

Filtering

The snpSift package allows very flexible filtering options. An example is given here:

java -Xmx8g -jar /data/CCRBioinfo/biowulf/local/SnpSift_latest.jar filter '(na ID) & (ID =~ 'COSM') & !( ID =~ 'rs')' -f

Melting to Tab-delimited Text

VCF files are quite difficult to read and filter in something like Excel. The term, “vcf melting”, refers to taking a VCF file and pulling out the various parts into separate tab-delimited columns. See seqtool vcf melt for details of command-line operation or seqtools.vcf.vcfMelt for API usage.