Working with VCF files

Somatic Variant Calling


A command line like this works well to generate a VCF file that contains putative somatic mutations:

java64 Xmx4g -jar VarScan.jar somatic \
   <( samtools mpileup -C50 -BQ0 -d 1000000 -A -f FASTAFILE NORMALBAM ) \
   <( samtools mpileup -C50 -d 1000000 -A -f FASTAFILE TUMORBAM ) \
   PREFIX --output-vcf -min-var-freq 0.05

The FASTAFILE, NORMALBAM, and TUMORBAM should be self-explanatory. The PREFIX should be replaced with the prefix for the .snp and .indel files from VarScan. The min-var-freq setting turns up the sensitivity for somatic variants; these may need to be filtered out later.

After generating the VarScan VCF files, they need to be “fixed” using seqtools varscan fixVcf command-line script.

Annotating VCF Files

A few command-lines for annotating VCF files using snpEff and snpSift are given below. In addition, one should run the GATK VariantAnnotator script to generate rich quality control information for the VCF files.


Download data

java -jar /data/CCRBioinfo/biowulf/local/snpEff_3_0/snpEff.jar download -c /data/CCRBioinfo/biowulf/local/snpEff_3_0/snpEff.config GRCh37.66

Run effect prediction

java -jar /data/CCRBioinfo/biowulf/local/snpEff_3_0/snpEff.jar eff -c /data/CCRBioinfo/biowulf/local/snpEff_3_0/snpEff.config GRCh37.66


Annotate with dbSNP

java -Xmx8g -jar /data/CCRBioinfo/biowulf/local/SnpSift_latest.jar annotate  /data/CCRBioinfo/public/GATK/bundle/1.5/hg19/dbsnp_135.hg19.excluding_sites_after_129.vcf tmp2.vcf > tmp2.dbsnp.vcf

Annotate with dbNSFP

# Donload and uncompress database (you need to do this only once):
# WARNING: The database is 3Gb when compressed and 30Gb uncompressed.
gunzip dbNSFP2.0b3.txt.gz

java -jar SnpSift.jar dbnsfp /data/CCRBioinfo/public/snpEff/data/dbNSFP2.0b3.txt myFile.vcf > myFile.annotated.vcf


The snpSift package allows very flexible filtering options. An example is given here:

java -Xmx8g -jar /data/CCRBioinfo/biowulf/local/SnpSift_latest.jar filter '(na ID) & (ID =~ 'COSM') & !( ID =~ 'rs')' -f

Melting to Tab-delimited Text

VCF files are quite difficult to read and filter in something like Excel. The term, “vcf melting”, refers to taking a VCF file and pulling out the various parts into separate tab-delimited columns. See seqtool vcf melt for details of command-line operation or seqtools.vcf.vcfMelt for API usage.