Working with VCF files¶
Somatic Variant Calling¶
VarScan¶
A command line like this works well to generate a VCF file that contains putative somatic mutations:
java64 Xmx4g -jar VarScan.jar somatic \
<( samtools mpileup -C50 -BQ0 -d 1000000 -A -f FASTAFILE NORMALBAM ) \
<( samtools mpileup -C50 -d 1000000 -A -f FASTAFILE TUMORBAM ) \
PREFIX --output-vcf -min-var-freq 0.05
The FASTAFILE, NORMALBAM, and TUMORBAM should be self-explanatory. The PREFIX should be replaced with the prefix for the .snp and .indel files from VarScan. The min-var-freq setting turns up the sensitivity for somatic variants; these may need to be filtered out later.
After generating the VarScan VCF files, they need to be “fixed” using seqtools varscan fixVcf command-line script.
Annotating VCF Files¶
A few command-lines for annotating VCF files using snpEff and snpSift are given below. In addition, one should run the GATK VariantAnnotator script to generate rich quality control information for the VCF files.
snpEff¶
Download data¶
java -jar /data/CCRBioinfo/biowulf/local/snpEff_3_0/snpEff.jar download -c /data/CCRBioinfo/biowulf/local/snpEff_3_0/snpEff.config GRCh37.66
Run effect prediction¶
java -jar /data/CCRBioinfo/biowulf/local/snpEff_3_0/snpEff.jar eff -c /data/CCRBioinfo/biowulf/local/snpEff_3_0/snpEff.config GRCh37.66
snpSift¶
Annotate with dbSNP¶
java -Xmx8g -jar /data/CCRBioinfo/biowulf/local/SnpSift_latest.jar annotate /data/CCRBioinfo/public/GATK/bundle/1.5/hg19/dbsnp_135.hg19.excluding_sites_after_129.vcf tmp2.vcf > tmp2.dbsnp.vcf
Annotate with dbNSFP¶
# Donload and uncompress database (you need to do this only once):
# WARNING: The database is 3Gb when compressed and 30Gb uncompressed.
wget http://sourceforge.net/projects/snpeff/files/databases/dbNSFP2.0b3.txt.gz
gunzip dbNSFP2.0b3.txt.gz
java -jar SnpSift.jar dbnsfp /data/CCRBioinfo/public/snpEff/data/dbNSFP2.0b3.txt myFile.vcf > myFile.annotated.vcf
Filtering¶
The snpSift package allows very flexible filtering options. An example is given here:
java -Xmx8g -jar /data/CCRBioinfo/biowulf/local/SnpSift_latest.jar filter '(na ID) & (ID =~ 'COSM') & !( ID =~ 'rs')' -f
Melting to Tab-delimited Text¶
VCF files are quite difficult to read and filter in something like Excel. The term, “vcf melting”, refers to taking a VCF file and pulling out the various parts into separate tab-delimited columns. See seqtool vcf melt
for details of command-line operation or seqtools.vcf.vcfMelt for API usage.