VCFx
Version 2.0b
Author: Erick C. Castelli and Celso T. Mendes-Junior
www.castelli-lab.net/apps/vcfx
Please report bugs to erick.castelli@unesp.br


*****************************************************************************************
VCFX OVERVIEW 
*****************************************************************************************

	VCFx comprises a series of tools to deal with VCF (variant call format) files.

	You may create complete sequences in Fasta format (keeping phase information).
	You can treat uncertain genotypes and refine variants.
	You can convert the VCF file into different formats, including Arlequin, Genepop, etc
	
	This tool was written in C++. The source code and precompiled binaries are available
	to download at www.castelli-lab.net/apps/vcfx
	VCFx is compatible with UNIX-based operacional systems (e.g., Macos, Linux).
	
	THIS WORK IS LICENSED UNDER A CREATIVE COMMONS ATTRIBUTION-NONCOMMERCIAL-NODERIVATIVES
4.0 INTERNATIONAL LICENSE. THIS SOFTWARE IS PROVIDED "AS IS" AND THERE IS NO EXPRESSED OR 
IMPLIED WARRANTIES. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, 
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE OR DATA). USE IT AT YOUR OWN RISK. 
	

*****************************************************************************************
CURRENT VCFX COMMANDS (OR TOOLS) 
*****************************************************************************************

  Checking Tools                                                                
     checkpl     introduce missing alleles on genotypes with low likelihood     
     checkad     introduce missing alleles on unbalanced genotypes              
     statistics  print some VCF statistics                                      
     hfilter     hard-filtering variants                                        
     evidence    annotate variants with some quality control parameters         
     filter      keep only variants annotated with specific filter tags         
                                                                                
  Convertion Tools                                                              
     fasta       create complete sequences from phased VCF                      
     transcript  craate complete sequences from phased VCF using a BED file     
     haploview   recode VCF to PED for haploview (unphased)                     
     arlequin    recode VCF to ARP for Arlequin                                 
     genepop     recode VCF to GenePop format     


*****************************************************************************************
HOW TO INSTALL VCFX 
*****************************************************************************************

	The source code and precompiled binaries are available at
	www.castelli-lab.net/apps/vcfx
	
	First, vcfx uses the c++ boost library. You need to install it before
	compiling or using vcfx.
	- For Mac OSX, we recommend to install boost using homebrew (brew install boost).
	- For Linux, use 'sudo apt-get install libboost-all-dev'
	
	We use cmake to compile vcfx. If you don't have cmake, please do as follows:
	- For Mac OSX, you can use "brew install cmake"
	- For Linux, you can use "sudo apt-get install cmake"
	
	Decompress the .zip file
	cd [decompressed_folder/source]
	mkdir build 
	cmake ..
	make
	
	The vcfx executable will be placed at the /build folder.
	Move it to the /usr/local/bin if you want.
	
	This usually works for most of the computers. 

*****************************************************************************************
HOW TO CITE VCFx 
*****************************************************************************************

	This software was firstly introduced at 
	
	"Castelli et al. HLA-E coding and 3' untranslated region variability determined by 
next-generation sequencing in two West-African population samples"
Hum Immunol. 2015 Dec;76(12):945-53.
doi: 10.1016/j.humimm.2015.06.016

	and further at
	
	"HLA-F coding and regulatory segments variability determined by massively parallel 
sequencing procedures in a Brazilian population sample", Hum Immunol. 2016 Oct;77(10):841-53.
doi: 10.1016/j.humimm.2016.07.231

	Please cite both these articles.
	
 
 
*****************************************************************************************
CHECKPL
*****************************************************************************************
	
	This algorithm introduces missing alleles in uncertain genotypes. It uses the PL field
to check the confidence of a given genotype. If not accepted, a missing allele will be
introduced replacing the uncertain allele. This software does not impute missing alleles.
	The probability (P) of a certain genotype to be correct is calculated as follows:
	
	ratio = 1 / (10ˆ(((second lower PL value) / 10) * -1))
	P = ratio / (ratio + 1)
		
	The P value is compared with a threshold. If the P value is lower than the given
threshold, a missing allele is introduced for the uncertain allele, but the most likely
allele is preserved. 
	
	The 'checkpl' command presents several options. You may set each of them by using the
option identifier followed by '=' and the desired parameter, as described below. Do not use
blank spaces before or after '=', or file/directory names with blank spaces.

	Mandatory options: input
			
	A common usage: vcfx checkpl input=test.vcf
	
	Warning: a log file is also created, registering any modification eventually made.
	

*****************************************************************************************
CHECKAD
*****************************************************************************************
	
	This algorithm deals with uncertain genotypes from VCF files, mainly genotypes called
at low coverage segments and heterozygous genotypes in which one allele is extremely 
underrepresented. The VCF file should present the GT (genotype) and AD (Depth Per Allele
By Sample) fields. This software does not impute missing alleles.

	The 'checkad' command presents several options. You may set each of them by using the
option identifier followed by '=' and the desired parameter, as described below. Do not use
blank spaces before or after '=', or file/directory names with blank spaces.

Mandatory options: input
	
	
Other options (listing only the non-self-explanatory):
	
	alpha=number
			note: minimum number of reads to accept homozygosity, otherwise an 'unknown'
                  allele will be introduced. Default: 8 reads.
                  example: alpha=10
	
	delta=float
			note: on heterozygous genotypes with an underrepresented allele above BETA, but
                  lower than this this proportion, an 'unknown' allele will be introduced
                  replacing the underrepresented allele. Defaul: 0.20
			      example: delta=0.3

	
	A common usage: vcfx checkad input=test.vcf
	
	Warning: a log file is also created registering any modification eventually made.
	



*****************************************************************************************
HFILTER
*****************************************************************************************
	
This algorithm selects variants based on some statistics.

Mandatory options: input
	
	A common usage: vcfx hfilter input=test.vcf
	
	Warning: a log file is also created registering variants that gave failed to pass the
hard filters.
	
	
*****************************************************************************************
EVIDENCE
*****************************************************************************************
	
	This algorithm calculates a series of quality-control parameters for each variant, 
annotating them in the INFO field.

	Mandatory options: input
	
	A common usage: vcfx evidence input=test.vcf
	

*****************************************************************************************
FILTER
*****************************************************************************************
	
	This algorithm selects variants annotated with specific tags under the FILTER field.

	Mandatory options: input
	
	A common usage: vcfx filter input=test.vcf



*****************************************************************************************
FASTA 
*****************************************************************************************

    This command is used to create a fasta file considering phased genotypes. You need
to indicate a phased VCF file as input, using '|' to indicate a phased genotype. Vcfx will
isolate a reference sequence from the fasta file you have indicated as the reference, and
it will replace each variable site in the right position. Two sequences will be generated
for each sample (named as _h1 and _h2). 
	Please note that this algorithm will ignore any variable site using ‘/‘ to separate
alleles. 
    The 'fasta' command presents several options. You may set each of them using the option
identifier followed by '=' and the desired parameter, as described below. Do not use blank
spaces before or after '=', or file/directory names with blank spaces.
	If the "start" or "end" parameters were not configured (see below), the algorithm will
export the sequence between the first and last variable site in the VCF file.


Mandatory options:
	
	input=the_phased_vcf_file
		           	
	output=the_fasta_file_to_be_created
			note: the extension .fas will be added automatically
			
	reference=the_reference_in_fasta_format
			note: you need to indicate a fasta file containing the reference to be used.
			      For example, you need to indicate chromosome 6 sequence if you are
			      dealing with a VCF file containing chromosome 6 variable sites.
			      Please note that the same genome version between the VCF file and the 
			      reference must be used.
			      In addition, this reference must be in fasta format, with just one
			      sequence on it, i.e., only the sequence of the chromosome you are dealing
			      with. The reference sequence may be formated as a multi-line sequence, or
			      as single line sequence.
			      
			
Other options:
	
	chr=chromosome_designation
			note: if set, only variable sites associated with this chromosome identifier 
			      will be considered. You may use 'chr=6', or chr='chr6', depending on
			      your VCF structure.
	
	start=position
			note: if set, only the variable sites starting from this positions will be
			      considered. An example would be 'start=2900000'
	
	end=position
			note: if set, only the variable sites up to this positions will be
			      considered. An example would be 'end=3000000'

	--quiet
			note: do not output any message or warnings.
			
	
	A common usage is like this:
	vcfx fasta input=test.vcf output=teste start=2000 end=4000 reference=chr1.fas
	
	The output will be in fasta format, as the example below. Please note that there will
be two sequences for each sample, _h1 (with variable sites at the left side of the genotypes)
and _h2 (with variable sites at the right side of the genotypes).

	>Samples0001_h1
	ATCGACCGCATTTTGACAGCATA
	>Samples0001_h2
	GTGGACCGCATTATGACAGCGGATA


*****************************************************************************************
TRANSCRIPT
*****************************************************************************************

    Likewise FASTA, this algorithm will create a fasta file considering phased genotypes.
However, the intervals must be provided as a BED file. 

	A common usage is like this:
	vcfx transcript input=test.vcf output=teste bed=gene.bed reference=chr1.fas
	
	The output will be in fasta format, as the example below. Please note that there will
be two sequences for each sample, _h1 (with variable sites at the left side of the genotypes)
and _h2 (with variable sites at the right side of the genotypes).

	>Samples0001_h1
	ATCGACCGCATTTTGACAGCATA
	>Samples0001_h2
	GTGGACCGCATTATGACAGCGGATA


	 
*****************************************************************************************
COMMAND: ARLEQUIN 
*****************************************************************************************

    This command is used to create an input file for the Arlequin program.
	To use this function, you need to indicate a VCF file as input, and the name of the
output file. The output is created using the code below:
	0 = reference allele
	? = unknown or missing allele
	1 = first alternative
	2 = second alternative
	and so on... 

    The 'arlequin' command presents several options. You may set each of them by using the
option identifier followed by '=' and the desired parameter, as described below. Do not use
blank spaces before or after '=', or file/directory names with blank spaces.

Mandatory options
   	input     the input VCF file
   	output    the .arp file to be created
	
	note: the extension .arp will be added automatically
			
		
Other options
	
	chr=chromosome_designation
			note: if set, only variable sites associated with this chromosome identifier 
			      will be considered. You may use 'chr=6', or chr='chr6', depending on
			      your VCF structure.
	
	start=position
			note: if set, only the variable sites starting from this positions will be
			      considered. An example would be 'start=2900000'
	
	end=position
			note: if set, only the variable sites up to this positions will be
			      considered. An example would be 'end=3000000'
			      
	pop=the_population_file
			note: You may indicate a text file (.txt) with the name and its
			      group/population of each sample, one sample per line, such as this:
			      Sample001,Group_A
			      Sample002,Group_A
			      Sample003,Group_B

	--quiet
			note: do not output any message or warnings.
			
	
	A common usage fir this command is as follows:
	vcfx arlequin input=test.vcf output=teste
	
    WARNING: VCFx outputs genotypes in the same order as they are presented at the input
             VCF file. Thus, if you used a phased VCF file as input, using '|' to indicate
             the phase (such as the 1000 Genomes VCF files), you may change the output line 
             'GameticPhase = 0' to 'GameticPhase = 1' to indicate that your data is already
             phased.

	
	Arlequin 3.5 reference:
	
	Arlequin suite ver 3.5: a new series of programs to perform population genetics
analyses under Linux and Windows. Excoffier L, Lischer HE.
Mol Ecol Resour. 2010 May;10(3):564-7. doi: 10.1111/j.1755-0998.2010.02847.x.



*****************************************************************************************
COMMAND: GENEPOP
*****************************************************************************************

    Same as ARLEQUIN, this command is used to create an input file for Genepop.

	Mandatory options: input

	
*****************************************************************************************
KNOWN ISSUES
*****************************************************************************************

The FASTA algorithm may crash when exporting small sequences with large indels on it.
