PHASE using GATK ReadBackedPhasing data
When using the GATK ReadBackedPhasing to get haplotypes from NGS data, some variable sites are straightforwardly phased, but others are not. These include indels, multi-allelic loci and variable sites not presenting another close heterozygous site.
To get haplotypes for all variable sites, this script uses the phased data from a VCF file generated by the GATK ReadBackedPhasing algorithm to create a fragmented .known file, that will be used with the PHASE algorithm to fill in the blanks. This .known file is usually "fragmented", because GATK ReadBackedPhasing may phase some groups of variable sites, but not inform the association (or the phase) between these groups.
Then, the script runs the PHASE algorithm considering each of this fragments and compares the PHASE results from multiple runs.
This methodology was described and used at the following manuscripts. If you use this script, please cite them both.
Castelli EC et al. HLA-G variability and haplotypes detected by massively parallel sequencing procedures in the geographicaly distinct population samples of Brazil and Cyprus. Mol Immunol. 2017 Mar;83:115-126. doi: 10.1016/j.molimm.2017.01.020.
Lima TH et al. HLA-F coding and regulatory segments variability determined by massively parallel sequencing procedures in a Brazilian population sample. Hum Immunol. 2016 Oct;77(10):841-53. doi: 10.1016/j.humimm.2016.07.231.
How to install:
Please visit the official webpage at github: https://github.com/erickcastelli/phase-readbackedphasing