Cluster based methods such as multidimensional scaling (MDS) and priniciple component analysis (PCA) are traditionally used in identifying samples with genotypic inconsistencies, however, it is important to identify genotypes with high mendelian inconsistencies prior to any genetic or statistical analysis. In this tutorial, I would like to share how one can quickly check samples with high mendelian inconsistencies using genotypic data and pedigree information in bcftools.

Installing bcftools

bcftools is only available for Linux operating system. For detailed information on how to install it please consult user’s guide and further documentation at below link: https://samtools.github.io/bcftools/howtos/install.html


Input data

Two files are required as:

  • Genotype file in VCF format
  • Pedigree file in text (.txt) version

Pedigree file format

The pedigree file should be formatted as shown below:

mother,father,child1
mother,father,child2
mother,father,child3
mother,father,child4
mother,father,child5
mother,father,child6
mother,father,child7
mother,father,child8
mother,father,child9
mother,father,child10
mother,father,child11
mother,father,child12
mother,father,child13
mother,father,child14
mother,father,child15
mother,father,child16

Important! Make sure the parent names are sperated by comma (,) without any spaces in the text file.

running the bcftools

Use below sample command line to run the program:

$ bcftools +mendelian file.vcf -T pedigree.txt -c > output  

Output

The out file contains the following headers:

# [1]nOK  [2]nBad  [3]nSkipped  [4]Trio
334518    4621     996          mother,father,child

Where,

  • nOK .. number of genotypes with no missing sites and no Mendelian error
  • nBad .. number of genotypes with a Mendelian error
  • nSkipped .. number of genotypes at which at least one individual missing and therefore could not be considered
  • 4

Plotting output

The output file can be opened directly in MS Excel and can be interpreted by plotting it.

PLot output

In the above plot, it can be noted that there are four samples (in red) that have the highest mendelian error in comparison to the rest of samples, which should be excluded from any further analysis.


--- End of Tutorial ---

Thank you for reading this tutorial. If you have any questions or comments, please let me know in the comment section below or send me an email.

Bibliography

*Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]