1. What are SNPs?
SNPs are Single Nucleotide Polymorphisms, positions within the DNA that differ between individuals or chromosomes. Alleles from two parents found in diploid organisms are often largely identical. Often however, individual nucleotides are indeed different. These different nucleotides are called SNPs.
2. How can SNPs be used?
SNPs can be used to:
- Reveal genetic similarity between accessions
- Identify F1 Hybrid Genotypes
- Make a genetic map
- Association mapping
- Tracing origin of introgressions
However, in order to use the SNPs for the purposes outlined above the SNPs need to be reliable.
3. How can reliable SNPs be detected?
In order to detect SNPs, DNA sequences from different origin need to be aligned. Depending on the type of DNA sequence, genomic or transcript, and the availability of a reference genome this can be more or less challenging. The two main challenges in detecting reliable SNPs in next generation sequencing (NGS) data are to:
- Distinguish sequencing artefacts from true genetic variation.
- Distinguish paralogous alleles (multicopy genes) from orthologous genes (single copy genes).
3.1 Sequencing artefacts vs true SNPs
The number and nature of sequencing artefacts depend on the technology used and the quality of the library preparation. A sequencing artefact is a misread nucleotide which should not be mistaken for a SNP. Some of my strategies to avoid taking these false positive variable position into account are to:
- Require the presence of multiple independent reads that include the SNP.
- Demand a minimum read quality for a SNP position.
- Require a certain ratio for both alleles ( 1:1 or 1:2) in case mendelian segregation is expected.
3.2 Distinguishing paralogs from orthologs
Paralogous alleles or genes, paralogs for short, are copies of the same gene that occur within one set of chromosomes or even on the same chromosome from one individual. Variation between these paralogs is not useful for many of the purposes outlined above. Many plants have evolved complex genomes that contain multiple copies of almost every gene. In maize for example, it is estimated that ~70% of all genes has multiple copies (1). In order to distinguish orthologs (the good) from paralogs (the bad) I use the following strategy:
- haplotype based SNP selection based on QualitySNPpp
- Uniqueness screening, SNPs with flanking sequence should be unique within transcriptome or genome.
1 Ahn & Tanksley Comparative linkage maps of the rice and maize genomes PNAS 1993 90 (17) 7980-7984