Field Genetics - Cervus: Methodology

Cervus uses a likelihood-based approach to assign parentage combined with simulation of parentage analysis to determine the confidence of parentage assignments. This page explains the background to these methods.

For simplicity the text refers to assignment of parentage to a single parent but the same principles apply to assignment of parentage to parent pairs.

What is likelihood?

The philosophy behind likelihood analysis is to take data as a starting point and to evaluate hypotheses given that data. The likelihood of one hypothesis is always evaluated relative to another and this is called the likelihood ratio.

In the context of parentage analysis the data are genotypes from the offspring, the known parent (if one is known) and one or more candidate parents. For each candidate parent the two alternative hypotheses are:

1) The candidate parent is the true parent.
2) The candidate parent is not the true parent.

The likelihood of each hypothesis given the observed genotypes is calculated from the probability of obtaining the observed genotypes under that hypothesis.

The likelihood ratio is the likelihood that the candidate parent is the true parent divided by the likelihood that the candidate parent is not the true parent. A large likelihood ratio indicates that the candidate parent is much more likely to be the true parent than not the true parent.

Exclusion and likelihood

A simple approach to parentage analysis relies on a process of exclusion. The genotypes of candidate parents are compared against the offspring's genotype (taking account of the other parent's genotype, if available), and are excluded as parents if a mismatch occurs at one or more loci.

With few candidate parents and highly polymorphic loci, this process should usually leave just a single non-excluded candidate parent. However in less favourable circumstances it is common that multiple candidate parents remain non-excluded. In this case the exclusionary approach is inadequate because there is no way to identify which non-excluded candidate parent is the true parent.

Likelihood, on the other hand, can be used to statistically distinguish non-excluded candidate parents. For each locus likelihood captures two sources of information about the candidate parent that exclusion does not:

1) The frequency of the offspring allele or alleles that could have come from candidate parent.
2) Whether or not the candidate parent is heterozygous or homozygous.

The purpose of Cervus is to use this information to identify the candidate parent that is most likely to be the true parent.

Typing errors and likelihood

If genotypes are determined with 100% accuracy, a mismatch between offspring and candidate parent logically implies non-relationship (an exclusion of parentage). However if genotypes contain errors, a mismatch may be due to non-relationship but may also occur for a true parent due to a typing error in the offspring, the known parent (if one parent is known) or the true parent.

When multiple loci are used, the probability of at least one mismatch across all loci due to typing error can be relatively high, even when the frequency of typing error at any one locus is low.

For this reason Cervus uses likelihood equations that take account of typing error. An error is defined as the replacement of the true genotype with a genotype selected at random under Hardy-Weinberg assumptions. Under this definition, an erroneous genotype will sometimes be the same as the true genotype.

If the rate of typing error is greater than zero, no candidate parent is ever excluded, but those that mismatch at many loci end up with very low likelihood ratios. The advantage of this approach is that a true parent that mismatches at one or even two loci can still usually be identified as the most-likely parent. A potential disadvantage is that non-parents may sometimes only mismatch at one or two loci and therefore may be falsely assigned parentage. In practice best results are usually obtained by allowing for typing errors.

Allowing for typing errors also reduces the impact of two other causes of mismatches in parent-offspring relationships: mutations and null alleles. While it is not statistically ideal to treat mismatches arising from mutations and null alleles as if they were mismatches arising from typing errors, it is a very much better approximation than using such mismatches as a basis for parentage exclusion.

Use of likelihood in Cervus

Cervus calculates likelihood ratios for each candidate parent taking account of possible typing errors. The frequency of typing errors is specified by the user. The overall likelihood ratio for each candidate parent is calculated by multiplying together the likelihood ratios at each locus. This step assumes that loci are inherited independently (i.e. that they are unlinked).

When there is a known parent the likelihood ratio is different from the one used when there is not a known parent. A third form of the likelihood ratio is used to assess parent pairs.

Where available, Cervus makes use of genetic information from the parent of the opposite sex to the one being tested. A sampled parent of the opposite sex to the candidate parent(s) is referred to as a known parent. The overall likelihood ratio is more commonly expressed as a LOD score.

LOD score

In parentage analysis, the LOD score is obtained by taking the natural log (log to base e) of the overall likelihood ratio.

A positive LOD score means that the candidate parent is more likely to be the true parent than not the true parent. A true parent almost always has a positive LOD score.

A LOD score of zero means that the candidate parent is equally likely to be the true parent as not the true parent.

A negative LOD score means that the candidate parent is less likely to be the true parent than not the true parent. Negative LOD scores can theoretically occur when the candidate parent and offspring share very common alleles at every locus. More commonly, negative LOD scores indicate that a candidate parent mismatches the offspring at one or more loci. Candidate parents that are unrelated to the offspring typically have negative LOD scores.

If the LOD score of the most likely candidate parent is large enough parentage can be assigned to that candidate parent. Note that if likelihood ratios are calculated without taking account of genotyping errors, any mismatch leads to a likelihood ratio of zero and the LOD score is undefined. This is known as exclusion of parentage.

A derivative of the LOD score, Delta, may be also used as a criterion for assignment of parentage. Considering the set of candidate parents with a LOD score greater than zero, Delta is defined as the difference in LOD scores between the most likely candidate parent and the second most likely candidate parent. If only a single candidate parent has a LOD score greater than zero, Delta has the same value as the LOD score. If no candidate parent has a LOD score greater than zero, Delta is undefined.

Delta is especially useful when multiple candidate parents have positive LOD scores. If the two most likely candidate parents had almost equal LOD scores and parentage was assigned to one of them, there is almost a 50% probability that the assignment would be incorrect. However, Delta would be close to zero, reflecting the uncertainty of the identity of true parent, and typically no assignment would be made.

Simulation of parentage analysis

How large does the LOD or Delta score of the most likely candidate parent need to be for parentage to be assigned to that candidate parent?

LOD or Delta scores cannot be evaluated using a standard distribution such as the chi-square distribution. Therefore Cervus uses simulation of parentage analysis to evaluate the confidence in assignment of parentage to the most likely candidate parent. As well as using observed allele frequencies the simulation takes account of the number of candidate parents, the proportion of candidate parents sampled, completeness of genetic typing and estimated frequency of typing error when generating genotypes.

Parentage analysis is carried out with the simulated genotypes as it is with real genotypes, but in the simulation the identity of the true parent is known for each offspring. Cervus compares the distribution of LOD or Delta scores for tests in which the most likely candidate parent is the true parent with the distribution of LOD or Delta scores for tests in which the most likely candidate parent is not the true parent. Confidence in assignment is defined as the proportion of all candidate parents with LOD or Delta scores exceeding a given LOD or Delta score that are true parents.

For example, the simulation can identify the value of LOD or Delta for which 19 out of every 20 LOD or Delta scores exceeding that value come from the distribution of LOD or Delta scores for most likely candidate parents that are true parents, and only 1 out of every 20 LOD or Delta scores exceeding that value come from the distribution of LOD or Delta scores for most likely candidate parents that are not true parents, a false positive rate of 1 in 20. Any candidate parent with a LOD or Delta score exceeding this critical value is assigned parentage with 95% confidence.

The confidence levels in Cervus are population averages. The confidence of individual parentage assignments may be higher or lower than this average. To ensure satisfactory confidence in individual parentage assignments, 95% should be regarded as a minimum value for the population confidence level set in Cervus.

Further information

This information is also available in the help file that is supplied with Cervus. More details can also be found in Marshall et al. (1998).