Download Report

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
A Pairwise Relatedness Estimator bias-corrected for sample kin structure and size.
Dik Heg* and Dmitry A. Konovalov†
*Department of Behavioural Ecology, Zoological Institute, University of Bern,
Hinterkappelen, Switzerland, †School of Mathematics, Physics & Information
Technology, James Cook University, Townsville, Queensland, Australia
RUNNING TITLE: Unbiased relatedness estimator
Keywords: bias correction, relatedness, rare alleles, allele frequencies, microsatellites
Correspondence: Dik Heg, Fax: +41 31 6319141;
E-mail dik.heg@esh.unibe.ch
1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Abstract
Currently used pairwise relatedness (r) estimators (based on codominant DNA markers)
are only unbiased when population allele frequencies are known exactly. In practice, the
allele frequencies must often be estimated from a population sample of limited size and
of unknown kin structure. A new estimator of r is presented, which uses population
heterozygosity but not the frequencies of individual alleles. Based on P-values and mean
squared errors obtained from Monte Carlo simulations it is shown that the new estimator
outperforms the other considered estimators for small samples and samples with
internally high average pairwise relatedness. For large samples with low internal
relatedness the new estimator is comparable to the existing estimators. The P-values
calculated using the new estimator are independent of the estimated heterozygosity.
2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
Introduction
45
(e.g. BROMAN 2001; MCPEEK et al. 2004). Equivalently, the sample X could be
Estimation of pairwise relatedness using microsatellite (codominant) markers is used
extensively in biology, e.g. in studies of kin structure (e.g. Dierkes et al. 2005) and
population management (e.g. Liebert & Starks 2006; Romo et al. 2006). There is a
number of relatedness estimators available with the LR (Lynch & Ritland 1999), QG
(Queller & Goodnight 1989) and W (Wang 2002) estimators being the most commonly
used in practice. All three (LR, QG and W) estimators require population allele
frequencies which would normally be estimated from a population sample. Such
simultaneous estimation of relatedness and allele frequencies affects reliability of the
estimators, e.g. the estimators become biased with the amount of bias depending on the
sample kin (pedigree) structure and size (Wang 2002). Wang (2002) has started to
address the problem by developing the W relatedness estimator which is bias-corrected
for the sample size.
The purpose of the present study is to estimate pairwise relatedness, when
unknown biases due to the sample kin structure may be present, i.e. the allele frequencies
and pairwise relatedness values are estimated from the same sample. Note that this is the
situation most commonly experienced by workers in the field using microsatellite DNA
markers, where due to time and/or money constraints no large independent random
sample of individuals of one generation is available to capture all alleles in the population
and to calculate population frequencies of the alleles with high accuracy.
Here we show that the commonly used (LR, QG and W) relatedness estimators
become severely biased when sample size is small and complex kin relationships are
present in the sample. We derive a new regression-based relatedness estimator (HK)
which depends only on the population heterozygosity averaged across observed loci (in
addition to the actual genotypes). Using Monte Carlo simulations the HK estimator is
compared to the LR, QG and W estimators and found to be less biased than the other
three estimators regardless of the sample kin structure and size. All four estimators have
been integrated into the KINGROUP v2 program (Konovalov et al. 2004) which is freely
available from www.kingroup.org and could be run on all java-enabled computer
platforms such as Unix/Linux, Windows and Mac-OSX (java release 1.5 or higher is
required).
Methods
Terminology and definitions
Let X be a sample from an outbred population containing n individuals who are
genotyped at a single locus (generalization for multiple loci is given below) containing k
codominant alleles { A1 , A2 ,..., Ak } . The sample genotypes are recorded as a matrix
X  [ xm,i ] , where xm ,i is equal to the number of Am alleles observed in the i‘th individual
3
1
2
3
expressed as X  (x1 , x 2 ,..., x n ) , where xi  ( x1i , x2i ,..., xki ) is the i’th column of the
matrix X , i.e. the i’th individual of the sample. For a diploid population each sample
k
genotype xi contains exactly two alleles,  m 1 xmi  2 . For example, genotypes ( A1 , A2 )
4
5
and ( A3 , A3 ) are encoded as (1,1, 0, 0) and (0, 0, 2, 0) , respectively, at a locus with four
(k  4) alleles { A1 , A2 , A3 , A4 } . The i’th index (second index in xm,i ) refers to an
6
7
individual in the sample and could be omitted when working with individuals identified
by other means, e.g. x   x1 , x2 ,..., xk  or y   y1 , y2 ,..., yk  .
8
9
10
11
12
13
14
15
If no further information is available about the sample, the m’th allele frequency
is estimated by (e.g. Nei 1978)
1 n
qm 
(1)
 xmi
2n i 1
and will be referred to as sample allele frequency. Queller & Goodnight (1989) discussed
an approach where two focal individuals (dyad) are excluded from the allele frequencies
calculation, i.e.
n
1
qm (i, j ) 
 xm,i , (2)
2(n  2) ii  j 1
and will be referred to as outer-pairs allele frequencies. Another allele frequency biascorrection of Queller & Goodnight (1989) is the outer-groups allele frequencies, where
the individuals from the focal group (or groups) are excluded from the allele frequencies
calculation for that group(s). The outer-groups correction uses (or assumes) additional
information about the sample, i.e. the knowledge of statistically uncorrelated (unrelated)
groups. When such information is not available (this study), the correction is also
unavailable.
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
The second form could be obtained using common notation (Lynch & Ritland
1999; Wang 2002) and Kronecker delta  ij defined as  i , j i  1 and  i , j i  0 (e.g. Wang
32
33
34
35
2004). Let individuals X  {x(l  1), x(2),..., x( L)} and Y  {y (1), y (2),..., y ( L)} be
genotyped at L unlinked loci and kl alleles at l’th locus. At locus l , denoting two alleles
observed in x(l ) and y (l ) genotypes as (a, b) and (c, d ) , respectively, the QG estimator
is given by
One of the most commonly used relatedness estimator is the QG (Queller &
Goodnight 1989) estimator in its symmetric form as implemented in the
RELATEDNESS, KINSHIP (Goodnight & Queller 1999) and IDENTIX (Belkhir et al.
2002) programs. The QG estimator could be expressed in different forms yielding
identical estimates of r. The first (original) form is given by Queller & Goodnight (1989).
L
rQG ( X , Y ) 
36
  S (ab, cd )  p
2  
l 1
37
a
l 1
L
ab
 pb  pc  pd 
  cd  pa  pb  pc  pd 
,
(3)
where
4
1
2
3
4
5
6
7
8
9
10
11
S (ab, cd )   ac   ad   bc   bd ,
(4)
and where the values of (a, b, c, d ) alleles together with the corresponding population
allele frequencies ( pa , pb , pc , pd ) are locus specific. The explicit dependency on the
locus index (l ) is omitted hereafter to simplify the notation if the locus is implied by
context. Note that the QG estimator permits a number of variations (not just different
forms of the same variant), e.g. a variant studied in Lynch & Ritland (1999) is different
from Equation 3.
The QG estimator could also be rewritten in vector notation (the third form) using
S (a, b, c, d )  x  y , pa  pb  p  x , pc  pd  p  y , 1   ab  12 x  x and 1   cd  12 y  y ,
obtaining
L
12
T
rQG ( X , Y )  
B
x  y  p  x  p  y
l 1
L

l 1
1
2
xx  y y  p x  py
,
(5)
1
2
13
where the dot-product notation x  y   ml 1 xm ym is used to sum over allele index m and
14
where p  p(l )  ( p1 , p2 ,..., pkl ) is the vector of (locus specific) population allele
15
frequencies. Population locus homozygosity and heterozygosity are defined by
k
k
16
 (l )   p(l )  p(l )    pm2 (l ) and h(l )  1   (l ) ,
(6)
m 1
17
18
19
20
21
22
23
24
25
respectively.
New relatedness estimator
The QG estimator becomes biased when the sample or outer-pairs allele frequencies are
used instead of the population allele frequencies, see Appendix. From the derivation of
the expected value of the QG estimator (see Appendix) and after some rearrangement it is
found that E  (xi  x j ) 2   E  (xi  x j )  (xi  x j )   4(1  rij )h(l ) depends only on the
30
31
locus heterozygosity h(l ) . Then an unbiased estimator rijHK of relatedness can be found
within a class of best linear unbiased estimators (BLUE) which is given by, in most
general terms (e.g. McPeek et al. 2004),
L
d 2 (l )
rijHK  1   wl ij , (7)
h(l )
l 1
where
dij2 (l )  14 (xi  x j ) 2 , (8)
32
33
34
and where  w1 , w2 ,..., wL  are the locus weights and the given L loci are assumed to be
statistically independent (unlinked). The estimator is only approximately unbiased since it
2
contains a division, i.e. E  A / B   E  A  / E  B   cov  A, B  / E  B  . Without knowing
26
27
28
29
5


1
the true relatedness matrix the covariance term cov dij2 (l ), h(l ) can not be calculated
2
3
4
exactly and is omitted. Then the weights are selected proportional to the locus
heterozygosity h(l ) obtaining
rijHK (h )  1  dij2 / h , (9)
5
where dij2 
1
L

L
l 1
dij2 (l ) and h 
1
L

L
l 1
h(l ) . The estimator is approximately (in the
6
7
8
9
10
11
12
13
14
15
Taylor expansion sense) unbiased but having the advantage of not depending on the
frequencies of individual alleles. The estimator depends only on the population
heterozygosity h (averaged across loci). Moreover the dependency is trivial, where h is
just a scaling coefficient.
The new estimator essentially reduces the problem of pairwise relatedness
estimation to the problem of heterozygosity estimation which is well studied especially in
relation to inbreeding (e.g. Rousset & Raymond 1995). In the case of outbred
populations, a simple heterozygosity estimator could be obtained by observing that
1
2 E  x i  x i   1   (l )  2  h(l ) is independent of kin structure (and a specific allele
16
17
frequency). A heterozygosity indicator could be defined in vector notation as
(10)
hii  2  12  xi  xi  ,
18
where its expected value yields the locus heterozygosity E  hii   h(l ) . Then an unbiased
19
estimator he (l ) of heterozygosity at a locus can be found in the BLUE terms via
n
20
he (l )   ui hii ,
(11)
i 1

21
where the weights  u1 , u2 ,..., un  are normalized by
22
hii  0 for a heterozygote and homozygote, respectively . If the relatedness matrix rij
23
were known, the most optimal BLUE weights could be found by minimizing var  he  .
24
Since rij is not known, equal weights ui 
25
26
27
28
29
30
necessarily the most efficient) estimator of heterozygosity in the absence of allelespecific frequencies. The he estimate at a locus is simply equal to the number of observed
heterozygotes averaged over the sample size n.
31
1
n
n
u  1 , and where hii  1 and
i 1 i
are used, which yield an unbiased (but not
Since the new estimator requires averaged heterozygosity, the he (l ) estimator is
averaged across loci obtaining
1 L
1 L n
he   he (l ) 
 hii (l ) , (12)
L l 1
nL l 1 i 1
32
where E  he   h and var  he  
33
34
35
36
number of observed heterozygotes averaged over the sample size n and number of loci L.
Note that the new relatedness estimator is unbiased only if it is used together with a
heterozygosity estimator that is at least equally unbiased. For example, the commonly
used heterozygosity estimator (e.g. Nei 1978)
1
L2

L
l 1
var  he (l )  , i.e. the he estimate is equal to the
6
hNei 
1
k
2n 

1

qm2  , (13)


2n  1  m 1 
2
where qm  are the sample allele frequencies, is bias corrected for sample size but not for
3
the sample kin structure, i.e. E  hNei   1    h , where   (n  1)r /(2n  1) and where
4
r
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
2
n ( n 1)
 
n
n
i 1
j i 1 ij
r .
Simulations
Following Milligan (2002), three different population allele-frequency distributions are
considered: equally-frequent – all alleles occur at equal frequency; one-common – a
single allele occurs with a frequency of 0.6 and the remainder are equally frequent;
Dirichlet – allele frequencies are independently drawn from the same Dirichlet
distribution (p.209, Stuart & Ord 1987) with all parameters set to unity. The uniform
Dirichlet distribution is equivalent to the random distribution used in Belkhir et al. (2002)
and produces results very similar to those obtained with the triangular distribution
(Belkhir et al. 2002; Wang 2002).
Four degrees of pairwise relationships are considered: FS – full-sibs (r  12 ) , HS –
half-sibs (r  14 ) , NR – non-related (r  0) and PO - parent-offspring (r  12 ) . In order to
obtain the desired pairwise relationships two sample kin structures are used. The first
structure is inspired by Dierkes et al. (2005) and denoted by G ( g , s) , where g is the
number of parental pairs (groups), s is the number of full-sibs in each group. In order to
examine half-sibling relationships g  1 unrelated genotypes U1 ,U 2 ,...,U g 1 are
23
generated and then the full-sibs of the i’th group are generated by the U i , U i 1  parental
24
25
pair. For example, the family structure G (2,9) encodes n  g ( s  1)  1  21 genotypes
U1 ,U 2 ,U 3 , F1 (1, 2),..., F9 (1, 2), F1 (2,3),..., F9 (2,3) , where Fj (i, i  1) denotes the j’th
26
offspring of the U i , U i 1  parents. Each sample generated according to the structure
27
consists of N 
28
29
30
31
32
33
34
35
36
37
38
39
N HS  ( g  1) s 2  81 , N PO  2sg  36 , and N NR  N  N FS  N HS  N PO  21 yielding
210  21
210  90% of the sample having non-zero pairwise relatedness.
n ( n 1)
2
 210 unique dyads including N FS  12 s ( s  1) g  72 ,
The second structure (denoted single-family) is inspired by Figure 4 of Wang
(2002). A single-family sample consists of n individuals including s full-sibs together
with the two founding parents (in order to study the PO dyads) and n  2  s pairwise
unrelated genotypes. Therefore a single-family sample yields N FS  12 s ( s  1) , N HS  0 ,
N PO  2s , and N NR  12 n(n  1)  12 s ( s  1)  2s .
Statistical behaviour of the new estimator in relation to the existing LR, QG and
W estimators is assessed by comparing: sampling variance and bias (e.g. Wang 2002); the
mean-squared error (MSE) via its square root (RMSE) (e.g. Milligan 2003); and P-values
7
1
2
3
4
5
(e.g. Belkhir et al. 2002). Following Milligan (2002) the RMSE (for a chosen relatedness
estimator) is broken down into the relationship specific components. For example the
RMSE for  ’th kinship relationship (e.g.   PO for parent-offspring) in a sample is
defined by
RMSE(r ) 
1
N

N
i 1
(r (i )  r ) 2 , (14)
6
7
8
where: r (i ) is the i’th estimate of r (e.g. rPO  0.5 ); r is the actual value used to
generate the corresponding dyad in the sample; N is the number of  ’th relationships
in the sample. The RMSE is calculated via MSE’s equivalent expression in terms of
9
10
11
12
sampling variance and bias, RMSE(r )  var(r )  (bias(r )) 2 . Therefore the RMSE
represents an overall estimation error and shows if an estimator is optimized for bias at
the expense of variance and vice-versa. The kinship specific sampling variance and bias
N
N
are calculated via var(r )  ( N11)  i 1 (r (i )  r ) 2 and bias(r )  N1  i 1 (r (i )  r ) ,
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
respectively.
The P-values are calculated based on the resampling procedure of Guo &
Thompson (1992). For each simulated sample and at each locus, 17000 null-hypothesis
dyads are generated by randomly selecting alleles without replacement from the pool of
alleles available in the sample. A chosen relatedness estimator is then used to calculate:
the null-distribution from the null dyads; the pairwise relatedness estimates from the
sample; and the corresponding pairwise P-values.
Once a relatedness estimator, population allele-frequency distribution, and sample
kin structure are selected, a single simulation trial is performed as follows: (1) a sample is
generated based on the selected population allele-frequency distribution and sample kin
structure; (2) sample allele frequencies are calculated; (3) all pairwise relatedness
estimates are calculated using the selected relatedness estimator and the sample (not
population) allele frequencies; (4) sampling variance, bias and RMSE are calculated from
the sample for each of the kinship relationships present in the sample; (5) nulldistribution is calculated and used to calculate P-values. When the Dirichlet distribution
is used, in step 1 a freshly generated set of Dirichlet allele frequencies is used for each
sample. In step 3 the HK estimator does not use the sample allele frequencies but rather
the sample heterozygosity is estimated and used to calculate the HK estimates.
Note that all sampling statistics are calculated on a per-sample basis (single
simulation trial) and then averaged over 10,000 simulated samples. This is done to
simulate as close as possible what actually happens in practice where, effectively, only a
single sample is analysed for a given population at a given location. The bias, variance
and RMSE (broken down by the types of kinship relationships) values could only be
calculated in simulation studies. On the other hand, the pairwise P-values could be
calculated for real data (where the sample kin structure may not be known a priori)
separating unrelated and non-zero related dyads. The P-values for all considered
estimators can be calculated via the KINGROUP v2 (Konovalov et al. 2004) program.
8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Results and Discussion
Figure 1 displays results for the single-family sample structure, where each simulated
sample consists of 100 individuals and a varied number of full-sibs. Even though the
equally-frequent allele frequency distribution is chosen in this case, other considered
allele frequency distributions (and sample sizes) produce qualitatively similar results and
hence not shown. The same is applicable to the rest of the Figures where no special
optimisation for the sample structure or allele frequency distribution is performed. A
variety of samples and frequencies are presented to verify that the statistical properties of
the new estimator are consistent across the sample structures and allele frequency
distributions (as expected from the derivation of the estimator). The W estimates are
calculated based on the variant of the estimator which is bias-corrected for a sample size
as per Wang (2002).
First, Figure 1 shows that the LR, QG and W estimators become severely biased
(second column in Figure 1) with the sample allele frequencies if a large part of the
sample consists of related individuals. The reason for the negativity of the bias is
discussed extensively in Queller & Goodnight (1989). In summary, the relatedness
estimators use allele frequencies which are overestimated from the sample due to the
presence of one (as in Figure 1) or more groups of highly related individuals. Therefore
the relatedness values are calculated to be smaller (hence the negative bias) than those
calculated with the correct (but unknown in practice) population allele frequencies.
Second, the bias is entirely due to the error in the estimation of allelic frequencies,
which can be seen by comparing the dashed lines (the QG estimates with the sample
allele frequencies) with the dotted lines (the QG estimates with the correct population
allele frequencies). The classic allele frequency bias-correction (the outer-pairs allele
frequencies) yields little improvement in the QG estimates (compare the dashed and solid
lines in Figure 1). Moreover the outer-pairs frequencies can not be applied to the LR
estimator without further assumptions since the occurrence of rare alleles yield infinite
LR estimates (Wang 2002).
The new HK estimator is significantly less biased (some small residual bias may
always be present due to the statistical nature of the simulations) than the other estimators
throughout. The only possible critique is that the new estimator has slightly larger
variance for the full-sibling dyads (FS in Figure 1) but the absolute magnitude of the
increase is biologically irrelevant (e.g. var(rFS(HK) )  var(rFS(W) )  0.00001 for 60 full-sibs),
38
while the improvement in the bias is manyfold better (e.g. bias(rFS(HK) )  bias(rFS(W) )  0.1
39
40
41
42
for 60 full-sibs). The overall estimation error is measured by the RMSE (third column in
Figure 1) and is dominated by the bias while the contribution from the variance is hardly
(if at all) noticeable. Note that Figure 1 reproduces the previously reported bias
 bias(rPO(W) )  0.04 in the parent-offspring (PO) dyads with 40 full-sibs (Wang 2002).
43
44
45
The forth column in Figure 1 shows the P-values. In the case of the FS and PO
dyads the new estimator rejects the null hypothesis at the lower levels of significance
9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
(smaller P-values). The P-values for the unrelated dyads remains around 0.5 which could
not lead to the unrelated dyads being misclassified as related. Moreover the average Pvalue of p  0.5 is the by-definition correct value for an unbiased estimator applied to
the unrelated dyads which are equivalent to the null-hypothesis dyads. This could be seen
in Figure 2 where all considered estimators become unbiased as the sample size increases
while the number of related individuals in the sample remains constant.
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
derivation of the new estimator. A similar residual RMSE is present in all other
estimators. However, such low quality loci are most likely to be rejected by not passing
the Hardy Weinberg test, e.g. using GENEPOP (Raymond & Rousset 1995).
Figure 2 reveals that the new estimator becomes asymptotically equivalent to the
other estimators when applied to the large samples with small average pairwise
relatedness. As in Figure 1, the new estimator achieves the lowest RMSE. Interestingly,
for the one-common allele frequency distribution (h  0.6) , the LR estimator achieves
the lowest P-values for the PO and FS dyads starting from sample sizes of about 70. On
the other hand the LR estimates are the most biased (largest RMSE) for the same dyads
(second column in Figure 2). Figure 2 also reveals that the P-values may be hard to
improve upon (see FS and PO in the last column of Figure 2) by just increasing the
sample size. This is due to the null-distribution being mainly controlled by the number of
available loci and their heterozygosities. Figure 2 reminds that one needs to be cautious
with loci of low heterozygosity (e.g. around 0.5 or less). In the case of the HK estimator
the residual RMSE is mainly due to the omission of cov  dij2 (l ), h(l )  / h 2 (l ) in the
Figure 2 confirms that the HK estimator is comparable to the other estimators on
large samples (with low average pairwise relatedness). However the main advantage of
the new estimator is revealed for the samples which are small and/or highly related
internally. In particular, Figure 3 shows that the new estimator is the only estimator that
can resolve the half-sibling (HS) dyads (HS P-values in Figure 3) in the G(3,9) sample
structure. The main reason why the other estimators fail is because the estimators remain
severely biased regardless of the number of loci. The new estimator remains (see also
Figure 1) to be more powerful in resolving the PO and FS dyads (the lowest P-values).
The problem of rare alleles has been already effectively solved by the W (Wang 2002)
estimator. The HK estimator is also unaffected by the presence or absence of rare alleles,
e.g. in Figure 3 in nearly every sample there are one or more rare alleles (alleles present
only in a single group of related individuals).
Another advantage of the HK estimator is that the P-values calculated using the
estimator are independent of the estimated heterozygosity, i.e. the estimated
heterozygosity is only required as a scaling coefficient to reduce the bias of the HK
estimates.
Finally, on the conceptual level, the main reason why the new estimator
outperforms the other considered estimators could be explained within the framework of
the Statistical Learning Theory of Vapnik (1995): given a limited amount of information
“when solving a given problem, try to avoid solving a more general problem as an
10
1
2
3
4
5
6
7
8
intermediate step” (Vapnik 1995). The new estimator does just that by not estimating the
population allele frequencies which is by far a much more difficult task when working
with the small sample sizes. To some extent the W estimator followed a similar path by
not using frequencies of individual alleles. Nevertheless, the W estimator requires the
k
following three observed sums of powers of allele frequencies: a   m 1 qm for
  {2,3, 4} .
11
1
References
2
Belkhir K, Castric V, Bonhomme F (2002) IDENTIX, a software to test for relatedness in
3
4
5
6
a population using permutation methods. Molecular Ecology Notes, 2, 611-614.
Broman KW (2001) Estimation of allele frequencies with data on sibships. Genetic
Epidemiology, 20, 307-315.
Dierkes P, Heg D, Taborsky M, Skubic E, Achmann R (2005) Genetic relatedness in
7
groups is sex-specific and declines with age of helpers in a cooperatively breeding
8
cichlid. Ecology Letters, 8, 968-975.
9
10
11
12
13
Goodnight KF, Queller DC (1999) Computer software for performing likelihood tests of
pedigree relationship using genetic markers. Molecular Ecology, 8, 1231-1234.
Guo SW, Thompson EA (1992) Performing the Exact Test of Hardy-Weinberg
Proportion for Multiple Alleles. Biometrics, 48, 361-372.
Konovalov DA, Manning C, Henshaw MT (2004) KINGROUP: a program for pedigree
14
relationship reconstruction and kin group assignments using genetic markers.
15
Molecular Ecology Notes, 4, 779-782.
16
Liebert AE, Starks PT (2006) Taming of the skew: transactional models fail to predict
17
reproductive partitioning in the paper wasp Polistes dominulus. Animal
18
Behaviour, 71, 913-923.
19
20
21
22
Lynch M, Ritland K (1999) Estimation of Pairwise Relatedness With Molecular Markers.
Genetics, 152, 1753-1766.
McPeek MS, Wu XD, Ober C (2004) Best linear unbiased allele-frequency estimation in
complex pedigrees. Biometrics, 60, 359-367.
12
1
2
3
4
5
6
7
Milligan BG (2003) Maximum-likelihood estimation of relatedness. Genetics, 163, 11531167.
Nei M (1978) Estimation of Average Heterozygosity and Genetic Distance from a Small
Number of Individuals. Genetics, 89, 583-590.
Queller DC, Goodnight KF (1989) Estimating relatedness using genetic markers.
Evolution, 43, 258-275.
Raymond M, Rousset F (1995) Genepop (Version-1.2) - Population-Genetics Software
8
for Exact Tests and Ecumenicism. Journal of Heredity, 86, 248-249.
9
Romo M, Suzuki S, Nakajima M, Taniguchi N (2006) Genetic evaluation of
10
interindividual relatedness for broodstock management of the rare species barfin
11
flounder Verasper moseri using microsatellite DNA markers. Fisheries Science,
12
72, 33-39.
13
14
15
16
Rousset F, Raymond M (1995) Testing Heterozygote Excess and Deficiency. Genetics,
140, 1413-1419.
Stuart A, Ord JK (1987) Kendall's Advanced Theory of Statistics, Vol. 1: Distribution
Theory, 5 edn. Oxford University Press, New York.
17
Vapnik V (1995) The Nature of Statistical Learning Theory Springer, New York.
18
Wang J (2002) An estimator for pairwise relatedness using molecular markers. Genetics,
19
20
21
160, 1203-1215.
Wang J (2004) Sibship reconstruction from genetic data with typing errors. Genetics,
166, 1963-1979.
22
13
1
2
3
4
5
6
7
8
9
10
Acknowledgements
This paper was partly written when D.K. was on sabbatical leave at the University of
Bern. We would like to thank University of Bern and James Cook University and, in
particular, Michael Taborsky and Bruce Litow for supporting this collaborative project,
and Peter Stettler for his hospitality. We thank Francois Rousset, Ross Crozier, Laurent
Excoffier, Nigel Bajema and three anonymous referees for helpful comments on earlier
versions of this manuscript. D.H. is supported by SNF Grant 3100A0-108473.
14
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Figure Legends
Fig. 1 Sampling variance, bias, root mean-square error (RMSE) and P-values of
relatedness estimates for kin relationships as shown in the row titles. The curves are
based on 10,000 samples, where each sample consists of 100 individuals in the singlefamily kin structure. The number of full-sibs in each sample is varied (x-axis). The dashed
and dotted lines denote the QG estimates calculated with the sample and population
allelic frequencies, respectively. The bold dots, crosses and circles denote the new HK,
LR (Lynch & Ritland 1999) and W (Wang 2002) estimates, respectively. The solid line
denotes the QG estimates with the outer-pairs allele-frequency correction. Ten loci are
used, each having ten alleles with their population frequencies in the equally-frequent
distribution.
Fig. 2 The same as in Figure 1 but for varied sample sizes and 10 full-sibs (including
their two parents) in each sample. Ten loci are used, each having five alleles with their
population frequencies in the one-common allele frequency distribution.
Fig. 3 The same as in Figure 1 but for samples with G (3,9) kin structure. The number of
loci is varied (x-axis), each locus having 20 alleles with their population frequencies in
the Dirichlet distribution.
15
1
2
3
Figure 1.
16
1
2
3
Figure 2
17
1
2
3
4
Figure 3
18
1
2
3
4
5
6
7
8
9
Appendix
Expected value of the QG estimator
Having expressed the QG estimator in vector notation, its expected value could be
obtained as follows. Let an outbred population be in Hardy Weinberg Equilibrium
(HWE) and described by the population allele frequencies p   p1 , p2 ,..., pk  which are
assumed to be known exactly. Then each observed (sample) genotype
xi  ( x1,i , x2,i ,..., xk ,i ) could be represented as a sum of two statistically independent
10
 , obtaining E   mi2   E   mi   pm ,
gamete vectors xi  ε i  εi , i.e. xmi   mi   mi
11
2
var   mi   pm 1  pm  , E  xmi   2 pm , E xmi
 2 pm 1  pm  ,
12
var  xmi   2 pm 1  pm  (e.g. Broman 2001), where    p  p    m 1 pm2 and h  1  
13
14
15
16
17
 
1
2
E  xi  xi   1   and
k
are homozygosity and heterozygosity at the given locus, respectively.
A new form of pairwise relatedness matrix could be defined in the vector notation
by x j  rij xi  (1  rij )z ij , where rii  1 , and where z ij is statistically independent of xi and
x j , obtaining cov  xmi , xmj   2rij pm 1  pm  , E  xmi xmj   2  rij pm (1  pm )  2 pm2  and
18
E  xi  x j   2  rij h  2  . Then the expected values of the nominator and denominator in
19
Equation 5 for individuals xi and x j (from an outbred population sample) become
20
T  E (T )  2rij h and B  E ( B )  2h , respectively. In the case of multiple loci, B  2 Lh
21
and T  2rij Lh , where h 
22
23
the l’th locus. The expected value of the QG estimator is approximately calculated via a
Taylor series by omitting higher orders E T / B   T / B obtaining
24
25
26
27
28
29
30
31
32
33
34
35
E  rQG ( X i , X j )  
1
L

L
l 1
h(l ) , and where h(l ) is population heterozygosity of
2rij LH
 rij ,
(15)
2 LH
where the expected value of the estimator is independent of population heterozygosity
(and the population allele frequencies, for that matter).
Bias in the QG estimator due to estimated allele frequencies
In practice population allele frequencies are rarely known and the frequencies must be
estimated from a sample. Two such frequency estimators are considered to reveal the
effect of frequency estimation error on the QG relatedness estimator. The first frequency
estimator is the sample frequency estimator (Equation 1) which yields the following
expected value of the QG estimator
 L

E    xi  x j  q  xi  q  x j  
rij  Cij
 l 1


eij  E  rQG  X i , X j , q   
, (16)

C
1
 L 1

ij
E    2 xi  xi  12 x j  x j  q  xi  q  x j  
 l 1

19

n
1
where the sample allele frequencies qm 
2
allele frequencies  pm    p1 , p2 ,..., pk  , and where Cij 
3
4
An unbiased relatedness estimates could then be constructed from the QG estimates via
rij  eij  Cij 1  eij  . (17)
5
6
It is interesting to examine some examples. The first example is where all genotypes are
in fact unrelated  rij   ij  . Then for the QG estimator to become unbiased it must be
7
8
calculated via
rij (NR)  eij  21n 1  eij  . (18)
9
10
11
12
13
14
15
1
2n
i 1
xmi are used instead of the population
1
2
 R  R  and
i
j
Ri 
1
n

n
r .
j 1 ij
The second example is where all sample genotypes are full sibs  rij  12 (1   ij )  and the
QG estimator must be bias-corrected via
rij (FS)  eij   12  21n  1  eij  . (19)
The second frequency estimator is the outer-pairs allele frequency estimator
(Equation 2) which yields
r  Cij
, (20)
eij  E  rQG  X i , X j , q(i, j )    ij
1  C
ij

n
16
where the outer-pairs allele frequencies qm (i, j ) 
17
the population allele frequencies  pm  , and where Cij 
18
Ri 
19
and rij (FS)  eij  12 1  eij  revealing the fact that the outer-pairs frequencies yield
20
21
22
23
24
25
26
1
n2

n
1
2( n  2)
1
2
x
i  i  j 1 m ,i 
are used instead of
 R  R  and
i
j
r . The two examples of all unrelated and all full-sibs yield rij (NR)  eij
i  i  j 1 ii 
unbiased estimates only if the outer (in relation to the focal pair) individuals are unrelated
to either of the paired genotypes. In summary, the QG pairwise relatedness estimator
becomes biased if either one or both of the two focal individuals are related to one or
more individuals in the rest of the sample. The outer-pairs frequency correction does not
solve the problem of simultaneous estimation of pairwise relatedness and allele
frequencies.
20