PropertyValue
is nif:broaderContext of
nif:broaderContext
is schema:hasPart of
schema:isPartOf
nif:isString
  • Approach to detect related individuals: Our approach is based on the methodology used by GRAB [14] which was designed for unphased and diploid genotype or sequencing data. This approach divides the genome into non-overlapping windows of 1 Mbps each and compares for a pair of individuals the alleles inside each window. Each SNP is classified into three different categories: IBS2 when the two alleles are shared, IBS1 when only one allele is shared and IBS0 when no allele is shared. The program calculates the fractions for each category (P2, P1 and P0) per window and, based on certain thresholds, uses them for relationship estimation. GRAB can estimate relationships from 1st to 5th degree, but it has not been tested with data from different SNP panels or populations [14]. We assume that our input data stems from whole genome shotgun sequencing of an ancient sample resulting in low coverage sequencing data. In such situations, a common approach in many ancient DNA studies is to randomly sample one sequencing read per individual and SNP site and then use the allele carried on that read as pseudo-haploid information. Such approaches are obviously restricted to a set of biallelic SNPs ascertained in an external dataset. Consequently, we only expect to observe one allele per individual and SNP site which is either shared or not shared between the two individuals. READ does not model aDNA damage, so it is expected that the input is carefully filtered, e.g. by restricting to sites known to be polymorphic, by excluding transition sites or by rescaling base qualities before SNP calling [61]. Analogous to GRAB [14], we partition the genome in non-overlapping windows of 1 Mbps and calculate the proportions of haploid mismatches and matches, P0 and P1, for each window. Since P0 + P1 = 1, we can use P0 as a single test statistic. The average P0 is calculated from the genome-wide distribution. To reduce the effect of SNP ascertainment, population diversity and potential batch effects, each individual pair’s average P0 scores are then normalized by dividing all values by the average non-normalized P0 score from an unrelated pair of individuals from the same population ascertained in the same way as for the tested pairs. Such a normalization step is not implemented in GRAB [14] suggesting that GRAB might be sensitive to ascertainment bias and general population diversity. The normalization sets the expected score for an unrelated pair to 1 and we can define classification cutoffs which are independent of the diversity within the particular data set. We define three thresholds to identify pairwise relatedness as unrelated, second-degree (i.e. nephew/niece-uncle/aunt, grandparent-grandchild or half-siblings), first-degree (parent-offspring or siblings) and identical individuals/identical twins. The general work flow and the decision tree used to classify relationships is shown in Fig 1. There are four possible outcomes when running READ: unrelated (normalized P0≥0.90625), second degree (0.90625≥normalized P0≥0.8125), first degree (0.8125≥normalized P0≥0.625) and identical twins/identical individuals (normalized P0<0.625) (Fig 1). The cutoffs were chosen to lie halfway between the probabilities of one randomly chosen allele for an individual not being IBD to a randomly chosen allele from another individual considering their degree of relationship: 1/2 = 0.5 identical twins/identical individuals, 3/4 = 0.75 for first degree relatives, 7/8 = 0.875 for second degree relatives and 15/16 = 0.9375 for third degree relatives. We do not aim to classify higher degrees than second degree and, therefore, consider all relationships of third degree or higher as ‘unrelated’. This is a decision to keep the approach conservative and to allow for some variation within the group of unrelated individuals. Furthermore, the 1000 genomes data contains very few third degree relatives making it difficult to estimate error rates for this group. READ is implemented to classify pairs of individuals in certain categories, so it will always output the best fitting degree of relationship based on the point estimate of the average P0. As an assessment of confidence of that classification, we estimate the standard error of the mean of the distribution of normalized P0 scores (SE=σ^/n where σ^ is the empirical standard deviation of P0 across all windows and n is the total number of windows) and calculate the distance to the cutoffs in multiples of the standard error (similar to a Z score also known as ‘standard score’). Furthermore, the user is provided with a graphical output (see Fig 5) showing the average P0 for each pair, their 95% confidence interval, and the cutoffs for classification together with their 95% confidence interval. Relationship Estimation from Ancient DNA (READ) was implemented in Python 2.7 [73] and GNU R [74]. The input format is TPED/TFAM [8] and READ is publicly available from https://bitbucket.org/tguenther/read and as S1 File. Modern data with reported degrees of relationships: Autosomal Illumina Omni2.5M chip genotype calls from 1,326 individuals from 15 different populations were obtained from the 1000 genomes project (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/hd_genotype_chip/) [48]. We used vcftools version 0.1.11 [75] to extract autosomal biallelic SNPs with a minor allele frequency of at least 10% (1,156,468 SNPs in total—similar to the aDNA data set used for the empirical data analysis [35]; see below) and to convert the data to TPED/TFAM files. The data set contains pairs of individuals that were reported as related, 851 of them as first degree relationships and 74 as second degree. We randomly sub-sampled 1,000, 2,500, 5,000, 10,000, and 50,000 SNPs and also randomly picked one allele per site in order to mimic extremely low coverage sequencing of ancient samples. In an additional simulation, we introduced different allelic error rates to the data to assess the possible effects of sequencing and mapping errors, contamination and post-mortem damage. Allelic errors were introduced by randomly changing alleles to the alternative based on a per site error rate, the per site error rates are aimed to reflect different error rates in different parts of the genome. Per site error rates were drawn from a Gaussian distribution with a mean corresponding to the average allelic error rate (0.05, 0.1, 0.15 or 0.2) and a standard deviation of 0.01. READ was then applied to these data sets and the median of all average P0s per population was used to normalize scores assuming that this would represent an unrelated pair. Additionally, the data was tested employing a 10-fold cross-validation procedure, which allowed to infer the expected value of P0 for a pair of unrelated individuals from a different subset of the data than what was used to test the relationships avoiding potential circularity. The average value of P0 obtained for each pair was then used for classification. To evaluate READ’s performance, we calculate false positive and false negative rates. Unrelated individuals classified as related were considered as false positives, related individuals classified as unrelated or as related but not at the proper degree were considered false negatives. READ’s performance was similar for both normalization approaches (median and cross-validation), so we present the results of using the median in the main text and the cross-validation approach results in Supplementary figures (S1 and S2 Figs). The cross validation approach would require large sample sizes per population which are not reached in most ancient DNA studies (see the empirical data set below for an example). In addition to the modern data, published ancient data was obtained from the study of Mathieson et al. (2015) [35]. The data set consisted of 230 ancient Europeans from a number of publications [28, 30–33, 76] as well as new individuals from various time periods during the last 8,500 years. The data set consisted of haploid data for up to 1,209,114 SNPs per individual. We extracted only autosomal data for all individuals and applied READ to each cultural or geographical group (as defined in the original data set of Mathieson et al (2015) [35]) with more than four individuals separately. Shotgun sequencing data was also analyzed separately from SNP capture data to avoid batch effects. The median of all average P0s per group was used for normalization assuming that this would represent an unrelated pair. Mathieson et al (2015) [35] report nine pairs of related individuals and they infer all of them to be first degree relatives without providing details on how they were classified. Y-chromosome haplotypes of the five individuals shown in Fig 6 were checked using samtools [77] (applying a minimum mapping and base quality of 30) and marker information for the haplotypes R1a and R1b from the International Society of Genetic Genealogy (http://www.isogg.org, accessed January 16, 2017). The results are shown in S1 Table.
rdf:type