nif:isString
|
-
All patients or their representatives, if participants were under age 18, and included relatives, gave their informed consent to this study. All procedures were in accordance with the Helsinki declaration and approved by the local ethics committees/internal review boards of the participating centers. The leading institution was the Ethics Commission of the University and the University Clinic of Tübingen. GGE cohort: This cohort included 196 subjects with genetic generalized epilepsy. All subjects were of European descent (Italian 81, German 54, Finnish 22, Dutch 11, British 9, Danish 8, Turkish 6, Swedish 3, French 1, Greek 1). The cohort included 117 female subjects (60%). The GGE-diagnoses were childhood absence epilepsy (CAE; n = 94), juvenile absence epilepsy (JAE; 21), juvenile myoclonic epilepsy (JME; 47), genetic generalized epilepsy with generalized tonic-clonic seizures (EGTCS, 27), early-onset absence epilepsy (EOAE, 4), epilepsy with myoclonic absences (EMA, 1), and unclassified GGE (2). Age of epilepsy onset ranged from 1 year to 38 years with a median of 8 years. The majority of subjects derived from multiplex families with at least 2 affected family members (n = 183), thereof 90 families with 3 or more affected members. RE cohort: This cohort included 204 unrelated Rolandic patients of European ancestry which were recruited from centers in Austria (n = 107), Germany (n = 84), and Canada (n = 13). Control cohort: We used 445 females and 283 males (728 in total) from the Rotterdam Study as population control subjects [22]. The same cohort was recently used for the screening of 18 GABAA-receptor genes in RE and related syndromes [23].
Our primary analysis workflow included three major steps as shown in Fig 1. These are 1) data pre-processing, 2) SNV/INDEL analysis and 3) copy number variant analysis.
Figure data removed from full text. Figure identifier and caption: 10.1371/journal.pone.0202022.g001 Flowchart of the analysis steps.Parameters used in each step are described in detail in the methods section. Data pre-processing: Sequencing adapters were removed from the FASTQ files with cutadapt [24] and sickle [25]. GATK best practices were followed for the next steps of data pre-processing and variant calling [26]. Alignment to the GRCh37 human reference genome was performed using BWA-MEM [27] with default parameters. Conversion of SAM to BAM files was done with SAMtools [28]. Sorting of BAM files, marking of duplicate reads due to PCR amplification and addition of read group information were done using Picard (https://github.com/broadinstitute/picard) tools with default parameters. Base quality score recalibration and local realignment for INDELs was performed using GATK version 3.2. Coverage: Mean depth of coverage and target coverage of exons were calculated from the BAM files using the depth of coverage tool from GATK. The same files were also used as input for calling of CNVs. Variant calling: The GATK haplotype caller (version 3.2) was chosen to perform multiple sample variant calling and genotyping with default parameters. To include splice site variants in the flanking regions of the exons, exonic intervals were extended by 100 bp each upstream and downstream. Multiple sample calling is advantageous in deciding whether a variant can be identified confidently as it provides the genotype for every sample. It allows filtering variants based on the rate of missing genotypes across all samples and also according to the individual genotype. Sample QC: Samples were excluded from the analysis based on the following criteria: 1) Samples with a mean depth <30x or <70% of exon targets covered at <20x were excluded from further analysis; 2) samples with >3 standard deviations from mean in number of alternate alleles, number of heterozygotes, transition/transversion ratio, number of singletons and call rate as calculated with the PLINK/SEQ i-stats tool (https://atgu.mgh.harvard.edu/plinkseq/); 3) call rate <97%; 4) ethnically unmatched samples as identified by multi-dimensional scaling analysis with PLINK version 1.9 [29]; 5) PI_HAT score>0.25 as calculated by PLINK version 1.9 to exclude related individuals. Variant QC: Initial filtering of variants was performed based on quality metrics over all the samples with the following parameters for VQSR: Tranches chosen, VQSRTrancheSNV99.90to100.00. QC over all samples (INFO column) was done as follows: a) for SNVs, variants were filtered for QD < 2.0, FS > 60.0, MQ< 40.0, MQRankSum < –12.5, ReadPosRankSum < –8.0, DP <10.0, GQ_MEAN < 20.0, VQSLOD < 0, more than 5% missingness, ABHet > 0.75 or < 0.25 and deviation from Hardy-Weinberg equilibrium (Phred scale p-value of > 20); b) for INDELs, the same was done as for SNVs except for the following parameters for variant filtration: QD <2.0, FS >200.0, ReadPosRankSum < –20.0, DP <10.0, GQ_MEAN <20.0, missingness <5%, Hardy-Weinberg Phred scale value of >20, VQSLOD >0. To further exclude low quality variants, we also applied filtering based on quality metrics for each genotype using read depth and quality of individual genotypes. Genotypes with a read depth of <10 and GQ of <20 were converted to missing by using BCFtools [28]. Multi-allelic variants were decomposed using variant-tests [30] and left-normalized using BCFtools [28]. Variant annotation: Variants were annotated with ANNOVAR [31] version 2015, Mar22 using RefSeq and Ensembl versions 20150322 and the dbNSFP [32] version 2.6 annotations including nine scores for missense mutations (SIFT, PolyPhen2 HDIV, PolyPhen2 HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, MetaSVM, MetaLR), the CADD score, and three conservation-based scores from GERP++, PhyloP and SiPhy. Splicing variants were defined to include 2 bp before and after the exon boundary position. To obtain rare variants, we filtered the variants for a minor allele frequency (MAF) of <0.005 in public databases such as 1000 genomes [33], dbSNP [34], ExAC (release 0.3) and the exome variant server (EVS). We defined deleterious variants as those variants that fulfil any of the following three criteria: 1) all the variants except the synonymous variants predicted to be deleterious by at least 5 out of 8 missense prediction scores, CADD score >4.5, or 2 out of 3 conservation scores (GERP>3, PhyloP>0.95, SiPHy>10) show high conservation; 2) variants annotated as “splicing”, "stop gain" or "stop loss"; 3) any insertion or deletion. CNV detection: In the remaining high quality samples, CNVs were detected by using XHMM as described in [35]. In the current study, we focused only on deletions, as the false positive rate for duplications is too high to allow for meaningful interpretation. CNV calls were annotated using bedtools version 2.5 [36]. NCBI RefSeq (hg19, 20150322) was used to identify the genes that lie within the deletion boundaries. CNV filtering: The detected deletions were filtered based on the following criteria: 1) Z score < –3, given by XHMM; 2) Q_SOME score ≥ 60, given by XHMM.
Burden analysis of large and rare deletions: Excess deletion rate of the large deletions (length >400 kb) in subjects with epilepsy compared to the controls was measured as described in [13] using PLINK version 1.9 [29]. We set the overlap fraction to 0.7 (70%) and the internal allele frequency cut-off <0.5% and evaluated the significance empirically by 10,000 case-control label permutations.
The CNVs that are unique for cases (not present in any of the in-house controls) and occur at a low frequency, i.e., present in ≤2 independent cases, while having a frequency of ≤1% in the CNVmap, the DGV gold standard dataset [37] and 1000 genomes SV [38] were selected and subjected to further analysis as described below.
We proceeded by visual inspection of depth variation across exons of the filtered deletions; we also performed qPCR validations of three small deletions, two of which, NCAPD2 and CAPN1, stood the filtering procedure (see Table 1). For RE patients, genomic DNA samples were analysed using the Illumina OmniExpress Beadchip (Illumina, San Diego, CA, USA) [13]. Twenty-three of 60 CNVs present in the RE patients were validated by available array data (S1 Table). Generally, small CNVs cannot be reliably identified with SNP arrays [39]. Indeed, of the 37 CNVs that were not identified in the beadchip data, 23 have a size of <10 kb, whereas only 2 of the 23 validated CNVs have a size of less than 10 kb according to the array data.
Table data removed from full text. Table identifier and caption: 10.1371/journal.pone.0202022.t001 Epilepsy associated microdeletions. Compound heterozygous mutations and protein-protein interactions: We checked for concurrence of a deletion in one allele and a deleterious variant in the second allele. We included the first order interacting partners from the protein-protein interaction network (PPIN) in this analysis [40] and assessed if any gene or its first order interacting partner carries a deletion in one allele and a deleterious variant in the other. We excluded all genes that had no HGNC (HUGO Gene Nomenclature Committee) entry resulting in a network of 13,364 genes and 140,902 interactions. This network was then further filtered for interactions likely to occur in brain tissues using a curated data set of brain-expressed genes [41]. The final brain-specific PPIN consisted of 10,469 genes and 114,533 interactions.
Genes that were expressed in brain [42] and located within deletion boundaries were used as input for an enrichment analysis using the Ingenuity Pathway Analyser (IPA®) [43]. We performed the enrichment analysis with all deleted genes from the RE and GGE samples together as well as for each phenotype separately.
To assess whether the deleted set of genes were enriched in known epilepsy-associated genes, we retrieved genes that were associated with the disease term “epilepsy” from the DisGeNET database [44]. Then we compared the overlap between the brain-expressed genes that are deleted in RE (n = 85), GGE (n = 49) and RE+GGE (n = 134) against the brain-expressed epilepsy-related genes in DisGeNet (n = 674). We used the total number of brain-expressed genes (n = 14,177) as the background. The R GeneOverlap package (https://bioconductor.org/packages/release/bioc/html/GeneOverlap.html) was used to compute the p-value.
The CNV tolerance score was used as defined in [45]. The CNV tolerance and deletion scores for the genes that are deleted in our study were obtained from the ExAC database [46] and their enrichment in GGE and RE cases was assessed by the Wilcoxon rank sum test.
The overlap between the different data sets was obtained by gene symbol matches between the detected gene deletions and the gene lists from different databases; more details are given in the discussion section. A workflow depicting the steps above is shown in Fig 1.
|