PropertyValue
is nif:broaderContext of
nif:broaderContext
is schema:hasPart of
schema:isPartOf
nif:isString
  • Introduction to the Rasch model: Some variables can be measured directly (eg, height and weight); other variables are measured indirectly by how they manifest (eg, disability, cognitive function, quality of life). Therefore, we need a method to transform the manifestations of these “latent” variables into numbers that can be taken as measurements [14]. Rating scales are a means to measure latent variables by a set of items, each of which has two or more ordered response categories that are assigned sequential integer scores. For the analysis of rating scales, the Classical Test Theory is usually applied, whereby the item scores are summed to give a total score. However, this simple and natural approach has two main limitations [13]. First, scoring the items with sequential integers implies equal differences at the item level (differences between each response category are assumed to be equal) and at the summed score level (a change of one point implies an equal change across the range of the scale, no matter which item is concerned by this change). Consequently, such ordinal scores cannot provide us with a stable frame of reference in terms of the distance between individuals on the ability scale. Second, when applying the Classical Test Theory, the latent trait of interest is estimated by a summed score which is actually difficult to match to each single item in order to know what an individual can actually perform: individuals with the same summed score may not be able to achieve the same item task. To establish a reliable rating scale, the information of the relative difficulties of items which is actually lost in the summed score must be taken into account. As a main alternative to overcome theses limitations, the Item Response Theory assumes that the probability of a specified score of a person on an item is a function of the person’s ability and the item difficulty [15]: Pr(Xni=x)=f(βn,τki), where Xni = x ∈ {0, 1, …, mi} is an integer random variable for item i where mi is the maximum score, βn corresponds to the ability parameter of person n and τki corresponds to the difficulty to obtain the score k for the item i. When the person’s ability is high and the item difficulty is low, the probability of having a high score for that item increases. The Rasch model constitutes a particular case of the Item Response Theory and can be viewed as applying a transformation to the total scores [16]. The Rasch transformation preserves the order of the raw scores, but the distance between individuals can be assessed, and not only the rank ordering. Second, both the item difficulty and person ability are defined on the same scale; if a person’s ability is known, we can predict how that person is likely to perform on an item. The Rasch model has several forms and extensions according to the data. The simplest form is the dichotomous Rasch model and corresponds to the situation where items have only two response categories (0 and 1). Specifically, the probability of a correct response is modeled as a logistic function of the difference between the person and item parameter: Pr(Xni=1)=exp(βn-τ1i)1+exp(βn-τ1i). It assumes that when the person’s ability equals the item difficulty, the probability of score 1 for item i is 0.5. The polytomous Rasch model is a generalization of the dichotomous Rasch model [17]. Here, we will precisely consider the Partial Credit model which allows different difficulty parameters for different items [14]: Pr(Xni=x)=exp∑k=0x(βn-τki)∑j=0miexp∑k=0j(βn-τki). The Rasch model is based on four assumptions: 1) in the model there is only one latent variable of interest, which is the focus of the measurement and all items tap into this latent variable; 2) the total scores over an item or a person contains sufficient information for calculation of the parameters of the model; 3) for a person, the response to different items are independent; 4) the relationship between the probability of a given score to an item i and the latent trait is described by a logistic curve. Based on these assumptions, the item difficulty parameters (τki) can be estimated by Conditional Maximum Likelihood; then the person’s ability parameters (βn) can be estimated by Maximum Likelihood. Application of the Rasch model to multi-marker genetic association: The Rasch model is a measurement model that has potential application in any context where the objective is to measure a trait or ability through a process in which responses to items are scored with successive integers. When dealing with bi-allelic SNPs of possible alleles a and A, a set of SNPs can be considered as a set of items of possible categories 0 (= aa), 1 (= aA or Aa) or 2 (= AA) assuming an additive effect which is a reasonable hypothesis for complex traits, and analyzed with the polytomous Rasch model in order to summarize the information into one score. It corresponds to the person’s ability parameter defined previously. In summary, our appraoch takes the genotypes of a set of SNPs as entry and apply the Rasch model to calculate one multi-marker Rasch genetic score per subject. Once this score is estimated for each subject, its association to a trait of interest can be assessed within classical statistical inference models according to the trait of interest (linear for quantitative traits, logistic for binary traits) with the possibility to adjust with covariates such as population stratification or gender. Several softwares and R packages are available for Rasch model analysis such as ConQuest (https://shop.acer.edu.au/group/CON3), RUMM (www.rummlab.com.au), ltm (cran.r-project.org/package=ltm) and eRM (cran.r-project.org/package=eRm). Considering its flexibility and ease of integration to a pipeline of analysis, we choose to use the eRM R package. The following short R script provides the functions used to obtain the multi-marker Rasch genetic score for each subject of a dataset of interest, where ‘Geno’ is a data matrix of genoptypes coded by 0, 1 and 2, with subjects in rows and markers in columns: >library(eRM)>rasch.model = PCM(Geno)>score = person.parameter(rasch.model)$theta.table[, 1] If ‘Trait’ is a binary trait disease coded by 1 for cases and 0 for controls, the association of the multi-marker Rasch genetic score to the disease can then simply be assessed with a logistic model: >glm(Trait ∼ score, family = “binomial”) If ‘Trait’ is a quantitative trait, the association of the multi-marker Rasch genetic score to the disease can then simply be assessed with a linear model: >lm(Trait ∼ score) The performances of our Rasch-based multi-marker genetic association test are first evaluated in term of false-positive rate and power based on simulations over three scenarios of dependence between SNPs and varying levels of association. For each scenario, we consider: –a binary disease trait (500 cases and 500 controls) of prevalence Kp = 0.05.–a set of 24 SNPs including 12 disease susceptibility loci (DSL) simulated with relative risks ranging from 1 (no association) to 2 (strong association). This simulation framework detailed hereafter follows principles widely used previously [18–22]. Scenario 1: SNPs are independent: The simulation model for one SNP is based on the Wright’s model [23] applied to a bi-allelic marker with alleles a and A having the frequencies pa and pA = 1 − pa. p0, p1 and p2 are the frequencies of genotypes aa, aA/Aa and AA defined by the Hardy-Weinberg proportions: {p0=pa2+Fpa(1-pa)p1=2pa(1-pa)-Fpa(1-pa)p2=(1-pa)2+Fpa(1-pa) where F is the consanguinity coefficient. This coefficient can indicate a deficit (F > 0) or conversely an excess (F < 0) of heterozygous. Here, we consider F = 0, so that the locus is under the Hardy-Weinberg equilibrium. We then want to compute the genotype frequencies of the SNP for cases and controls pDi and pHi where i = 0, 1 or 2 using the disease prevalence Kp, the penetrances f0, f1 and f2 of the genotypes and the mode of inheritance. The main modes of inheritance can be defined by considering the relative risks RRi/0=RRi=fif0, i = 1, 2. By assuming an additive mode of inheritance (RR1=RR2+12), and using f0 = Kp/(p0 + RR1 × p1 + RR2 × p2), f1 = RR1 × f0, f2 = RR2 × f0 and the Bayes formulas, we can easily derive the desired frequencies: {(pD0,pD1,pD2)=(f0×p0Kp,f1×p1Kp,f2×p2Kp)(pH0,pH1,pH2)=((1-f0)×p0Kp,(1-f1)×p1Kp,(1-f2)×p2Kp) The 24 SNPs are simulated independently according to this model, the 12 non-associated SNPs with a relative risk of 1 and the 12 DSLs with a relative risk ranging from 1 to 2. Scenario 2: SNPs in moderate Linkage Disequilibrium: To account for SNPs in Linkage Disequilibrium (LD), our simulation model follows an approach based on the diplotype frequencies of real datasets. These frequencies are used as an empirical distribution of the range of possible diplotypes. First, 12 DSLs are simulated independently from the model described in Scenario 1. Then the remaining SNPs are completed based on a real dataset (here the chromosome 6 of the ADNI dataset described below) in order to generate one LD blocks of moderate magnitude (0.4–0.7) around each DSL. Simulating this way leads to genetic patterns similar to those found in real data and therefore allows us to finely control the level of LD between SNPs. Scenario 3: SNPs in strong Linkage Disequilibrium: The simulation is the same as for Scenario 2 with the difference that we consider SNPs in strong LD (0.8–1). Monte-Carlo estimation of false-positive rate and power: For each scenario and each level of DSL relative risk, we ran B = 1000 simulations in order to provide accurate Monte-Carlo estimates of false-positive rate and power. For each simulation we obtain a p-value of association of the set of SNPs simulated by applying our Rasch-based multi-marker association test. The false-positive rate is estimated by PrH0(p-value ⩽ α) and the power is estimated by PrH1(p-value ⩽ α), with α the significance level usually set to 5%. Consequently in our simulations, by placing ourselves under the null hypothesis H0 of no association (RR2 = 1), then under the alternative hypothesis H1 of association (RR2 > 1), we can respectively estimate both false-positive rate and power of our method by considering the same quantity: ♯(p-valuei⩽α,i=1,...,B)B, where ♯() represents the number of p-values inferior or equal to α. We compared the performances of our Rasch-based multi-marker association test to three existing methods: –minP [9] is the simplest and most naive method. It considers the most significant p-value of the set of SNPs considered as the p-value of the set. This method is obviously biased since it does not take the multiple-testing and the dependence of tests into account. It is used here as a negative control and also because it is nevertheless the most widely used approach in practice.–GATES [24] is a multi-marker association test using an extended Simes procedure to apply on each SNP. The p-values computed by a standard linear trend test of association on each SNP are combined with the control of correlation structure: significant p-values in high LD count less than significant p-values of independent SNPs.–Fisher [12] is the well-known Fisher’s combination of p-values. For m SNPs, the multi-marker test statistic is given by T=−2∑i=1mln(pi) which has a chi-square distribution with 2m degrees of freedom under the null hypothesis when the m tests are independent. An adjustment to dependent tests is also available and used here [25].–SKAT [26] is SNP-set Kernel Association Test. It aggregates individual test score statistics of SNPs in a set and efficiently computes the set-level p-value. It performs multiple regression of a phenotype on all variants with Davies method while adjusting for covariants for counting account for population stratification and upweights rare variants. Application to the Alzheimer ADNI GWAS data: Alzheimer’s disease (AD) is the most common neurodegenerative disorder and affects more than 35 million people worldwide. It is characterized by brain atrophy reflecting neuronal and synaptic loss and the presence of amyloid plaques and neurofibrillary tangles, leading to a progressive deterioration of cognitive functions involving memory, reason, judgment and orientation [27]. AD pathogenic mechanisms are still unclear and the disease remains a condition without cure. According to age at onset, two main types of AD are differentiated: Early-Onset AD (EOAD, appears generally before the age of 65, less than 10% of the AD population and clear genetic determinants with mutations found in the APP, PSEN1 and PSEN2 genes) and Late-Onset AD (LOAD, more than 90% of the AD population, appears generally after the age of 65 and has a complex etiology based on genetic and environmental factors) [28]. In recent years, several Genome-Wide Association Studies (GWAS) were performed to detect genetic loci associated with LOAD [29–31]. These studies support the hypothesis that APOE is a major susceptibility gene for LOAD [32]. In addition to APOE, markers within several other genes gave replicated evidence of association with LOAD [33]. The identification of these genes improves our knowledge of AD. For instance, CR1 has been demonstrated to be able to produce an AD up-regulated protein [34]. Although these new loci have been found, some problems ramain unsolved. First, to date none of these loci has proven accurate or sensitive enough to serve as biomarker. Second, the replication of results is a tedious task in GWAS. To push the boundaries of current knowledge on AD, further studies about GWAS and statistical models are still necessary. By way of illustration, we applied our Rasch-based multi-marker association test to the genes of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu) [31]. The study population is made up of 359 cases and 226 controls, genotyped with an Illumina Human 610-Quad (= 620901 SNPs). A standard quality control process based on minor allele frequency, Hardy-Weinberg equilibrium, missingness and relatedness excluded 31 cases, 49 controls and 82071 SNPs [35]. The dataset was also reduced with a minimal loss of information by pruning with Plink (window size = 50 SNPs, shift = of 5 SNPs at each step and threshold correlation coefficient of 0.2) [36]. Missing genotypes were imputed with weighted k-Nearest-Neighbors method [37]. SNPs are considered attached to a gene if they are located within a distance of 20 kb around it. The curated dataset to analyze comprises 16514 genes. For each gene and each subject, a Rasch-based multi-marker genetic score is computed, and the association of this score to the disease is evaluated by a logistic regression model. The top genes identified by the Rasch analysis were integrated into a hypothetic signalling network. Protein-protein interaction data and functional findings were extracted from QIAGEN’s Ingenuity Pathway Analysis (IPA, QIAGEN Redwood City, www.qiagen.com/ingenuity), manually analysed and supplemented by literature curation.
rdf:type