nif:isString
|
-
Data were directly downloaded from published studies and no additional ethics approval was needed. Each study is referenced and details on ethics approval are available in each manuscript.
A study by Alcamo et al. [11] mapped Satb2 expression in developing cortex and showed that Satb2 mutant mice display altered expression of 28 genes associated with axon projection, including BCL11B at E18.5. A more recent study by McKenna et al. [24] performed RNA-seq analysis of cortices at postnatal day P0 to study differentially expressed genes (DEGs) between wild type and Satb2-deficient mice. This led to the identification of 74 DEGs in the deep layers and 15 DEGs in the upper layers of Satb2-deficient cortices. The list of genes from these two studies (n = 117) was increased using data from other studies of Satb2 mouse models [2,15,38]. We also included in this set genes considered to be vital components of the NuRD complex [25] as it has been previously shown to facilitate Satb2-mediated repression of Bcl11b during development [5,12]. Altogether, following conversion of murine gene IDs to orthologous human gene IDs, a total of 127 genes (including SATB2, BCL11B and GATAD2A) were included in this first gene-set named SATB2+NuRD (S1 Table). The second gene-set was generated using a dataset of 1,341 ChIP-seq peaks (GEO accession: GSE68910) that map binding sites of SATB2 in cortices of wild type mice at E15.5 [24]. ChIP-seq reads were mapped against the mouse NCBI37/mm9 assembly. Functional annotation tool GREAT (http://bejerano.stanford.edu/great/public/html/index.php) was used to associate both proximal and distal input ChIP-seq peaks with their putative target genes and thereby identify genes that may be regulated by SATB2 [58]. We used the default basal plus extension approach within GREAT where each gene in the genome is assigned a basal regulatory domain of a minimum distance of 5kb upstream and 1kb downstream of the transcription start site of the canonical isoform of the gene (regardless of other nearby genes). The gene regulatory domain is extended in both directions to the nearest gene’s basal domain but no more than the maximum extension of 1,000kb in one direction. In addition, GREAT utilizes a set of literature curated regulatory domains that extend the regulatory domain for each gene to include its known regulatory element. GREAT mapped 1,341 ChIP-seq peaks to 1,800 unique gene IDs. For only 144 of these genes, the peak was located 5kb upstream or 1kb downstream of the gene. Given the large default extension region applied in GREAT, this may have led to a number of spurious results. We filtered the peaks mapping to the remaining 1,656 genes by overlapping them with defined enhancers from ENCODE (http://chromosome.sdsc.edu/mouse) to provide extra support for a potential regulatory role. A total of 452 peaks overlapped with mouse brain-specific enhancers (E14.5) and were mapped back to 712 of the 1,656 genes. This resulted in a final set of 856 mouse genes where a SATB2 ChIP-seq peak maps to regulatory regions of those genes. The Ensembl data-mining tool BioMart (http://www.ensembl.org/index.html) was then used to convert these mouse gene IDs to human gene IDs, which resulted in a final set of 778 human genes. This second gene-set was named SATB2_Cort (S2 Table). The third gene-set was generated using a dataset of 5,027 ChIP-seq peaks (GEO accession: GSE GSE77005) that map binding sites of Satb2 in primary hippocampal cell cultures from wild type mice at postnatal day P0 to P1 [13]. This dataset represents the high-confidence peak list derived from two independent biological ChIP-seq replicates by using the MAnorm to filter out the inconsistent peaks (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439967/). We used these ChIP peaks to identify 4,138 human gene targets using the same procedure as mentioned above. This third gene-set was named SATB2_Hipp (S3 Table). There is a considerable difference in gene number between the SATB2_Cort and SATB2_Hipp gene-sets. Factors contributing to this difference are likely to include the different functions of SATB2 in the pre- and post-natal brain and that the ChIP-Seq data has been generated from different brain regions (cortex v hippocampus). In addition, the SATB2 ChIP-Seq data was generated under different experimental conditions (tissue v primary neuronal cultures) including use of different antibodies (anti-SATB2 v anti V5-tag antibody (ChIP-grade)). These details are supplied in S8 Table. Sets of ‘brain-expressed’ genes (n = 14,243) and ‘brain-elevated’ genes (n = 1,424) were sourced from the Human Protein Atlas (https://www.proteinatlas.org/humanproteome/brain) and used as covariates in the GSA. Brain-elevated genes are those that show an elevated expression in brain compared to other tissue types.
Summary statistics from the most recent SZ GWAS [21] were obtained from the Walters group data repository on the MRC Centre for Neuropsychiatric Genetics and Genomics website (http://walters.psycm.cf.ac.uk/). This study included data on 40,675 cases and 64,643 controls. Summary statistics from the most recent EA GWAS [23] were obtained from the Social Science Genetic Association Consortium (SSDAG) website (http://ssgac.org/Data.php, Summary data file: EduYears_Main.txt—discovery and replication cohorts except 23andMe). This study reported results for 328,917 individuals. Summary statistics from a GWAS of hippocampal volume (n = 33,536; [32]) and a second GWAS of intracranial volume (n = 32,438 [29]) were obtained from the ENIGMA Consortium website (http://enigma.ini.usc.edu/). GWAS summary statistics were sourced for AD [59], ADHD (https://www.biorxiv.org/content/early/2017/06/03/145581), ASD (https://www.biorxiv.org/content/early/2017/11/27/224774), BPD [60], CAD [61], CD [62], OCD [63], STR [64], T2D [65] and UC [66].
A gene-set analysis (GSA) is a statistical method for simultaneously analysing multiple genetic markers in order to determine their joint effect. We performed GSA using MAGMA [26](http://ctg.cncr.nl/software/magma) and summary statistics from various GWAS. An analysis involved three steps. First, in the annotation step we mapped SNPs with available GWAS results on to genes (GRCh37/hg19 start-stop coordinates +/-20kb). Second, in the gene analysis step we computed gene P values for each GWAS dataset. This gene analysis is based on a multiple linear principal components regression model that accounts for linkage disequilibrium (LD) between SNPs. The European panel of the 1000 Genomes data was used as a reference panel for LD. Third, a competitive GSA based on the gene P values, also using a regression structure, was used to test if the genes in a gene-set were more strongly associated with either phenotype than other genes in the genome. The MHC region is strongly associated in the SZ GWAS data. This region contains high LD and the association signal has been attributed to just a small number of independent variants [67]. However, MAGMA still identifies a very large number of associated genes despite factoring in the LD information. Of 278 genes that map to chromosome 6 (25-35Mb), 130 genes were associated with SZ in our MAGMA analysis. To avoid the excessive number of associated genes biasing the MAGMA GSA, we excluded all genes within the MHC region from our GSA of SZ. MAGMA was chosen because it corrects for LD, gene size and gene density (potential confounders) and has significantly more power than other GSA tools [68]. Numerical data used for all figures displaying MAGMA results are provided in S9 Table.
Analysis of gene-sets using rare variant data: A list of genes harbouring de novo variants identified in patients with SZ, autism spectrum disorder (ASD), intellectual disability (ID) and in unaffected siblings and controls were sourced from Fromer et al. [33]. We used the categories of variant as defined in that study (all, loss-of-function (LoF), non-synonymous (NS) and silent; gene number within each group is detailed in Table 1). We sourced a list primary ID genes (n = 960) from the curated SysID database of ID genes (http://sysid.cmbi.umcn.nl/) [69]. From an exome sequencing of 12,332 unrelated Swedish individuals (4,946 individuals with SZ), we sourced a list of 42 genes that had a significant excess of disruptive and damaging ultra-rare variants (dURVs) in SZ cases compared to controls [34]. We performed enrichment analysis of these gene lists with our gene-sets using 2x2 contingency tables with genes restricted to those annotated as protein coding using a background set of 19,424 genes (https://www.ncbi.nlm.nih.gov/). Bonferroni multiple test correction was performed separately for the tests of de novo variant genes (n = 48 tests), for the tests of SysID genes (n = 3) and for the tests of dURVs in SZ genes (n = 3).
Gene expression datasets and potentially synaptic genes: Human brain expression data from the Protein Atlas (http://www.Proteinatlas.org/humanproteome/brain/) was used to filter the SATB2_Hipp gene-set to only include genes expressed in the brain. This dataset included 14,540 genes expressed in, but not unique to, the human brain. For filtering SATB2 gene-sets to include only neuron-expressed genes, we used an RNA-Seq transcriptome and splicing database of glia, neurons, and vascular cells of the cerebral cortex [70]. We used RNA-Seq data from mouse neurons (https://web.stanford.edu/group/barreslab/brainrnaseq.html) and separated genes into three categories; low, medium and high expressed. Low expressed genes were those with Fragments Per Kilobase of transcript per Million mapped reads (FPKM) values <2.0 (n = 12,161 genes). The median FPKM value for the remaining genes was 9.6, hence that was used to categorize medium and high expressed genes; medium (FPKM = 2.0–9.6; n = 5,107 genes) and high (FPKM>9.6; n = 5,189 genes). Mouse gene IDs were converted to human gene IDs using BioMart. For analysis of the SATB2_Hipp gene-set, we used expression data from the hippocampus from pcw 37 to 1 year (n = 4 samples) from the Brainspan Atlas of the Developing Human Brain (http://www.brainspan.org/). We calculated mean expression values and categorised genes as low (FPKM<2.0; n = 9,931), medium (FPKM = 2.0–7.45; n = 5,619) and high expressed (FPKM>7.45; n = 5,842 genes). We followed a method previously outlined [34] to identify potentially synaptic genes.
ConsensusPathDB-human (http://cpdb.molgen.mpg.de/) was used to perform overrepresentation analysis of gene-sets and we report on enriched gene ontology-based sets[71].
|