General and specific functions of exonic splicing silencers in splicing control. Mol Cell. Quantitative evaluation of all hexamers as exonic splicing elements. Genome Res. Human Splicing Finder: an online bioinformatics tool to predict splicing signals. MutPred Splice: machine learning-based prediction of exonic variants that disrupt splicing.
Genome Biol. Article Google Scholar. The human splicing code reveals new insights into the genetic determinants of disease. Science Learning the sequence determinants of alternative splicing from millions of random sequences. Bioinformatics Oxford, England. Analysis and design of RNA sequencing experiments for identifying isoform regulation.
Nat Methods. Deciphering the splicing code. Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular context. Integrative deep models for alternative splicing. Intron-centric estimation of alternative splicing from rna-seq data. The expanding landscape of alternative splicing variation in human populations.
Am J Hum Genet. A new view of transcriptome complexity and regulation through the lens of local splicing variations. Pathogenic variants that alter protein code often disrupt splicing. Nat Genet. Saturation mutagenesis reveals manifold determinants of exon definition. Vex-seq: high-throughput identification of the impact of genetic variation on pre-mRNA splicing efficiency. Clinvar: public archive of relationships among sequence variation and human phenotype.
Kipoi: accelerating the community exchange and reuse of predictive models for genomics. Analysis of protein-coding genetic variation in 60, humans. Reports from CAGI: the critical assessment of genome interpretation. Hum Mutat. A multiplexed assay for exon recognition reveals that an unappreciated fraction of rare genetic variants cause large-effect splicing disruptions.
A general framework for estimating the relative pathogenicity of human genetic variants. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. The genotype-tissue expression gtex project. Trends Biochem Sci. The variant call format and VCFtools. The Ensembl variant effect predictor. Paggi JM, Bejerano G. A sequence-based, deep learning model accurately predicts RNA splicing branchpoints. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology.
Genet Med. Chollet F, et al. Consortium G, et al. Variation in alternative splicing across human tissues. Cis -regulatory elements explain most of the mRNA stability variation across genes in yeast. Differential chromatin marking of introns and expressed exons by H3K36me3. PLoS Biol. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. Google Scholar. Ioffe S, Szegedy C.
Batch Normalization: accelerating deep network training by reducing internal covariate shift. Kingma D, Ba J. Adam: a method for stochastic optimization. Hyperopt: a python library for model selection and hyperparameter optimization. Comput Sci Discov. Huber PJ. Robust estimation of a location parameter. Ann Math Stat. Detection of nonneutral substitution rates on mammalian phylogenies. Bootstrap methods and their applications, vol.
Cambridge University Press; MMSplice : modular modeling improves the predictions of genetic variant effects on splicing. Cheng J. Adamson SI. Accessed 16 Feb Insigne KD. Accessed 15 Mar ClinVar: public archive of relationships among sequence variation and human phenotype. Accessed May Download references. We thank Scott I. Adamson, Brenton R. MaPSy data is available as Additional files 2 of this manuscript.
You can also search for this author in PubMed Google Scholar. JC implemented the software and analysed data. All authors read and approved the final manuscript. Correspondence to Julien Gagneur. Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Reprints and Permissions. Cheng, J. MMSplice: modular modeling improves the predictions of genetic variant effects on splicing. Genome Biol 20, 48 Download citation. The clinical files containing survival data were obtained from TCGA. Choosing the threshold which maximizes a test statistic avoids the choice of an arbitrary threshold, while increasing reliability. Minimum p -values were calculated for each gene and splice variant. The logrank test was performed on each possible split between high and low abundance within a threshold range of 0.
This method results in a distribution of p -values that is non-uniform and skewed to the left. To overcome the limitations of choosing a single threshold arbitrary threshold and minimum p -value methods non-uniform p -values under the null hypothesis , we developed a new approach resulting in null empirically estimated p -values NEEP.
The method works by estimating the null distribution of the minimum p -value approach, so that p -values under the null hypothesis are uniformly distributed, allowing for p -value adjustment.
When using the logrank test with a single expression threshold t , the set of all possible p -values is finite, and it corresponds to all possible patient partitions into the low and high expression groups. By extension, the null distribution of the minimum p -value method is defined as the entire discrete distribution of possible p -values across all values of t.
To determine a viable range of thresholds, we first chose minimum power and effect size. Given a sample size of patients with recorded death events and an alpha of 0. This corresponds to a minimum percentile threshold of 0. So, in this study we set l to 0. Considering the size of s , enumerating all possible p -values is computationally unfeasible. Therefore, we built a null distribution with 1,, Monte Carlo simulations performed by randomly partitioning samples into the two groups using all possible thresholds and extracting the smallest logrank test p -value.
We estimated p -values as the fraction of simulated values less than or equal to the minimum p -values obtained from the actual data.
As expected, the NEEP procedure yielded uniformly distributed p -values. Since NEEP produces a uniform distribution of p -values, the probability that two adjacent p -values are not equal in the sorted list is approximately one minus the ratio of samples to simulations. For simplicity, we rounded up to 1,, It is noteworthy to mention that individual expression values do not impact the KM survival curve as long as they result in the same split between high and low expression.
In our dataset, this criterion was met by 21, genes and 69, splice variants. While effect sizes of the logrank test are not as naturally interpretable as many other statistical tests, we calculated the hazard rate ratio of high over low expression curves as well as the mortality rates at 1, 2, and 5 years. We measured robustness of the NEEP method by examining sensitivity of significant splice variants to changes in the set of patients. Since the number of patients in each simulation is then compared to the actual values, we expect the adjusted NEEP p -value change is due to a smaller sample size.
So we compared ranks instead of p -values to isolate the sensitivity of NEEP p -values to resampling. We compared ranks of the significant isoforms in the full dataset against their rank in the simulations. We utilized data across multiple granularities to construct plausible graphs for the association between splice variant expression and lung cancer survival.
The general structure of these multi-granular graphs MGG can be seen in Fig 1. The outcome of the workflow consists of quadripartite networks, connecting splice variants with their potential interaction partners via their domain-domain interaction, as variant-domain-domain-variant paths.
Of these interactions, , involved proteins with an active Ensembl ID. Because PPIs are reported at the gene product level, we expanded the interactions to include all possible variant-variant interactions 73,, Ghost domains are those that are present in survival-insignificant splice variants within the same gene, but are not found in the survival-significant splice variant.
Gained domains are not found in the survival-insignificant splice variants but are present in the survival-significant splice variants. By incorporating these gained and ghost domains direct into their PPI context, we can consider MGGs as potential interaction changes caused by domain exclusion or inclusion. Genes without multiple splice variants were removed from further analysis.
We downloaded the 11, domain-domain interactions DDI from 3did [ 34 ], where inclusion requires a known protein-protein structure to support the interaction.
After removal of interactions without DDI support, interactions remained. Of these, only had gained or ghost domains. There were 50 genes with an isoform significant for survival for these interactions. We note that this tool may be used independently of NEEP. Mutation profiles for each patient were constructed as described as the 6 substitution types along with the neighboring bases for a total of 96 mutation types using only mutations found by all four variant callers.
We used MutationalPatterns [ 40 ] in R to find the linear contributions of each of the 30 signatures to a patient profile. To check for possible confounding factors, we considered three smoking variables reported by TCGA and checked whether they were different between the low and high RAD51C expression groups: cigarettes per day was tested using the Welch two sample t-test; years smoked was tested using the Wilcoxon rank sum test; and the binary smoker variable was tested using the two-sample test for equal proportions.
In addition, we checked if survival confounded the relationship between RAD51C and signature 3. Because the causality between mutations and survival must be directional, we conducted Cox-PH survival analysis of RAD51C with and without the contribution of Signature 3 as a confounding variable.
The exact binomial test was used to determine if the proportion of splice variants significant for Signature 3 was greater than expected by chance.
Counts of genes belonging to each enrichment cluster for the gene, isoform splice variant , and MGG granularities are displayed as bar lengths. Missing bars does not indicate no membership, just insignificant enrichment of any term in the cluster. The file does not contain the exact paths, only the members of each component of the MGG. Abstract Splice variants have been shown to play an important role in tumor initiation and progression and can serve as novel cancer biomarkers.
Author summary In spite of many recent breakthroughs, there is still a pressing need for better ways to diagnose and treat cancer in ways that are specific to the unique biology of the disease. Introduction Large-scale cancer sequencing initiatives have opened up a window into the genome of individual cancers, offering unprecedented opportunities for studying the functional consequences of molecular alterations in human cancers [ 1 ].
Download: PPT. Fig 1. Workflow for generating multi-granularity graphs MGGs. Results NEEP identifies splice variants significantly associated with patient survival One of the contributions of this work is the development of a statistically robust and computationally efficient method to identify optimal expression thresholds that yield minimum p -values when performing survival analysis over a large number of transcripts.
Case study of multi-granular graphs linked to DNA repair After identifying splice variants significantly associated with survival, we followed the procedure described in Materials and Methods and summarized in Fig 1 to generate multi-granular graphs MGGs. RAD51C expression is linked to a characteristic mutational signature and lower patient survival The biological models discussed above suggest a role played by these splice variants in DNA repair.
Table 1. Table 2. Statistical test results comparing smoking variables to RAD51C expression. Discussion The goal of this work was to identify splice variants significantly associated with patient survival and provide possible mechanisms underpinning the associations.
Minimum p -value approach. Empirical estimation of p -values. Robustness We measured robustness of the NEEP method by examining sensitivity of significant splice variants to changes in the set of patients. Multi-granular graphs We utilized data across multiple granularities to construct plausible graphs for the association between splice variant expression and lung cancer survival.
Supporting information. Only exon 15 deletion c. However, none of them affected the recognition of exons 14 and Figure 2. Exon 15 ESE mapping: functional assay of c. B Capillary electrophoresis and sequence results of functional assays of microdeletion c. Potential splicing variants were selected following these criteria: creation or disruption of splice sites according to MES or NNSplice ; disruption of the branch point; disruption of the polypyrimidine tract; elimination of enhancers or creation of silencers.
Some of the selected variants had a combined effect, for example, they were predicted to simultaneously create an ESS and removed an ESE. According to their previous clinical classification, the selection contained: 8 benign or likely benign variants, 30 VUS and 15 pathogenic or likely pathogenic variants.
Table 1. Bioinformatics analysis of BRCA2 exons 14 and 15 selected variants. Bioinformatics indicated that 13 variants disrupted the natural splice sites, three decreased their scores one disrupted the polypyrimidine tract , 11 created new splice sites, one decreased the branch point score HSF: Exceptionally, variants c. Thus, c. Despite 53 variants were initially selected, mutagenesis experiments did not work for c.
The 52 mutant minigenes were checked by Sanger sequencing and assayed in MCF-7 cells. Among these 12 variants, there were 9 intronic, 2 missense and 1 nonsense changes. Functionally, the 9 intronic variants c. Table 2. Quantification of the transcripts found by capillary electrophoresis after functional assays of BRCA2 exons 14 and 15 variants.
Figure 3. A Exon 14 variants. B Exon 15 variants. On the left, capillary electropherograms are shown. Labeled transcripts are shown as blue peaks, LIZ was used as size standard orange peaks. The expected size of the full length transcript is nt — nt according to Peak Scanner. On the right, splicing patterns are represented. While blue boxes are natural exons, red boxes represent aberrant exons; dashed black and red lines show canonical and aberrant splicing events, respectively.
Curiously, while the main outcome of c. The loss of nt at the beginning of exon 14 would generate a PTC 27 codons downstream p. The branch point c. Other exon 15 acceptor variants, such as c. The use of this cryptic acceptor would provoke a frameshift deletion, leading to a PTC in the protein p. Variant c. In summary, we found 5 variants c. Remarkably, all of them showed the total absence of canonical transcript, except for c. Moreover, our results unveiled exon 14 and 15 cryptic splice sites that are only recognized when natural acceptors are disrupted.
Seven variants were predicted to disrupt donor sites: c. Among the exon 14 variants, only c. The E14q5 is an aberrant splicing isoform which leads to PTC p. Surprisingly, this cryptic donor was not detected by NNSplice software as the canonical one was. Regarding exon 15 donor variants, our results showed that all of them c. Only c. This matches with the creation of a new donor site that was not detected by the splicing prediction software.
Conversely, none of the exon 15 SRE-variants impaired splicing even though microdeletion tests had revealed a presumed ESE interval c. We have assayed 10 variants of this type, six in exon 14 c. Results showed that two exon 15 variants c. The variant c. The so called full-length or canonical transcript expected size: nt was amplified with primers placed on vector exon V1 and BRCA2 exon Apart from the canonical transcript, we have detected other ten different ones Figure 3.
A nt transcript of unknown structure could also be detected by capillary electrophoresis. Due to the implementation of Next Generation Sequencing in the clinical setting Slavin et al.
HBOC and the breast cancer susceptibility genes are not exceptions, where thousands of different variants have been reported although many of them are considered as VUS Spurdle et al. In this context, the functional and clinical classifications pose a challenge for Medical Genetics. We found 12 variants that altered splicing, nine of which would severely alter the protein. The following advantages of the minigene technology should be underlined: i analysis of a single allele outcome without the interference of the wt counterpart of a patient sample; ii simple and fast quantification of generated transcripts by fluorescent capillary electrophoresis with minimum hands-on time versus other proposed methods Farber-Katz et al.
In fact, we have previously provided many examples of the minigene reproducibility. In the case of BRCA2 exons 14 and 15, variants c. Moreover, another 31 variants of this and other constructs replicated previously reported patient splicing outcomes Acedo et al. Although the biomolecular mechanisms are different, the principle is the same, that parts of the protein, called inteins instead of introns, are removed.
The remaining parts, called exteins instead of exons, are fused together. However, protein splicing has so far not been observed in humans, but in yeast. Category : Gene expression. Read what you need to know about our industry portal chemeurope. My watch list my. My watch list My saved searches My saved topics My newsletter Register free of charge.
Keep logged in. Cookies deactivated. To use all functions of this page, please activate cookies in your browser. Login Register. Additional recommended knowledge. Main article: Protein splicing. Topics A-Z. All topics.
0コメント