Computational Identification Of Genetic And Epigenetic Signal Biology
Recent studies on imprinting show that this process involves the collaboration of multiple genetic and epigenetic factors. This study investigates imprinting control regions (ICRs) -in mice and human independently- in an attempt to discover DNA sequence features that are necessary to trigger the genomic imprinting process. We are trying to find DNA sequence features that discriminate ICRs from other, similar but non-ICR sequences. Such discriminating features can be used for ICR prediction, but are especially useful for the construction and testing of models of the imprinting mechanism.
The first stage of the study includes analysis of Imprinting Control Regions (ICRs) sequences to identify any special patterns (sequence motifs) associated with imprinting. The reason of focusing on ICRs is motivated by the many experiments that confirm loss of imprinted expression of many genes within the same cluster as a consequence of deleting all or part of the ICR. Examining the periodicity of CG dinucleotide spacing within ICRs is the next stage. This involves calculating the distance between each CG pair and testing the distribution of distances (histogram) for significant periodicity. Finally, we will apply unbiased machine learning techniques to the task of ICR versus non-ICR classification and analyse the results in order to understand the learned decision-making rules and to extract from them novel ICR-specific sequence features.Genomic ImprintingDefinitions
Genomic imprinting is a form of genetic marks in which genes are expressed or repressed in a parent of origin dependent manner. At the same time it is a parental specific epigenetic modifications to achieve monoallelic gene expression without changing the underlying genetic sequence. These modifications involved DNA methylation and changes in histone modifications. They are established early in germ line and are maintained throughout all somatic cells.
Most of the mammalian genes inherit two functionally equivalent and actively expressed alleles (biallelic), one from the egg (maternal allele) and the other from the sperm (paternal allele). However, small, but growing, number of genes does not follow this rule of inheritance. These genes, which contain only one actively transcribed allele (monoallelic) either from the mother (maternally expressed, like H19) or from the father (paternally expressed, like: IGF2), are called Imprinted genes.Stages
Each gametocyte (germ cell) carries its own gene programming which is important in determining the allele-specific expression of imprinted genes. However in early germ line development, specifically in the post-implantation embryo, and during global DNA demethylation, all previous gene programming including the imprints is erased, then later re-established during gametogenesis. Parental imprints are then inherited at fertilisation and the effect of imprints is translated as inequality of parental allele expression. (Hartl and Jones, 2001 and Morgan, et al., 2005).Evolution and ICRs
Genomic imprinting is largely restricted to placental mammals, marsupials (Killian el at., 2000 and Renfree, et al., 2008) and flowering plants. There are many hypotheses related to the evolution of imprinting; one is related to the genetic conflict theory of evolution in which paternally expressed genes promote the growth whilst the maternally expressed genes inhibit it (Haig, 1996). The first imprinted gene, Insulin-like growth factor type-2 (IGF2), was identified in early 1990s on mouse chromosome 7. It is a paternally expressed gene which plays an essential role in growth and development before birth (DeChiara et al., 1991 and Barlow et al., 1991). Since the identification of IGF2, extensive studies identify about (143, 103, 80) (Peters, 2010) imprinted genes in mice and around (100, 68) in human (Catalogue of parent of origin effects, 2010); most of these imprinted genes found in ~ 1 M-base clusters throughout the genome, like H19 and KCNQ1OT1 sub-domains on human chromosome 11 (Verona et al., 2003 and O'Neill, 2005); each cluster may contain paternal and maternal alleles of imprinted genes as well as non-imprinted genes, Differentially Methylated Region (DMR), non-coding RNAs (ncRNA) and ICR (Bartolomei, 2009). ICRs are DNA sequences which are responsible of regulating the expression of imprinted genes, deleting -all or part of- these regulatory elements results in loss of imprinting for multiple of genes within the cluster.Significance
The regulation of imprinted gene expression through imprinting is essential to the normal development of embryo. Thus Loss of Imprinting (LOI) and / or aberrant expression of many imprinted genes play a major role in multiple disorders development. I.e. LOI in IGF2 gene is associated to different tumours, like: colorectal, liver, esophageal, adrenocortical and breast cancer (Kaneda and Feinberg, 2005 and Chao and D'Amore, 2008). Biallelic expression of H19 and downregulation of IGF2 caused overgrowth disorders, like Silver Russell Syndrome (SRS) (Gicquel et al., 2005) and Beckwith-Wiedemann syndrome (BWS) (Chao and D'Amore, 2008).Research Area
The challenging in Genomic imprinting is identifying more imprinted genes, since imprinted genes expression is often tissue specific. Moreover, what precisely initially causes a region to become imprinted remains unclear. The main goal of this study is identifying the genetic and epigenetic markers involved in genomic imprinting by analysing the ICRs distinctive features using computational tools and statistical measurements.Stages of ProgressIdentifying the motifs sequences in ICRs
ICRs analysis requires searching for certain motifs with distinctive sequences which could form the genetic signals needed to commence the imprinting phenomenon. This stage requires using tools like: MEME to search for those motifs. The ICRs may overlap with gene region or exist within the intron part of the gene, or may exist on the upstream / downstream sequence of the gene. We divide the ICRs into three groups: Paternal ICRs, Maternal ICRs and ICRs-like used as negative controls. We will exclude the paternal ICRs since there are only few, and they have different mechanisms of imprinting.ICRs sequence readsMEME toolPosition Weight MatrixPeriodicity
Genomic imprinting is directly accompanied by differential DNA methylation, which add methyl group to repress the gene transcription. This methylation process is mediated by DNA methyltransferase enzyme, like: Dnmt3a, which shows significant sequence specificity (Ferguson-Smith and Greally, 2007). Dnmt3a along with its regulatory factor, DNA methyltransferase 3-like protein (Dnmt3L), work together to methylate CG dinucleotides in 8 - 10 base pair distance period (Jia et al., 2007). Thus this stage examines the periodicity and frequency of CG dinucleotides within the ICRs to identify the periodic pattern that might be directly related to imprinting and differential DNA methylation. CG dinucleotides "dataset" will be collected from ICRs; excluding the ones occur within codons, and within transcribed sequence as well as the different repetitive elements. This dataset will be prepared using the "Galaxy" tool.
After calculating CG dinucleotides frequencies we can plot the histogram of distances and then analyse its periodicity as a time series variable to identify wavelengths with significant amplitudes using Fast Fourier Transform.Machine Learning
Machine Learning (ML) techniques are used to classify input sequences into predefined groups (supervised learning) depending on patterns that are specific for each group. The main idea behind using ML techniques is to identify these patterns from the large set of training input sequences. We want to then understand the procedure used by the trained classifier to distinguish ICRs from other sequences, so that we can extract ICR-specific sequence features related to imprinting.
ML models usually contain two phases; training phase and testing phase. Our training set will be part of the known ICRs sequences and the ICRs-like sequences as negative controls and the testing set will be the rest of them. We will choose a suitable technique (eg, a Support Vector Machine (SVM), or a Self-Organizing Map (SOM)), train / test our machine and, if the predictive performance (sensitivity versus specificity) is promising, then analyse the trained classifier to infer classification rules and thus, ICR-specific sequence patterns.
Article name: Computational Identification Of Genetic And Epigenetic Signal Biology essay, research paper, dissertation