Predicting drug resistance of HIV-1 protease using artificial neural networks
1.0 Introduction
Drug resistance is probably the most important factor influencing the failure of present HIV therapies. Drug resistance is associated with a loss of contacts between drug and receptor [1]. We have only a small amount of reported IC90 values. So basically we are using available IC90 values of known mutants to find resistance of unknown mutants. However the aim of this experiment is not to predict the actual IC90 values but to segregate the mutants into High, Intermediate or Low resistance. The emergence of anti-retroviral drug resistance is not unexpected, as drug resistance had been reported for other viruses such as herpes simplex, varicella-zoster, cytomegalovirus, Influenza A and rhinovirus. However, the drug resistance problem is far more important in the case of the HIV virus because of the dramatic final outcome of HIV-related illnesses.
Back in the days Genotyping and phenotyping was used to predict HIV-1 protease drug resistance:
1.1 Genotyping:A Genotyping test result lists the mutations found in the protease and Reverse Transcriptase genes of a person's HIV inturn what it means in terms of drug resistance. It is done by sequencing the gene sequence of the persons HIV by PCR and checking against a list known to cause drug resistance. Most drugs have a set pattern of resistance mutations.
Another technique is probe assay where a probe is used to indicate each mutation known to lead to drug resistance. [1]
1.2 Phenotyping:This Test measures the amount of drug needed to suppress the growth of HIV in laboratory setting. Known levels of drug stop reproduction of non-resistant HIV. Resistant HIV requires higher levels of the same drug to stop reproduction. [1] This test measures the amount of drug needed inhibit 90 or 50% of virus population. Results are given in the form of IC90 or IC50 (Inhibitory concentration)
1.3 Data mining:It is the process of extracting patterns from data. In this Process we aim to extract sequence or structure patterns from the genetic and 3D structure of HIV in the form the computer understands allot weights for every input.
There are six different learning methods to predict phenotypic drug susceptibility based on viral genotype (the presence or absence of mutations): (1) decision trees, (2) neural networks, (3) support vector regression, (4) linear regression, and (5) least angle regression..[2]
1.4 HIV-1 protease:HIV-1 protease (HIV PR) is an aspartic protease that is essential for the life-cycle of HIV, the retrovirus that causes AIDS. HIV PR cleaves newly synthesized polyprotein at the appropriate places to create the mature protein components of an infectious HIV virion. Without effective HIV PR, HIV virions remain uninfectious. Thus, mutation of HIV PR's active site or inhibition of its activity disrupts HIV's ability to replicate and infect additional cells, making HIV PR inhibition the subject of much pharmaceutical research. [1]
1.5 Antiretroviral Drugs:Antiretroviral Drugs Antiretroviral drugs are medications for the treatment of infection by retroviruses, primarily HIV. When several such drugs, typically three or four, are taken in combination, the approach is known as highly active antiretroviral therapy, or HAART. The American National Institutes of Health and other organizations recommend offering antiretroviral treatment to all patients with AIDS. Because of the complexity of selecting and following a regimen, the severity of the side-effects and the importance of compliance to prevent viral resistance, however, such organizations emphasize the importance of involving patients in therapy choices, and recommend analyzing the risks and the potential benefits to patients without symptoms.[1]
1.5 Classes of drug: [5]Antiretroviral (ARV) drugs are broadly classified by the phase of the retrovirus life-cycle that the drug inhibits.
* Nucleoside and nucleotide reverse transcriptase inhibitors (nRTI) inhibit reverse transcription by being incorporated into the newly synthesized viral DNA and preventing its further elongation.
* Non-nucleoside reverse transcriptase inhibitors (NNRTI) inhibit reverse transcriptase directly by binding to the enzyme and interfering with its function.
* Protease inhibitors (PIs) target viral assembly by inhibiting the activity of protease, an enzyme used by HIV to cleave nascent proteins for final assembly of new virons.
* Integrase inhibitors inhibit the enzyme Integrase, which is responsible for integration of viral DNA into the DNA of the infected cell. There are several Integrase inhibitors currently under clinical trial, and raltegravir became the first to receive FDA approval in October 2007.
* Entry inhibitors (or fusion inhibitors) interfere with binding, fusion and entry of HIV-1 to the host cell by blocking one of several targets. Maraviroc and enfuvirtide are the two currently available agents in this class.
* Maturation inhibitors inhibit the last step in gag processing in which the viral capsid polyprotein is cleaved, thereby blocking the conversion of the polyprotein into the mature capsid protein (p24). Because these viral particles have a defective core, the virons released consist mainly of non-infectious particles. There are no drugs in this class currently available, though two are under investigation, bevirimat and Vivecon.
* Broad spectrum inhibitors. Some natural antivirals, such as extracts from certain species of mushrooms, may contain multiple pharmacologically active compounds, which attack the virus at various different stages in its lifecycle.
The preferred initial regimens are [5]
* efavirenz + zidovudine + lamivudine
* efavirenz + tenofovir + emtricitabine
* lopinavir boosted with ritonavir + zidovudine + lamivudine
* Lopinavir boosted with ritonavir + tenofovir + emtricitabine.
2.0 The problem:Disadvantages of Genotyping and Phenotyping: [2]
1. May take over a month to get the results.
2. It is a labor intensive technology which needs expertise to extract the input and decode the output.
3. The cost of a genotypic or phenotypic test is approximately US$350 to $900.
4. May not detect minority species of virus present at levels less than 10%-20%.
5. Must be done in a centralized laboratory facility.
So a more cost effective method is needed. We do not need to know exactly how much resistance a mutant shows in values but an idea is more than enough.
3.0 Different Approaches To the same Problem:Sorin et al used two ways of Predicting HIV Drug resistance using neural networks.[3]
1. Structure based Data mining.
2. Sequence based Data mining.
3.1.1 Structure based Data mining [3]:Steps:1. Construct mutant genotypes and produce 3D structures using Modeler
2. Use Ligplot to analyze the 3D structures and produce a list of contacts between the mutant proteases and protease inhibitor.
3. Preprocess the contact information (input reduction, normalization)
4. Construct and train a self-organizing map to categorize mutant resistance to the protease inhibitor as high, medium, or low.
5. Test the network and analyze its performance
We do not know how a HIV virus will mutate. It mutates randomly. Some mutations to the gene produces greater resistance some lesser. So we induce random mutations to the genome and model them through modeler. Modeler is a software suite which models the 3D structure of the protein given the sequence information. Now we use Ligplot to calculate the number of connections between the drug and the protein. Different mutants have different contact points with the drug. The number of atomic contacts between the protease and the inhibitors typically around 30. [3] They have zero values for those not present and non-zero values for the present contacts. These 30 contacts are spread across 173 contact points. The Hydrogen bonds (HB) and the Non Bonding Interactions (NBI) are taken into consideration. Weights of 1.0 and 2.0 were given to NBI's and HB's to reflect the strength of Hydrogen bonds. The contacts that represented no different from the wildtype were neglected which leaves us with 22 contacts. These 22 contacts vectors describing each mutant were normalized to a length of 1. A Kohonen neural network was used to train the above patterns. When a pattern is presented the excitation of each unit is equal to the dot product between the input vector and the weight vector. In order to train the network one has to change the weight vectors manually until the weight vectors become more similar to the input pattern. Training stops when the weights become zero. There were a total of 38 patterns in which 31 were used for training and 7 were used for testing.
3.1.2 Prediction accuracy [3]:Coverage is the ratio of patterns that were classified to the total number of test patterns.
Accuracy is defined as the ratio of patterns that were correctly classified to the total number of patterns classified.
Network score is the product of the coverage and accuracy.
Score=Coverage X Accuracy X 100
The training abilities were estimated by calculating the number of patterns with known resistance values. The patterns were trained and see if the neural network places them in the correct resistance category. Using leave one out training the network was able to predict 6 out of 10 cases. The accuracy of the predictor was estimated to be between 60 -70%.
3.1.3 Sequence based Data Mining [3]:In this approach the amino acid of the virus is used to calculate the drug resistance. The amino acids sequences are represented in a value between 0 to 1. If the mutant sequence resembles the wildtype it is given a value of 0 else it's given any value between o to one based on n equal increments where n is the number of mutations from the wild type of that residue.
3.1.4 Prediction Accuracy [3]:The network with the best prediction accuracy was the 8 X 8 output matrix with a learning rate of 0.6. It produced an accuracy of 100% but produced only 31% coverage. The best single network produced a high coverage with high accuracy of 68%.
3.2 Prediction using Recurrent Networks[7]:In this technique Isis Bonet et al uses features related to the 3 D structure of the protein i.e. amino acid energies to train the network. It is related to the 3D structure as it gives information of the folding of the protein in 3D space.
3.2.1 Feature Set [7]:In this study the feature set used is the Energy contact energy between amino acids and ΔEnergy the difference in energy of the amino acid position in mutant and wild type.
Energy: A → R
Where A is the set of 20 amino acids and R is the set of real numbers.
The energy variation is
ΔEnergy (Ai) = Energy (AWi) − Energy (Ai)
Where AWi is the amino acid position i of the wild type sequence and Ai is the amino acid at position I of the mutated sequence.
3.2.2 Network information [7]:The network has 33 input neurons and 2 output neurons. The network topology has two context blocks one with reference to the left and other to the right. The context layers consists of Backward(HB), forward(HF) and hidden(HO).
In this approach Back propagation through time was used based on the folding and unfolding process. In the forward process the network is unfolding and executes the Backpropagation algorithm to obtain the corresponding output and in the back propagation algorithm the network is folding again to turn back to as it was in the beginning in order to update its weights. The prediction is divided in three tasks. The first task is to split the sequence in three parts, representing the three entries to parts the network. The second task is to unfold the network and to obtain the three outputs for this input, and the third task is to compute the final output - to represent the resistance or not of this protein - as an adequate combination of the three previous outputs. In the output layer we get 3 coordinates
(O1, O2, O3) where Oi ∈ {0, 1}. Where {0,1} is resistant and {1,0} is susceptible. [7]
4.0 My Idea:Drug resistance prediction speed and accuracy is increased by many folds and there is not one way that is the best and most accurate method. However there are some pressing issues to be compared when modeling something in vivo on a computer. We can't always predict that the given result is even half true to what really happens in our system. For example Sorin et al used Ligplot to obtain their feature set which still has many bugs and has high chances of producing false positives. Ligplot uses HBPLUS to calculate Hydrogen bonds and non-bonded interactions. The major drawback of HBPLUS is that it was designed to compute Hydrogen bonds between protein side chains and so it is unable to recognize the majority of ligands in the PDB. As a result, it may miss certain hydrogen bonds between protein and ligand, and LIGPLOT will not plot these absent interactions which could add up to why they obtained such low values for prediction accuracy.[Ligplot manual] So in my opinion all open source bioinformatics tools should be avoided for cutting edge research. In my opinion amino acid-Ligand contact energies pose a better feature set as they are directly proportional to the change in the 3D structure of the Protease.
5.0 Related Work:Until now genotyping and phenotyping is used in hospitals. To fasten this process of resistance prediction many other approaches were used the most successful of which was found by Brown et al at UC Irvine. Other attempts focused the flaps and dimer-interface flexibility (Ishima et al., 1999), molecular surface analysis (Pattabiram et al., 1999) and the auto-processing of the HIV-1 protease (Louise et al., 1999)
To solve the same problem by computational method many approaches have been used like rule based, statistical analysis and machine learning.
5.1 Rule based algorithms:Rule-based algorithms such as HIVdb (http://hivdb.stanford.edu/hiv) , ANRS , Rega ,and VGI contain the rules encoding information from the medical literature as the knowledge base. The HIVdb system used the mutation scoring tables to calculate a score from each sequence and interpreted drug susceptibility into one of five classes ranging from susceptible to high-level resistant.[7]
5.2 Statistical analysis:Multiple linear regression analysis (REG) was applied to construct a separate regression model for each drug [7]. In the model, the dependent variable is the logarithm of the IC50 fold change, while the independent variables are dummy variables corresponding to mutations. In addition, this technique used the stepwise regression method to optimize the parameters for each independent variable.
5.3 Machine learning:Machine learning is the most popular approach applied to predict phenotype from genotype. Many supervised learning algorithms have been used to solve this problem such as decision tree (DT), support vector machines (SVMs) and artificial neural networks (ANNs). These algorithms classify drug susceptibility into one of two classes: susceptible or resistant. Furthermore, the self-organizing map (SOM), an unsupervised learning algorithm, was used to classify drug susceptibility into one of three classes: high, medium, or low resistant.[3]
6.0 Conclusions and further work:In this paper we have analyzed various methods of predicting resistance and by comparing the methods recurrent neural networks trained with amino acid energies were proved to produce better accuracies as it was closely related to the 3D structure of the protein. Energy and ΔEnergy are good features to represent HIV genotype. The accuracies shown by the best predictor by a neural network till date was a recurrent neural network with accuracies of 81.4- 94.7 %.
REFERENCES:1. HIV/Aids Antiretroviral Newsletter August 2003 Issue 9.
2. Sandra E. Sinisi, Eric C. Polley, Maya L. Petersen, Soo-Yon Rhee, Mark J. van der Laan "Super Learning: An Application to the Prediction of HIV-1 Drug Resistance" Statistical Applications in Genetics and Molecular Biology: Vol. 6 : (2007) Iss. 1, Article 7.
3. Sorin Dr˘ aghici - and R. Brian Potter "Predicting HIV drug resistance with neural Networks Bioinformatics". 19 (2003) 98-107.
4. Davies DR "The structure and function of the aspartic proteinases". Annu Rev Biophys Chem 19(1990):189-215.
5. Dybul M, Fauci AS, Bartlett JG, Kaplan JE, Pau AK "Guidelines for using antiretroviral agents among HIV-infected adults and adolescents". Ann. Intern. Med. 137 (5 Pt 2): 381-433.
6. Anantaporn Srisawat and Boonserm Kijsiriku "Using Associative Classification for Predicting HIV-1 Drug Resistance" Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (2004):0-7695-2291-2/04
7. Isis Bonet1, Maria M. Garcia, Yvan Saeys, Yves Van de Peer, and Ricardo Grau "Predicting Human Immunodeficiency Virus (HIV) Drug Resistance Using Recurrent Neural Networks" LNCS 4527(2007) 234-243.
Article name: Predicting drug resistance of HIV-1 protease using artificial neural networks essay, research paper, dissertation