Hidden Markov Model for sequence analysis Training
Overview
question Questionsobjectives Learning Objectives
How to perform prediction method of transmembrane proteins?
What are the quality measuremnts to check for a method?
How to improve the quality of a prediction?
requirements Requirements
Understand the ML method and algorithms
Understand the CML method and algorithms
Manipulate Juchmme files (Transitions, Emmitions, Weigths, Configuration, Model)
Assess prediction method
Use Juchmme tool
Understand
Use tools for quality correction
Use HMM extensions to improve prediction accurasy
Process semi-supervised data
- Introduction to Transmembrane Proteins
- Introduction to Machine Learning
- Introduction to Hidden Markov Models
time Time estimation: 1 hour 30 minutes
level Level: Introductory level level level
Supporting Materials
last_modification Last modification: Mar 27, 2021
Introduction
Transmembrane Proteins
Hidden Markov Models (HMMs) are probabilistic models widely used in biological sequence analysis. The software can be used for building standard HMMs, CHMMs with labeled sequences, or HNNs. The versatile architecture and parameterization allow the easy use, either by modifying the archi-tecture of existing models, or by building new complex models for any kind of sequence analysis problem (e.g. gene-finding, protein secondary structure and so on). The models can be freely parame-terized in terms of their architecture (i.e. states and transitions between them), the alphabet of the symbols used, and the number of labels and the sharing of emission probabilities (parameter ty-ing). Using the software, however, does not require any program-ming skills, since the user only needs to specify the appropriate model and choose the corresponding training and testing options in a configuration file, following the detailed User’s Guide that it is provided. Thus, JUCHMME can be used for original research as well as for educational purposes.
Therefore, it is necessary to understand, identify and exclude different prediction methods that may impact the accuracy of sequence analysis. Sequence quality control is therefore an essential first step in your analysis. Catching errors early saves time later on.
Agenda
In this tutorial, we will deal with:
Tool Setup
hands_on Hands-on: Tool download
JUCHMME is a Java executable that can be run from the command line. JUCHMME is written in Java and requires a 32-bit or 64-bit Java runtime environment version 7 or later, freely available from http://www.java.org. The Windows and MacOS X installers contain a suitable Java runtime environment that will be used if a suitable Java runtime environment cannot be found on the computer.
How to find
Java
version in Windows or Mac (Developer Instructions)Using the Command-line to find Java Versions. Open a command prompt and type the command
java --version
Download
Juchmme
from Juchmme (ask your instructor)https://github.com/pbagos/juchmme/releases
In order to run Juchmme you must move in bin folder. Using the Command-line and type the command
cd bin
Using the Command-line to find Juchmme Version. Open a command prompt and type the command
java hmm/Juchmme -V
output Output
JUCHMME :: Java Utility for Class Hidden Markov Models and Extensions Version 1.0.5; September 2019 Copyright (C) 2019 Pantelis Bagos Freely distributed under the GNU General Public Licence (GPLv3) -------------------------------------------------------------------------- Preparing System Arguments JUCHMME Version 1.0.5Compile Compile
Researchers can create, run, and quickly adapt a HMM without ever writing a line of code. Something that isn't possible with other tools. This makes it a lot more accessible to non-programmers. Many features can be accessed without coding a line. But just a few lines of code can allow it to call utilities. In this case user must complile code using above commands
javac -XDignore.symbol.file -sourcepath src/ -d ./bin src/hmm/Juchmme.java
HMM Components
First step for a researcher is to define parameter files. User can define the model file using a text editor. This contains the symbols and the labels
HMM model
Transition parameter
Emission parameter
Determine the Initial Probabilities of the HMM
JUCHMME provides a flexible functionality to parameterize/initialize transition, emission and weight probabilities. Transition probabilities are required to be provided as input since they describe the model itself. In the initial transition probabilities table, non-zero entries define the allowed transitions between states and thus define the model architecture (a transition probability of zero cannot be undone). Emission probabilities are required in case an HMM is to be used and are obsolete in case an HNN is to be used. The user can specify in the configuration settings if it needs to randomize or to uniformize either transition or emission probabilities. Moreover, the user can choose to initialize probabilities using the Viterbi algorithm on the training data (this is the suggested option for real-life applications).
Training
Traditionally, HMM training is performed using the Baum-Welch algorithm [1-3] which is a special case of the Expectation-Maximization (EM) algorithm for missing data [4]. Missing data in this case is the path π (i.e. the sequence of states), since, if we knew the exact path, the Maximum Likelihood (ML) estimates could be easily derived by counting the observed transitions and emissions. An alternative to the Baum-Welch algorithm, even though not widely used, is the gradient-descent algorithm proposed by Baldi and Chauvin [5]. In any case, in these models, maximization of the likelihood corresponds to an unsupervised learning problem. In other biological sequence analysis problems, where we want to classify various segments along the sequence, we often use labeled sequences for training. In such cases, each amino acid sequence x is accompanied by a sequence of labels y for each position i in the sequence (y=y1, y2, …,yL). Consequently, we declare a new probability distribution, the probability δk(yi=c) of a state k having a label c. In most applications, this probability is just a delta-function, since a particular state is not allowed to match more than one label. Krogh proposed a simple modification of forward and backward algorithms in order to incorporate information from labeled data [19]. The likelihood to be maximized in such situations is the joint probability of the sequences (x) and the labels (y) given the model, in which the summation is done only over those paths Πy that are in agreement with the labels y. This typically corresponds to a supervised learning procedure.
Maximum Likelihood (ML)
After defining the model and the configuration file researcher can continue to the training process. HMMs in their basic formulation correspond to an unsupervised procedure. In complex sequence analysis problems, we often use an extension of HMM called as “class HMMs” which uses labeled sequences for training and correspond to a supervised learning procedure. JUCHMME integrates a wide range of training algorithms for HMMs for labeled sequences. This kind of models are commonly trained using the Maximum Likelihood (ML). The tool has been developed to support the Baum-Welch algorithm and its extension that is necessary to handle labeled data. Other alternatives are also supported, like the gradient-descent algorithm and the Viterbi training.