Hidden Markov Model for sequence analysis Training

By:

Overview
question Questions

How to perform prediction method of transmembrane proteins?

What are the quality measuremnts to check for a method?

How to improve the quality of a prediction?

objectives Learning Objectives

Understand the ML method and algorithms

Understand the CML method and algorithms

Manipulate Juchmme files (Transitions, Emmitions, Weigths, Configuration, Model)

Assess prediction method

Use Juchmme tool

Understand

Use tools for quality correction

Use HMM extensions to improve prediction accurasy

Process semi-supervised data

requirements Requirements

Introduction to Transmembrane Proteins
Introduction to Machine Learning

Introduction to Hidden Markov Models

time Time estimation: 1 hour 30 minutes

level Level: Introductory level level level

Supporting Materials

slides Slides

datasets Datasets

last_modification Last modification: Mar 27, 2021

Introduction

Transmembrane Proteins

Hidden Markov Models (HMMs) are probabilistic models widely used in biological sequence analysis. The software can be used for building standard HMMs, CHMMs with labeled sequences, or HNNs. The versatile architecture and parameterization allow the easy use, either by modifying the archi-tecture of existing models, or by building new complex models for any kind of sequence analysis problem (e.g. gene-finding, protein secondary structure and so on). The models can be freely parame-terized in terms of their architecture (i.e. states and transitions between them), the alphabet of the symbols used, and the number of labels and the sharing of emission probabilities (parameter ty-ing). Using the software, however, does not require any program-ming skills, since the user only needs to specify the appropriate model and choose the corresponding training and testing options in a configuration file, following the detailed User’s Guide that it is provided. Thus, JUCHMME can be used for original research as well as for educational purposes.

Therefore, it is necessary to understand, identify and exclude different prediction methods that may impact the accuracy of sequence analysis. Sequence quality control is therefore an essential first step in your analysis. Catching errors early saves time later on.

Agenda

In this tutorial, we will deal with:

tool-setup

HMM components

Training

Filter and Trim

Process paired-end data

Tool Setup

hands_on Hands-on: Tool download
JUCHMME is a Java executable that can be run from the command line. JUCHMME is written in Java and requires a 32-bit or 64-bit Java runtime environment version 7 or later, freely available from http://www.java.org. The Windows and MacOS X installers contain a suitable Java runtime environment that will be used if a suitable Java runtime environment cannot be found on the computer.

How to find Java version in Windows or Mac (Developer Instructions)

Using the Command-line to find Java Versions. Open a command prompt and type the command
java --version
Download Juchmme from Juchmme (ask your instructor)
https://github.com/pbagos/juchmme/releases
In order to run Juchmme you must move in bin folder. Using the Command-line and type the command
cd bin
Using the Command-line to find Juchmme Version. Open a command prompt and type the command
java hmm/Juchmme -V
output Output

JUCHMME :: Java Utility for Class Hidden Markov Models and Extensions
Version 1.0.5; September 2019
Copyright (C) 2019 Pantelis Bagos
Freely distributed under the GNU General Public Licence (GPLv3)
--------------------------------------------------------------------------
Preparing System Arguments
JUCHMME Version 1.0.5
Compile Compile

Researchers can create, run, and quickly adapt a HMM without ever writing a line of code. Something that isn't possible with other tools. This makes it a lot more accessible to non-programmers. Many features can be accessed without coding a line. But just a few lines of code can allow it to call utilities. In this case user must complile code using above commands
javac -XDignore.symbol.file -sourcepath src/ -d ./bin src/hmm/Juchmme.java

HMM Components

First step for a researcher is to define parameter files. User can define the model file using a text editor. This contains the symbols and the labels

HMM model

Transition parameter

Emission parameter

Determine the Initial Probabilities of the HMM

JUCHMME provides a flexible functionality to parameterize/initialize transition, emission and weight probabilities. Transition probabilities are required to be provided as input since they describe the model itself. In the initial transition probabilities table, non-zero entries define the allowed transitions between states and thus define the model architecture (a transition probability of zero cannot be undone). Emission probabilities are required in case an HMM is to be used and are obsolete in case an HNN is to be used. The user can specify in the configuration settings if it needs to randomize or to uniformize either transition or emission probabilities. Moreover, the user can choose to initialize probabilities using the Viterbi algorithm on the training data (this is the suggested option for real-life applications).

Training

Traditionally, HMM training is performed using the Baum-Welch algorithm [1-3] which is a special case of the Expectation-Maximization (EM) algorithm for missing data [4]. Missing data in this case is the path π (i.e. the sequence of states), since, if we knew the exact path, the Maximum Likelihood (ML) estimates could be easily derived by counting the observed transitions and emissions. An alternative to the Baum-Welch algorithm, even though not widely used, is the gradient-descent algorithm proposed by Baldi and Chauvin [5]. In any case, in these models, maximization of the likelihood corresponds to an unsupervised learning problem. In other biological sequence analysis problems, where we want to classify various segments along the sequence, we often use labeled sequences for training. In such cases, each amino acid sequence x is accompanied by a sequence of labels y for each position i in the sequence (y=y1, y2, …,yL). Consequently, we declare a new probability distribution, the probability δk(yi=c) of a state k having a label c. In most applications, this probability is just a delta-function, since a particular state is not allowed to match more than one label. Krogh proposed a simple modification of forward and backward algorithms in order to incorporate information from labeled data [19]. The likelihood to be maximized in such situations is the joint probability of the sequences (x) and the labels (y) given the model, in which the summation is done only over those paths Πy that are in agreement with the labels y. This typically corresponds to a supervised learning procedure.

Maximum Likelihood (ML)

After defining the model and the configuration file researcher can continue to the training process. HMMs in their basic formulation correspond to an unsupervised procedure. In complex sequence analysis problems, we often use an extension of HMM called as “class HMMs” which uses labeled sequences for training and correspond to a supervised learning procedure. JUCHMME integrates a wide range of training algorithms for HMMs for labeled sequences. This kind of models are commonly trained using the Maximum Likelihood (ML). The tool has been developed to support the Baum-Welch algorithm and its extension that is necessary to handle labeled data. Other alternatives are also supported, like the gradient-descent algorithm and the Viterbi training.

Hidden Markov Model for sequence analysis Training

Overview

Introduction

Agenda

Tool Setup

hands_on Hands-on: Tool download

output Output

Compile Compile