Home

Analysis of domains and motifs in protein sequences

Friday June 23, 2006

Introduction to Patterns, Profiles, and HMMs

 
1. Prosite database

In this exercise we will explore the format and documentation of the Prosite database. First have a look to the following Prosite entry: PS50235.

Question 1: is PS50235 a Pattern or a Profile?

Question 2: does PS50235 is known to have false positives and false negatives in Swiss-Prot?

Question 3: what is the function of PS50235? (check the documentation).

Question 4: is PS50235 related to other Prosite entries?

Question 5: is the following protein related to PS50235? Use ScanProsite to perform a scan.

>unknown
MDLFIESKINSLLQFLFGSRQDFLRNFKTWSNNNNNLSIYLLIFGIVVFFYKKPDHLNYI
VESVSEMTTNFRNNNSLSRWLPRSKFTHLDEEILKRGGFIAGLVNDGNTCFMNSVLQSLA
SSRELMEFLDNNVIRTYEEIEQNEHNEEGNGQESAQDEATHKKNTRKGGKVYGKHKKKLN
RKSSSKEDEEKSQEPDITFSVALRDLLSALNAKYYRDKPYFKTNSLLKAMSKSPRKNILL
GYDQEDAQEFFQNILAELESNVKSLNTEKLDTTPVAKSELPDDALVGQLNLGEVGTVYIP
TEQIDPNSILHDKSIQNFTPFKLMTPLDGITAERIGCLQCGENGGIRYSVFSGLSLNLPN
ENIGSTLKLSQLLSDWSKPEIIEGVECNRCALTAAHSHLFGQLKEFEKKPEGSIPEKPIN
AVKDRVHQIEEVLAKPVIDDEDYKKLHTANMVRKCSKSKQILISRPPPLLSIHINRSVFD
PRTYMIRKNNSKVLFKSRLNLAPWCCDINEINLDARLPMSKKEKAAQQDSSEDENIGGEY
YTKLHERFEQEFEDSEEEKEYDDAEGNYASHYNHTKDISNYDPLNGEVDGVTSDDEDEYI
EETDALGNTIKKRIIEHSDVENENVKDNEELQEIDNVSLDEPKINVEDQLETSSDEEDVI
PAPPINYARSFSTVPATPLTYSLRSVIVHYGTHNYGHYIAFRKYRGCWWRISDETVYVVD
EAEVLSTPGVFMLFYEYDFDEETGKMKDDLEAIQSNNEEDDEKEQEQKGVQEPKESQEQG
EGEEQEEGQEQMKFERTEDHRDISGKDVN

[Solutions]

 
2. Iterative training using PSI-blast

Using the NCBI web interface to the BLAST programs, execute three cycles of PSI-blast using the sequence below as initial query

>sw:ERCC5_XENLA/27-95
LAVDISIWLNQAVKGARDRQGNAIQNAHLLTLFHRLCKLLFFRIRPIFVFDGEAPLLKRQTLAKRRQRT

and SWISS-PROT as the database to search (to keep the list of results manageable!). At every cycle, record the number of matches equal or below the threshold and the E-values produced by the proteins ERCC5_XENLA, FEN1_HUMAN, DIN7_YEAST. Complete the table below

# matches ERCC5_XENLA FEN1_HUMAN DIN7_YEAST
Cycle 1        
Cycle 2        
Cycle 3        

Could you figure out how PSI-blast behave over successive iterations?

[Solutions]

 
3. Build a pattern

Consider the composition of the columns in the following multiple sequence alignment

Seq1  WFFKGIADKDAERHLLA
Seq2  WFFKNLEQKDAEARLLA
Seq3  WFFKR---KDAERQLLA
Seq4  WFFGTI---DAERQLLA
Seq5  WFFKDIPTKDAERQLLA
Seq6  WYFG----RESERLLLA
Seq7  WYFGKIPLKDAERQLLA
Seq8  WYFGKLRAKDTERLLLL

and try to write down a pattern using the Prosite syntax.

Submit the pattern to the ScanProsite server and search against SWISS-PROT. Could you say something about the proteins matching your pattern? A possible manner to validate the output of the pattern search is to search against a randomized database. Use again ScanProsite, but this time against a reversed SWISS-PROT. Do you find any false positive sequence?

Repeat the exercise with the following multiple sequence alignment:

seq1 ERGLR
seq2 DRASR
seq3 DRLGR
seq4 ERAAR
seq5 ERGVR

What is going on when searching this pattern against a randomized database?

[Solutions]

 
4. Identification of Known Domains in a Protein Sequence

There exist several databases of protein domains. Most of them rely upon profile-HMM based predictors.

What can you tell about the following sequence:

>QUERY
MGQVGDFFIPNKIIFSDEKLVGKKIAMDGNLAYQFLTSIRLRGDSPNLRKRIGETSANYVYGKFTIHLLENIDTPIWVFD
PGEPLKKEKVRTRKRMEKKEEALKMKEIKAKDEFEEAYKAKARVYLSKTPMVEKNCLYSLLMGPIYEVASPEEGQASAMK
YAKGDWVVAVSQYDDALLYGAPRVVRNTLTTKEPMELINLEELVEDLRSILDDLIDIAIMFGDTNPYGVGKIGGFRKYAE
RVLGSVDAVKLKKEVYYEDIEIRKEFKKPVTDNYSLSLLKPDKEIIGKVFDLEDNFYNDRKVKVHKDYLNNLAITKKQKT
ADLWKF

using the following Motif-Scan servers (the list is not exhaustive!). Use the default setting for every web pages (except MyHits) and don't hesitate to run the searches in parallel using multiple browser windows because of delays.

  • The Pfam database of protein domains is currently the largest one with over 7000 entries.
  • The Prosite database is developed together with SwissProt and if found on the Expasy web site.
  • The SMART database is smaller than Pfam, with a different focus.
  • Interpro is an attempt to federate multiple motif domain databases and serve as a tool to annotate TrEMBL.
  • The domain predictors of the NCBI CD-Search are derived from the Pfam and SMART databases with some additional material.
  • The Tigrfam database is a collection of HMM that are specifically meant for the automated annotation of bacterial proteins.
  • The Hamap database is a collection of generalized profiles that are specifically meant for the automated annotation of bacterial proteins. Also part of the the Expasy server.
  • Our local scan server in the MyHits integrated environment. You must select the databases you want to search!

How different are these predictions? Which server do you prefer? Why?

[Solutions]

 

 

Latest update 2006-06-23
Valid HTML 4.01 Transitional   Valid CSS!