Introduction to Patterns, Profiles, and HMMs | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1. Prosite database | ||||||||||||||||||||
In this exercise we will explore the format and documentation of the Prosite database. First have a look to the following Prosite entry: PS50235. Question 1: is PS50235 a Pattern or a Profile? Question 2: does PS50235 is known to have false positives and false negatives in Swiss-Prot? Question 3: what is the function of PS50235? (check the documentation). Question 4: is PS50235 related to other Prosite entries? Question 5: is the following protein related to PS50235? Use ScanProsite to perform a scan. >unknown MDLFIESKINSLLQFLFGSRQDFLRNFKTWSNNNNNLSIYLLIFGIVVFFYKKPDHLNYI VESVSEMTTNFRNNNSLSRWLPRSKFTHLDEEILKRGGFIAGLVNDGNTCFMNSVLQSLA SSRELMEFLDNNVIRTYEEIEQNEHNEEGNGQESAQDEATHKKNTRKGGKVYGKHKKKLN RKSSSKEDEEKSQEPDITFSVALRDLLSALNAKYYRDKPYFKTNSLLKAMSKSPRKNILL GYDQEDAQEFFQNILAELESNVKSLNTEKLDTTPVAKSELPDDALVGQLNLGEVGTVYIP TEQIDPNSILHDKSIQNFTPFKLMTPLDGITAERIGCLQCGENGGIRYSVFSGLSLNLPN ENIGSTLKLSQLLSDWSKPEIIEGVECNRCALTAAHSHLFGQLKEFEKKPEGSIPEKPIN AVKDRVHQIEEVLAKPVIDDEDYKKLHTANMVRKCSKSKQILISRPPPLLSIHINRSVFD PRTYMIRKNNSKVLFKSRLNLAPWCCDINEINLDARLPMSKKEKAAQQDSSEDENIGGEY YTKLHERFEQEFEDSEEEKEYDDAEGNYASHYNHTKDISNYDPLNGEVDGVTSDDEDEYI EETDALGNTIKKRIIEHSDVENENVKDNEELQEIDNVSLDEPKINVEDQLETSSDEEDVI PAPPINYARSFSTVPATPLTYSLRSVIVHYGTHNYGHYIAFRKYRGCWWRISDETVYVVD EAEVLSTPGVFMLFYEYDFDEETGKMKDDLEAIQSNNEEDDEKEQEQKGVQEPKESQEQG EGEEQEEGQEQMKFERTEDHRDISGKDVN | ||||||||||||||||||||
| 2. Iterative training using PSI-blast | ||||||||||||||||||||
Using the NCBI web interface to the BLAST programs, execute three cycles of PSI-blast using the sequence below as initial query >sw:ERCC5_XENLA/27-95 LAVDISIWLNQAVKGARDRQGNAIQNAHLLTLFHRLCKLLFFRIRPIFVFDGEAPLLKRQTLAKRRQRT and SWISS-PROT as the database to search (to keep the list of results manageable!). At every cycle, record the number of matches equal or below the threshold and the E-values produced by the proteins ERCC5_XENLA, FEN1_HUMAN, DIN7_YEAST. Complete the table below
Could you figure out how PSI-blast behave over successive iterations? | ||||||||||||||||||||
| 3. Build a pattern | ||||||||||||||||||||
Consider the composition of the columns in the following multiple sequence alignment Seq1 WFFKGIADKDAERHLLA Seq2 WFFKNLEQKDAEARLLA Seq3 WFFKR---KDAERQLLA Seq4 WFFGTI---DAERQLLA Seq5 WFFKDIPTKDAERQLLA Seq6 WYFG----RESERLLLA Seq7 WYFGKIPLKDAERQLLA Seq8 WYFGKLRAKDTERLLLL and try to write down a pattern using the Prosite syntax. Submit the pattern to the ScanProsite server and search against SWISS-PROT. Could you say something about the proteins matching your pattern? A possible manner to validate the output of the pattern search is to search against a randomized database. Use again ScanProsite, but this time against a reversed SWISS-PROT. Do you find any false positive sequence? Repeat the exercise with the following multiple sequence alignment:
seq1 ERGLR seq2 DRASR seq3 DRLGR seq4 ERAAR seq5 ERGVR What is going on when searching this pattern against a randomized database? | ||||||||||||||||||||
| 4. Identification of Known Domains in a Protein Sequence | ||||||||||||||||||||
There exist several databases of protein domains. Most of them rely upon profile-HMM based predictors. What can you tell about the following sequence: >QUERY MGQVGDFFIPNKIIFSDEKLVGKKIAMDGNLAYQFLTSIRLRGDSPNLRKRIGETSANYVYGKFTIHLLENIDTPIWVFD PGEPLKKEKVRTRKRMEKKEEALKMKEIKAKDEFEEAYKAKARVYLSKTPMVEKNCLYSLLMGPIYEVASPEEGQASAMK YAKGDWVVAVSQYDDALLYGAPRVVRNTLTTKEPMELINLEELVEDLRSILDDLIDIAIMFGDTNPYGVGKIGGFRKYAE RVLGSVDAVKLKKEVYYEDIEIRKEFKKPVTDNYSLSLLKPDKEIIGKVFDLEDNFYNDRKVKVHKDYLNNLAITKKQKT ADLWKF using the following Motif-Scan servers (the list is not exhaustive!). Use the default setting for every web pages (except MyHits) and don't hesitate to run the searches in parallel using multiple browser windows because of delays.
How different are these predictions? Which server do you prefer? Why? |