Home

Introduction to wEMBOSS

Friday May 11, 2007

EMBOSS

Practicals: Answers


[Questions]

 

1 Retrieve Sequences and Files

First run showdb to retrieve the list of available databases [output]. The SwissProt database can be accessed using the name "swissprot" . To retrieve the sequence in fasta format of the Swiss-Prot entry P57727 use the the program seqret: type swissprot:P57727 as input for the program [input]. The default output format is fasta. [output]

The difference between entret and seqret is that entret reads and writes the complete sequence entry together with the heading annotation (documentation) without attempting to reformat or interpret the data in any way [output]. seqret on the other hand will read in the entry data, determine which bit of it is the sequence, which is the description line and which is the feature table and will then write the sequence, description and features out in the way prescribed by the sequence format which has been requested for output.

If you have a look at the Features FT lines of the file retrieved by entretyou will find five splice variants annotated (FT lines: VARSPLIC) which leads to isoforms B,T,D and E of the protein [output].

If you have a look at the Cross-References DR lines of the file retrieved by entret you will find 9 cross references to the EMBL database, which correspond to the mRNA sequence for this protein [output].
You can retrieve the DNA sequence corresponding to one of the EMBL entry, eg. embl:AF201380 with seqret [input] [output].

 
2 Find Protein domains/motifs

The patmatmotifs program reports the occourrence of [input][output]:
1) The catalytic activity active site of the serine proteases (from the trypsin family);
2) The low-density lipoprotein (LDL) receptor domain;
3) The scavenger receptor cysteine-rich (SRCR) domain;

In particular the patmatmotifs reports the position of two (HIS and SER) of the three residues which are part of the catalytic triad of the serine proteases of the trypsin family. Please read the documentation for matching patterns for details about the catalytic triad of the active site.

The occurrence of the domains is also annotated in the FT lines of the SwissProt entry (you already retrieved in the previous exercise with the entret program) [output]:
1) LDL receptor domain, from amino acid 72 to amino acid 108 of the sequence.
2) SRCR domain, from amino acid 109 to amino acid 205 of the sequence.
3) The peptidase domain containing the active site, from amino acid 217 to amino acid 449 of the sequence

 
3 Pairwise sequence alignment

You can use needle to perform a global parwise alignment and water to do a local parwise alignment.

The local parwise alignment algorithm (water) tries to find the best local alignment(s) between the two sequences which in this case is the alignment between the second and the third domain that the two proteins have in common [input] [output]. In fact, in the splice variant the first domain (the LDL receptor) is missing.
The global parwise alignment algorithm (needle) on the other end, align the entire sequences [input] [output].

 
4 Producing a restriction map

Use for instance the EMBL sequence AF201380 and run the program restrict [input] [output] or remap [input] [output]. Type '6' in the input option 'Minimum recognition site length'. And in the advance section type '1' and '2' respectively in the 'Minimum cuts per RE' and 'Maximum cuts per RE' options.

 
5 Translation

entret reads and writes the complete sequence entry together with the heading annotation. The coding sequence of the mRNA is reported in the FT lines of the entry: from nucletotide 144 to nucleotide 1511 [output].

You can translate the coding sequence to the corresponding protein product either with the program coderet (extracts CDS automatically from the feature tables) [input] [output] or with the program transeq (by specifying in the input options the begin and the end of the DNA sequence to be translated) [input] [output].

The application getorf finds and extracts potential reading frames (in the 6 frames). Since it is a predictive algorithm, errors can occour especially (as in this example) in predicting the correct start of a reading frame (starting at position 3 instead of position 144) [input] [output].

 
6 Designing primers

Use the program eprimer3 to design the 6 best primers for the embl sequence AF201380 (type '6' in the advanced option 'Number of results to return'). If you check the advanced output option 'Explain flag' you will see that 393 primer pairs are considered OK by the program [input] [output].


To exclude the first and the last 12 base pairs of the sequence specify the sub-region 12,2404 in which to pick the primer of the advanced option 'Included region(s)' [input] [output].

To design an internal oligo to detect one of the sequence variants (for instance the isoform B which starts at amino acid 127) the 'Target region(s)' advanced option can be specified. If one or more targets is specified then a legal primer pair must flank at least one of them [input] [output].

 

Questions: L. Bordoli (Lorenza.Bordoli@unibas.ch) or L. Falquet (Laurent.Falquet@isb-sib.ch)

 

 

Latest update 2007-05-11
Valid HTML 4.01 Transitional   Valid CSS!