Open Access Articles- Top Results for Protein sequencing

Protein sequencing

File:Protein sequencer.jpg
A Beckman-Coulter Porton LF3000G protein sequencing machine

Protein sequencing is a technique to determine the amino acid sequence of a protein, as well as which conformation the protein adopts and the extent to which it is complexed with any non-peptide molecules. Discovering the structures and functions of proteins in living organisms is an important tool for understanding cellular processes, and allows drugs that target specific metabolic pathways to be invented more easily.

The two major direct methods of protein sequencing are mass spectrometry and the Edman degradation reaction. It is also possible to generate an amino acid sequence from the DNA or mRNA sequence encoding the protein, if this is known. However, there are many other reactions that can be used to gain more limited information about protein sequences and can be used as preliminaries to the aforementioned methods of sequencing or to overcome specific inadequacies within them.

Protein sequencer

A protein sequencer is a machine that is used to determine the sequence of amino acids in a protein.

They work by tagging and removing one amino acid at a time, which is analysed and identified. This is done repetitively for the whole polypeptide, until the whole sequence is established.

This method has generally been replaced by nucleic acid technology, and it is often easier to identify the sequence of a protein by looking at the DNA that codes for it.

Determining amino acid composition

It is often desirable to know the unordered amino acid composition of a protein prior to attempting to find the ordered sequence, as this knowledge can be used to facilitate the discovery of errors in the sequencing process or to distinguish between ambiguous results. Knowledge of the frequency of certain amino acids may also be used to choose which protease to use for digestion of the protein. A generalized method often referred to as amino acid analysis[1] for determining amino acid frequency is as follows:

  1. Hydrolyse a known quantity of protein into its constituent amino acids.
  2. Separate and quantify the amino acids in some way.


Hydrolysis is done by heating a sample of the protein in 6 M hydrochloric acid to 100–110 °C for 24 hours or longer. Proteins with many bulky hydrophobic groups may require longer heating periods. However, these conditions are so vigorous that some amino acids (serine, threonine, tyrosine, tryptophan, glutamine, and cysteine) are degraded. To circumvent this problem, Biochemistry Online suggests heating separate samples for different times, analysing each resulting solution, and extrapolating back to zero hydrolysis time. Rastall suggests a variety of reagents to prevent or reduce degradation, such as thiol reagents or phenol to protect tryptophan and tyrosine from attack by chlorine, and pre-oxidising cysteine. He also suggests measuring the quantity of ammonia evolved to determine the extent of amide hydrolysis.


The amino acids can be separated by ion-exchange chromatography or hydrophobic interaction chromatography. An example of the former is given by the NTRC using sulfonated polystyrene as a matrix, adding the amino acids in acid solution and passing a buffer of steadily increasing pH through the column. Amino acids will be eluted when the pH reaches their respective isoelectric points. The latter technique may be employed through the use of reversed phase chromatography. Many commercially available C8 and C18 silica columns have demonstrated successful separation of amino acids in solution in less than 40 minutes through the use of an optimised elution gradient.

Quantitative analysis

Once the amino acids have been separated, their respective quantities are determined by adding a reagent that will form a coloured derivative. If the amounts of amino acids are in excess of 10 nmol, ninhydrin can be used for this; it gives a yellow colour when reacted with proline, and a vivid purple with other amino acids. The concentration of amino acid is proportional to the absorbance of the resulting solution. With very small quantities, down to 10 pmol, fluorescamine can be used as a marker: This forms a fluorescent derivative on reacting with an amino acid.

N-terminal amino acid analysis

File:Sanger peptide end-group analysis.svg
Sanger's method of peptide end-group analysis: A derivatization of N-terminal end with Sanger's reagent (DNFB), B total acid hydrolysis of the dinitrophenyl peptide

Determining which amino acid forms the N-terminus of a peptide chain is useful for two reasons: to aid the ordering of individual peptide fragments' sequences into a whole chain, and because the first round of Edman degradation is often contaminated by impurities and therefore does not give an accurate determination of the N-terminal amino acid. A generalised method for N-terminal amino acid analysis follows:

  1. React the peptide with a reagent that will selectively label the terminal amino acid.
  2. Hydrolyse the protein.
  3. Determine the amino acid by chromatography and comparison with standards.

There are many different reagents which can be used to label terminal amino acids. They all react with amine groups and will therefore also bind to amine groups in the side chains of amino acids such as lysine - for this reason it is necessary to be careful in interpreting chromatograms to ensure that the right spot is chosen. Two of the more common reagents are Sanger's reagent (1-fluoro-2,4-dinitrobenzene) and dansyl derivatives such as dansyl chloride. Phenylisothiocyanate, the reagent for the Edman degradation, can also be used. The same questions apply here as in the determination of amino acid composition, with the exception that no stain is needed, as the reagents produce coloured derivatives and only qualitative analysis is required. So the amino acid does not have to be eluted from the chromatography column, just compared with a standard. Another consideration to take into account is that, since any amine groups will have reacted with the labelling reagent, ion exchange chromatography cannot be used, and thin layer chromatography or high-pressure liquid chromatography should be used instead.

C-terminal amino acid analysis

The number of methods available for C-terminal amino acid analysis is much smaller than the number of available methods of N-terminal analysis. The most common method is to add carboxypeptidases to a solution of the protein, take samples at regular intervals, and determine the terminal amino acid by analysing a plot of amino acid concentrations against time. This method will be very useful in the case of polypeptides and protein-blocked N termini. C-terminal sequencing would greatly help in verifying the primary structures of proteins predicted from DNA sequences and to detect any postranslational processing of gene products from known codon sequences.

Edman degradation

Main article: Edman degradation

The Edman degradation is a very important reaction for protein sequencing, because it allows the ordered amino acid composition of a protein to be discovered. Automated Edman sequencers are now in widespread use, and are able to sequence peptides up to approximately 50 amino acids long. A reaction scheme for sequencing a protein by the Edman degradation follows; some of the steps are elaborated on subsequently.

  1. Break any disulfide bridges in the protein with a reducing agent like 2-mercaptoethanol. A protecting group such as iodoacetic acid may be necessary to prevent the bonds from re-forming.
  2. Separate and purify the individual chains of the protein complex, if there are more than one.
  3. Determine the amino acid composition of each chain.
  4. Determine the terminal amino acids of each chain.
  5. Break each chain into fragments under 50 amino acids long.
  6. Separate and purify the fragments.
  7. Determine the sequence of each fragment.
  8. Repeat with a different pattern of cleavage.
  9. Construct the sequence of the overall protein.

Digestion into peptide fragments

Peptides longer than about 50-70 amino acids long cannot be sequenced reliably by the Edman degradation. Because of this, long protein chains need to be broken up into small fragments that can then be sequenced individually. Digestion is done either by endopeptidases such as trypsin or pepsin or by chemical reagents such as cyanogen bromide. Different enzymes give different cleavage patterns, and the overlap between fragments can be used to construct an overall sequence.

The Edman degradation reaction

The peptide to be sequenced is adsorbed onto a solid surface. One common substrate is glass fibre coated with polybrene, a cationic polymer. The Edman reagent, phenylisothiocyanate (PITC), is added to the adsorbed peptide, together with a mildly basic buffer solution of 12% trimethylamine. This reacts with the amine group of the N-terminal amino acid.

The terminal amino acid can then be selectively detached by the addition of anhydrous acid. The derivative then isomerises to give a substituted phenylthiohydantoin, which can be washed off and identified by chromatography, and the cycle can be repeated. The efficiency of each step is about 98%, which allows about 50 amino acids to be reliably determined.

Limitations of the Edman degradation

Because the Edman degradation proceeds from the N-terminus of the protein, it will not work if the N-terminal amino acid has been chemically modified or if it is concealed within the body of the protein. It also requires the use of either guesswork or a separate procedure to determine the positions of disulfide bridges.

Mass spectrometry

Main article: mass spectrometry

The other major direct method by which the sequence of a protein can be determined is mass spectrometry.[2] This method has been gaining popularity in recent years as new techniques and increasing computing power have facilitated it. Mass spectrometry can, in principle, sequence any size of protein, but the problem becomes computationally more difficult as the size increases. Peptides are also easier to prepare for mass spectrometry than whole proteins, because they are more soluble. One method of delivering the peptides to the spectrometer is electrospray ionization, for which John Bennett Fenn won the Nobel Prize in Chemistry in 2002. The protein is digested by an endoprotease, and the resulting solution is passed through a high-pressure liquid chromatography column. At the end of this column, the solution is sprayed out of a narrow nozzle charged to a high positive potential into the mass spectrometer. The charge on the droplets causes them to fragment until only single ions remain. The peptides are then fragmented and the mass-to-charge ratios of the fragments measured. (It is possible to detect which peaks correspond to multiply charged fragments, because these will have auxiliary peaks corresponding to other isotopes - the distance between these other peaks is inversely proportional to the charge on the fragment). The mass spectrum is analysed by computer and often compared against a database of previously sequenced proteins in order to determine the sequences of the fragments. This process is then repeated with a different digestion enzyme, and the overlaps in the sequences are used to construct a sequence for the protein.

Predicting protein sequence from DNA/RNA sequences

In organisms that do not have introns (e.g., prokaryotes) the amino acid sequence of a protein can also be determined indirectly from the mRNA or the DNA that codes for the protein. If the sequence of the gene is already known, then this is all very easy. However, it is rare that the DNA sequence of a newly isolated protein will be known, and, so, if this method is to be used, it has to be found in some way. One way that this can be done is to sequence a short section, perhaps 15 amino acids long, of the protein by one of the above methods, and then use this sequence to generate a complementary marker for the protein's RNA. This can then be used to isolate the mRNA coding for the protein, which can then be replicated in a polymerase chain reaction to yield a significant amount of DNA, which can then be sequenced relatively easily. The amino acid sequence of the protein can then be deduced from this. However, it is necessary to take into account the possibility of amino acids being removed after the mRNA has been translated.

Bioinformatics Tools for Sequencing

Bioinformatics tools exist that translate nucleic acid sequences into their corresponding polypeptide chain. One such tool is EMBOSS TranSEQ. TranSEQ takes an input of a nucleic acid sequence and will produce the corresponding amino acid sequence based on the selected settings. The tool is designed so non-computer scientist can use it in their research by simply accessing the web tool. Use of the tool is broken into 3 steps: Input of sequences, Parameter selection, and submission. To input the sequences there are two options. The first being entering the sequence in the textbox in a format accepted by Transeq. The second option is to upload a file using the “Choose File” button located under the textbox to upload an accepted file type.

Once the data has been input into the toll the next step is to select the desired parameters. There are several parameters available to change. The first two available before selecting “More Options” are Frame and Codon Table. Frame has a default value of 1. “Frame -1 is the reverse-complement of the sequence having the same codon phase as frame 1. Frame -2 is the same phase as frame 2. Frame -3 is the same phase as frame 3”.[3] Codon table allows the user to select which genetic code table that want to use. When “More Options” is selected three more parameters are available: Regions, Trim, Reverse. The Regions parameter allows for the user to select what regions will be translated. The default value is START-END, meaning it will translate the entire sequence. Trim remove the stop and ambiguity symbols from the end of the translation. The default value for Trim is false, meaning it will not remove the symbols from the translation. The last parameter available to the user through the web app is Reverse. Reverse will give the compliment translated sequence if selected. The default value of Reverse is false.

After both the data and parameters are set the final step is submission. Before submission there is an option to change the Job Title. The title will be associated with the results and may appear in graphical representations. There is also an option to enter an email address to notify the user when the job has been completed. Once the “Submit” button has been pressed the tool will begin to run and will produce and output. The output will be saved on the server for 7 days and can be accessed by using the job-id or the URL of the output page. You can also download the output and toggle colors to make the amino acids different colors. Each amino acid is represented with its 1 character symbol, stops are indicated with a star (*), and spots of ambiguity are marked with “X”.

Another option is to use a bioinformatics tool such as EMBOSS Backtanseq. EMBOSS Backtranseq is a tool that takes an amino acid sequence and back-translates it into the most likely nucleic acid sequence. Backtranseq uses a codon usage table which gives the frequency of usage of each codon for each amino acid. For each amino acid in the input sequence, the corresponding most frequently occurring codon is used in the nucleic acid sequence that is output.[4] Backtranseq has a web application available for end-users. The tool is simple to use and has three steps to it. The first step is sequence input. The input can be done inside the text box in one of the following formats: GCG, FASTA, EMBL, GenBank, PIR, NBRF, PHYLIP or UniProtKB/Swiss-Prot. The input can also be uploaded by using the “Choose File” button under the textbox. The supported file types are the same as the types that are accepted when data is entered into the textbox. The second step is to select the parameters. Only on parameter is available with the web app, which is the Codon usage table. This allows the user to select what table they want to use for the back translation. The final step is Submission. Before the job is submitted there is an option to receive an email notification when it is completed as well as an option to give the job a title. Like Transeq the job title may be visible on some graphics in the output. After submission the job will run. Once completed the output is displayed on a web page and is stored on the server for7 days. There is an option to download the output file as well as an option to toggle colors on the output.

See also


  1. ^ Michail A. Alterman; Peter Hunziker (2 December 2011). Amino Acid Analysis: Methods and Protocols. Humana Press. ISBN 978-1-61779-444-5. 
  2. ^ Coon, Joshua J. (April 13, 2009). "Collisions or Electrons? Protein Sequence Analysis in the 21st Century". Anal. Chem. 81 (9): 3208–3215. doi:10.1021/ac802330b. 
  3. ^ {{EMBOSS: Transeq." EMBOSS: Transeq. EMBOSS, n.d. Web. 05 Apr. 2015. <>}}
  4. ^ {{EMBOSS: Backtranseq." EMBOSS: Backtranseq. EMBOSS, n.d. Web. 05 Apr. 2015. <>}}

Further reading

  • Steen, Hanno; Mann, Matthias (2004). "The abc's (and xyz's) of peptide sequencing". Nature Reviews Molecular Cell Biology 5 (9): 699–711. ISSN 1471-0072. doi:10.1038/nrm1468.