Protein Folding

The Concept

P

roteins are one of the four macromolecules critical for life, found amongst nucleic acids, carbohydrates, and lipids as the building blocks in all cells. They perform many of the jobs in the cell and also play a central role in the cell's structure. Proteins are made up of individual monomers called amino acids, which are linked together by peptide bonds. The order of these amino acids is determined by the DNA coding for the particular protein. This order is called the "primary structure" of the protein. There are 20 kinds of amino acids coded for in the DNA. Proteins can be anywhere from short polypeptides of about 50 amino acids to very long chains consisting of thousands. The order of the amino acids is critical for the proper functioning of the protein. Proteins with even one incorrect, missing, or added amino acid can be rendered unusable.


A

fter the amino acids have been bonded together to form a polypeptide, they begin to curl and layer into various secondary structures such as \alpha-helices and \beta-pleated sheets. The tertiary structure is acquired as the protein completely finishes folding itself. When the protein is initially manufactured it may require a chaperone protein to assist it in folding. If the protein is placed in a non-standard environment (different temperature, pressure, etc. than its native conditions) it may become denatured and unfold. While some proteins will again require chaperone proteins to assist in refolding, other proteins have been witnessed to refold automatically when returned to their natural conditions. This project addresses one of the modern mysteries of biology - the ability of some proteins to refold "by themselves" when returned to their native environment.



Image Source
Protein Structure from Boundless.com

The Problem

A

mino acids are bonded to their nearest neighbors by peptide bonds. The bond between two amino acids is not entirely flexible, but it does allow for some movement. If we schematically represent the bond as seen in the figure, we see there are two characteristic relative angles between the amino acids. In most cases reversal of either of these angles is energetically equivalent. Thus for each bond we have four possible configurations. This is approximately true for every bond between neighboring amino acids (ignoring possible interference with the existing chain), and given a polypeptide of N amino acids, we conclude there are \sim4^N different possible configurations.


S

o what's the paradox? Cyrus Levinthal tastefully pointed out that, for example, given a chain of 300 amino acids, the number of accessible tertiary structures is \sim10^{180}. Yet we have discovered that for many proteins, while removing them from their native environment causes them to "denature" (unfold), returning them to their natural surroundings initiates an automatic and in most cases accurate refolding to original structure, within a very short period of time such as a matter of seconds. If the protein was sampling every possible configuration along its folding path, even if it stayed in each configuration for only a tenth of a picosecond (\sim10^{-13} seconds), it should take longer than the age of the universe to find the correct structure!


We conclude (quoted from Nicholas Giordano's Computational Physics page 303):

Since proteins are, in fact, able to fold rapidly into their proper tertiary structure, it would appear that a protein is somehow able to locate the correct structure without searching through all of the possibilities. Precisely how it accomplishes this feat and how it "knows" what the proper tertiary structure should be, is not understood. [Emphasis added] This is the protein-folding problem.



Image Source
Computational Physics by Nicholas Giordano

The Model

M

uch interest has been sparked in the scientific community regarding modeling the folding process, with the hopes that successfully modeling it may shine light on some of its enigmas and may even bring us to the point of predicting the folding of new synthetic molecules. Obviously modeling such a complex process is no trivial matter. We must begin with the simplest of models, and slowly introduce more and more of the complexity of reality as we go. We begin by introducing the most basic model of a protein: a two-dimensional lattice model. Such a model is described by Nicholas Giordano in his book Computational Physics, in chapter 11.


C

onsider a chain of N amino acids. Two attributes of each amino acid must be indicated: the type of amino acid and its location in the chain. There are 20 different types of amino acids for which the DNA contains code. Correspondingly, each amino acid can be arbitrarily assigned a number, 1 through 20. In addition, each amino acid is assigned a second number indicating its location in the chain. This primary structure is labeled:


A(1),A(2),...,A(N)

The model is illustrated by overlaying the labeled chain onto a lattice diagram, placing amino acids (represented by the solid dots) at the lattice structure intersections. The monomers are assumed to be covalently bonded by the peptide bonds mentioned earlier. In addition, bonded monomers must always be found exactly one lattice site apart (vertically or horizontally and not diagonally or further). Now we must consider energetic interactions between amino acids. For the sake of simplicity consider only nearest-neighbor interactions between amino acids that are not covalently bonded. All possible interactions (e.g. hydrogen bonding, interactions with water molecules, Van der Waals forces, etc) are combined into one effective energy value associated with a nearest-neighbor pair. This energy is denoted:


J_{A(m),A(n)}

where m and n are locations in the chain. Then the total interaction energy of the chain is given by


$$E=\sum_{(m,n)}\delta_{m,n}J_{A(m),A(n)}

where the summation is over all possible pairs of amino acids in the chain. The delta function is 1 when the pair are nearest neighbors and thus their interaction energy contributes, and is 0 otherwise. For example, in the above illustration monomers 3 and 6 are nearest neighbors and thus their delta function gives 1, whereas the pair (2,6) are not nearest neighbors and their delta function gives 0.


The program presented here in the "Program" tab uses a more advanced 3-D version of this 2-D model. However the basic concepts are maintained.



Image Source
Computational Physics by Nicholas Giordano

The Method

T

he most accurate way of determining the state of a protein due to energy considerations is an overwhelmingly complex task. However, common behavioral traits allow us to use a statistical process known as the Monte Carlo method to simulate likely results given a limited range of the interaction energies. To initialize the method, several values must first be determined: the primary structure described in "The Model", interaction energies, and the initial tertiary structure. Once these values are set, the method is as follows, as outlined by Nicholas Giordano in his book Computational Physics.

  • A random integer representing a location in the chain in the range 1 to N is selected. The lattice site (x_0,y_0) is assigned to this randomly selected amino acid.
  • One of the site's four nearest neighbors is randomly selected.
  • A check is performed on the occupation of this neighboring site. If a covalent bond would not be broken, and the neighboring site is unoccupied, the amino acid is permitted to move. Permission to move does not mean a move occurs.
  • Once permission is determined, the energy required to make the move is calculated from the difference in the energy states of the current amino acid configuration and the potential new configuration. If the energy required to make a move is greater than zero, the Boltzmann factor is then used to determine whether or not the move is made. If the required energy is less than zero the move is made.
  • Repeat.

Each iteration of this process is called a Monte Carlo time step, and the simulation involves performing many of these time steps. The hope is that as more time steps are performed, the overall state of the protein will gradually approach the ideal folded state.


O

ne may ask, why use a statistical approach that sometimes allows the energy to actually increase? Kirkpatrick explains in his paper "Optimization by Simulated Annealing" (Science, Vol. 220, May 1983, pp 671-680) that only allowing moves that lower the energy is like freezing something rapidly. In such circumstances the substance is very likely to be caught in metastable energy states in which the molecules found a local minimum of energy, but completely "missed" a much better global minimum because of the rapidity of the freezing and the impossibility of "adjusting" after they "choose" a lower energy configuration. Thus this new approach was developed, in which the energy may increase slightly, making possible small adjustments that may allow the molecule to ultimately find a lower energy configuration. The probabilistic handling of the move imitates a molecule interacting thermally with atoms in a heat bath at a given temperature T. In this set-up, the best way to achieve the lowest energy state is to first melt the substance (raising it to very high temperatures), and then slowly reduce the temperature while maintaing thermal equilibrium at each stage. This is called temperature annealing. If thermal equilibrium is violated near the freezing point, the substance will most likely enter local metastable states while failing to reach a more global minimum. In the simulation included below, the user is provided with the option to turn temperature annealing on and to set the initial and final temperatures as well as the number of time steps. This way the user can compare the final state of a polypeptide that experienced temperature annealing as it was folding versus a polypeptide that attempted folding at a constant temperature.


The Program

A

former Technion student by the name of Erez developed a program simulating 3-D protein folding based on the Monte Carlo method described above. His program is included here. The program can be downloaded as a ZIP or as a TAR.


In order to run the program in Windows:

  1. Open the "run" file in MATLAB.
  2. Make sure that MATLAB sets the current folder to be the folder containing the program files by right-clicking on the open tab of the file "run.m" and choosing the option "Change Current Folder to..." as seen here. Note: Clicking on any screenshot presented here will open it up in a new tab in your browser for easier viewing in higher definition. To return to this page simply click on the tab containing this site in your browser.

  1. After making sure the current folder is set correctly, press the green "Run" button or type "run" in the Command Window and hit enter:

In order to familiarize the user with the program, Erez has included a Help button that brings up a Help menu explaining the procedure of usage and the different settings options available to the user. This is the opening screen of his program with the Help button marked.

Clicking on the Help button brings up the following Help menu.

Note: The words in bright teal in the center right of the screenshot say "Use mouse to rotate the protein chain Click right mouse button and drag" [sic].

When inside Erez's running MATLAB program, clicking on a button in the Help menu activates a pop-up window explaining the basic function of the button. The following is a drop-down menu containing all the buttons in the Help Menu in approximate order from top to bottom. Choosing an item from this drop-down menu below will open a new page containing a screenshot of the explanation provided by the pop-up window from Erez's program. Press the back arrow in your browser to return to this page.




In addition, a complete list of all the buttons and their explanations can be viewed conveniently in this table.


Once you are familiar with the various characteristics you can choose for the polypetide in the simulation (e.g. choosing length of polypeptide, choosing number of different kinds of amino acids involved, choosing bond angles and potential functions, choosing temperature, choosing to have temperature annealing or not) as well as the options you have regarding the simulation itself (e.g. number of cycles before seeing an update and total number of cycles), you are ready for your first simulation. The opening screen of the program supplies the user with automatic default settings, and for the first simulation it is recommended to leave the settings on default and go straight to initiating the simulation. In general if the user would like to change these settings, this should be done first, before initiating the simulation.


To actually run the simulation take the following steps:


  1. Click on the "Initiate" button. If this is your first use of the simulator and you've left the settings as default, you should see something like this (the screenshot below) after pressing "Initiate". It's important to note that the amino acids are arranged with some degree of randomness initially (and that the entire method is based on random changes), so don't be surprised if your initial polypeptide looks a little bit different from this. Also note, often one needs to zoom out (click on the "Zoom out" button) to see the whole polypeptide because it can by default be arranged partially outside of the viewing area.
  1. After you've clicked "Initiate" and an unfolded polypeptide has appeared, click "Go!". The program will run till the end without the user needing to do anything. If you'd like to stop it early you can press the "Stop!" button. If you'd like to quit MATLAB click "Quit". If you'd like to reset everything click "Restart".


Included here is a short clip demonstrating the simulation with default settings.

Included below are two examples of different options outside of the default settings that can be chosen in the simulation.

Example 1: Changing number of different kinds of amino acids

Below is an example of an initial polypeptide when the option of three different kinds of amino acids is chosen. Different kinds of amino acids are differentiated in the simulation by the use of different colors. Choosing multiple kinds of amino acids provides the user with the opportunity to define the bond angles and the potential functions between different kinds of amino acids independently, as can be seeen in the screenshot below.

Example 2: Employing temperature annealing

In order to better avoid "getting stuck" in metastable states, temperature annealing may be employed in the simulation (by default temperature annealing is turned off).

Sometimes temperature annealing can effectively enable the system to reach a lower energy level than it would otherwise. For example, compare the following two end results, one without temperature annealing and one with. Both begin with a polypeptide composed of 20 amino acids of all the same type. However, the polypeptide experiencing temperature annealing is initiated at a much higher temperature of 30000 K and is slowly cooled while running through many more Monte Carlo cycles, allowing it to achieve a significantly lower final energy and a more compact shape.


Temperature Annealing Off

Temperature Annealing On

Odds and Ends

Some Closing Remarks

Levinthal's paradox, like many other mysteries in science, may never be resolved. However, as with many other things in science, we are not discouraged by this. Instead we continue theorizing and simulating, in the hopes that even if we do not ever discover a complete explanation or the "why" of the mechanism, we will still learn more about the physical world we live in. And that's the reason we started investigating in the first place. Since Erez's project in 1999, many researchers have written more and more complicated protein folding simulations, ever striving to achieve something more realistic that may shine some light on the phenomenon. Of course, this means much more complexity must be added. For example, a simple approach to temperature annealing in which the protein is heated to extremely high temperatures, as Erez took here, is not realistic in light of the various every day environments proteins encounter in a cell. So although this method proved successful in lowering the overall energy of the system, it still requires some serious adjustments to be modeling the real thing. Many other simplifications had to be made as well considering the nature of the process. In spite of these drawbacks, Erez did excellent work implementing a 3-D Monte Carlo-simulated protein folding method.


Contact Information and Other Odds and Ends

My name is Hadossa Oudean, and I did this project as part of Joan Adler's Computational Physics class.

  • My homepage for this class, which contains a link back to this project, can be found here.
  • Joan Adler's homepage for Computational Physics can be found here.
  • The class's homepage can be found here.
  • I can be contacted at my personal e-mail address: hadossaoudean@gmail.com

As stated before, the program was originally written by a former Technion student by the name of Erez in the year 1999. Here is a link to his website from back then. A very special thanks goes to Alexander Cheplev, a former fellow student of mine, who successfully adjusted the code to run in today's MATLAB. It should be noted that the code runs on Windows but does not yet run successfully on Linux. It is suspected that this is related to the compilation of the C code in Linux and/or some kind of path issue. If you happen to find a way to run it on Linux successfully or have an idea of what I could try to fix this please let me know!