Homology Modeling
See also:
In protein structure prediction, homology modeling, also known as comparative modeling, is a class of methods for constructing an atomic-resolution model of a protein from its amino acid sequence
(the "query sequence" or "target"). Almost all homology modeling
techniques rely on the identification of one or more known protein
structures (known as "templates" or "parent structures") likely to
resemble the structure of the query sequence, and on the production of
an alignment
that maps residues in the query sequence to residues in the template
sequence. The sequence alignment and template structure are then used
to produce a structural model of the target. Because protein structures
are more conserved than protein sequences, detectable levels of sequence similarity usually imply significant structural similarity.[1]
The quality of the homology model is dependent on the quality of the
sequence alignment and template structure. The approach can be
complicated by the presence of alignment gaps (commonly called indels)
that indicate a structural region present in the target but not in the
template, and by structure gaps in the template that arise from poor
resolution in the experimental procedure (usually X-ray crystallography) used to solve the structure. Model quality declines with decreasing sequence identity; a typical model has ~2 Å agreement between the matched Cα atoms at 70% sequence identity but only 4-5 Å agreement at 25% sequence identity. Regions of the model that were constructed without a template, usually by loop modeling, are generally much less accurate than the rest of the model, particularly if the loop is long. Errors in side chain
packing and position also increase with decreasing identity, and
variations in these packing configurations have been suggested as a
major reason for poor model quality at low identity.[2]
Taken together, these various atomic-position errors are significant
and impede the use of homology models for purposes that require
atomic-resolution data, such as drug design and protein-protein interaction predictions; even the quaternary structure
of a protein may be difficult to predict from homology models of its
subunit(s). Nevertheless, homology models can be useful in reaching qualitative
conclusions about the biochemistry of the query sequence, especially in
formulating hypotheses about why certain residues are conserved, which
may in turn lead to experiments to test those hypotheses. For example,
the spatial arrangement of conserved residues may suggest whether a
particular residue is conserved to stabilize the folding, to
participate in binding some small molecule, or to foster association
with another protein or nucleic acid.
Homology modeling can produce high-quality structural models when
the target and template are closely related, which has inspired the
formation of a structural genomics consortium dedicated to the production of representative experimental structures for all classes of protein folds.[3] The chief inaccuracies in homology modeling, which worsen with lower sequence identity, derive from errors in the initial sequence alignment and from improper template selection.[4]
Like other methods of structure prediction, current practice in
homology modeling is assessed in a biannual large-scale experiment
known as the Critical Assessment of Techniques for Protein Structure
Prediction, or CASP.
Motivation
The method of homology modeling is based on the observation that protein tertiary structure is better conserved than amino acid sequence.[1]
Thus, even proteins that have diverged appreciably in sequence but
still share detectable similarity will also share common structural
properties, particularly the overall fold. Because it is difficult and
time-consuming to obtain experimental structures from methods such as X-ray crystallography and protein NMR
for every protein of interest, homology modeling can provide useful
structural models for generating hypotheses about a protein's function
and directing further experimental work.
There are exceptions to the general rule that proteins sharing
significant sequence identity will share a fold. For example, a
judiciously chosen set of mutations of less than 50% of a protein can
cause the protein to adopt a completely different fold.[5][6] However, such a massive structural rearrangement is unlikely to occur in evolution, especially since the protein is usually under the constraint that it must fold
properly and carry out its function in the cell. Consequently, the
roughly folded structure of a protein (its "topology") is conserved
longer than its amino-acid sequence and much longer than the
corresponding DNA sequence; in other words, two proteins may share a
similar fold even if their evolutionary relationship is so distant that
it cannot be discerned reliably. For comparison, the function of a
protein is conserved much less than the protein sequence, since relatively few changes in amino-acid sequence are required to take on a related function.
Steps in model production
The homology modeling procedure can be broken down into four
sequential steps: template selection, target-template alignment, model
construction, and model assessment.[1]
The first two steps are often essentially performed together, as the
most common methods of identifying templates rely on the production of
sequence alignments; however, these alignments may not be of sufficient
quality because database search techniques prioritize speed over
alignment quality. These processes can be performed iteratively to
improve the quality of the final model, although quality assessments
that are not dependent on the true target structure are still under
development.
Optimizing the speed and accuracy of these steps for use in
large-scale automated structure prediction is a key component of
structural genomics initiatives, partly because the resulting volume of
data will be too large to process manually and partly because the goal
of structural genomics requires providing models of reasonable quality
to researchers who are not themselves structure prediction experts.[1] Fully automated predictions with no human intervention are studied in a CASP parallel project known as CAFASP.
Template selection and sequence alignment
The critical first step in homology modeling is the identification
of the best template structure, if indeed any are available. The
simplest method of template identification relies on serial pairwise
sequence alignments aided by database search techniques such as FASTA and BLAST. More sensitive methods based on multiple sequence alignment - of which PSI-BLAST is the most common example - iteratively update their position-specific scoring matrix
to successively idenfity more distantly related homologs. This family
of methods has been shown to produce a larger number of potential
templates and to identify better templates for sequences that have only
distant relationships to any solved structure. Protein threading,
also known as fold recognition or 3D-1D alignment, can also be used as
a search technique for identifying templates to be used in traditional
homology modeling methods.[1] When performing a BLAST search, a reliable first approach is to identify hits with a sufficiently low E-value,
which are considered sufficiently close in evolution to make a reliable
homology model. Other factors may tip the balance in marginal cases;
for example, the template may have a function similar to that of the
query sequence, or it may belong to a homologous operon. However, a template with a poor E-value
should generally not be chosen, even if it is the only one available,
since it may well have a wrong structure, leading to the production of
a misguided model. A better approach is to submit the primary sequence
to fold-recognition servers or, better still, consensus meta-servers
which improve upon individual fold-recognition servers by identifying
similarities (consensus) among independent predictions.
Often several candidate template structures are identified by these
approaches. Although some methods can generate hybrid models from
multiple templates, most methods rely on a single template. Therefore,
choosing the best template from among the candidates is a key step, and
can affect the final accuracy of the structure significantly. This
choice is guided by several factors, such as the similarity of the
query and template sequences, of their functions, and of the predicted
query and observed template secondary structures. Perhaps most importantly, the coverage
of the aligned regions: the fraction of the query sequence structure
that can be predicted from the template, and the plausibility of the
resulting model. Thus, sometimes several homology models are produced
for a single query sequence, with the most likely candidate chosen only
in the final step.
It is possible to use the sequence alignment generated by the
database search technique as the basis for the subsequent model
production; however, more sophisticated approaches have also been
explored. One proposal generates an ensemble of stochastically
defined pairwise alignments between the target sequence and a single
identified template as a means of exploring "alignment space" in
regions of sequence with low local similarity.[7]
"Profile-profile" alignments that first generate a sequence profile of
the target and systematically compare it to the sequence profiles of
solved structures; the coarse-graining inherent in the profile
construction is thought to reduce noise introduced by sequence drift in nonessential regions of the sequence.[8]
Model generation
Given a template and an alignment, the information contained therein
must be used to generate a three-dimensional structural model of the
target, represented as a set of Cartesian coordinates for each atom in the protein. Three major classes of model generation methods have been proposed.[9]
Fragment assembly
The original method of homology modeling relied on the assembly of a complete model from conserved structural fragments identified in closely related solved structures. For example, a modeling study of serine proteases in mammals
identified a sharp distinction between "core" structural regions
conserved in all experimental structures in the class, and variable
regions typically located in the loops
where the majority of the sequence differences were localized. Thus
unsolved proteins could be modeled by first constructing the conserved
core and then substituting variable regions from other proteins in the
set of solved structures.[10]
Current implementations of this method differ mainly in the way they
deal with regions that are not conserved or that lack a template.[11]
Segment matching
The segment-matching method divides the target into a series of
short segments, each of which is matched to its own template fitted
from the Protein Data Bank.
Thus, sequence alignment is done over segments rather than over the
entire protein. Selection of the template for each segment is based on
sequence similarity, comparisons of alpha carbon coordinates, and predicted steric conflicts arising from the van der Waals radii of the divergent atoms between target and template. [12]
Satisfaction of spatial restraints
The most common current homology modeling method takes its
inspiration from calculations required to construct a three-dimensional
structure from data generated by NMR spectroscopy. One or more target-template alignments are used to construct a set of geometrical criteria that are then converted to probability density functions for each restraint. Restraints applied to the main protein internal coordinates - protein backbone distances and dihedral angles - serve as the basis for a global optimization procedure that originally used conjugate gradient energy minimization to iteratively refine the positions of all heavy atoms in the protein.[13]
This method had been dramatically expanded to apply specifically to
loop modeling, which can be extremely difficult due to the high
flexibility of loops in proteins in aqueous solution.[14] A more recent expansion applies the spatial-restraint model to electron density maps derived from cryoelectron microscopy
studies, which provide low-resolution information that is not usually
itself sufficient to generate atomic-resolution structural models.[15]
To address the problem of inaccuracies in initial target-template
sequence alignment, an iterative procedure has also been introduced to
refine the alignment on the basis of the initial structural fit.[16] The most commonly user software in spatial restraint-based modeling is MODELLER and a database called ModBase has been established for reliable models generated with it.[17]
Loop modeling
Regions of the target sequence that are not aligned to a template are modeled by loop modeling;
they are the most susceptible to major modeling errors and occur with
higher frequency when the target and template have low sequence
identity. The coordinates of unmatched sections determined by loop
modeling programs are generally much less accurate than those obtained
from simply copying the coordinates of a known structure, particularly
if the loop is longer than 10 residues. The first two sidechain dihedral angles (χ1 and χ2)
can usually be estimated within 30° for an accurate backbone structure;
however, the later dihedral angles found in longer side chains such as lysine and arginine are notoriously difficult to predict. Moreover, small errors in χ1 (and, to a lesser extent, in χ2)
can cause relatively large errors in the positions of the atoms at the
terminus of side chain; such atoms often have a functional importance,
particularly when located near the active site.
Model assessment
Assessment of homology models without reference to the true target structure is usually performed with two methods: statistical potentials
or physics-based energy calculations. Both methods produce an estimate
of the energy (or an energy-like analog) for the model or models being
assessed; independent criteria are needed to determine acceptable
cutoffs. Neither of the two methods correlates exceptionally well with
true structural accuracy, especially on protein types underrepresented
in the PDB, such as membrane proteins.
Statistical potentials are empirical methods based on observed
residue-residue contact frequencies among proteins of known structure
in the PDB. They assign a probability or energy score to each possible
pairwise interaction between amino acids
and combine these pairwise interaction scores into a single score for
the entire model. Some such methods can also produce a
residue-by-residue assessment that identifies poorly scoring regions
within the model, though the model may have a reasonable score overall.[18] These methods emphasize the hydrophobic core and solvent-exposed polar amino acids often present in globular proteins. Examples of popular statistical potentials include Prosa and DOPE. Statistical potentials are more computationally efficient than energy calculations.[18]
Physics-based energy calculations aim to capture the interatomic
interactions that are physically responsible for protein stability in
solution, especially van der Waals and electrostatic interactions. These calculations are performed using a molecular mechanics force field; proteins are normally too large even for semi-empirical quantum mechanics-based calculations. The use of these methods is based on the energy landscape hypothesis of protein folding, which predicts that a protein's native state is also its energy minimum. Such methods usually employ implicit solvation,
which provides a continuous approximation of a solvent bath for a
single protein molecule without necessitating the explicit
representation of individual solvent molecules. A force field
specifically constructed for model assessment is known as the Effective Force Field (EFF) and is based on atomic parameters from CHARMM.[19]
A very extensive model validation report can be obtained using the Radboud Universiteit Nijmegen "What Check" software which is one option of the Radboud Universiteit Nijmegen "What If"
software package; it produces a many page document with extensive
analyses of nearly 200 scientific and administrative aspects of the
model. "What Check" is available as a free server; it can also be used to validate experimentally determined structures of macromolecules.
One newer method for model assessment relies on machine learning techniques such as neural nets,
which may be trained to assess the structure directly or to form a
consensus among multiple statistical and energy-based methods. Very
recent results using support vector machine
regression on a jury of more traditional assessment methods
outperformed common statistical, energy-based, and machine learning
methods.[20]
Structural comparison methods
The assessment of homology models' accuracy is straightforward when
the experimental structure is known. The most common method of
comparing two protein structures uses the root-mean-square deviation
(RMSD) metric to measure the mean distance between the corresponding
atoms in the two structures after they have been superimposed. However,
RMSD does underestimate the accuracy of models in which the core is
essentially correctly modeled, but some flexible loop regions are inaccurate.[21] A method introduced for the modeling assessment experiment CASP is known as the global distance test
(GDT) and measures the total number of atoms whose distance from the
model to the experimental structure lies under a certain distance
cutoff.[21] Both methods can be used for any subset of atoms in the structure, but are often applied to only the alpha carbon or protein backbone atoms to minimize the noise created by poorly modeled side chain rotameric states, which most modeling methods are not optimized to predict.[22]
Benchmarking
Several large-scale benchmarking efforts have been made to assess the relative quality of various current homology modeling methods. CASP
is a community-wide prediction experiment that runs every two years
during the summer months and challenges prediction teams to submit
structural models for a number of sequences whose structures have
recently been solved experimentally but have not yet been published.
Its partner CAFASP
has run in parallel with CASP but evaluates only models produced via
fully automated servers. Continuously running experiments that do not
have prediction 'seasons' focus mainly on benchmarking publicly
available webservers. LiveBench and EVA
run continuously to assess participating servers' performance in
prediction of imminently released structures from the PDB. CASP and
CAFASP serve mainly as evaluations of the state of the art in modeling,
while the continuous assessments seek to evaluate the model quality
that would be obtained by a non-expert user employing publicly
available tools.
Accuracy
The accuracy of the structures generated by homology modeling is
highly dependent on the sequence identity between target and template.
Above 50% sequence identity, models tend to be reliable, with only
minor errors in side chain packing and rotameric state, and an overall RMSD between the modeled and the experimental structure falling around 1 Â.
This error is comparable to the typical resolution of a structure
solved by NMR. In the 30-50% identity range, errors can be more severe
and are often located in loops. Below 30% identity, serious errors
occur, sometimes resulting in the basic fold being mis-predicted.[9]
This low-identity region is often referred to as the "twilight zone"
within which homology modeling is extremely difficult, and to which it
is possibly less suited than fold recognition methods.[23]
At high sequence identities, the primary source of error in homology
modeling derives from the choice of the template or templates on which
the model is based, while lower identities exhibit serious errors in
sequence alignment that inhibit the production of high-quality models.[4]
It has been suggested that the major impediment to quality model
production is inadequacies in sequence alignment, since "optimal" structural alignments
between two proteins of known structure can be used as input to current
modeling methods to produce quite accurate reproductions of the
original experimental structure.[24]
Attempts have been made to improve the accuracy of homology models built with existing methods by subjecting them to molecular dynamics simulation in an effort to improve their RMSD to the experimental structure. However, current force field
parameterizations may not be sufficiently accurate for this task, since
homology models used as starting structures for molecular dynamics tend
to produce slightly worse structures.[25] Slight improvements have been observed in cases where significant restraints were used during the simulation.[26]
Sources of error
The two most common and large-scale sources of error in homology
modeling are poor template selection and inaccuracies in
target-template sequence alignment.[4][27] Controlling for these two factors by using a structural alignment,
or a sequence alignment produced on the basis of comparing two solved
structures, dramatically reduces the errors in final models; these
"gold standard" alignments can be used as input to current modeling
methods to produce quite accurate reproductions of the original
experimental structure.[24]
Results from the most recent CASP experiment suggest that "consensus"
methods collecting the results of multiple fold recognition and
multiple alignment searches increase the likelihood of identifying the
correct template; similarly, the use of multiple templates in the
model-building step may be less optimal than the use of the single
correct template but more optimal than the use of a single suboptimal
one.[27]
Alignment errors may be minimized by the use of a multiple alignment
even if only one template is used, and by the iterative refinement of
local regions of low similarity.[1][7] A lesser source of model errors are errors in the template structure. The [PDBREPORT]
database lists several million, mostly very small but occasionally
dramatic, errors in experimental (template) structures that have been
deposited in the PDB.
Serious local errors can arise in homology models where an insertion or deletion
mutation or a gap in a solved structure result in a region of target
sequence for which there is no corresponding template. This problem can
be minimized by the use of multiple templates, but the method is
complicated by the templates' differing local structures around the gap
and by the likelihood that a missing region in one experimental
structure is also missing in other structures of the same protein
family. Missing regions are most common in loops
where high local flexibility increases the difficulty of resolving the
region by structure-determination methods. Although some guidance is
provided even with a single template by the positioning of the ends of
the missing region, the longer the gap, the more difficult it is to
model. Loops of up to about 9 residues can be modeled with moderate
accuracy in some cases if the local alignment is correct.[1] Larger regions are often modeled individually using ab initio structure prediction techniques, although this approach has met with only isolated success.[28]
The rotameric
states of side chains and their internal packing arrangement also
present difficulties in homology modeling, even in targets for which
the backbone structure is relatively easy to predict. This is partly
due to the fact that many side chains in crystal structures are not in
their "optimal" rotameric state as a result of energetic factors in the
hydrophobic core and in the packing of the individual molecules in a protein crystal.[29]
One method of addressing this problem requires searching a rotameric
library to identify locally low-energy combinations of packing states.[30]
It has been suggested that a major reason that homology modeling so
difficult when target-template sequence identity lies below 30% is that
such proteins have broadly similar folds but widely divergent side
chain packing arrangements.[2]
Utility
Uses of the structural models include protein-protein interaction prediction, protein-protein docking, molecular docking, and functional annotation of genes identified in an organism's genome.[31]
Even low-accuracy homology models can be useful for these purposes,
because their inaccuracies tend to be located in the loops on the
protein surface, which are normally more variable even between closely
related proteins. The functional regions of the protein, especially its
active site, tend to be more highly conserved and thus more accurately modeled.[9]
Homology models can also be used to identify subtle differences
between related proteins that have not all been solved structurally.
For example, the method was used to identify cation binding sites on the Na2+/K+ ATPase and to propose hypotheses about different ATPases' binding affinity.[32] Used in conjunction with molecular dynamics
simulations, homology models can also generate hypotheses about the
kinetics and dynamics of a protein, as in studies of the ion
selectivity of a potassium channel.[33] Large-scale automated modeling of all identified protein-coding regions in a genome has been attempted for the yeast Saccharomyces cerevisiae,
resulting in nearly 1000 quality models for proteins whose structures
had not yet been determined at the time of the study, and identifying
novel relationships between 236 yeast proteins and other previously
solved structures.[34]
References
- ^ a b c d e f g
Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A. (2000).
Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 29: 291-325.
- ^ a b Chung SY, Subbiah S. (1996.) A structural explanation for the twilight zone of protein sequence homology. Structure 4: 1123–27.
- ^ Williamson AR. (2000). Creating a structural genomics consortium. Nat Struct Biol 7 S1(11s):953.
- ^ a b c
Venclovas C, Margeleviĉius M. (2005). Comparative modeling in CASP6
using consensus approach to template selection, sequence-structure
alignment, and structure assessment. Proteins 61(S7):99-105.
- ^ Dalal S, Balasubramanian S, Regan L. (1997). Transmuting alpha helices and beta sheets. Fold Des 2(5):R71-9.
- ^ Dalal S, Balasubramanian S, Regan L. (1997). Protein alchemy: changing beta-sheet into alpha-helix. Nat Struct Biol 4(7):548-52.
- ^ a b Muckstein U, Hofacker IL, Stadler PF. (2002). Stochastic pairwise alignments. Bioinformatics 18 Suppl 2:S153-60.
- ^ Rychlewski L, Zhang B, Godzik A. (1998). Fold and function predictions for Mycoplasma genitalium proteins. Fold Des 3(4):229-38.
- ^ a b c Baker D, Sali A. (2001). Protein structure prediction and structural genomics. Science 294(5540):93-96.
- ^ Greer J. (1981). Comparative model-building of the mammalian serine proteases 153(4):1027-42.
- ^ Wallner B, Elofsson A. (2005). All are not equal: A benchmark of different homology modeling programs. Protein Science 14:1315-1327.
- ^ Levitt M. (1992). Accurate modeling of protein conformation by automatic segment matching. J Mol Biol 226(2): 507-33.
- ^ Sali A, Blundell TL. (1993). Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234(3):779-815.
- ^ Fiser A, Sali A. (2003). ModLoop: automated modeling of loops in protein structures. Bioinformatics 19(18):2500-1.
- ^
Topf M, Baker ML, Marti-Renom MA, Chiu W, Sali A. (2006). Refinement of
protein structures by iterative comparative modeling and CryoEM density
fitting. J Mol Biol 357(5):1655-68.
- ^ John B, Sali A. (2003). Comparative protein structure modeling by iterative alignment, model building and model assessment. Nucleic Acids Res 31(14):3982-92.
- ^
Ursula Pieper, Narayanan Eswar, Hannes Braberg, M.S. Madhusudhan, Fred
Davis, Ashley C. Stuart, Nebojsa Mirkovic, Andrea Rossi, Marc A.
Marti-Renom, Andras Fiser, Ben Webb, Daniel Greenblatt, Conrad Huang,
Tom Ferrin, Andrej Sali. MODBASE, a database of annotated comparative
protein structure models, and associated resources. Nucleic Acids Res 32, D217-D222, 2004.
- ^ a b Sippl MJ. (1993). Recognition of Errors in Three-Dimensional Structures of Proteins. Proteins 17:355-62.
- ^
Lazaridis T. and Karplus M. 1999a. Discrimination of the native from
misfolded protein models with an energy function including implicit
solvation. J. Mol. Biol. 288: 477–487
- ^
Eramian D, Shen M, Devos D, Melo F, Sali A, Marti-Renom MA. (2006). A
composite score for predicting errors in protein structure models. Protein Science 15:1653-1666.
- ^ a b Zemla A. (2003). LGA - A Method for Finding 3-D Similarities in Protein Structures. Nucleic Acids Research, 31(13):3370-3374.
- ^ Mount DM. (2004). Bioinformatics: Sequence and Genome Analysis 2nd ed. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.
- ^ Blake JD, Cohen FE. (2001). Pairwise sequence alignment below the twilight zone. J Mol Biol 307(2):721-35.
- ^ a b
Zhang Y and Skolnick J. (2005). The protein structure prediction
problem could be solved using the current PDB library. Proc. Natl.
Acad. Sci. USA 102(4):1029-34.
- ^ Koehl P, Levitt M. (1999). A brighter future for protein structure prediction. Nat Struct Biol 6(2):108-11.
- ^
Flohil JA, Vriend G, Berendsen HJ. (2002). Completion and refinement of
3-D homology models with restricted molecular dynamics: application to
targets 47, 58, and 111 in the CASP modeling competition and posterior
analysis. Proteins 48(4):593-604.
- ^ a b Ginalski K. (2006). Comparative modeling for protein structure prediction. Curr Opin Struct Biol 16(2):172-7.
- ^ Kryshtafovych A, Venclovas C, Fidelis K, Moult J. (2005). Progress over the first decade of CASP experiments. Proteins 61(S7):225-36.
- ^ Vasquez M. (1996). Modeling side-chain conformation. Curr Opin Struct Biol 6(2):217-21.
- ^
Wilson C, Gregoret LM, Agard DA. (1993). Modeling side-chain
conformation for homologous proteins using an energy-based rotamer
search. J Mol Biol 229(4):996-1006.
- ^
Gopal S, Schroeder M, Pieper U, Sczyrba A, Aytekin-Kurban G, Bekiranov
S, Fajardo JE, Eswar N, Sanchez R, Sali A, Gaasterland T. (2001).
Homology-based annotation yields 1,042 new candidate genes in the
Drosophila melanogaster genome. Nat Genet 27(3):337-40.
- ^ Ogawa H, Toyoshima C. (2002). Homology modeling of the cation binding sites of Na+K+-ATPase. Proc Natl Acad Sci USA 99(25):15977-15982
- ^
Capener CE, Shrivastava IH, Ranatunga KM, Forrest LR, Smith GR, Sansom
MSP. (2000). Homology Modeling and Molecular Dynamics Simulation
Studies of an Inward Rectifier Potassium Channel. Biophys J 78(6):2929-2942
- ^ Sánchez R, Sali A. (1998). Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc Natl Acad Sci USA 95(23):13597-13602.
This article is licensed under the GNU Free Documentation License. It uses material from Wikipedia Encyclopedia article "Homology Modeling"
|
|