The important properties of proteins and how to explore them Introduction: the context for studies and data analysis KEY CONCEPT ■ Appreciating the complexity of cellular systems in terms of the numbers of distinct protein species present 'There's no big mystery to being an enzymologist. All you have to have is a razor blade and a liver'. Gordon Tomkins to Julius Axelrod, circa 1950. The characterization of proteins, including enzymatic proteins, requires a good understanding of their important properties and the approaches employed to explore them. This chapter, underpinned by the conceptual toolkit in Chapter 1, aims to enhance your understanding of the goals sought and methods used in characterizing proteins. The simplicity of the approach suggested by Gordon Tomkins fails to convey the challenges associated with applying the key concepts and tools, outlined in Chapters 2-4, to the characterization of proteins within complex biological systems. A typical eukaryotic genome may well have in excess of 10 000 genes which, in turn, can encode probably over 10 times as many distinct proteins, resulting from differential RNA processing and post-translational modification. To establish the role of the several hundred thousand proteins would require the structural and functional characterization of each one. The enormity of this task is compounded by the diverse nature of protein function and structure. In section 5.2, we shall consider how to establish the function and structure of a protein and how its activity may be regulated. In section 5.3, the range of assays used to monitor the biological activity of proteins is outlined; such assays underpin protein characterization and purification. The classical approach to studying proteins requires their purification from their native source (section 5.4.1) using bespoke purification procedures for individual proteins. This task has been simplified greatly with the advent of protein expression in recombinant systems (section 5.4.2). In addition, the overexpression of recombinant proteins has enhanced the structural characterization of proteins, as reflected in the large As examples, the genomes of the eukaryotic species budding yeast Saccharomyces cerevisiae, nematode worm Caenorhabditis elegans and human contain about 6000, 12 000 and 23 000 genes, respectively. 152 5. THE IMPORTANT PROPERTIES OF PROTEINS AND HOW TO EXPLORE THEM As described in section 5.8, the PDB is the database for all three-dimensional structures of proteins solved to atomic resolution by X-ray crystallography or nuclear magnetic resonance (NMR). As of July 2008, there were approximately 47 000 structures in the PDB, of which about 85% had been solved by X-ray crystallography. It should be noted that many structures in the PDB refer to the same protein in different forms or complexes and that relatively few structures of membrane proteins have been determined. increase in the number of structures deposited in the Protein Data Bank (PDB) in recent years. Section 5.5 will present a brief outline of the methods employed to determine the structures of proteins. Having established the structure and function of a protein, it is important to understand how it is regulated within the cellular environment and the types of interactions in which it is involved. Typically, this is achieved by monitoring the effects of a number of physical and chemical variables on protein activity, as described in sections 5.6 and 5.7. In section 5.8, we shall consider the use of bio-informatics in exploring the properties of proteins and, finally, experimental design will be outlined in section 5.9. This chapter should lead to a sound understanding of the goals and methods employed to separate, identify, and characterize proteins that will enable data acquisition and handling in the specific examples outlined in subsequent chapters. The key questions about a protein KEY CONCEPTS ■ Being aware of the range of functions of proteins ■ Understanding the levels of protein structure ■ Identifying appropriate methods to explore protein structure and protein interactions All proteins share the common structural feature of being composed of amino acids which are linked by peptide bonds (Chapter 1, sections 1.1-1.3); however, it is the sequence of amino acids within a given protein that dictates its unique function and structure. The characterization of a novel protein requires an appreciation of the diversity of protein function and structure. It is also important to appreciate that biological systems are not static entities; they respond to environmental, developmental and metabolic signals with concomitant changes in the structure and function of proteins associated with these processes. Finally, while it is convenient to study proteins in isolation (section 5.4), it must be remembered that they almost always occur within complex cellular environments, interacting with other proteins, metabolites and cellular structures. Thus, to achieve complete characterization, we need to establish how proteins interact with other molecules under physiologically relevant conditions. 5.2.1 What is the function of the protein? Proteins fulfil a diverse range of roles within the cell which can be categorized into the following general groups: 154 5. THE IMPORTANT PROPERTIES OF PROTEINS AND HOW TO EXPLORE THEM Antigen Antigen binding site Fig. 5.1 Representation of general structure of an antibody molecule with the antigen binding sites located at the N-termini of the heavy (V,,) and light (VL) chains. The constant regions (CH and CJ are linked by disulphide bonds (S-) and contain a carbohydrate modification (CHO). Fig. 5.2 Structure of a monomeric form of actin (Protein Data Bank code 2A5X). This structural protein usually occurs as part of multimeric structures, such as actin filaments. S.2 THE KEY QUESTIONS ABOUT A PROTEIN 155 Signalling proteins: Communication between cells relies on the production of signal molecules in one cell type that are detected by receptors (proteins) located on the surface of a second cell type or target cell; the interaction between any given signal and its receptor leads to a cellular response, a process termed 'signal transduction'. Cellular signal molecules can take the form of small molecules (such as adrenalin, also known as epinephrine) or macromolecules. Insulin, a protein produced in the p cells of the pancreas is one such signal molecule which is detected by insulin receptors located in the cell membrane of many cell types. Insulin is produced in response to high blood glucose levels and the subsequent binding of the hormone to insulin receptors leads to a reduction in blood glucose. Motor proteins: Movement within biological systems is a process that is accompanied by the utilization of ATP. A number of motor proteins, including dynein and myosin, can harvest the energy released by ATP hydrolysis to generate movement. Within muscle structures, ATP hydrolysis drives the movement of myosin relative to actin filaments resulting in muscle contraction. Storage proteins: Storage proteins fulfil an essential role within the cell by storing minerals or essentially acting as a source of amino acid nutrients, poised for release in response to an appropriate metabolic or developmental signal. Seed storage proteins are released and degraded on seed germination to provide essential nutrients for the developing seedling. A number of storage proteins, such as the iron storage protein, ferritin, sequester ligands, which may prove toxic to the cell. Iron would tend towards its toxic ferric state within biological systems; however, this essential mineral is stored safely within proteins such as ferritin until it is required for processes such as haem synthesis. In all cases, the function of each of these protein groups is dependent on the structure of the protein, i.e. form fits function: catalytic proteins have residues in and around the active site, which present an environment to promote specific chemical reactions and binding proteins present specific binding sites to allow recognition and binding of target molecules. Increasingly, the function of a novel protein is determined by establishing its amino acid sequence (either directly using amino acid sequencing or indirectly by translating the sequence of the gene encoding the novel protein) and then conducting a homology search of the ever-expanding sequence databases (section 5.8.5). Confirmation of the probable function requires the purification of the protein and an assay to test its function, e.g. an enzyme assay to measure the activity of the putative catalytic protein. Assays can also be used to test the possible influence of environmental and metabolic effectors on the activity of the protein. A complete understanding of protein function requires solving (or prediction) of its structure. The terms adrenalin(e), which is used in the UK and Europe, and epinephrine (used in the USA) are both derived from the location of the secretory adrenal gland that is adjacent to the kidneys. The roots are: Latin, ad (against), ren (kidney) or Greek, epi (close to), nephros (kidney). It has been estimated that the actin-myosin system in skeletal muscle can be up to about 60% efficient in converting the chemical energy of the fuel (ATP hydrolysis) into mechanical work. This is more efficient than power stations powered by coal or gas, which can achieve efficiencies in the range 30-40%. Under physiological conditions, iron can exist in the ferrous (Fe2+) or ferric (Fe1*) state. The ferric state forms large complexes with anions and hydroxide ions, which are highly insoluble and toxic. The role of ferritin is to sequester iron in the ferric state, complexed with phosphate and hydroxide ions, until it is required for processes such as haem biosynthesis. Ferritin is composed of 24 identical subunits which associate to form a hollow spherical structure. Each 24-mer can accommodate up to 4500 iron ions within this hollow structure (see Fig. 5.3). I I 156 5. THE IMPORTANT PROPERTIES OF PROTEINS AND HOW TO EXPLORE THEM Fig. 5.3 24-mer structure of the iron storage protein ferritin (Protein Data Bank code 1FHA). 24 identical subunits pack together to form a hollow shell that can accommodate up to 4500 iron ions in the ferric (Fe3+) state. 5.2.2 What is the structure of the protein? Over the past 20 years, our understanding of protein structure has been greatly enhanced by the near exponential growth in the number of structures that have been solved (see Fig. 5.4). A number of factors have contributed to this growth, including: the ability to overexpress many proteins (which has provided the quantity and quality of material required for structural studies), enhanced computing capabilities and developments in the biophysical techniques employed to determine protein structure. Complete structural characterization of a protein requires: ■ Determination of the amino acid sequence: This may be deduced from the nucleotide sequence of a gene encoding a particular protein or from direct amino acid sequencing (Chapter 8, section 8.3). The amino acid sequence of a protein can be used to generate a wealth of structural information, including the 5.2 THE KEY QUESTIONS ABOUT A PROTEIN 157 Number 2,500 5,000 7,500 10,000 12,500 15,000 17,500 20,000 22,500 25,000 27,500 30,000 32,500 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991 1990 S 1989 ** 1988 1987 1986 1985 1984 1983 1982 1981 1980 1979 1978 1977 1976 1975 1974 1973 1972 □ Total ] Yearly Fig. 5.4 Growth of protein structures solved over the past 35 years. 158 5. THE IMPORTANT PROPERTIES OF PROTEINS AND HOW TO EXPLORE THEM Many of the DNA-derived protein sequences in the databases correspond to proteins of unknown function or proteins which may not be found in cells, certainly under normal growth conditions. theoretical mass, potential post-translational modification sites and, with ever-expanding sequence databases and analysis tools, the ability to identify structural and functional motifs (see Chapter 1, section 1.3.2). In addition, the amino acid sequence is essential for determining X-ray crystallographic and high-resolution NMR spectroscopic structures. Experimentally determined mass: The theoretical mass of a protein, calculated on the basis of the amino acid sequence, does not take into account any post-translational modifications, such as disulphide bond formation, phosphorylation, glycosylation, and proteolysis. Experimental techniques, most notably mass spectrometry, can determine the mass of a protein with a level of accuracy that permits characterization of post-translational modifications, e.g. horse lysozyme has an experimental mass which is 8 Da lower than the theoretical value, indicating the formation of four disulphide bonds (see Chapter 1, section 1.3.2.5). In the case of proteins with more complex post-translational modifications, e.g. the hormone protein, insulin (see Fig. 5.5), more elaborate experiments are required to relate the experimentally determined mass to the structure of the post-translationally modified protein. Characterization of secondary and tertiary structure: Proteins adopt three-dimensional structures that are dictated by the amino acid sequence of the protein (see Chapter 1, sections 1.4 and 1.5). An inspection of the tens of thousands of structures which have been solved to date reveals a vast range of three-dimensional structures, from the simplicity of defensin to the elegance of ATP synthase (Fig. 5.6), which enable proteins to fulfil specific roles within the cell. Despite the variety of three-dimensional structures, a number of common principles determine the structures adopted by proteins: residues with hydrophobic amino acid side chains tend to be buried in the interior of the structure, whereas residues with hydrophilic side chains tend to be exposed on -COOH -COOH peptide Preproinsulin Proinsulin Insulin Fig. 5.5 Maturation of insulin from preproinsulin. The 24 amino acid signal peptide is initially cleaved from the N-terminus of preproinsulin to generate proinsulin. Two disulphide bonds form within proinsulin, which is then proteolytically cleaved to remove a stretch of 51 amino acids (C-chain) to generate the mature form of insulin, which contains two polypeptide chains (A and B) linked by two disulphide bonds. 5.2 THE KEY QUESTIONS ABOUT A PROTEIN 159 (a) (b) Fig. 5.6 Structures of (a) human ß-defensin-1 and (b) ATP synthase. Defensin (Protein Data Bank code 1IJU) is a small antimicrobial protein with a relatively simple structure, whereas ATP synthase (Protein Data Bank code lQOl) is an elegant machine that converts the energy associated with a proton motive force to generate ATP. the surface; they adopt close-packed structures and contain elements of secondary structure as a result of main chain polar groups forming hydrogen bonds (see Chapter 1, section 1.5.1). The predominant forms of secondary structure are a-helices and P-sheets. The secondary structure of a protein can be inferred theoretically from its amino acid sequence or it can be measured experimentally. Theoretical methods employed to determine the secondary structure include homology modelling and prediction methods. To determine the secondary structure of an unknown protein using homology modelling requires a similar protein (>25% sequence identity) of known structure which can serve as a model to establish the structure of the unknown protein (Sander and Schneider, 1994). As an example, prior to solving the structure of HIV protease, the structures of aspartic proteases from a number of sources had been determined. As HIV 5. THE IMPORTANT PROPERTIES OF PROTEINS AND HOW TO EXPLORE THEM The preferences of amino acids for types of secondary structure are mentioned in Chapter 1, section 1.4.4. CD is based on the difference in absorption of right and left circularly polarized components of plane-polarized light and can be used to detect chirality (optical activity) in molecules. The regular secondary structures of proteins such as a-helix and (3-sheet are chiral. protease shared >25% sequence identity with these aspartic proteases, it was possible to use homology modelling-based methods to predict the secondary structure of HIV protease. Indeed, this predicted structure proved pivotal in the design of drugs which inhibited HIV protease, which in turn inhibited the replication cycle of HIV. In the absence of a similar protein, prediction methods can be used to predict the secondary structure of the unknown protein. To date, two types of prediction methods are used: statistical analysis of the sequence of the unknown protein to calculate the likelihood of a given amino acid to occur in a particular type of secondary structure element (Chou and Fasman, 1974) or techniques such as PHD (Rost, 1996), which generate multiple sequence alignments (with proteins of lower levels of identity). This alignment is submitted to a neural network system to predict the secondary structure of the unknown protein. With accuracies of up to 72%, such prediction methods are a useful tool to complement experimental estimates of secondary structure content. One technique which has proved valuable in estimating the secondary structure content of proteins is circular dichroism (CD) (Kelly et ai, 2005). Regular secondary structure elements within proteins produce characteristic CD spectra which arise from absorption at 190 and 220 nm by peptide bonds (see Fig. 5.7). Ultimately, the determination of the three-dimensional structure of a protein by X-ray crystallography or high-resolution NMR spectroscopy provides not only a measure of the secondary structure content, but also molecular detail of the length and spatial arrangement of secondary structure elements within the folded polypeptide chain. The overall three-dimensional arrangement of a polypeptide chain, i.e. its tertiary structure, is maintained by multiple weak, non-covalent interactions such as electrostatic, van der Waals', hydrophobic interactions, and hydrogen bonds (see Chapter 1, section 1.5). These interactions are also involved in maintaining subunit-subunit contacts, i.e. the quaternary structure of proteins. Rubisco catalyses the first dark reaction of photosynthesis (i.e. the addition of CO, to the five-carbon sugar ribulose bisphosphate). It is a relatively inefficient enzyme in catalytic terms and is thought to be the most abundant protein on earth. Quaternary structure: In general, larger proteins (typically >50 kDa) tend to exist as multiple subunits giving rise to the level of structure known as the quaternary structure. The quaternary structure can range in complexity from two identical subunits, e.g. ribulose-bisphosphate carboxylase (Rubisco: EC 4.1.1.39) from pho-tosynthetic bacteria (2 x 55 kDa) to multiple non-identical subunits, e.g. Rubisco from plants and algae which exists as hexadecamer with eight large subunits and eight small subunits (8 x 55 kDa and 8x15 kDa). Characterization of multi-subunit proteins requires determination of the overall molecular mass of the protein, identification of the types of subunits and the molecular mass of each type, calculation of the number of each subunit type within the protein and the structural arrangement of the subunits. The molecular mass of multi-subunit proteins must be determined under non-denaturing conditions to maintain subunit-subunit interactions, using techniques such as gel filtration chromatography and ultracentrifugation (Chapter 8, section 8.2). Analysis by SDS-PAGE 5.2 THE KEY QUESTIONS ABOUT A PROTEIN 161 80000 -40000 200 220 Wavelength (nm) 260 Fig. 5.7 Far-UV CD spectra of various types of protein secondary structure. Solid line, cc-helix; long dashed line, anti-parallel p-sheet; dotted line, type I p-turn; cross dashed line, extended 3,-helix or poly (Pro) II helix; short dashed line, irregular structure. (see Chapter 8, section 8.2.1) can reveal the molecular mass of individual sub-units and the number of subunit types within a multi-subunit protein: if only one species appears on an SDS-PAGE gel, it usually indicates that only one type of subunit exists within the protein; two species can suggest two subunit types, and so on. It is worth noting that there may be instances when non-identical subunits may have similar mobilities on SDS-PAGE (i.e. subunits of similar size but different amino acid sequence appear as one band) and further analysis, such as mass spectroscopy or amino acid sequencing, will be required to confirm whether all subunits are identical or not. It is often possible to determine the number of subunits within a multi-subunit protein knowing the molecular mass of the multi-subunit protein, the number of subunit types and the molecular 5. THE IMPORTANT PROPERTIES OF PROTEINS AND HOW TO EXPLORE THEM It should be noted that SDS-PAGE is not a high-resolution technique and it is doubtful if two proteins which had subunits of similar molecular masses (within 2%) would be efficiently separated by this technique. Mass spectrometry (see Chapter 8, section 8.2.5) would be useful for resolving proteins of similar masses. mass of the individual subunits, e.g. S. cerevisiae phosphoglycerate mutase has an overall molecular mass of 110 kDa, one subunit type with an molecular mass of 27.5 kDa, which suggests it exists as a tetramer. However, in the case of very large proteins with multiple subunits, techniques such as cross-linking (Chapter 8, section 8.5) can determine the number of subunits within a multi-subunit protein. High-resolution X-ray crystallography is required to provide the molecular details of subunit arrangements within multi-subunit proteins. Recently, there has been a marked increase in the number of structures which have been solved for very large multi-subunit proteins, such as the proteasome (Groll et al, 2001) and the heterodecameric RNA polymerase II from S. cerevisiae (Armache etal, 2005). 5.2.3 What factors might affect the function of the protein? The turnover rates of different proteins in eukaryotic cells vary enormously, with some proteins having a very short half-life of only a few minutes, whereas others exist within the cell for weeks. The half-life of a protein depends on its function, e.g. proteins involved in the regulation of cell division and transcription have a short half-life (minutes), whereas central metabolic enzymes have a long half-life (several days). For example, antibodies are available commercially that can distinguish between phosphorylated side chains of serine and tyrosine in proteins. As indicated at the start of section 5.2, biological systems are dynamic and are capable of responding to environmental, developmental and metabolic changes. Many of these changes affect the amount of protein present (e.g. by altering rate of gene expression (transcription and translation) and/or the rate of protein degradation) or by altering the activity of that protein. Characterization of changes in protein function, in response to a factor, requires a biological assay, e.g. a specific enzyme assay for a catalytic protein or a binding assay for a binding protein, see section 5.3. Using this approach, it has been possible to identify the following factors which can alter protein function within biological systems: Non-covalent binding of other molecules ranging from small molecules to regulatory protein subunits and specific macromolecules, collectively known as effectors. It is possible to characterize these interactions in vitro using binding assays (see Chapter 6, section 6.4) coupled with appropriate analysis of the saturation curves (see Chapter 4, section 4.3). Availability of ligands, including substrates for enzymes or cofactors, which can vary greatly within the cell. This effect is most obvious when varying the substrate concentration at values close to the Michaelis constant (Km), which can lead to sizeable changes in activity (see Chapter 4, section 4.3.3 and Fig. 4.8). Reversible covalent modification, involving the addition and removal of specific chemical groups, can have a dramatic effect on the property of a protein. Many of these modifications are listed in Chapter 1, section 1.3.2.5 and can be readily identified by mass spectrometry or binding of antibodies which recognize specific post-translational modification groups such as the phosphoryl group. Irreversible covalent changes, including the targeted proteolytic cleavage of inactive precursors to generate functionally active proteins, can be characterized using SDS-PAGE, size exclusion chromatography and mass spectrometry. 5.3 ASSAYS FOR BIOLOGICAL ACTIVITY 163 5.2.4 How does the protein interact with other molecules? By considering the complexity of the cellular environment, together with the concentration of macromolecules within the cell (a typical prokaryotic cell has a protein concentration >200 mg mL"1, and even higher concentrations are found in erythrocytes (>300 mg mL1) and the mitochondrial matrix (>500 mg mL1)), it soon becomes clear that proteins do not exist in isolation. Proteins interact with many different types of molecules within the cell via non-covalent interactions such as hydrogen bonds, ionic, van der Waals', and hydrophobic interactions (see Chapter 1, section 1.7). As noted in Chapter 4, section 4.3 and in section 5.2.1 the types of molecules with which proteins interact include other proteins, nucleic acids, lipids, low molecular mass molecules, and substrates (for enzymes), and are collectively referred to as ligands. To address how proteins interact with ligands, we should aim to determine: ■ the three-dimensional structure of the protein-ligand complex ■ the site on the protein which interacts with the ligand and the molecular details of the interaction ■ the number of ligands interacting with the protein (stoichiometry) ■ the strength of the interaction and the rate constants involved. Binding studies, or enzyme kinetic studies for catalytic proteins (as described in Chapter 4, section 4.3), provide information relating to the stoichiometry, rate constants, and the strength of the interaction, whereas structural studies, such as site-directed mutagenesis and side chain modifications (see Chapter 9, section 9.10), indicate the amino acids that are important for protein-ligand interactions. Assays for biological activity KEY CONCEPT ■ Appreciating the range of assays for biological activity measurements HHB and their limitations A specific assay is required for purification of a protein and to address key questions, relating its structure and function. In general, an assay is used during protein purification to gauge the success of the purification protocol, i.e. enhancement of the specific activity and maintenance of a high yield of biologically active protein (Chapter 3, section 3.9). Assays are also used to assess the effect of factors which may influence the biological activity of a protein, providing information relating to the function and structure of the protein. A good assay is one which is quick, simple to conduct, highly specific for the protein of interest, and relatively inexpensive. Before looking at some specific examples of different assay types, 164 5. THE IMPORTANT PROPERTIES OF PROTEINS AND HOW TO EXPLORE THEM In a coupled assay, the coupling enzyme(s) is (are) added in a large excess, perhaps 50-fold more than the enzyme being assayed. Any coupling enzymes should therefore be extremely pure. Such limiting factors could include an insufficiency of a coupling enzyme, or that the reaction occurs too quickly for the detection system to give an accurate measurement of the rate. we must consider the care which must be exercised when interpreting assay data in general: Is the activity measurement due solely to the presence of the protein of interest? There may be other factors present which may contribute to the activity measurement. A series of control assay measurements in the absence of biologically active protein will indicate whether the assay components make a contribution to the measured response, e.g. an increase in absorbance arising from non-enzyme-catalysed substrate degradation in an enzyme assay, or non-specific binding partners in immunoblots and ELISAs. The quality of assay components is critical to the success of the assay. A good example of this is provided by the enzyme-coupled assay for the glycolytic enzyme, phosphoglycerate mutase (PGAM): 3-phosphoglycerate is converted to 2-phosphoglycerate by PGAM and then 2-phosphoglycerate is then converted to phosphoenolpyruvate (PEP) by the coupling enzyme enolase (see Fig. 5.8). The assay is started by the addition of PGAM, with the resultant formation of phosphoenolpyruvate (PEP) monitored by measuring the increase in absorbance at 240 nm. During a study of yeast PGAM, it was noted that prior to the addition of this enzyme to the assay, some unexpectedly high-activity measurements were obtained. Further analysis revealed that the commercial preparation of enolase (purified from horse liver) was contaminated with PGAM. Is the activity measurement proportional to the amount of protein present? If twice as much protein is added to the assay, is a doubling of the activity observed? If the answer is no, this may reflect the fact that some component of the assay is limiting the activity measurements. The most convenient type of assay is the continuous assay in which a response is measured directly, e.g. change in absorbance, fluorescence or pH, following the addition of protein and the response can be recorded continuously throughout the course of the reaction. Less convenient, but no less useful, are discontinuous assays that involve reactions which are initiated by the addition of a protein and then samples are removed at specific time intervals. These samples are subsequently quenched (i.e. biological activity is stopped) and analysed. Fig. 5.8 Phosphoglycerate mutase (PGAM) assay contains 3-phosphoglycerate and the coupling enzyme enolase. The reaction is initiated by the addition of PGAM, which converts 3-phosphoglycerate to 2-phosphoglycerate and then enolase converts this to phosphoenol pyruvate (PEP), which is detected at 240 nm. 5.3 ASSAYS FOR BIOLOGICAL ACTIVITY 165 5.3.1 Catalytic proteins (enzymes) The aim of assaying an enzyme is to measure the rate of product formation or substrate utilization. This typically relies on a difference in the spectroscopic properties between the substrate and the product. For example, in the following reaction catalysed by alcohol dehydrogenase (EC 1.1.1.1): ethanol + NAD+ -> acetaldehyde (ethanal) + NADH + H+ the reduced form of nicotinamide dinucleotide coenzyme (NADH) absorbs radiation at 340 nm, however, the oxidized form (NAD+) does not, allowing the direct continuous monitoring of product formation. Not all enzyme-catalysed reactions have natural reactants and products with suitable spectroscopic properties. In such cases, synthetic chromogenic substrates may prove useful, e.g. the enzyme 4-nitrobenzyl esterase, which is used in the synthesis of the antibiotic, Loracarbef, catalyses a reaction which produces little spectroscopic change. However, a nitro-phenyl derivative of the substrate generates the product 4-nitrophenol, which is readily detected at 400 nm. Alternatively, the reaction of interest can be coupled to a second reaction, which will produce a spectroscopic change, e.g. the assay for hexokinase (EC 2.7.1.1) involves the addition of the coupling enzyme glucose-6-phosphate dehydrogenase (EC 1.1.1.49). Hexokinase: ATP + D-glucose —> ADP + D-glucose 6-phosphate Glucose-6-phosphate dehydrogenase: D-glucose 6-phosphate + NADP+ —» D-glucono- 1,5-lactone 6-phosphate + NADPH + H+ Whilst the reaction catalysed by hexokinase produces no spectroscopic change, glucose-6-phosphate dehydrogenase generates NADPH which absorbs at 340 nm, providing a indirect continuous assay for hexokinase activity. Alternative detection methods, such as change in pH, may provide a means of monitoring enzyme activity. Loracarbef is s cephalosporin-derived antibiotic. During the synthesis of loracarbef, carboxyl groups are protected with 4-nitrobenzyl alcohol, which is subsequently removed by the enzyme 4-nitrobenzyl esterase. 5.3.2 Binding proteins Protein binding assays can be used to both identify ligands and to characterize the nature of the protein-ligand interaction. Binding assays can be subdivided into two categories: those which are based on biophysical changes in the protein or ligand upon protein-ligand complex formation and those which employ direct quantitation of free and bound ligand. Biophysical changes can be monitored using a range of spectroscopic methods, such as absorbance, fluorescence, CD and NMR. These techniques rely on significant changes in the spectra of the unbound 166 5. THE IMPORTANT PROPERTIES OF PROTEINS AND HOW TO EXPLORE THEM 250 -i (a) [molybdate] (uM) (b) ppm Fig. 5.9 Changes in the spectral properties of proteins on the addition of a ligand. (a) Change in fluorescence. Binding of molybdate ions to the molybdate-sensing protein ModE from E. coli. Aliquots of ligand were added to a solution of protein (40 uM) and changes in the fluorescence at 350 nm were monitored. Saturation occurs at a concentration of 40 uM ligand, showing that there is one binding site per polypeptide chain. Binding of ligand leads to an approximately 50% quenching in fluorescence (Boxer et al, 2004). (b) Chemical shift in NMR spectrum. This stack plot shows the spectral effects on addition of increasing amounts of Cu(II) on the tryptophan residue within a prion protein peptide (PHGGGWGQ). The CuSO., was added in aliquots of 0.0033 mole-equivalents up to 0.02 mole-equivalents (Viles et al., 1999). The chromatin-associated protein Hbsu from Bacillus subtilis binds DNA in a sequence-independent manner and is important in bacterial nucleoid formation. Wild-type Hbsu does not contain any Trp residues and by introducing a Trp at position 47 it was possible to generate a mutant form of the protein, which was indistinguishable from the wild type while providing a spectroscopic means for determining dissociation constants (Groch ct al., 1992). protein or ligand and the complexed state. In titration studies, where the degree of spectral change is assumed to be directly proportional to the ligand concentration, it is possible to determine the degree of binding (see Fig. 5.9). In the absence of a suitable intrinsic spectroscopic change, it is possible to design chemically modified versions of the protein or ligand to facilitate a simple binding assay. A variety of means can be used to introduce spectroscopic labels into proteins. For example, site-directed mutagenesis can be employed to substitute a non-fluorescent amino acid by the fluorescent tryptophan (Trp) at a selected position in the protein. An alternative approach to introducing fluorophores involves in vitro chemical modification of amino acids; a fluorescent group can be introduced by reaction of a suitable reagent with Cys side chains, for example the introduction of the fluorescent probe IAEDANS to Trp repressor protein mutants to characterize tryptophan and DNA binding (Chou and Matthews, 1989). Recent advances in bacterial and yeast expression systems allow efficient site-specific introduction of unnatural amino acids in vivo; use of engineered tRN A and aminoacyl tRNA synthase within these systems permits the introduction of a range of unnatural amino acids, including fluorophores (Magliery, 2005). Site-specific, in vivo, incorporation of unnatural fluorescent amino acids has been used to modify green fluorescent protein (GFP) at residue 66. The mutant GFPs were successfully overexpressed and purified and were found to have unique spectral properties (Wang et al, 2003). 5.3 ASSAYS FOR BIOLOGICAL ACTIVITY Another possibility is to introduce the Trp analogue 7-azaTrp in place of Trp in an expressed protein by growth of the host organism on a medium containing this amino acid, with the biosynthetic pathway for Trp inhibited. 7-azaTrp has spectroscopic properties which can be readily distinguished from those of Trp. In the case of proteins that undergo significant conformational changes as a result of ligand binding, it may be possible to monitor these changes by determining the sedimentation coefficients using ultracentrifugation. Traditional direct quantitation methods rely on partitioning techniques, such as equilibrium dialysis and membrane filtration, in which the protein and bound ligand are separated from free ligand. A typical dialysis binding assay would involve placing the protein within a dialysis membrane, which is then placed in a solution of ligand. At equilibrium, the concentration of free ligand will be the same inside and outside the dialysis membrane. The concentration of bound ligand can either be calculated from the difference between the free ligand concentration at the start of dialysis and the free ligand concentration at equilibrium, or from the measured concentrations of ligand on the protein side of the membrane (free ligand plus bound ligand) and on the other side of the membrane (free ligand). Similarly, the ability of membrane filtration to retain protein and protein com-plexed with ligand, but not free ligand, can be exploited to detect ligand binding. This simple technique requires a means of measuring the amount of ligand retained by the filter, i.e. complexed with protein. More recently, the use of solid phase techniques, such as surface plasmon resonance (SPR), have been employed to detect and characterize protein-ligand interactions. This technique is based on the detection of an increase in mass resulting from protein-ligand complex formation. A typical assay would involve immobilization of the protein on the surface of a sensor, followed by introduction of a ligand. Protein-ligand interactions, resulting in an increase in mass, give rise to an increase in signal. Likewise, dissociation of a ligand from immobilized protein results in a decrease in mass, producing a decrease in signal (see Fig. 5.10). Thus, solid phase techniques are proving useful in identifying potential ligands and in determining the kinetic and affinity properties in binding assays (see Chapter 10, section 10.7). Replacement of all tryptophan residues in X bacteriophage lysozyme with 7-aza Trp has been used to probe the structure and function of this enzyme. In addition, the 7-aza Trp-modified version of lysozyme facilitated its successful crystallization under the microgravity conditions on space shuttle flights (Evrard era/., 1998). Binding studies using equilibrium dialysis or membrane filtration are greatly facilitated if the ligand has some convenient spectroscopic property or is radioactively labelled. Essentially, the SPR technique measures the rate constant for the association (complex formation) and dissociation (complex breakdown) steps. As described in Chapter 4, section 4.3, the equilibrium constant can be derived from the ratio of these rate constants. 5.3.3 Transport proteins Transport protein assays are designed to measure the rate of transport of a ligand from one location to another, e.g. transport of glucose into erythrocytes by an integral membrane glucose transporter. While binding assays (see section 5.3.2) on purified transport proteins provide information relating to the stoichiometry and affinity of the protein-ligand interaction, they do not necessarily provide a measure of transport. In vivo and in vitro transport assays can involve relatively complex systems, such as whole cells or proteoliposomes, and require some mechanism to monitor the initial ligand concentrations in one location and ligand concentration in its final destination. One of the most direct methods employed to There is a whole family of glucose transporter (GLUT) proteins which share some common structural features, but have different tissue distributions and affinities for glucose. For example, GLUT2 has a low affinity for glucose and is found in the liver and pancreas. GLUT4 has a higher affinity for glucose and is stimulated by insulin; it plays a particularly important role in glucose uptake by muscle and adipose tissue. 168 5. THE IMPORTANT PROPERTIES OF PROTEINS AND HOW TO EXPLORE THEM r t Sensorgram WWW / Sample response WWW K_ t Baseline ' II II I buffer complex dissociation « buffer formation regeneration Fig. 5.10 A typical surface plasmon resonance (SPR) sensogram. A baseline signal is produced by the continuous flow of buffer over the protein immobilized on the surface of a sensor chip. Introduction of ligand into the flow of buffer may result in an association of the ligand with the immobilized protein and this is detected by an increase in signal. Subsequent removal of the ligand and a return to the continuous flow of buffer alone will generate a decrease in signal which reflects the dissociation of the ligand from the immobilized protein. The rate constants for the association and dissociation steps are obtained by curve-fitting procedures. SPR occurs when light is reflected off thin metal films. SPR has been exploited to detect protein-ligand interactions by immobilizing a protein on the surface of a thin metal film (sensor chip) and subsequently adding ligands to produce a change in mass (protein plus ligand), which in turn alters the surface of the metal film. Surface changes produce a change in the angle of the reflected light, i.e. a change in the SPR signal. RU, response units. measure the activity of transport proteins involves the use of isotopically labelled ligands. In a typical assay, a known amount of isotopically labelled ligand is added to the system and incubated for a fixed period of time, after which the reaction is stopped by introducing an inhibitor or by rapid isolation of the final destination of the ligand, e.g. isolation of cells or proteoliposomes using centrifugation or size exclusion chromatography. Functional studies of transport proteins require assays to be conducted in the presence of varying amounts of ligand and in the presence or absence of effectors (see section 5.2.3). 5.3.4 Other types of proteins This section will describe how to test the biological function of proteins which cannot be measured using the more conventional assays outlined in the previous sections. GFP, which occurs naturally in the jellyfish Aequorea victoria and has been used as a reporter molecule in many prokaryotic and eukaryotic systems, is assayed based on its ability to fluoresce at 510 nm, following excitation at 395 nm. A more unusual assay has been developed for the taste-modifying glycoprotein, miraculin. Miraculin, which is isolated from the red berries of the West African 5.3 ASSAYS FOR BIOLOGICAL ACTIVITY 169 shrub, Richadella dulcifica, has the unusual property of making sour tastes seem sweet. The sweet-inducing activity of miraculin is measured by administering a small amount of miraculin to subjects, followed by sour citric acid solutions; the subjects then assign an apparent sweetness value to the citric acid solutions (Theerasilp and Kurihara, 1988). One final example of less conventional assays is that used to monitor the effects of cytokines. Cytokines are a family of proteins, secreted primarily by leukocytes, which allow cell-cell communication. Cytokines are often assayed by monitoring the effects they have on cell cultures, such as cell proliferation, differentiation, and stimulation of immune functions. In the case of proteins with biological functions which cannot be measured easily using direct assays, the advent of heterologous expression systems has presented a convenient means of monitoring the purification of such proteins, circumventing the need for an assay. Overexpressed proteins can account for a substantial proportion of the total cell protein (see section 5.4.2) and as a result can be readily identified by measurements of molecular mass using SDS-PAGE; the most abundant species with the correct molecular mass (theoretical or known) can be identified in crude extracts and in samples throughout the purification stages (see Fig. 5.11). When the protein with the correct mass is purified, its identity should be confirmed by mass spectrometry and by partial amino acid sequencing or peptide mass fingerprinting (Chapter 8, section 8.3). The cytokines are a diverse group of signalling proteins that include interferons, several interleukins, and a range of growth factors. Cytokines are secreted by many different cell types which then bind to specific receptor proteins located on the surface target cells, resulting in a biological response. Each cytokine acting on specific target cells produces a specific response that may be cell differentiation, growth, or tissue development. kDa 1 2 3 4 5 6 7 97.6 — 66.2- .) 5' Promoter Gene of interest CS Tag 3' Terminator Transcribe and translate Ml ■ I Tag fused to the C-terminus of the protein of interest Fig. 5.12 Portion of expression plasmid with tag sequence at 5' (i) or 3' (ii) end of the gene producing N- or C- terminally tagged protein, respectively. In some cases, as shown here, a stretch of amino acids containing a target cleavage sequence (CS) (shown in black) is included to allow selective removal of the tag. 5.4 PURIFICATION OF PROTEINS 173 Affinity tag Protein of interest Sepharose Affinity tag binding partner Immoblized binding partner of affinity tag TPEG (substrate analogue of p-galactosidase) Glutathione Affinity tag fused to N- or C-terminus of protein P-galactosidase Glutathione-S-Transferase Immunoglobulin G Cu II, Co II or Ni II poly His or poly Cys Protein A Fig. 5.13 Affinity purification of tagged proteins. A tag is fused to the N- or C-terminus of the protein of interest to facilitate purification, which relies on a specific interaction between the affinity tag and the immobilized binding partner of the affinity tag. fusion proteins such as P-galactosidase and glutathione-S-transferase (Chapter 7, section 7.3.4), affinity proteins (e.g. protein A, exploiting its affinity for IgG) or metal-chelating affinity tags (see Fig. 5.13). Metal affinity tags, which include poly-histidine and poly-cysteine tags, have a high affinity for divalent metal ions such as copper, nickel, and cobalt. This property forms the basis of immobilized metal affinity chromatography: divalent metal ions are immobilized on a chelating chromatography medium (e.g. NTA (nitrilotriacetic acid)-agarose) and selectively retain proteins fused to poly-histidine or poly-cysteine affinity tags. Subsequent elution, for example with imidazole solutions in the case of His-tags, can produce large quantities of pure tagged protein in one chromatographic step (Chapter 7, section 7.3.8). Removal of purification tags may be necessary to restore function to the protein of interest or to enhance its solubility. Removal of a tag can be achieved by chemical or, more commonly, enzymatic methods. A number of examples of cleavage target sites and associated cleavage agents are given below. Cleavage site Cleavage agent Asp-Asp-Asp-Asp-LysJ'-X Enteropeptidase Leu-Val-Pro-Argl-X Thrombin Ile-(Glu/Asp)-Gly-Arg!x Meti-X Cyanogen bromide Factor Xa í Asn-Gly Hydroxylamine 74 5. THE IMPORTANT PROPERTIES OF PROTEINS AND HOW TO EXPLORE THEM Tag removal can be somewhat complex: chemical conditions tend to be harsh, leading to non-specific cleavage, whereas enzymatic methods can be specific but inefficient. Finally, recombinant DNA technologies allow us to change the sequence of genes encoding proteins at specific locations. The technique, known as site-directed mutagenesis (Chapter 9, section 9.10.3), requires the specific mutation of a gene and its overexpression to generate the resultant mutant protein. This approach allows us to probe the structure and function of proteins by designing, overexpressing, and purifying mutant proteins. Structure determination KEY CONCEPTS ■ Understanding the tools required to explore the different levels of protein structure ■ Appreciating the importance of an integrated experimental approach to provide a more complete picture of protein structure Prior to the early 1980s, many protein crystal structures were studied without the availability of their amino acid sequences. Nevertheless, in each case it was usually possible to produce an atomic interpretation of the X-ray diffraction pattern and to trace the path of the polypeptide chain, although the later availability of the sequence allowed detailed interpretation of the diffraction data to be undertaken. Many of these proteins, e.g. oc-chymotrypsin, lactate dehydrogenase, subtilisin, and myoglobin, were studied because they could be readily purified from their natural source and formed crystals which gave rise to clear, interpretable, X-ray diffraction patterns. The ultimate aim of protein structure determination is to gather and interpret data to reveal the three-dimensional structure of the protein at an atomic level. Whilst X-ray crystallography and high-resolution NMR spectroscopy (for proteins of molecular mass <30 kDa) generate structural data, additional experimental information relating to the primary, secondary, tertiary, and quaternary structure plus post-translational modifications, is required to interpret these data and solve the structure. The apparent molecular mass, which is indicative of the number of amino acids within individual subunits, can be estimated by SDS-PAGE, gel filtration, ultra-centrifugation, and mass spectrometry. The order of the amino acids within a polypeptide chain can be determined directly or indirectly. The direct approach would use Edman degradation or tandem mass spectrometry (Chapter 8, Section 8.3) to degrade sequentially peptides derived from the protein (see Fig. 5.14). The indirect approach would require translation of the gene encoding the protein. A comparison of the theoretical molecular mass calculated from the primary structure with the experimentally determined mass can provide evidence of post-translational modification. The nature of the post-translational modification can be determined by SDS-PAGE combined with specific removal of modifications, immunoblotting using antibodies raised to specific post-translational modifications, and mass spectrometry to analyse the mass of peptides with modifications (Chapter 8, section 8.2). A number of techniques, including chemical modification of surface exposed residues, fluorescence, CD, and NMR spectroscopy, can indicate the nature of the secondary and tertiary structure of a protein (Chapter 8, section 8.4). The availability of secondary and tertiary structure prediction tools (section 5.2.2) together 5.6 FACTORS AFFECTING THE ACTIVITY OF PROTEINS Purified Protein Generate Peptides Purify peptides by chromatography Edman degradation of peptides to generate amino acid sequence Peptide mixture applied to first analyser to separate peptides by mass/charge Individual peptides selected from first analyser are fragmented in a collision cell and mass (sequence) determined in the second analyser Fig. 5.14 Ouline of scheme for the 'direct' determination of the primary structure of a protein using Edman degradation or tandem mass spectrometry. with the primary structure can be employed to complement experimentally derived data. Determination of the quaternary structure of a protein requires combining information relating the molecular mass and number of types of individual subunits to the apparent mass of the native multi-subunit protein, estimated by gel filtration or ultracentrifugation. Complex multi-subunit proteins may require further characterization using techniques, such as cross-linking (Chapter 8, section 8.5), to establish the number of subunits present. Combining all of this experimental evidence permits the accurate interpretation of data generated by X-ray crystallography or NMR spectroscopy, and eventual structure determination. The advent of bioinformatics has enabled the determination of theoretical protein structures, using homology searches and prediction methods (sections 5.2.2 and 5.8). Again, this approach can be used to complement experimentally derived data. Factors affecting the activity of proteins KEY CONCEPTS ■ Defining the major factors which influence the activity of proteins ■ Understanding how to monitor their effects on protein structure and ■HHHi function Studying the effects of factors such as post-translational modification, ligands, pH, and temperature on protein activity can provide insight into the structure and 5. THE IMPORTANT PROPERTIES OF PROTEINS AND HOW TO EXPLORE THEM Note that while we can draw some conclusions about how the protein may function within the cell, the in vitro assay conditions will be very different from in vivo conditions (e.g. high protein concentration, low concentration of ligands, protein complex formation, compartmentalization). function of proteins. The effects are measured by monitoring their impact on activity assays (section 5.3). This information, combined with data from other activity and structural studies, can suggest the function of a protein within the cell and how it will respond to developmental, environmental, and metabolic signals. The major factors influencing the activity of proteins within the cell are ligand concentration and post-translational modification. Whilst pH and temperature are normally constant within biological systems, in vitro manipulation of these factors can provide valuable information about the structure, stability, and function of the protein (section 5.2.3). In addition, the amount of any given protein in the cell is controlled by the relative rates of synthesis and degradation. 5.6.1 pH and temperature The effects of pH are not only important in studies of activity, also they are an essential consideration in developing successful purification strategies in which isolectric focusing (Chapter 6, section 6.2.3) or ion-exchange chromatography (Chapter 7, section 7.2.2) are employed. The effects of pH on protein activity can be used to identify residues which are functionally or structurally important by exploiting the characteristic pKa values of amino acid side chains (see Chapter 1, section 1.2.3.2). A typical activity response to changes in pH is shown in Fig. 5.15. Under extreme pH conditions, the lack of activity is due to protein denaturation. pH conditions which elicit the highest level of activity (the pH optimum) promote side chain side ionization states which are essential for optimal activity. Identification of important residues requires activity measurements at pH values close to the pH optimum. Resultant changes in the ionization state of key residues produce changes in activity which can be used to estimate pK, values. In some cases, pi70°C (dotted line). 178 5. THE IMPORTANT PROPERTIES OF PROTEINS AND HOW TO EXPLORE THEM Thermophilic organisms are categorized according to their growth temperatures: the optimal growth temperature for thermophiles is >50°C, for extreme thermophiles >65°C (e.g. Thermus aquaticus, the source of Taq polymerase), and for hyperthermophiles >90°C (e.g. Pyrococcus furiosus). Mesophilic organisms have growth temperature optima in the range 20-37°C. Further information can be gleaned by monitoring activity changes in response to temperature changes under varying conditions, such as the presence of ligands and post-translational modifications. These types of experiments are important in characterizing the influence of ligands and post-translational modifications on protein activity. Generally, we can correlate temperature-induced activity changes with temperature-induced structural changes, using a number of methods such as spectroscopic techniques, chemical modification, and site-directed mutagenesis. Proteins from thermophilic organisms respond to temperature changes in a similar way; however, in these cases, the optimum temperature is much higher. Although thermophilic proteins are generally structurally similar to their mesophilic counterparts, they have adopted a range of strategies to enhance their thermostability including additional electrostatic interactions, helix dipoles, helix capping, and shorter surface loops (Cowan, 1995). 5.6.2 Inhibitor and activator molecules Inhibitors and activators are important effectors of protein activity within the cell. By measuring the influence of these effectors on protein activity in vitro, it is possible to establish how proteins are regulated in a cellular environment. The effect of inhibitor and activator molecules on protein activity assays (section 5.3) combined with saturation curve analyses (Chapter 4, section 4.3) can be used to calculate proteimeffector stoichiometry, the strength of the interaction, and their influence on the rate constants of individual steps in the process. Typical inhibition studies involve measuring protein activity in the presence and absence of inhibitor. The effect of the inhibitor can be quantified by calculating Kf, the binding constant of the inhibitor to the protein: a small K„ value indicates a high-affinity inhibitor, whereas a large Kn value suggests a low-affinity inhibitor. Biochemists usually quantify the effects of inhibitors in terms of an inhibitor constant, K„ whereas pharmacologists quantify the effects of an inhibitor with the term IC50 (or 705), which is the concentration of inhibitor required to reduce the protein activity by 50%. This is related to the Ki by the Cheng-Prusoff equation, a derivation of which is given in the appendix to this chapter. In the case of enzymes, it is possible to determine the type of inhibition by measuring the reaction rate at a variety of substrate concentrations in the absence and presence of an inhibitor (Chapter 4, section 4.3.4). Once the type of inhibition has been identified it is possible to calculate /C, (in this case usually denoted as Kfi) using the following equations: competitive inhibition 5.1 5.6 FACTORS AFFECTING THE ACTIVITY OF PROTEINS 179 non-competitive inhibition M where Km and Vmax are the constants in the absence of inhibitor, whereas K[n and V^ax are the constants in the presence of the inhibitor and these are determined as outlined in Chapter 4, section 4.4. A more accurate method of determining i£, involves extending this approach to look at the effect of several concentrations of inhibitor to generate a series of Lineweaver-Burk plots (Engel, 1981). The activation of proteins by effector molecules can be characterized by conducting protein activity assays in the presence and absence of activator. Changes in saturation curves are reflected in the rate constants that can be used to measure the affinity of the activator for the protein (KA) and the type of activation, e.g. allosteric activation, in which the binding of the activator promotes cooperative substrate/ligand binding, analogous to oxygen binding to haemoglobin outlined in Chapter 4, section 4.3.2. Repeating these experiments with modified versions of effectors can improve our understanding of protein/effector specificity and provide structural information concerning the nature of the effector binding site (see Fig. 5.17). Over the last decade, generating libraries of modified protein ligands by a technique known as combinatorial chemistry has emerged as a powerful tool in protein characterization and drug discovery. (a) Hydrophobic pocket — " fl o Ser (N-terminus) P3 P2 (C-terminus) Asp'102 Fig. 5.17 Mapping binding sites using modified ligands. (a) In the 1960s the active sites of serine proteases (e.g. chymotrypsin) were mapped using substrate mimics of varying length and composition. By varying substrate length (PI, P2, P3, etc.) and composition, it was noted that substrates with large hydrophobic residues at PI were good substrates for chymotrypsin. Structure studies subsequently revealed that the substrate specificity observed relates to the shape and hydrophobic nature of the active site. 180 5. THE IMPORTANT PROPERTIES OF PROTEINS AND HOW TO EXPLORE THEM (b) rts rzs rts b FDA Approved Invirase (Saquinavir) Ro 31-8959 Norvir (Ritonavir) ABT-538 Crixlven (Indinavir) MK-630 (L-735,524) Fig. 5.17 (b) A similar approach has been used more recently to map the active sites of therapeutically important proteases and subsequently to inhibit them. A good example of this is provided by the protease inhibitors designed for the effective treatment for HIV infection, a representation of HIV protease complexed with putative substrate, b number of effective HIV protease inhibitors which mimic the substrate (from Wlodawer and Vondrasek, (1998)). 5.6 FACTORS AFFECTING THE ACTIVITY OF PROTEINS In many cases, activity changes induced by the presence of inhibitors and activators are associated with measurable changes in protein structure. Effector-induced structural changes can be monitored using spectroscopic techniques, such as absorbance, fluorescence, CD, and NMR, and can provide valuable information which complements measurements of effector-induced activity changes. 5.6.3 Post-translational modifications Post-translational modification is one of major effectors of protein activity within the cell, indeed, its importance has been re-examined recently following the completion of numerous genome sequencing projects, in particular those of higher organisms. These projects have revealed that the complexity of higher organisms is due to the differential RNA processing and post-translational modification of gene products. Thus, the human genome is thought to encode about 23 000 gene products, but these are thought to give rise to several hundred thousand distinct proteins. There are many forms of post-translational modification (see Table 5.1) which can influence protein activity. All of these covalent modifications are the result of enzyme-catalysed reactions which confer rapid and amplified changes in protein activity in response to small cellular signals. Protein activity assays alongside saturation curve analyses (Chapter 4, section 4.3) can be used to characterize the effects of post-translational modifications. The exploration of these effects requires uniform protein preparations, both with and without post-translational modification. This can be checked using SDS-PAGE, Western blotting with modification-specific antibodies, and/or mass spectrometry. Developing standard activity assays of post-translationally modified protein to include activity measurements in the presence of inhibitors/activators and the effects of temperature can help determine how the protein is regulated in vivo. Changes in protein activity resulting from post-translational modification may be accompanied by structural changes. Secondary and tertiary structural changes can be monitored using spectroscopic techniques, whereas changes in the Table 5.1 Examples of post-translational modifications Amino acid modifications Modifications Example Cysteine: disulphide bond formation Lysozyme Lysine biotinylation Acetyl CoA carboxylase Serine phosphorylation Glycogen Phosphorylase Threonine phosphorylation Cyclin-dependent kinase Tyrosine phosphorylation Cortactin Addition of prosthetic groups—thiamine diphosphate Pyruvate dehydrogenase Proteolytic processing Chymotrypsin Protein targetting (signal sequences) Penicillin acylase 182 5. THE IMPORTANT PROPERTIES OF PROTEINS AND HOW TO EXPLORE THEM quaternary structure can be determined using gel filtration, ultracentrifugation, and cross-linking studies (Chapter 8, section 8.2). A good example of a protein which undergoes post-translational modification resulting in activity and structural changes is the enzyme glycogen Phosphorylase (E.C. 2.4.1.1). Phosphorylase exists in two interconvertible forms: Phosphorylase a, which is phosphorylated at Ser 14 by a specific kinase, and Phosphorylase b, which is not phosphorylated. In the absence of AMP, Phosphorylase b is inactive, whereas Phosphorylase a actively catalyses the cleavage of glycogen to produce glucose-1 -phosphate: (l,4-a-D-glucosyl)„ + Pt^± (1,4-a-D-glucosyl)„_, + a-D-glucose 1 -phosphate Conversion of Phosphorylase b to Phosphorylase a results in tertiary and quaternary structural changes which are associated with the regulatory properties of both forms of this key enzyme. Interactions with other macromolecules KEY CONCEPTS I ■ Appreciating the importance of protein-protein interactions in vivo ■ Understanding the need to use appropriate experimental conditions to study these interactions Whilst it is convenient to study isolated proteins in vitro, this does not reflect how they function in vivo. Within the cell, proteins are part of a highly organized system involving subcellular structures and high concentrations of macromolecules, including other proteins, nucleic acids, and lipids. Recent developments in techniques used to study protein interactions (section 5.3.2) are beginning to reveal the complexity of these interactions, many of which are weak and transient, but nonetheless important for the expression of protein function. The best-characterized protein-protein, protein-nucleic acid, or protein-lipid complexes are those which are most stable and have survived the harsh cell disruption procedures used during complex isolation. The study of the interaction with other molecules can involve the identification of molecules which interact with the protein of interest or characterization of the effect of these molecules on protein activity. Typically, the identification of novel effectors involves the immobilization of the protein of interest which then serves as a bait to bind potential effectors. A great deal of care must be exercised when establishing the conditions which will promote and maintain protein-effector interactions as these interactions are highly sensitive to pH, ionic strength and the presence of other effectors. Failure to optimize conditions can result in false positive results, i.e. certain conditions can promote non-specific interactions, or 5.8 USE OF BIOINFORMATICS produce false negative data in which interactions are not identified due to weakened binding. The influence of 'other molecules' on protein activity can be determined using activity assays (section 5.3) and saturation curve analyses (Chapter 4, section 4.3). The binding of'other molecules' may be accompanied by structural changes which can be monitored as outlined in the previous section, providing structural data to complement activity measurements. Use of bioinformatics KEY CONCEPTS ■ Appreciating the range of databases and bioinformatic tools available to assist protein characterization ■ Understanding the theoretical basis of the tools used to calculate properties of a protein from its sequence The study of proteins is no longer an exclusively laboratory-based activity pursued by biochemists; instead it is possible to use computer-based methods to examine proteins in detail. The field of bioinformatics has harnessed the exponential growth in the amount of information relating to nucleotide sequences, protein sequences, and biomolecular structures. As a result, all of this information has been organized into web-based databases which can be accessed and analysed using a range of computer programs. In this section, we shall consider some of these databases and how they can be employed to explore protein structure and function. In addition, we shall look at a range of programs which can be used to analyse database information to assist protein characterization. 5.8.1 Web resources and databases Nucleotide sequences, amino acid sequences, and protein structures are collected within a number of web-based databases. The aim of each of these databases is not only to collect information but also to present it in an annotated form to help biologists understand better the significance of the data. More recently, there has been an effort to integrate data sets (e.g. linking individual nucleotide/protein sequences to related three-dimensional structures, metabolic pathway databases, enzyme databases, disease databases, organism-specific databases, two-dimensional gel databases, and associated references), allowing researchers to characterize more fully the structural and functional properties of proteins. The major databases for RNA and DNA sequences are EMBL, Genbank, and DDBJ. Sequencing technologies have enabled the completion of increasing numbers of genome-sequencing projects, which in turn has maintained a very large growth The International Nucleotide Sequence Database Collaboration (http://wvvTv.insdc.org/) combines the efforts of European Molecular Biology Lab (EMBL) in Hinxton, UK, GenBank in Bethesda, Maryland, USA and DNA Data Bank of Japan (DDBJ) in Mishima, Japan. 184 S. THE IMPORTANT PROPERTIES OF PROTEINS AND HOW TO EXPLORE THEM The PDB is currently-maintained by the Research Collaboratory tor Structural Bioinformatics(RSCB), involving Rutgers (NJ), San Diego (CA) and Madison (WI), and is located at http://www.rcsb.org/pdb. Linked sites are maintained by the European Bioinformatics Institute in Cambridge, UK (http://www.ebi.ac.uk/msd/) and in Osaka, lapan (PDBj; see http://www.pdbi.org). of these databases, with the EMBL Nucleotide Sequence Database containing 130 million sequence entries, comprising in excess of 2 x 10" nucleotides in 2008. Search tools permit retrieval of relevant database entries which can be further analysed using a suite of molecular biology tools (e.g. restriction site identification, sequence comparisons, and translation to identify open reading frames) or cross-referenced with other databases. One of the main protein sequence databases is SwissProt (http://www.ebi.ac.uk/ swissprot/), which contains sequences which have been generated either directly by amino acid sequencing or indirectly from translating open reading frames of genes within TrEMBL (a database containing translated coding regions of EMBL/GenBank/DDBJ nucleotide databases). The SwissProt database is linked with about 50 other databases, allowing extensive characterization of the protein of interest. An increasing number of whole genome-specific databases have become available over the past decade. Each genome is annotated to identify individual genes and these are cross-referenced with protein sequence and three-dimensional structure databases. The genomes characterized to date reflect their importance either as biological models (e.g. E. coli, C. elegans, S. cerevisiae) or as commercial organisms (e.g. rice, maize, chicken, salmon) or as disease-related organisms (human, Helicobacter pylori, Anopheles gambiae). Fig. 5.18 displays the NCBI websites for the C. elegans and the A. gambiae genomes. The bacterium H. pylori, although isolated over 100 years ago, has only recently been shown to cause stomach or duodenal ulcers. A. gambiae, the mosquito, carries the malaria parasite and is the main vector of malaria in Africa, where it is estimated that a million children die each year from this disease. Characterization of the genomes of such disease agents will enhance the development of disease prevention, detection, and treatment. All three-dimensional biomolecular structures are deposited in the PDB database. A search of the PDB database using the name of a protein or its PDB entry code, will provide an image of the protein structure together with information about the protein and its source, reference to the paper describing structure determination, the amino acid sequence, ligands present in the structure, secondary structure composition, and the atomic coordinates. The atomic coordinates can be downloaded from the PDB site and analysed in detail using molecular viewer software such as Rasmol (http://www.openrasmol.org) or its derivative Protein Explorer (http://www.umass.edu/microbio/rasmol). Such tools facilitate an exploration of protein structure, including ligand binding sites, catalytic residues, key structural residues, and protein-protein interactions. 5.8.2 Sequence analysis Protein sequences, derived from experimental data or database entries, can generate a wealth of information that can assist protein characterization. Empirically derived information will be considered in this section; sequence comparisons 5.8 USE OF BIOINFORMATICS 185 NCBi Map Viewer _ II Lm-.fir - j<-|. I j. Melat J. Heu-al ... r- .1, .-!,.<' Il^ i Hi -ul .1-.. Hui Mi-Mr. .-, hin; _'«-:. :lu.-.-im, .-I, :iu -j completed the sequencing of the Caanorkabdilie II provided by WormBase ai I majority of the protein codmg genet ait supported by expression data. f annotated protons have a I V. ill iU>. I ■,. n:i,.-l,l Hli.li . 1 i Sh 1 IIan.il. ■■ V M.ij Vir-*., , u, ;-; . „s„wr r 4rs. npn. 11 .1 ,, ■ :u] ,tri I .-:;.ii,.: >. • Mjp V|ftt Heb - de*ads about navigation and customing your duplay ■ _ir_i.- - A v. ■ lei Eij'h r. t genome 11 smith an