PDB 101 Course Notes [https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/introduction]
0. Introduction
PDB archive == repository of atomic coordinates (+ other information) describing proteins
- structures in the archive are determined using a balanced mixture of experimental observation and
knowledge-based modeling => we should confirm that here is experimental evidence that supports the
structure
To determine the structure (location of each atom relative to each other in the molecule):
- scientist has some experimental data about the structure of the molecule - it is not sufficient
to build an atomic model => need for additional knowledge (e.g. the sequence of amino acids in
the protein + we know the preferred geometry of atoms in a typical protein - the bond lengths and bond
angles)
1. X-ray Crystallography
- experimental data: X-ray diffraction pattern
Steps:
1. Purified and concentrated proteins form
crystals.
2. Within the crystal, many copies of the protein
are arranged in symmetrical arrays.
3. X-ray beams strike the crystal. 4. The X-ray scatters into a spot pattern.
5. X-ray diffraction patterns are then analyzed to
determine the positions of atoms in the protein.
- PDB contains 2 types of crystal structures:
- 1. coordinate files (atomic positions for the final structure model),
- 2. structure factors (the intensity and phase of the X-ray spots in the diffraction
pattern) - an image of the electron density map can be created (e.g. Astex viewer)
- X-ray crystallography can provide very detailed atomic information, showing every atom in a
protein or nucleic acid along with atomic details of ligands, inhibitors, ions, and other
molecules that are incorporated into the crystal
- crystallization is difficult + limits the types of proteins that may be studied by it: excellent for
determining structures of rigid proteins (form nice, ordered crystals), but flexible proteins are
more difficult (crystallography relies on having many molecules aligned in exactly the same
orientation, like a repeated pattern in wallpaper and flexible portions of protein will often be
invisible in crystallographic electron density maps, since their electron density will be smeared over a
large space)
- accuracy of the atomic structure depends on the quality of the crystals (accuracy measures:
resolution of crystallographic structure - measures the amount of detail that may be seen in the
experimental data, and R-value - measures how well the atomic model is supported by the
experimental data found in the structure factor file)
2. X-ray Free Electron Lasers (XFEL)
- new method thanks to new technology termed serial femtosecond crystallography
Steps:
1. XFEL is used to create pulses of radiation that are extremely short (lasting only femtoseconds) and
extremely bright.
2. Stream of tiny crystals (nanometers-micrometers) is passed through the beam.
3. Each X-ray pulse produces a diffraction pattern from a crystal (often burning it up in the process).
4. Data set is compiled from as many as tens of thousands of these individual diffraction patterns.
- very powerful method because it allows us to study molecular processes that occur over very
short time scales (e.g. absorption of light by biological chromophores)
3. NMR Spectroscopy (Nuclear Magnetic Resonance)
- experimental data: information on the local conformation and distance between atoms that
are close to one another
Steps:
1. Purified protein is mixed with special solvent
and inserted into NMR probe.
2. The sample is exposed to strong magnetic
field.
3. It makes nuclei of certain atoms spin (e.g.
hydrogen).
4. When the sample is probed with radio waves,
the nuclei become excited and resonate.
5. The frequencies are measured and recorded.
(the surrounding atoms determine how high/ low the
frequencies are)
6. Using computational methods, the
measurements are converted into a graph that
represents the frequencies as peaks with
specific locations for specific atom groups.
7. This information is further refined and
combined with additional NMR experiments to
determine the 3D structure.
- technique is currently limited to small/ medium proteins (large proteins present problems with
overlapping peaks in the NMR spectra)
- major advantage: it provides information on proteins in solution (opposed to those locked in a
crystal/ bound to a microscope grid) - premier method for studying the atomic structures of
flexible proteins
- typical NMR structure will include an ensemble of protein structures, all of which are consistent with
the observed list of experimental restraints - structures in ensemble will be very similar to each other in
regions with strong restraints, and very different in less constrained portions of the chain - these areas
with fewer restraints are the flexible parts of the molecule => do not give a strong signal
- PDB contains two types of structures:
- 1. full ensemble from the structural determination (each structure designated as a
separate model),
- 2. minimized average structure,
- + 3. list of restraints (hydrogen bonds, disulfide linkages, distances between hydrogen atoms
that are close to one another, and restraints on the local conformation and stereochemistry of
the chain)
4. 3D Electron Microscopy (3DEM)
- experimental data: image of the overall shape of the molecule
Steps:
1. Tiny amount of purified protein is placed
onto copper grid.
2. Special machine spreads the sample in single
layer on the grid.
3. Sample if frozen in liquid ethane. 4. Sample is loaded into electron microscope.
5. Sample is exposed to a beam of accelerated
electrons and images are captured.
6. Usually, the proteins are in many different
orientations. Images are grouped by orientation.
7. Images grouped by orientation are
computationally combined to reconstruct the 3D
shape of molecule.
8. Atoms are fit into the map to derive the 3D
structure of the protein.
- 3DEM used to determine 3D structures of large macromolecular assemblies
- imaging of many thousands of different single particles preserved in a thin layer of non-crystalline ice
(cryo-EM) - these views show the molecule in myriad different orientations, a computational approach
akin to that used for computerized axial tomography or CAT scans in medicine will yield a 3D mass
density map
- cryo-electron tomography provides structural information at slightly lower resolution (i.e., protein
domains and secondary structural elements)
The practice of combining multiple experimental approaches is often referred to as Integrative or
Hybrid Methods (I/HM) - proven to be very useful for multimolecular structures (complexes of
ribosomes, tRNA and protein factors, and muscle actomyosin structures).
PDB contains structures for:
- ribosomes,
- oncogenes,
- drug targets,
- and even whole viruses.
- we can often find multiple structures for a given molecule, or partial structures, or structures that have
been modified or inactivated from their native form
File formats: PDB, mmCIF, XML
- usually consists of header (info about the protein) + sequence of atoms and their coordinates
- in a typical entry, there is a diverse mixture of biological molecules, small molecules, ions, and water the
names and chain IDs can be used to help sort these out
In structures determined from crystallography, atoms are annotated with temperature factors that
describe their vibration and occupancies that show if they are seen in several conformations.
NMR structures often include several different models of the molecule.
1. Beginner’s Guide to PDB Structures and the PDBx/mmCIF Format
PDBx/mmCIF format includes data items relevant to macromolecular crystallographic experiments
- overcomes limitations of the legacy PDB file format and supports data representing large
structures, complex chemistry, and new and hybrid experimental methods
- PDB file format is not modified or extended to support new content (will become outdated)
- supported by visualization applications: Jmol, Chimera, and OpenRasMol and structure determination
systems: CCP4, Phenix
Syntax & Format:
_data_item_category_name.data_item_attribute_name
_category.attribute value // key-value data category
_category.attribute_beta 90.00
// tabular data category (multiple values for each token):
_loop
_atom_site.id
_atom_site.label_sth
_atom_site.sth
_atom_site.string_example
1 SOME_LABEL 6.913 ‘vic, veci, carkou’
2 QUAK 8.888 ‘quak’
3 NECONECO 1.000 ‘A.B.C., Neco’
# hash at line beginning would indicate a comma/ separate categories
=> category is a tabular data structure where data items are the rows and the stored information are the
columns:
—---------------------------------------------------------------
|__________________________atom_site___________________________|
| .id | 1 | 2 | 3 |
| .label_sth | SOME_LABEL | QUAK | NECONECO |
| .sth | 6.913 | 8.888 | 1.000 |
| .string_name | ‘vic, veci, carkou’ | ‘quak’ | ‘A.B.C., Neco’ |
—---------------------------------------------------------------
- if there are multiple columns within a data item/ group of data items in the same category, the
category is preceded by a loop_ token
Parent-child relationships:
- created when data item occurs in multiple categories (most commonly occurs for labels and
identifiers which are reused throughout the dictionary)
Chemical component dictionary - descriptions of all of the monomers and ligands in PDB structures
(CHEM_COMP_DICTIONARY category group)
2. Dealing with Coordinates
- primary information stored in PDB: coordinate files (list of atoms in each structure and their
3D location in space + summary about the structure, sequence, and experiment)
- files are available in formats: PDBx/mmCIF, PDB, XML
Atomic-level Data
- PDB entry contains atomic coordinates for a collection of proteins, small molecules, ions and
water
- each atom is identified by a sequential number, specific atom name, the name and number of
the residue it belongs to, a one-letter code to specify the chain, its x, y, and z coordinates, and
an occupancy and temperature factor (stored in the _atom_site category)
ATOM record - used to identify proteins or nucleic acid atoms
HETATM record - to identify atoms in small molecules
- most molecular graphics programs enable to color identified portions of the molecule
selectively - pick out all of the carbon atoms and color them/ one particular amino acid and
highlight it (by default, many molecular graphics programs do not display the water
molecules)
Chains
- biological molecules are hierarchical: atoms -> residues -> chains -> assemblies
- coordinate files contain ways to organize and specify molecules at all levels
- in PDBx/mmCIF format, the looping nature of the records makes it easy to represent different
chains and multiple molecules
Segment from entry 4hhb showing the transition from chain A to chain B, where the chain is
designated in the _atom_site.label_asym_id record and further identified in the
_atom_site.label_entity_id record:
loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_alt_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_entity_id
_atom_site.label_seq_id
_atom_site.pdbx_PDB_ins_code
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.pdbx_formal_charge
_atom_site.auth_seq_id
_atom_site.auth_comp_id
_atom_site.auth_asym_id
_atom_site.auth_atom_id
_atom_site.pdbx_PDB_model_num
ATOM 1 N N . VAL A 1 1 ? 6.204 16.869 4.854 1.00 49.05 ? 1 VAL A N 1
ATOM 2 C CA . VAL A 1 1 ? 6.913 17.759 4.607 1.00 43.14 ? 1 VAL A CA 1
<snip>
ATOM 1069 O OXT . ARG A 1 141 ? -9.474 13.682 -9.742 1.00 31.52 ? 1 ARG A OXT 1
ATOM 1070 N N . VAL B 2 1 ? 9.223 -20.614 1.365 1.00 46.08 ? 1 VAL B N 1
ATOM 1071 C CA . VAL B 2 1 ? 8.694 -20.026 -0.123 1.00 70.96 ? 1 VAL B CA 1
TER records are used to separate protein and nucleic acid chains:
- indicates that the chains are not physically connected to each other
ATOM 1067 NH1 ARG A 141 -10.147 7.455 -6.079 1.00 23.24 N
ATOM 1068 NH2 ARG A 141 -8.672 8.328 -4.506 1.00 33.34 N
ATOM 1069 OXT ARG A 141 -9.474 13.682 -9.742 1.00 31.52 O
TER 1070 ARG A 141
ATOM 1071 N VAL B 1 9.223 -20.614 1.365 1.00 46.08 N
ATOM 1072 CA VAL B 1 8.694 -20.026 -0.123 1.00 70.96 C
ATOM 1073 C VAL B 1 9.668 -21.068 -1.645 1.00 69.74 C
ATOM 1074 O VAL B 1 9.370 -22.612 -0.994 1.00 71.82 O
MODEL/ENDMDL keywords indicate multiple molecules in a single file:
- MODEL keyword also used in biological assembly files to separate the many symmetrical copies of the
molecule that are generated from the asymmetric unit
Temperature Factors
If we were able to hold an atom rigidly fixed in one place, we could observe its distribution of electrons in an
ideal situation. The image would be dense towards the center with the density falling off further from the
nucleus. The experimental electron density distributions, however, usually have a wider distribution -> due to
vibration of the atoms/ differences between the many different molecules in the crystal lattice. The observed
electron density will include an average of all these small motions, yielding a slightly smeared image of the
molecule.
- motions + resultant smearing of the electron density are incorporated into the atomic model
by a B-value or temperature factor (amount of smearing is proportional to the magnitude of
the B-value)
_atom_site.B_iso_or_equiv
- value < 10: model of the atom is very sharp (the atom is not moving much => is in the same
position in all of the molecules in the crystal),
- > 50: atom is moving so much that it can barely been seen (often for atoms at the surface of
proteins, where long side chains are free to wag in the surrounding water)
. . .
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.pdbx_formal_charge
_atom_site.auth_seq_id
_atom_site.auth_comp_id
_atom_site.auth_asym_id
_atom_site.auth_atom_id
_atom_site.pdbx_PDB_model_num
ATOM 1 N N . VAL A 1 1 ? 6.204 16.869 4.854 1.00 49.05 ? 1 VAL A N 1
ATOM 2 C CA . VAL A 1 1 ? 6.913 17.759 4.607 1.00 43.14 ? 1 VAL A CA 1
ATOM 3 C C . VAL A 1 1 ? 8.504 17.378 4.797 1.00 24.80 ? 1 VAL A C 1
- temperature factors == a measure of confidence in the location of atom
Occupancy and Multiple Conformations
- macromolecular crystals are composed of many individual molecules packed into a
symmetrical arrangement
- in some crystals: slight differences between each of these molecules (a sidechain on the surface
may wag back and forth between several conformations/ substrate may bind in two orientations in an
active site/ a metal ion may be bound to only a few of the molecules)
- => observe occupancy: for most atoms it has value 1 (== the atom is found in all of the
molecules in the same place in the crystal), if a metal ion binds to only half of the molecules
in the crystal –> occupancy of 0.5 (we will see a weak image of the ion in the electron density map)
- occupancies are also commonly used to identify side chains or ligands that are observed in
multiple conformations -> it indicate the fraction of molecules that have each of the
conformation - 2+ atom records are included for each atom, with occupancies like 0.5 and 0.5,
or 0.4 and 0.6, or other fractional occupancies that sum to a total of 1
Alternate conformations in Myoglobin:
(alternate conformations: _atom_site.label_alt_id, occupancy: _atom_site.occupancy)
. . .
_atom_site.label_alt_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_entity_id
_atom_site.label_seq_id
_atom_site.pdbx_PDB_ins_code
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.pdbx_formal_charge
_atom_site.auth_seq_id
_atom_site.auth_comp_id
_atom_site.auth_asym_id
_atom_site.auth_atom_id
_atom_site.pdbx_PDB_model_num
<snip>
ATOM 65 C GLN A 8 7.602 12.149 22.631 1.00 8.08 C
ATOM 66 O GLN A 8 8.769 12.399 22.918 1.00 8.39 O
ATOM 67 CB AGLN A 8 5.987 11.822 24.520 0.57 13.03 C
ATOM 68 CB BGLN A 8 5.948 11.968 24.580 0.43 9.68 C
ATOM 69 CG AGLN A 8 7.030 11.303 25.506 0.57 16.30 C
ATOM 70 CG BGLN A 8 6.967 12.094 25.688 0.43 12.07 C
3. Biological Assemblies
- Biological Assembly and Asymmetric Unit are the same for many PDB entries - some are
different (mostly those solved by X-ray crystallography)
The primary coordinate file of a crystal structure typically contains just one crystal asymmetric unit
and may or may not be the same as the biological assembly. This introduction describes the terms
asymmetric unit and biological assembly, lists where information about these can be found in various
files formats (PDB and mmCIF), and explains how biological assembly files in the PDB archive are
derived. Since the PDBML format is derived from the mmCIF format file, a separate discussion of this
format is not included here.
4. Missing Coordinates and Biological Assemblies
o
5. Primary Sequences and the PDB Format
o
6. Hierarchical Structure of Proteins
o
7. Exploring Carbohydrates in the PDB Archive
o
8. Small Molecule Ligands
o
9. Molecular Graphics Programs
o
10. Computed Structure Models
o
12. Resolution
o
13. R-value and R-free
o
14. Structure Factors and Electron Density
o
text sample
text sample
text sample