R E S E A R C H A R T I C L E What makes it difficult to refine protein models further via molecular dynamics simulations? Lim Heo | Michael Feig Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan Correspondence Michael Feig, 603 Wilson Road, Room 218 BCH, East Lansing, MI 48824, USA. Email: feig@msu.edu Abstract Protein structure refinement remains a challenging yet important problem as it has the potential to bring already accurate template-based models to near-native resolution. Refinement based on molecular dynamics simulations has been a highly promising approach and the performance of MD-based refinement in the Feig group during CASP12 is described here. During CASP12, sampling was extended well into the microsecond scale, an improved force field was applied, and new protocol variations were tested. Progress over previous rounds of CASP was found to be limited which is analyzed in terms of the quality of the initial models and dependency on the amount of sampling and refinement protocol variations. As current MD-based refinement protocols appear to be reaching a plateau, detailed analysis is presented to provide new insight into the major challenges towards more extensive structure refinement, focusing in particular on sampling with and without restraints. K E Y W O R D S protein structure prediction, structure refinement, molecular dynamics simulation, CASP, Markov state models, homology models 1 | INTRODUCTION Computational protein structure prediction has become a valuable tool in structural biology.1 Template-based modeling is now applicable to ever more folds as the number of structures in the Protein Data Bank (PDB) continues to grow rapidly, especially because of advances with cryo-electron microscopy (Cryo-EM)2 and structure determination via X-ray free electron lasers (XFEL).3 As a result, the chance of finding the structure of a close homolog for a given sequence is increasing so that accurate protein structure prediction is routinely possible.4 Moreover, modeling based on co-evolutionary information allows high-resolution modeling even in the absence of a homologous structure.5,6 This has been enabled by next-generation genome sequencing (NGS), which rapidly increased the number of known sequences for homologous proteins in different organisms as the basis for predicting residue-residue contacts with high accuracy.7 Complementary to the advances outlined above, protein structure refinement methods have been increasingly successful in further improving initial knowledge-based predictions via ab initio techniques.8 Evidence of significant protein structure refinement has first emerged around 2010, at CASP9.9 In CASP10, refinement based on molecular dynamics (MD) led to a significant step forward after introducing the idea of ensemble averaging rather than attempting to select a single snapshot from a generated structural ensemble.10–12 The progress with MD-based methods has also been catalyzed by continuous improvements in force fields and ability to sample extensively.11 MD-based refinement so far is able to provide moderate and consistent structure improvements,13 and recent protocols have been optimized to limit the computational costs by taking advantage of GPU comput- ing.14 MD-based refinement has also contributed to significant improvements in local structure quality such as correct stereochemistry and avoidance of clashes, which is almost always possible even if the global structure cannot be improved.15 A number of alternate refinement protocols have also been proposed: In CASP11, Della Corte et al. devised a method where homologous structure information was utilized to smooth the potential energy landscape and lower the energy barriers thereby enhancing sampling during MD simulations.16 Lee et al. and Park et al. proposed methods that reconstruct unreliable regions such as loops and termini, followed by global relaxation.17,18 However, all of the refinement methods available to date, are still limited to relatively localized structural changes and perform best when the initial model is already relatively close to Proteins. 2018;86:177–188. wileyonlinelibrary.com/journal/prot VC 2017 Wiley Periodicals, Inc. | 177 Received: 1 August 2017 | Revised: 11 September 2017 | Accepted: 29 September 2017 DOI: 10.1002/prot.25393 the native state. It remains very challenging to consistently improve models with significant errors.19 In this article, further tests of MD-based protein structure refinement methods during CASP12 are described. Commonly, weak restraints have to be applied for MD-based refinement to succeed,12,20 which limits the extent of possible refinement. Since the protein force field has been improved further21 and GPGPU computing has become more widely available, MD simulations were carried out over longer time scales with weaker restraints during CASP12. The hope was to be able to sample more broadly, and, ultimately, refine structures more extensively. Other ideas that were explored during CASP12 involved iterative refinement along the initial direction of refinement. However, while consistent moderate refinement was again possible, we could not significantly expand the extent of refinement indicating that MD-based refinement methods using current protocols may have reached a plateau where simply running longer simulations is not offering further advantages. So, a main focus of this article is to look back and summarize the lessons we have learned until now and discuss where the key barriers are that need to be overcome to move forward. 2 | METHODS The general approach to protein structure refinement during CASP12 followed our previous protocol that involved long, weakly restrained MD simulations to generate structural ensembles. From the ensembles, a subset of structures was subsequently selected and averaged before finishing with detailed refinement of the local stereochemistry using locPREFMD.15 Variants of our previous protocol were tested during CASP12. In the CASP12 protocol, shown as a flow chart in Figure 1, initial models were first subjected to locPREFMD15 to improve the stereochemical quality and obtain better starting structures for the MD simulations. The resulting initial models were submitted as “Model 5” during CASP12 as the most conservative attempt at structure refinement. Afterwards, up to two rounds of MD-based refinement were carried out. Generally, the same scoring and filtering scheme followed by structure averaging was used as described previously and applied in CASP1113 on different ensembles of structures generated from MD trajectories under different conditions (see below). Before submitting final models, locPREFMD was applied twice after ensemble averaging and. depending on which MD ensemble was used, the resulting structures were submitted as Models “1”, “2”, “3”, or “4”, respectively. The primary set of MD simulations in the first round consisted of four runs at 500 ns, two runs at 400 ns, and 14 runs at 200 ns using weak harmonic restraints applied to Ca atoms with respect to the initial structure (force constant: 0.025 kcal/mol/Å2 ) for a total of 5.6 ls of sampling. This was about five times the amount of sampling applied during CASP11.13 The combination of runs at different lengths was predicated by the availability of GPU resources and the time limit for submitting predictions during CASP. While this ambitious amount of sampling could be reached for many targets within the CASP time frame of a few weeks, some targets were too large and simulation lengths had to be reduced in those cases. Each simulation was carried out in explicit solvent starting with randomly assigned initial momenta. The latest CHARMM force field, CHARMM36m,21 was used to take advantage of further improvements in sampling backbone torsions, especially in less-structure segments such as loops. Each protein was solvated in a periodic cubic box with at least a 9 Å solvent buffer to the box edge. The TIP3 water model22 as implemented in CHARMM was used and all systems were neutralized by adding either sodium (Na1 ) or chloride (Cl– ) counter ions, as appropriate. A 10 Å cutoff with switching between 8 and 10 Å was applied to the Lennard-Jones potential and the direct part of the electrostatic potential. ParticleFIGURE 1 Refinement protocol applied by the FEIG group in CASP12. Each colored box with a dashed line depicts a structural ensemble for which a common protocol of scoring, filtering, averaging and final local refinement was applied as depicted at the bottom 178 | HEO AND FEIG mesh Ewald summation was used to evaluate the full electrostatic potential.23 Bonds involving hydrogen atoms were kept rigid using holonomic constraints. Each system was prepared by minimization and subsequent heating to 298 K. Langevin dynamics was performed under NVT condition at 298 K with a 2 fs time step and a friction coefficient of 0.01/ps. The ensemble resulting from this set of simulations was filtered and averaged to result in the structure submitted as “Model 3.” To further enhance the initial conformational ensembles, 200,000 conformations were extracted from the first four 500 ns MD simulations and subsequently clustered into five clusters using the k-center algorithm implemented in MSMBuilder24 using the differences in the pair-wise Ca distance matrix as the distance measure. For each cluster, an averaged structure was generated from which an additional 100 ns of MD simulations were performed in the same manner as explained above except that restraints were applied with respect to the cluster-averaged structure using a reduced force constant of 0.005 kcal/mol/Å2 . The ensemble from the five cluster-initiated simulations was filtered and averaged to result in the structure submitted as “Model 4. The combined set of structures resulting from the initial simulations plus the additional simulations after clustering was used to generate the structure submitted as “Model 1” In some cases, there was not enough time to run the additional set of simulations after clustering. In this case, “Model 3” was submitted also as “Model 1.” A second round of refinement began with “Model 1” and involved an additional MD simulation >200 ns without restraints. From the resulting ensemble, structures were selected that advanced further along the direction of the initial refinement with the idea that those changes would bring the structure closer to the native state. The direction of the initial refinement was determined by comparing “Model 1” with the initial model and the conformations extracted from the second round of MD were accordingly scored using both RWplus25 and the vectorial direction similarity measure Sk calculated for a structure k based on Equations 1 and 2. Sk5 1 NCa XNCa i vrefined i Á vk i (1) vk i 5 1 3 X j2f21;0;1g rk i1j2rinitial i1j   (2) where the dot products of unit vectors of Ca deviation directions between the refined model and a given structure k were averaged over all NCa atoms. The Ca deviation direction was defined as the averaged vector between a given Ca and its two adjacent Ca atoms in a structure k from the initial model after overall least-squares structural superposition according to Equation 2. vrefined i was calculated in the same manner. Only structures generated during round two with Sk>0:4 and r2r rr <22, where r denotes the RWplus score, were selected for structure averaging, to be submitted as “Model 2.” If no such structures were found, that is, the additional second round sampling did not advance further along the initial direction of refinement, “Model 1” was submitted also as “Model 2.” The MD production runs were conducted on GPUs using CHARMM26 with OpenMM27 via API integration, except for the second-round single 200 ns simulation that was run using NAMD.28 Other MD simulations for equilibration, structure averaging, and locPREFMD were carried out using CHARMM. Structure preparation, selection, and analysis tasks relied on the MMTSB Tool Set.29 3 | RESULTS AND DISCUSSION 3.1 | Overall performance in CASP12 The overall results during CASP12 with the refinement protocol shown in Figure 1 are given in Table 1. For one target (TR879), the refinement protocol was applied incorrectly and the initial structure disintegrated as a result. Because the results for this target do not reflect the described protocol, the results were excluded from further analysis. Based on all other targets, GDT-HA scores were improved by 1.6 units, and four out of 36 targets could be refined by >5 GDT-HA units. Refinement was highly consistent as structures were refined for 30 out of 36 targets (83%). The performance was overall similar to previous years, but the extent of refinement was actually a bit less than in CASP1113 and CASP1011 despite the additional sampling and force field improvements. The new protocols tested in CASP12 did not provide significant advantages. Although “Model 1” submissions (that included the snapshots from the enhanced sampling following clustering, see above) seemed to be slightly better than just following the CASP11 protocol, statistical analysis indicates that none of the MD-refinement methods were significantly different from each other (based on P values > 0.05 from the paired t test analysis on the common targets). An interesting observation is that the refined models using just the snapshots from the enhanced sampling following clustering (“Model 4”) showed a greater variation with some models much better and others much worse than the “Model 1” structures. Pearson’s correlation coefficients for GDT-HA improvements between “Model 1” and “Model 3” and “Model 4” are 0.96 and 0.68, respectively. This reflects broader sampling as a result of the weaker restraints and the application of restraints with respect to structures generated in the first round of sampling. However, overall, the extent of refinement did not improve as structures moved away from the native as much as they came closer. The second round of the refinement protocol produced models only for about half of the targets, either because there was not enough time to run additional sampling after the first round was complete or because no structures were generated along the direction of the initial refinement (see Methods). In general, where available, models after the second round were refined to a similar extent as after the first round. Although restraints were not used in the second round sampling, the resulting models did not drift away from the native structure except for two targets (TR891 and TR922-D1, where >2 GDT-HA units were lost during the second round). This confirms an earlier conclusion11 that after initial refinement, models are in a deep energy minimum from which escape and additional refinement are very challenging. HEO AND FEIG | 179 TABLE 1 Overall results for CASP12 refinement targets Target Improvements for model 1 DGDT-HA for each modela DGDT-HA –DRMSD (Å) DGDC-SC MolProbityb 1 2 3 4 5 TR862 2.96 0.06 20.85 0.51 2.96 3.50 2.96 1.61 20.81 TR866-D1 0.72 0.18 21.06 6.00 0.72 2.64 1.68 2.89 21.69 TR868-D1 21.91 20.02 22.41 0.83 –1.91 22.14 22.14 24.28 20.24 TR869 0.00 20.12 20.10 0.80 0.00 20.49 0.00 21.92 20.24 TR870-D1 0.69 20.10 22.20 0.97 0.69 0.00 1.38 0.91 0.46 TR872 21.41 20.04 20.07 0.57 –1.41 20.56 21.13 3.42 0.57 TR874 1.35 20.06 0.33 0.50 1.35 1.35 1.57 3.60 0.22 TR875 1.72 20.02 1.66 0.60 1.72 1.08 2.37 2.37 20.21 TR876-D1 0.74 0.04 20.29 0.55 0.74 0.74 0.49 0.98 0.00 TR877 2.11 0.02 21.63 0.53 2.11 0.17 1.94 20.35 0.35 TR879 218.97 20.19 216.36 2.21 –18.97 218.86 218.18 236.36 21.93 TR881 2.35 0.01 5.80 0.85 2.35 2.72 2.72 0.74 0.50 TR882 3.80 0.09 20.52 0.60 3.80 4.12 3.80 5.39 0.64 TR884 5.63 20.05 0.98 0.50 5.63 5.28 5.98 4.93 20.70 TR885-D1 3.13 0.05 3.03 0.50 3.13 1.69 3.13 1.69 0.97 TR887-D9 5.43 0.12 4.88 0.50 5.43 4.19 4.50 3.26 0.47 TR890 20.54 20.12 21.25 0.95 –0.54 - 0.26 24.13 20.40 TR891 0.22 0.08 21.06 0.72 0.22 22.68 0.00 1.34 0.89 TR893 0.00 20.03 211.62 1.42 0.00 - 21.92 21.77 0.00 TR894 1.85 0.03 6.16 0.50 1.85 - 1.39 8.80 1.39 TR594 1.13 20.05 1.01 0.50 1.13 - 0.57 2.81 20.56 TR694 1.24 20.05 1.78 0.85 - - 1.24 - 20.38 TR895 3.96 0.01 7.14 0.69 3.96 - 5.42 4.17 0.42 TR896 21.45 20.06 0.35 0.90 –1.45 23.19 21.45 22.03 0.29 TR898 2.37 20.05 0.25 0.50 2.37 2.13 0.48 0.95 20.23 TR901 21.01 20.04 0.78 0.65 –1.01 - 20.56 21.34 21.12 TR905 0.30 0.02 22.05 1.68 0.30 - 20.52 21.76 20.83 TR909 1.65 0.00 1.14 1.47 1.65 - 1.87 22.18 20.15 TR910 1.87 0.02 21.74 0.69 - - 1.87 - 20.37 TR912 1.21 0.07 1.39 0.85 1.21 - 2.53 5.43 0.84 TR913 2.29 0.01 1.56 1.11 2.29 - 2.51 211.32 0.37 TR917 5.43 0.08 5.88 0.50 5.43 - 5.37 29.84 20.06 TR920 0.79 0.04 1.21 0.50 - - 0.79 - 20.35 TR520 1.40 0.05 21.37 0.98 - - 1.40 - 20.15 TR921 3.98 20.04 1.12 0.98 3.98 4.53 3.44 5.79 0.54 TR922-D1 3.23 0.00 20.61 0.71 3.23 1.21 2.83 3.23 1.61 TR928 22.79 20.15 0.45 1.56 - - –2.79 - 21.32 TR942 2.72 20.09 3.11 0.66 - - 2.72 - 0.33 (Continues) 180 | HEO AND FEIG Finally, models submitted as “Model 5” for which only the local stereochemistry was refined with locPREFMD did not change much in terms of GDT-HA and RMSD scores. However, although locPREFMD does not target global structure refinement, there were still small improvements in GDT-HA scores suggesting that just the improvement of local structure may also slightly improve the global structure. 3.2 | Sampling vs refinement Under the assumption that the force field is good enough to distinguish the native state as the global free energy minimum, sampling is the limiting factor for being able to traverse the energy landscape until the native state is found. While it remains unclear how much sampling is exactly needed to reach the native state from a nearby initial model on a given landscape, the general expectation is that more sampling (more trajectories as well as longer trajectories) would improve the chances for refinement. In CASP12, we extended the simulation time to >5.6 ls per target. At the same time, the restraint force constant was reduced by half and we expected to sample more diverse structures on non-canonical regions (that is, loops) by introducing the CHARMM36m force field. As already mentioned above, the more extended sampling did not result in better refinement compared to previous rounds of CASP, but we performed post-analysis based on simulation subsets to determine how the simulation length affected the results just within CASP12. Figure 2 illustrates how GDT-HA improvements depend on the amount of sampling. This analysis is based on averages over 24 targets for which we have generated at least 20 trajectories over 200 ns. Generally, increased sampling resulted in higher GDT scores, consistent with expectations, and using the full 4 ls resulted in the maximum improvements. However, the GDT-HA scores for the top 5% percentile of the distribution was almost converged after 1–2 ls of sampling, and additional sampling did not make much of a difference. The next question we investigated was to what degree RMSD and GDT-HA metrics as well as the RWplus score improved with increasing simulation time. As shown in Figure 3, average RMSD values did not change much but the distribution of GDT-HA scores shifted to larger values and RWplus scores moved to lower values up until 50–100 ns, after which there was little change. The significant shift in the RWplus score toward lower energies implies that the interaction within proteins and internal packing were getting better as the simulations progressed. Therefore, it appears that extending simulations beyond 200 ns would not provide much benefits, at least in the presence of positional restraints. The optimal balance between the number and length of simulation with the same amount of sampling is another issue. Many short simulations may explore different regions of conformational space based on different initial momenta and are better suited to typical parallel computing resources than a few long simulations. However, it is more difficult to cross significant kinetic barriers with short simulations so that an optimal balance may be a compromise of a moderate number of moderate-length simulations. Figure 2 confirms this as 5 3 200 ns (1 ls in total) or 10 3 200 ns (2 ls in total) provide slightly better results than 20 3 50 ns or 20 3 100 ns. As there is little improvement in GDT-HA and RWplus scores during 100–200 ns (see Figure 3), the optimal sampling strategy seems to be to run as many as possible 200 ns simulations within the CASP12 refinement protocol. 3.3 | Scoring, filtering, and structure averaging The general approach toward scoring, filtering, and structure averaging was established first during CASP10. In CASP11, DFIRE30 was replaced with RWplus25 and the filtering criteria were slightly adjusted.13 In CASP12 the same protocol was applied as in CASP11. Briefly, the filtering step involves the selection of a subset of structures that have both low RWplus scores and are closer to the initial model. In this section, we reassessed whether this scoring, filtering, and averaging protocol is still optimal for the CASP12 test set. We compared refined models with and without filtering and extracted models at the center of a given ensemble based on the lowest Ca RMSD sum to all other structures instead of averaging. The resulting GDT-HA improvements are summarized in Figure 4. It can be seen, that averaging still provides a significant benefit, either with or without filtering, but, surprisingly, the filtering step did not offer a clear advantage anymore. With TABLE 1 (Continued) Target Improvements for model 1 DGDT-HA for each modela DGDT-HA –DRMSD (Å) DGDC-SC MolProbityb 1 2 3 4 5 TR944 0.79 0.01 0.90 0.85 - - 0.79 - 0.19 TR945 0.53 20.02 0.05 0.94 - - 0.53 - 20.27 TR947 2.29 0.03 0.67 0.78 - - 2.29 - 0.44 TR948 5.38 20.07 0.94 0.71 5.38 - 5.21 0.51 2.52 Average (full)c 1.78 0.00 0.38 0.93 1.78 1.31 1.73 1.73 0.17 Average (all)d 1.61 0.00 0.58 0.91 a Models submitted as “Model 1” are shown in bold characters. b MolProbity scores for refined models. c DGDT-HA averages for 20 targets shown in bold for which the full protocol was applied. d Refinement protocol was not applied correctly for this target and it was excluded from further analysis. If a method was not applied to a target, the corresponding value is shown as a hyphen (-). HEO AND FEIG | 181 averaging, the results were similar whether the ensemble was filtered or not. Moreover, when averaging was not applied, the use of the filter actually resulted in worse performance. This suggests that it may be necessary to reassess when and how to filter MD-generated ensembles as sampling has increased, force fields have improved, and the generation of the initial models may have changed (see below) from when we first optimized our refinement protocol. 3.4 | Refinement as a function of the initial model Our finding that refinement was less successful in CASP12 than in previous rounds of CASP despite increased sampling prompted us to examine again13 how the degree of refinement depended on the initial model quality. As shown in Figure 5, we found little correlation of GDT-HA improvements with initial GDT-HA scores, although the highest average improvement was seen for models with initial GDT-HA scores above 70. Improvements in Ca RMSD were more strongly dependent on the initial Ca RMSD. There was a clear trend of structures with lower RMSD values, below 4 Å in Ca RMSD, being improved in terms of RMSD, while structures with larger initial RMSD values became worse in terms of RMSD after refinement. Finally, MolProbity scores were consistently reduced to less than 1.0 in most cases except when the initial MolProbity score was worse than 3.0, although even in those cases, the MolProbity scores were reduced significantly (to about 1.5). This analysis suggests that the reduced amount of refinement seen in CASP12 vs previous rounds of CASP was probably not because initial models had higher or lower GDT or RMSD values. The next question we posed was whether the origin of the initial model played a role. In CASP12, the initial models came from various protein structure prediction servers, but models built by the Lee and Baker groups each contributed 12 initial structures for the 42 refinement targets. Table 2 summarizes the refinement performance as a function of where the initial models came from. Most notably, models originating from the LEE server could only be improved by modest FIGURE 2 Average improvements in GDT-HA score as a function of simulation time. Improvements in GDT-HA for the best (A) and top 5% percentile (B) of the sampled structure are shown as heat maps. Blue colors indicate the largest improvements in GDT-HA while red colors indicate the least improvement. Contours indicate the same total simulation times of 1 and 2 ls. Values in bold are given for: 5 3 200 ns, 10 3 100 ns, 20 3 50 ns (on the 1 ls line), 10 3 200 ns, 20 3 100 ns (on the 2 ls line), and 20 3 100 ns (upper right corner) FIGURE 3 Z-score distributions for Ca RMSD (A), GDT-HA scores (B), and RWplus scores (C) averaged over 24 targets using 20 trajectories over 200 ns as a function of simulation time (red: 0–10 ns; orange: 10–20 ns; green: 20–50 ns; cyan: 50–100 ns; blue: 100–200 ns). Arrows indicate the direction of change toward longer simulation times 182 | HEO AND FEIG amounts in terms of GDT-HA while more substantial GDT-HA improvements were possible for most other methods. Furthermore, the accuracy of sidechains, measured with the GDC-SC score, actually deteriorated for the LEE models while other models were again improved to different degrees. We performed more detailed analysis of the change in per-residue Ca RMSD from the native as well as the per-residue Ca RMSD with respect to the initial model. As shown in Figure 6, most of the improvements were made when the initial Ca RMSD was in the range of 1–5 Å. This suggests that residues with moderate errors are more likely to move toward the native structure, while refinement of substantially deviated residues is more challenging. Residues that were already close to the native (less than 1 Å) did not improve much as there is little room for refinement but these residues also moved less with respect to the initial structure. This means that residues that were essentially correct in the initial model tended to remain at that position during refinement. This would be expected for residues already in the deep global native minimum but residues close to the experimental structure are also more likely part of the highly packed protein core where there is less room for displacement even if the force field energy was not optimal. Comparing this analysis for models from the LEE and BAKER servers, there are noticeable differences especially in the amount of displacement from the initial model. With LEE models, there was much less displacement from the initial model even for residues with high initial RMSD values. This suggests that the LEE models were already in a FIGURE 4 Average improvements in GDT-HA scores (see color scale) based on different protocols for generating refined models from MD ensembles as a function of the length and number of MD trajectories. Structures obtained by averaging (A, B) are compared with structures obtained as ensemble centers (C, D), and with (A, C) and without (B, D) ensemble filtering (see text) HEO AND FEIG | 183 deep energetic minimum, presumably as a result of refinement using a similar protocol and energy function as what we have applied here. Therefore, our refinement of the LEE models was apparently more akin to a second round refinement. This would explain the much more moderate improvements for the LEE models compared to the refinement of other models and the overall lower success with refinement during CASP12 since the LEE models made up a significant fraction of refinement targets. 3.5 | Limitations toward further progress Moderate structure refinement via MD-based sampling is now consistently possible and has apparently even become integrated in standard automated modeling platforms as evidenced by the LEE server models during CASP12. However, we seem to have reached a plateau where further progress is difficult to achieve. An obvious issue is the use of restraints that limits how far structures can deviate from a given initial model and, thereby, how much initial models can be refined toward the native state. The application of restraints stems from early lessons during the application of MD to the structure refinement problem that have taught us that entirely unrestrained simulations from homology models are more likely to move away from the native state than toward it12,20,31 although it is not entirely clear why exactly. The use of restraints during refinement simulations has been the key to achieving consistency in structure refinement, but it is clear that with simulations restraint to the initial model it will not be possible to reach the native state for any but those initial models that are already very close to the native state. Thus, understanding why exactly unrestrained sampling from initial homology models generally fails to reach the native state is the key issue in our opinion for advancing structure refinement to the next level. In order to gain further insights, we examined one of the CASP12 targets (TR872) in detail. Since the initial model for this target was relatively far from the native state, there was significant room for improvement (initial Ca RMSD and GDT-HA scores were 5.59 Å and 56.8, respectively). During CASP12, we could not improve this model using our MD-based protocol (DCa RMSD and DGDT-HA score are 10.04 Å and 21.4, respectively) and we therefore chose this target to understand the limitations in our protocol better. This target had errors at both termini and at a b-turn. There was also relatively poor packing of FIGURE 5 Refinement performance as a function of on initial model quality. GDT-HA improvements are compared to initial GDT-HA scores (A), RMSD improvements are compared to initial RMSD values (B), and MolProbity scores after refinement are shown as a function of initial MolProbity scores (C). Boxplots summarize the results with orange lines inside the boxes indicating median values, while the top and bottom of the boxes represent first and third quartiles. The minimum and maximum values within 1.5 interquartile from the first and third quartiles are indicated as whiskers. Individual data points are overlaid as red circles TABLE 2 Performance dependency on initial model prediction groups Initial model quality Average improvement for model 1 Server Na GDT-HA RMSD (Å) GDC-SC MolProbity DGDT-HA –DRMSD (Å) DGDC-SC MolProbityb Lee 12 50.30 5.01 29.75 1.80 0.64 20.01 21.12 0.91 Baker 12 58.58 4.13 38.69 1.01 1.96 0.04 0.60 1.12 Bates_BMM 4 41.35 5.00 18.56 2.62 2.04 20.02 4.02 0.64 QUARK 3 39.86 6.48 19.20 2.28 3.77 20.03 1.08 0.56 Pcons 3 42.72 7.49 23.29 0.85 2.13 0.02 1.69 0.52 Otherc 7 43.74 7.72 25.26 2.49 1.29 20.06 0.79 1.04 a Number of targets. b MolProbity scores for refined models. c Less than three targets each came from RaptorX, YASARA, Seok-server, myprotein-me, HHPred1, and BhageerathH-Plus. TR879 (from FFAS-3D) was excluded from the analysis. 184 | HEO AND FEIG sidechains between b-sheets. We complemented the simulations generated during CASP12 with extensive unrestrained simulations that were started either from the native structure or the initial model given during CASP12. The additional MD simulations involved initially a set of 40 runs over 100 ns each. These simulations were carried out in the same manner as the MD simulations during CASP12 (except for the lack of restraints). The generated conformations were then clustered similarly as during CASP12 (see Methods). From each cluster another round of 10 additional simulations over 100 ns were started. This resulted in a total of 22 ls of additional sampling (8 ls started from the native, 14 ls started from the initial model). The resulting conformations, as well as the structures generated during CASP12 with restrained sampling, were then combined and projected onto the two principal components from time-structure independent component analysis32 with a lag time 1 ns and using pair-wise Ca distances as the distance metric. Figure 7 illustrates the conformational sampling for TR872 during our refinement simulations in CASP12 compared with the additional unrestrained simulations started from the native structure and initial model. When started from the native structure, the simulations explore only a relatively narrow conformational space, despite the extensive sampling and a protocol that encourages broad exploration of conformational space. The only significant dynamics is found in the flexible Cterminus and a loop connecting an internal hairpin (see also Figure 8A). The average structure obtained from the MD simulations remains close to the native state with a GDT-HA score of 79.0 and a Ca RMSD value of 2.24 Å for the average structure. This indicates that the native state coincides with a deep free energy minimum in the MD simulations as expected from the protein folding funnel hypothesis.33 While we cannot say for sure based on the still limited sampling whether the native state is indeed at the global free energy minimum, the native state is at least a prominent local minimum with the force field used here from which escape is unlikely if it can be reached during refinement. Our refinement simulations from CASP12 that were started from the CASP-provided initial model but involved weak positional restraints also did not significantly explore conformational space (see Figure 6C). It is readily apparent from Figure 6 that the differences between the initial model and the native state are much too large to be overcome in the presence of the restraints. In contrast, the unbiased simulations started from the initial model generated broad sampling where a number of distinct states were visited (see Figures 7B and 8C-G). The state S0, closest to the initial model, was most favorable and is already significantly refined with respect to the native state. S0 has a GDT-HA score of 65.3 that is about 8.5 units better than the initial model. The state S0 is easily reachable from the initial model without restraints since the initial model and S0 are close in terms of the tICA principal components. However, as S0 is more than 5 Å Ca RMSD away from our model 1 submitted during CASP12, the restraints in our CASP12 protocol clearly prevented this state from being reached as it lies outside the sampling radius in the restrained simulations (see Figure 7B, C). Other states in the unrestrained simulations, S1-S4, were only slightly less favorable than S0 and separated by kinetic barriers. One of the states, S3, is even closer to the native state than S0 (see Figure 8F) with a GDT-HA score of 71.0 and a Ca RMSD value of 2.59 Å. This state FIGURE 6 Per-residue change in Ca RMSD from the native, DRMSDi,native, (A, B, C) and Ca RMSD from the initial model, RMSDi,initial, (D, E, F) vs initial Ca RMSD from the native for a given residue i. The analysis is compared for all targets (A, D) and for targets from the LEE (B, E) and BAKER (C, F) groups. Boxplots are drawn in the same manner as in Figure 5. Individual data points are shown as gray dots. Dashed lines indicate perfect refinement which makes Ca deviation to zero HEO AND FEIG | 185 closely resembles the native state except for a persistent incorrect Nterminal helix that would need to dissolve for the structure to completely match the native state (see Figure 8F). On the other hand, the simulations also visit a state (S4) that is located significantly further away from the native state than the initial model. In S4, the N-terminal is opened up significantly (see Figure 8G). S4 is also about 5 Å Ca FIGURE 7 Conformational sampling of TR872 during refinement projected onto the two principal coordinates obtained from timestructure independent component analysis (tICA). Contour plots are drawn for sampling probabilities (as –log P based on probabilities P and shifted to zero at the minimum value). The coordinates for the native and the initial model are indicated by using a blue and red “X”, respectively. Sampling distributions for different sets of MD trajectories are shown in different panels: (A) started from the native structure without restraints; (B) started from the initial model without restraints; and (C) started from the initial model with restraints according to our CASP protocol. Representative states (S0–4) are indicated in panel B FIGURE 8 Structures for target TR872 during refinement and with additional MD-based sampling. The native structure is shown in yellow in all panels and compared with the MD-simulated average structure of the native state (blue; A), the initial model given during CASP12 for refinement (pink; B), structures sampled via MD from the initial model and designated as states S0-S4 (cf. Figure 7, purple, C-G), and the model submitted during CASP12 (green, H). Structures are shown in two views and for each panel, GDT-HA and Ca RMSD measures relative to the native structure are given 186 | HEO AND FEIG RMSD away from our model 1 and, therefore, relaxing the restraints enough to be able to reach S0 would also allow S4 to be visited. According to Figure 7B, a transition from S0 to S3, would appear as the most direct path to the native state, but a transition via S2 or S1 is kinetically more likely. Both of these states are slightly unfolded relative to S0 and, so, in this example, further refinement from the initially relaxed state S0 would require partial unfolding (via S1 or S2) and refolding (to S3). The final transition to melt the N-terminal helix (which we did not observe) probably also requires an additional partial unfolding and repacking process. The initial relaxation to S0 is essentially what we can reach with our current refinement protocol as long as the restraints are weak enough to allow large enough structural changes from the initial model. For the example chosen here, our restraints were too strong, but weaker restraints should make it possible to reach S0. However, further refinement would not simply proceed downhill on a folding funnel along a direct path. Instead, multiple cycles of partial unfolding and refolding may be required that are hindered severely by restraints as larger conformational changes are required but also involve the danger of unfolding toward states that are further away from the native state if restraints are not applied. Once states further away from the native state are sampled (such as S4), a recovery toward the native state becomes increasingly difficult due to the explosion of configurational space. We believe that the example presented here typifies the general problem of structure refinement. We appear to be well-equipped to achieve modest refinement from an initial model toward locally relaxed structures by using restrained MD simulations aided by conformational averaging. However, further refinement likely requires cycles of partial unfolding and refolding that are essentially impossible in the presence of restraints. As such transitions require the crossing of kinetic barriers to states with similar relative energies, the key challenge going forward seems to be how to favor productive cycles that lead toward the native state while, at the same time, preventing transitions to states that lead away from the native state, but without a priori knowledge of where the native state is located. It does appear, though, that once the native state is reached it can be recognized based on the force field providing a deep free energy minimum. It remains to be seen in future studies how general these findings are and, ultimately, how to transform such insight into successful refinement protocols that can achieve more significant refinement than what is currently possible. 3.6 | Further challenges While the protein structure refinement exercise within CASP focuses on the very specific task of improving a given initial model with respect to the experimental structure, the larger goal is the generation of meaningful structural models of proteins in their biologically most relevant form. This includes the modeling of complete structures in their most prevalent oligomeric states, which involves additional challenges. Reliable templates used to generate initial models may not cover the entire sequence while information about the oligomeric state and especially oligomer interfaces are often lacking. Unfortunately, neither the ab initio modeling of missing fragments nor the application of protein-protein docking techniques to determine likely oligomer configurations are trivial. However, carrying out MD-based structure refinement would be expected to benefit from having complete structures in their oligomeric state as that would represent the most realistic physical environment for a given protein structure. It remains to be assessed in detail, though, what the effect of missing fragments and the neglect of oligomeric states have on refinement success with MD-based techniques. 4 | CONCLUSIONS The refinement of protein structures via MD simulations remains a highly successful approach. Our protocol that combines extensive sampling via MD followed by filtering, scoring and structure averaging remains highly successful in providing moderate but consistent refinement. However, the extent of refinement during CASP12 with our MD-based protocol has seen a decline based on CASP12 targets while we expected improved performance with increased sampling, further improved force fields, and new refinement protocols that were tested during CASP12. To some extent this appears to be a result of MDbased refinement becoming a standard component of modeling pipelines, so that initial models available in the refinement category at CASP are more difficult to refine further. But more generally, we seem to have reached a plateau with the current refinement protocols. A major issue is the use of restraints during sampling which is necessary on one hand to achieve consistency but severely limits the degree to which structures can be refined. A detailed analysis of one of the CASP12 targets suggests that much weaker restraints may be needed to achieve further refinement, and, in particular, to allow partial unfolding and refolding to reach the native state. The key issue remains, however, that very weak, or no restraints, also allow states to be visited that lead away from the native state while it appears that the energy function does not provide strong guidance toward the native state until the native state is actually reached. While extensive sampling is now possible and force fields have become highly realistic, we see the remaining challenge in effectively guiding sampling during refinement to eventually reach the native state rather than drift away toward unfolded states. ACKNOWLEDGEMENTS Funding was provided by the National Institute of Health Grant R01 GM084953 and computing time at NSF XSEDE facilities (TGMCB090003) was used to carry out this research. Office of Cyberinfrastructure, TG-MCB090003,;National Institute of General Medical Sciences, GM084953. ORCID Michael Feig http://orcid.org/0000-0001-9380-6422 REFERENCES [1] Zhang Y. Protein structure prediction: when is it useful?. Curr Opin Struct Biol. 2009;19:145–155. [2] Kuhlbrandt W. Cryo-EM enters a new era. Elife. 2014;3:e03678 HEO AND FEIG | 187 [3] Hirata K, Shinzawa-Itoh K, Yano N, et al. Determination of damagefree crystal structure of an X-ray-sensitive protein using an XFEL. Nat Methods 2014;11:734–736. [4] Zhang Y, Skolnick J. The protein structure prediction problem could be solved using the current PDB library. Proc Natl Acad Sci USA. 2005;102:1029–1034. [5] Thomas J, Ramakrishnan N, Bailey-Kellogg C. Graphical models of residue coupling in protein families. IEEE Trans Comput Biol Bioinform. 2008;5:183–197. [6] Nugent T, Jones DT. Accurate de novo structure prediction of large transmembrane protein domains using fragment-assembly and correlated mutation analysis. Proc Natl Acad Sci USA. 2012;109:E1540– E1547. [7] Ovchinnikov S, Park H, Varghese N, et al. Protein structure determination using metagenome sequence data. Science. 2017;355:294– 298. [8] Feig M. Computational structure refinement: almost there, yet still so far to go. Wiley Interdiscip Rev Comput Mol Sci. 2017;7:e1307. [9] MacCallum JL, Perez A, Schnieders MJ, Hua L, Jacobson MP, Dill KA. Assessment of protein structure refinement in CASP9. Proteins. 2011;79:74–90. [10] Nugent T, Cozzetto D, Jones DT. Evaluation of predictions in the CASP10 model refinement category. Proteins. 2014;82:98–111. [11] Mirjalili V, Noyes K, Feig M. Physics-based protein structure refinement through multiple molecular dynamics trajectories and structure averaging. Proteins. 2014;82:196–207. [12] Mirjalili V, Feig M. Protein structure refinement through structure selection and averaging from molecular dynamics ensembles. J Chem Theory Comput. 2013;9:1294–1303. [13] Feig M, Mirjalili V. Protein structure refinement via moleculardynamics simulations: what works and what does not?. Proteins. 2016;84(Suppl 1):282–292. [14] Heo L, Feig M. PREFMD: a web server for protein structure refinement via molecular dynamics simulations. Bioinformatics 2017. [15] Feig M. Local protein structure refinement via molecular dynamics simulation with locPREFMD. J Chem Inf Model. 2016;56:1304– 1312. [16] Della Corte D, Wildberg A, Schr€oder GF. Protein structure refinement with adaptively restrained homologous replicas. Proteins. 2016;84(Suppl 1):302–313. [17] Park H, Seok C. Refinement of unreliable local regions in templatebased protein models. Proteins. 2012;80:1974–1986. [18] Lee GR, Heo L, Seok C. Effective protein model structure refinement by loop modeling and overall relaxation. Proteins. 2016;84 (Suppl 1):293–301. [19] Modi V, Dunbrack RLJ. Assessment of refinement of templatebased models in CASP11. Proteins. 2016;84(Suppl. 1):260–281. [20] Chen J, Brooks CL III, Can molecular dynamics simulations provide high-resolution refinement of protein structure?. Proteins. 2007;67: 922–930. [21] Huang J, Rauscher S, Nawrocki G, et al. CHARMM36m: an improved force field for folded and intrinsically disordered proteins. Nat Methods. 2017;14:71–73. [22] Jorgensen WL, Chandrasekhar J, Madura JD, Impey RW, Klein ML. Comparison of simple potential functions for simulating liquid water. J Chem Phys. 1983;79:926–935. [23] Darden TA, York D, Pedersen LG. Particle-Mesh Ewald: an N log(N) method for Ewald sums in large systems. J Chem Phys. 1993;98: 10089–10092. [24] Beauchamp KA, Bowman GR, Lane TJ, Maibaum L, Haque IS, Pande VS. MSMBuilder2: modeling conformational dynamics at the picosecond to millisecond scale. J Chem Theory Comput. 2011;7:3412–3419. [25] Zhang J, Zhang Y. A novel side-chain orientation dependent potential derived from random-walk reference state for protein fold selection and structure prediction. PLoS One. 2010;5:e15386. [26] Brooks BR, Brooks CL, Mackerell AD, et al. CHARMM: the biomolecular simulation program. J Comput Chem. 2009;30:1545–1614. [27] Eastman P, Friedrichs MS, Chodera JD, et al. OpenMM 4: a reusable, extensible, hardware independent library for high performance molecular simulation. J Chem Theory Comput. 2013;9:461–469. [28] Phillips JC, Braun R, Wang W, et al. Scalable molecular dynamics with NAMD. J Comput Chem. 2005;26:1781–1802. [29] Feig M, Karanicolas J, Brooks IIICL. MMTSB tool set: enhanced sampling and multiscale modeling methods for applications in structural biology. J Mol Graph Modell. 2004;22:377–395. [30] Zhou HY, Zhou YQ. Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci. 2002;11:2714–2726. [31] Raval A, Piana S, Eastwood MP, Dror RO, Shaw DE. Refinement of protein structure homology models via long, all-atom molecular dynamics simulations. Proteins. 2012;80:2071–2079. [32] Naritomi Y, Fuchigami S. Slow dynamics in protein fluctuations revealed by time-structure based independent component analysis: the case of domain motions. J Chem Phys. 2011;134:065101 [33] Leopold PE, Montal M, Onuchic JN. Protein folding funnels: a kinetic approach to the sequence-structure relationship. Proc Natl Acad Sci U S A. 1992;89:8721–8725. How to cite this article: Heo L, Feig M. What makes it difficult to refine protein models further via molecular dynamics simulations? Proteins. 2018;86:177–188. https://doi.org/10.1002/ prot.25393 188 | HEO AND FEIG