R E S E A R C H A R T I C L E Protein structure model refinement in CASP12 using short and long molecular dynamics simulations in implicit solvent Genki Terashi1 | Daisuke Kihara1,2 1 Department of Biological Sciences, Purdue University, West Lafayette, Indiana 47907 2 Department of Computer Science, Purdue University, West Lafayette, Indiana 47907 Correspondence Daisuke Kihara, Department of Biological Sciences, Purdue University, West Lafayette, Indiana 47907, Email: dkihara@purdue.edu Funding information Division of Biological Infrastructure, Grant number: DBI1262189; National Institute of General Medical Sciences, Grant number: R01GM123055; Division of Information and Intelligent Systems, Grant number: IIS1319551; Division of Integrative Organismal Systems, Grant number: IOS1127027; Division of Mathematical Sciences, Grant number: DMS1614777 Abstract Protein structure prediction has matured over years, particularly those which use structure templates for building a model. It can build a model with correct overall conformation in cases where appropriate templates are available. Models with the correct topology can be practically useful for limited purposes that need residue-level accuracy, but further improvement of the models can allow the models to be used in tasks that need detailed structures, such as molecular replacement in X-ray crystallography or structure-based drug screening. Thus, model refinement is an important final step in protein structure prediction to bridge predictions to real-life applications. Model refinement is one of the categories in recent rounds of critical assessment of techniques in protein structure prediction (CASP) and has recently been drawing more attention due to its realized importance. Here we report our group’s performance in the refinement category in CASP12. Our method is based on inexpensive short molecular dynamics (MD) simulations in implicit solvent. Our performance in CASP12 was among the top, which was consistent with the previous round, CASP11. Our method with short MD runs achieved comparable performance with other methods that used longer simulations. Detailed analyses found that improvements typically occurred in entire regions of a structure rather than only in flexible loop regions. The remaining challenge in the structure refinement includes large conformational refinement which involves substantial motions of secondary structure elements or domains. K E Y W O R D S CASP, computational method, molecular dynamics, protein structure prediction, protein structure refinement, structure modeling, template-based modeling 1 | INTRODUCTION Methodology of protein structure prediction has been intensively studied over decades from various angles, such as bioinformatics, physics, chemistry, statistics, and robotics. Although there are still some areas that need further development, for example, template-free (often also called ab initio or de novo) modeling,1,2 structure prediction has become practical in several situations, which include cases where structures can be modeled using templates (template-based modeling).3–5 The progress of the protein structure modeling field has been objectively monitored in the critical assessment of techniques for protein structure prediction (CASP) from 1994,6 a community-wide assessment of prediction methods that is held every 2 years. In CASP, participating prediction groups/methods are evaluated based on structure models they build for protein target sequences, which are presented by the organizers before determination of their tertiary structures. From CASP7 in 2006, assessors’ evaluation reports show that performance of template-based modeling has not changed much partly reflecting maturity of the methods.4,7–10 Now, biologists routinely use structure prediction to interpret or design experiments.11,12 In the case of template-based modeling, obtaining a structure model of the correct fold, that is, a structure that has a root-mean square deviation (RMSD) of 3–6 Å, can be expected if the template structures found in a structure library has a reasonably high confidence score.5,13 Improving the structure model further through model refinement, an additional procedure to gain a couple of more angstroms in RMSD to the native structure, is critical for bringing a computational model with the correct fold to a level that it is practically useful in various applications. For example, a model at a 5 Å RMSD would be only used for indicating residue positions in the protein such as interpreting Proteins. 2018;86:189–201. wileyonlinelibrary.com/journal/prot VC 2017 Wiley Periodicals, Inc. | 189 Received: 21 June 2017 | Revised: 1 August 2017 | Accepted: 18 August 2017 DOI: 10.1002/prot.25373 and designing residue mutagenesis experiments. However, if the model was refined to within 1.5 Å RMSD, it can be used for molecular replacement in experimental structure determination, virtual drug screening, and investigating enzymatic reactions.14 Thus, model refinement has been one of the foci of recent method developments in the CASP community.4,15 However, model refinement is still not easy. It is well known that running naïve molecular dynamics (MD) simulations on a structure model do not improve model consistently. Rather, it deteriorates the model for almost half of the cases.16 In CASP, until CASP9 held in 2010, there were no methods that could refine models consistently with statistical significance.17 This situation changed in CASP10, when the FEIG and Seok groups showed improvements on the starting models in majority of their cases.18 Particularly, the FEIG group significantly outperformed all the other groups in their Global Distance Test-High Accuracy (GDT-HA) score19 improvement. The FEIG’s approach was based on MD: from the MD trajectories starting of a model to be refined, structures were chosen that did not deviate much from the initial structure to avoid degraded structures and further filtered by using additional scoring function.20 In CASP11, many top performing groups, including FEIG,15,21 used MD-based approaches with some variations, for example, using support vector machine (a machine learning method) to select models after MD,22 remodeling using multiple templates before MD-based refinement,23 taking consensus with homologous structures,24 or using multiple rounds of relaxation.25 Here, we report our group’s method and performance in the model refinement category in CASP12. Our approach is based on MD trajectories, following the FEIG group’s success, with several critical differences: We used an implicit solvent model rather than explicit water molecules in simulation, and moreover, the length of the trajectories is significantly shorter than those used in the FEIG approach. In spite of the computationally inexpensive strategy, our approach was ranked high among participants in CASP1115 as well as in CASP12. Differences of our approach in CASP11 and CAS12 are that we optimized parameters of the method carefully for CASP12 and also changed the implicit solvent model used in the MD simulations. In CASP12, our approach refined 29 out of 42 targets successfully in terms of the GDT-HA score and 24 in terms of the CASP12 assessors’ score that is a linear combination of five scores (http://predictioncenter.org/casp12/zscores_final_ refine.cgi?formula5 assessors) . We showed that the refinement occurred not only at loop regions but also overall in structures cores, and the approach mainly expanded structures (showing that the improvement of structure evaluation scores are not merely due to compression of structures). Drawbacks of the approach are also discussed. 2 | MATERIALS AND METHODS 2.1 | Overview of the refinement protocol Our refinement protocol used in CASP12 is shown in Figure 1. Figure 1A shows the overall flowchart. We performed short and long MD simulations after the energy minimization was applied to a structure model to refine. The short MD consists of sixty 1.2 nanosecond (ns) MD simulations with restraints of increasing strength (0.1, 0.2, 0.4 kcal mol21 Å22 ) applied to Ca atom positions, which was increased every 400 pico seconds (ps). The long MD consists of twenty 20 ns MD simulations with weak restraints (0.05 kcal mol21 Å22 )21 on Ca atoms positions. After the MD simulations, a subset of structures in FIGURE 1 The model refinement protocol of the Kiharalab group in CASP12. A, the flow chart of our refinement protocol. Before performing MD simulations, the energy of the staring model was minimized, and then subjected to the next heating and equilibration step. The equilibrated structure was used as the starting model for the two types of production MD runs (short MD and long MD simulations). In the filtering step, subset of models extracted from the MD trajectories were selected by the filtering criteria. The coordinates of selected models were averaged and then relaxed. B, Definition of the parameters used in the filtering step. The x axis is the Z score of GDT-HA of a model from the initial structure and the y axis is the Z score of the DFIRE scoring function. The distribution shown is from extracted models from short MD trajectories of TR520. Red points represent selected models by the filtering with r 5 2.0, u 5 310, and g 5 35 190 | TERASHI AND KIHARA the trajectories were selected by considering the deviation from the starting model and the statistical potential score (Figure 1B). The selected models were averaged on the Cartesian coordinates of atoms, which were then minimized and relaxed with a10 ps MD simulation. 2.2 | Setting of MD simulation MD runs were performed by CHARMM MD program version 38b2 with CHARM22/CMAP force field. We used a 2 femto seconds (fs) time step for all of simulations. The non-bond interactions were listed using a 14 Å cutoff. Electrostatic interactions were calculated with a shifting function applied to the potential energy at 12 Å. van der Waals interactions were calculated with a switching function applied to the potential energy between 10 and 12 Å. The solvent effect was computed by the FACTS implicit solvent model26 with its default parameters. In CASP11, we used SCPISM27 for implicit solvent but changed it to FACTS following a comparison study of implicit solvents by Hua et al.28 Before running MD simulations, hydrogen atoms were added to the starting model by the CHARMM HBUIld command. To fix the length of bonds involving hydrogen atoms, the CHARMM SHAKE command was used. Energy minimization was performed in total of 12,100 steps of the Adopted Basis Newton-Raphson (ABNR) method. In the first 100 steps of the minimization, the position of all non-hydrogen atoms was fixed, and then we applied harmonic constants that were subsequently decreased from 20.0, 10.0, 5.0, 2.0, 1.0, to 0.5 kcal mol21 Å22 for every 2 000 steps (4 ps). The minimized protein model was subjected to the next heating and equilibration step. The temperature of the system was gradually increased from 50 K to 298 K in 200 000 steps (400 ps) with harmonic restraints of 0.05 kcal mol21 Å22 on Ca atom positions. Then, the equilibrated structure was used as the starting model for production MD runs. All trajectories were calculated with the leapfrog verlet algorithm at 298 K. Model structures are saved every 500 steps (1ps) in each trajectory. In the short and the long MDs, a total of 72 000 (1200*60) and 400 000 (20 000*20) models were extracted from trajectories, respectively. 2.3 | Model filtering Following the FEIG method,20,21 extracted models from MD trajectories were cross-evaluated by a statistical knowledge-based potential (we used DFIRE29 ). We have also calculated the GDT-HA between the starting model and the extracted models after MD (denoted as iGDT_HA). These two parameters for the filtering are illustrated in Figure 1B. The DFIRE score and iGDT_HA of each selected model were normalized by computing Z score, using a distribution of structures from each MD trajectory as follows: FIGURE 2 Results of model selection from MD trajectories with different parameter settings, r, u, and g. For the explanation of the parameters, see Figure 1B. The distribution of the average dGDT-HA of selected models from the short MD (left) and the long MD (right) trajectories are shown in a color scale from dark red to purple for negatively large dGDT-HA, 23 to over 20.5. dGDT-HA is the difference of the GDT-HA to the native structure of the initial model to that of the average of the selected models using a corresponding parameter set. A white region means that no models exist for the corresponding parameter combinations TERASHI AND KIHARA | 191 Z iGDT HAm5 iGDT HAm2iGDT HA riGDT HA (1) Z DFIREm5 DFIREm2DFIRE rDFIRE (2) where iGDT HA and DFIRE are average values, and riGDT HA and rDFIRE are standard deviations of iGDT_HA and DFIRE. For DFIRE, structures with negative Z scores are those which are more geometrically favorable than the average (Z score 5 0). For iGDT-HA, structures with a positive value are those which are more similar to the initial structure than the average. To select models, we applied a filter (Figure 1B) that extracts a pieshaped area between angle u6g degree and the radial distance r from the center of distribution of Z iGDTHAm and Z DFIREm. The criteria for selecting model m were Z iGDT HAm 2 1Z DFIREm 2 > r (3) and arccos Z iGDT HAmcosu1Z DFIREmcosu Z iGDT HAm 2 1Z DFIREm 2   < g (4) which corresponds to the lower-right region in red in Figure 1B. Thus, the intention is that, among models from MD trajectories, we would like to select those which have relatively low (better) statistical potential and less deviation from the initial model. For the extracted models from the Short MD trajectories, we used r5 2.0, u 5 310 and g 5 35 in the filtering step. For the Long MD, we used r5 1.5, u 5 325 and g 5 30. These parameters were chosen based on a benchmark study performed on the CASP11 dataset (discussed in Results). After the filtering, 1 0004 500 and 1 00025 000 models were selected from the short MD and the long MD trajectories. Selected models were averaged, which were subject to the structure relaxation and energy minimization (Model 1 and Model 2 in Figure 1A). This protocol is different from the FEIG method21 used in CASP11 in the following ways: First, the MD runs in our protocol were much shorter. We used 1.2 ns * 60 trajectories (the short MD runs) and 20 ns * 20 trajectories (the long MD runs), thus in total of 472 ns MD runs, while the FEIG method used 1200 ns long runs (40 ns * 30). Second, we used an implicit solvent model while the FEIG method used explicit solvent. These two differences make our protocol more computationally inexpensive and affordable. There are also other technical differences, described in the following. Third, as mentioned above, we applied Ca restraints in the MD runs that increased over time from 0.1 to 0.4 kcal mol21 Å22 where FEIG used a constant value of 0.05 kcal mol21 Å22 . Fourth, the interval of structure snap shots taken was 1 ps in our protocol, while FEIG used 40 ps. Fifth, for the filtering (Figure 1B), we used iGDT-HA and DFIRE, while the FEIG method used iRMSD and RW1.30 2.4 | Structure relaxation and energy minimization Averaged models underwent energy minimization with 500 steps of the steepest descent algorithm and 4 500 steps of ABNR. The minimized models were relaxed with a 40 ps MD simulation. In the minimization and relaxation, we used strong harmonic restraints of 100 kcal mol21 Å22 on C-a atom positions. The model 3–5 were selected from all extracted models with the lowest DFIRE, the lowest GOAP31 score and the highest iGDT-HA (Figure 1A). 2.5 | Computational costs For a 200-residues target protein, the single 0.4 ns equilibration and 1.2 ns MD simulation took about 13.2 and 40 CPU hours, respectively. The total computational cost of the short MD simulations (60 trajectories) was about 3,200 CPU hours per target (2.7 hours on 1 200 cores of Intel Xeon-E5 CPU). A single 20 ns MD simulation took about 640 CPU hours. Consequently, the total computational costs of the long MD (20*20 ns MDs) was about 12,800 CPU hours for a refinement target (64 hours on 200 cores of Intel Xeon-E5 CPU). 3 | RESULTS AND DISCUSSIONS First we describe benchmark results on the CASP11 refinement dataset, then show the results of our group in CASP12. 3.1 | Parameter optimization for model selection using the CASP11 dataset As a preparation for participating in CASP12, we optimized parameters, r, u, and g that were used in the filtering protocol (Figure 1B) on the 36 TABLE 1 Average performance of our protocol in comparison with top10 groups in CASP11 Groupa Group ID GDT-TS GDT-HA RMS_CA FEIG 288 74.63 56.45 3.58 Seok 296 72.45 53.71 3.62 Seok-refine 423 72.61 53.56 3.59 Schroderlab 396 73.27 55.07 3.67 PRINCETON_TIGRESS 106 72.75 53.98 3.65 Kiharalab 333 72.88 54.22 3.57 LEE 169 72.14 53.31 3.65 BAKER 64 71.20 52.25 3.86 nns 038 71.57 52.51 3.66 Seok-server 011 72.66 53.93 3.64 Average of all 53 Groups  n/a 69.13 50.08 3.99 Starting model n/a 71.96 53.03 3.69 Short MD n/a 73.26 54.27 3.62 Long MDa n/a 70.48 50.82 3.84 The names of top 10 groups were obtained ranked according to the CASP11 web site, assessors’ formula. (http://predictioncenter.org/casp11/zscores_final_refine.cgi? formula 5 assessors). a The results are for 31 out of the 36 targets for which the long MD runs finished before the CASP12 has started to release refinement targets. 192 | TERASHI AND KIHARA refinement targets from CASP11. The dataset included the target TR217 to TR857 excluding TR795, whose native structure was not available at the time of the work. The starting models and the native structure of these targets were downloaded from the CASP11 website (http://predictioncenter.org/casp11/). The size of the target proteins ranged from 62 to 288 residues (average: 155). The Ca RMSD of the starting models to their native structures ranged from 1.45 to 12.45 Å (average: 3.69 Å), GDT-TS ranged from 46.17 to 90.24 (average: 71.96), and GDT-HA were from 29.10 to 73.51 (average: 53.03). The parameter optimization was performed separately for the short and the long MD runs. For the short MD runs we generated forty 1.2 ns-long trajectories while for the long MD runs we computed up to three 20 ns-long trajectories for the starting models of the targets. Long MD runs were finished only for 31 out of 36 targets before the first refinement target was released in CASP12. We first tested in total of 3888 (5 3*72*18) combinations of the three parameters (r, u, g) for selecting models (Figure 1B), which come from three values, 1.0, 1.5, and 2.0 for r, 72 values from 0 to 355 with an interval of 5 for u, and 18 values from 5 to 90 with an interval of 5 for g. We examined the average change of GDT-HA of the selected models with each parameter set from that of the initial model (dGDTHA). The results are visualized with a color scale in Figure 2. Higher in the color scale, purple to blue, are better than dark red that is at the bottom of the scale. Several important trends were observed: (1) Interestingly, as shown in Figure 2, none of the parameter combination gave on average better GDT-HA (i.e., positive dGDT-HA) than the starting model. However, as we show later, averaging structures selected by a good parameter set improved GDT-HA from the starting model. (2) Also, it was evident that the short MD runs (left panels) gave better results than the long MD runs (right panels). (3) The parameter space that gave better results locates at the right bottom corner of the panel, where u is around 300 to 350 and g is around 0 to 30 degrees. This region corresponds to the red regions in Figure 1B. (4) For this region, using a larger radius r gave better results. (5) The parameter space that did not perform well are at the middle bottom of the panel, where u is around 150 and g is around 0 to 50 degrees. This is the opposite region from the good performing region shown in red in Figure 1B. We selected 76 and 40 parameter combinations with largest dGDT-HA for the short and long MD runs. Then, for each parameter combination, we generated an averaged structure model from the selected models by the combination. The generated models were evaluated by dGDT-HA. A combination of r 5 2.0, u 5 310, and g 5 35 was found to be optimal for the short MD trajectories and a r 5 1.5, u 5 325, and g 5 30 combination was best for the long MD runs. 3.2 | Refinement results on the CASP11 targets Table 1 summarizes the performance of our protocol using the optimized refinement protocol with short and long MDs (Figure 1) on the CASP11 refinement targets. For comparison, results of the top 10 groups in CASP11 including our group (Kiharalab) are shown. Results are on the first model (Model 1) of the groups. In the table, the average value of three scores, GDT-TS, GDT-HA, and Ca-RMSD (RMS_CA) are shown, which were major evaluation scores used by the CASP assessors. GDT-TS (Global Distance Test-Total Score) and GDT-HA (Global Distance Test-High Accuracy) are the average of the percentage of Ca FIGURE 3 Comparison of the refined models by the short MD protocol on the CASP11 refinement targets. Models were evaluated by GDT-TS in the left column and by GDT-HA in the right column. A, refined models by the short MD in comparison with starting models. B, the short MD results compared with the long MD results. C, The refined models with the short MD runs compared with models submitted by Kiharalab (Group number 333) in CASP11. D, the short MD results compared with models by FEIG (Group number 288) in CASP11 TERASHI AND KIHARA | 193 atoms within four distance cutoffs from the corresponding Ca atoms in the native structure after structure superimposition. GDT-TS uses 1.0, 2.0, 4.0, and 8.0 Å while GDT-HA uses 0.5, 1.0, 2.0, and 4.0 Å for the distance cutoffs. Both scores ranges from 0 to 100 where 100 is the best score for a model. RMS_CA is RMSD of Ca atoms of a model to the native, which was calculated with the LGA program.32 Seven groups, FEIG, Seok, Seok-refine, Schroderlab, PRINCETON_TIGRESS, Kiharalab, and Lee showed improvement over the starting model in terms of all three scores, GDT-TS, GDT-HA, and RMS_CA (Table 1). In CASP11, our group (Kiharalab) used twenty10 ns MD runs with the Screened Coulomb Potential implicit solvent model (SCPISM). The current protocol using the short and long MD runs are shown at the bottom of the table. The Short MD runs performed very well, ranking the third in terms of GDT-HA (54.27) and GDT-TS (73.26) and the fourth for RMS_CA (3.62). The long MD runs performed worse than the short runs. On average, it failed to improve the starting models and ranked lower than the top 10 groups when GDT-TS and GDTHA were considered, and was ranked ninth for RMS_CA. In Figure 3, we examined refinement results with the short MD protocol in comparison with other refinement protocols. Figure 3A shows that short MD improved from the starting models in majority of the cases, 26 (72.2%) and 21 (58.3%) out of 36 targets in terms of GDTTS and GDT-HA, respectively. It is also observed in Figure 3A that improvement by the short MD did not depend on the initial quality of the starting models. Next, we compared with the results with the long MD runs (Figure 3B). Consistent with the data shown in Table 1 and Figure 2, the short MD performed substantially better than the long MD. In terms of GDT-TS and GDT-HA, the models of the short MD were better for 26 and 27 targets (83.9% and 87.1%) than the long MD models among the 31 targets for which we had data with the long MD runs. In the next two panels, we compared the short MD refinement with the models submitted in CASP11 by Kiharalab (Figure 3C) and FEIG (Figure 3D). Compared with CASP11 Kiharalab models, the short MD’s results were comparable. For about half of targets (17/36 targets), the short MD models had better GDT-TS and GDT-HA than the Kiharalab models. On the other hands, for only 7 (19.4%) targets the short MD models had better GDT-TS and GDT-HA than the FEIG models (Figure 3D). However, it is worth noting that the computational cost for the short MD protocol is much smaller than FEIG. The former used in total of 48 ns (40*1.2 ns) MD runs with implicit solvent while the latter used 1.2 ls MD runs with the TIP3P explicit solvent model.21 3.3 | Results in CASP12 Now we discuss our group’s performance in CASP12. In CASP12, we applied our refinement protocol to all 42 refinement targets from TR520 to TR948. The starting models were obtained from CASP website (http://predictioncenter.org/casp12/). The residue length of the targets were 54–414 residues (average: 193). The initial quality of the starting models ranged from 1.15 to 13.86 Å Ca RMSD (average: 5.5 Å), 37.03 to 92.07 in terms of GDT-TS (average: 66.93), and GDT-HA ranged from 24.33 to 78.36 (average: 49.76). First we see our performance in CASP12 relative to other groups. Table 2 summarize the average performance of top10 groups in TABLE 2 Average performances of the top10 groups in the CASP12 refinement targets Groupa Group ID Assessors’ formulab GDT-TS GDT-HA RMS_CA MolProbityc SphGrd QCSe GOAL 220 21.79 67.08 50.13 5.14 2.06 69.05 80.23 Seok 023 18.39 67.79 50.93 5.43 1.11 68.37 81.22 BAKER 247 17.73 67.11 50.48 5.36 0.87 69.28 81.02 Seok-server 250 16.91 67.40 50.34 5.42 1.11 68.05 78.57 SVMQA 208 15.01 67.72 51.01 5.48 1.71 67.30 80.70 FEIG 204 13.66 67.58 50.89 5.51 0.94 66.67 79.81 LEEab 450 13.58 64.94 47.42 5.37 1.83 68.13 78.53 Kiharalab 102 12.38 67.33 50.46 5.52 1.45 66.80 80.44 GOAL_COMPLEX 430 11.87 67.33 50.78 5.52 1.80 66.67 80.45 LEE 011 11.75 65.76 48.42 5.32 1.80 67.66 80.21 Average of All 39 Groups n/a  211.57 62.85 45.85 6.14 2.08 62.87 76.14 Starting model n/a n/a 66.93 49.76 5.50 1.77 66.99 79.74 a The names of Top 10 groups were obtained from CASP12 result page (http://predictioncenter.org/casp12/zscores_final_refine.cgi?formula 5 assessors) ranked by the assessors’ formula. b Combined Z score of GDT-HA, RMS_CA, SphGr, QCS and MolProbity score defined as; 0:17ZGDT2HA20:46ZRMSC A10:20ZSphGr10:15ZQCS20:02ZMolProbity. c MolProbity considers the number of steric clashes and the percentages of outliers in rotamer and backbone conformations. A low MolProbity score indicates that a model is more physically favorable. d SphereGrinder evaluates local structural similarity between a model and the native structure. e Quality Control Score. QCS evaluates the correctness of secondary structure element predictions in a model. 194 | TERASHI AND KIHARA CASP12. Refined model quality was evaluated by GDT-TS, GDT-HA, RMS_CA, MolProbity,33 SphereGrinder (SphGr),34 and Quality Control Score (QCS),35 which were the scores used in assessors’ evaluation. Kiharalab was ranked eighth according to the asessors’ formula (see Table 2 caption). In terms of individual scores, we were ranked fifth in GDT-TS, sixth in GDT-HA, ninth in RMS_CA, fifth in MolProbity, eighth in SphGr, and fifth in QCS. These ranking results are slightly worse than in CASP11, where our group was ranked at fourth (the CASP11 evaluation paper,15 Table 2). It is noted that the differences of individual scores between the top 10 groups are small. Interestingly, the difference between Kiharalab and FEIG is very small, and indeed became smaller in CASP12 relative to CASP11. For example, the average difference of GDT-HA between Kiharalab and FEIG was 2.23 in CASP11, which became as small as 0.43 this time. To reveal the significance of the differences of the group performance, we applied the paired Student’s t test on the common set of predicted targets in CASP12. Table 3 shows the results on GDT-TS and GDT-HA. According to the test, our group (Kiharalab, Group ID: 102) showed significantly better results (P values < 0.05) than LEEab (Group ID: 450), LEE (Group ID: 011), and the starting model for both GDT-TS and GDT-HA. On the other hands, only SVMQA (Group ID: 208) performed significantly better than our group, and the other seven groups were indistinguishable in this test with our group. 3.4 | Analyses of short MD (model 1) models in CASP12 From this section, we analyze our submitted models in details. In Figure 4, we examined our Model 1 models generated by the short MD protocols. First we checked how many cases the Model 1 improved over starting models (Figure 4A). When evaluated by GDT-TS and GDT-HA, TABLE 3 Head-to-head comparison of the top 10 groups in CASP12 (A) Paired t-test on GDT-TS ID 220 023 247 250 208 204 450 102 430 011 Starting model 220 – 0.84 0.52 0.68 0.82 0.75 0.02 0.65 0.65 0.02 0.40 023 0.16 – 0.18 0.03 0.38 0.31 0.00 0.05 0.04 0.00 0.00 247 0.48 0.82 – 0.67 0.82 0.74 0.01 0.64 0.63 0.03 0.39 250 0.32 0.97 0.33 – 0.87 0.65 0.01 0.40 0.40 0.00 0.04 208 0.18 0.63 0.18 0.13 – 0.34 0.00 0.01 0.00 0.00 0.00 204 0.25 0.69 0.26 0.35 0.66 – 0.01 0.22 0.21 0.00 0.05 450 0.99 1.00 0.99 0.99 1.00 1.00 – 0.99 0.99 0.90 0.98 102 0.35 0.95 0.36 0.60 0.99 0.78 0.01 – 0.48 0.00 0.03 430 0.35 0.96 0.37 0.60 1.00 0.79 0.01 0.52 – 0.00 0.01 011 0.98 1.00 0.98 1.00 1.00 1.00 0.10 1.00 1.00 – 0.98 Starting model 0.60 1.00 0.61 0.96 1.00 0.95 0.02 0.97 0.99 0.02 – (B) Paired t test on GDT-HA ID 220 023 247 250 208 204 450 102 430 011 Starting model 220 – 0.86 0.66 0.61 0.87 0.81 0.01 0.69 0.82 0.01 0.30 023 0.14 – 0.29 0.01 0.61 0.47 0.00 0.09 0.31 0.00 0.00 247 0.34 0.71 – 0.43 0.76 0.69 0.00 0.49 0.67 0.01 0.18 250 0.39 0.99 0.58 – 0.98 0.81 0.00 0.66 0.92 0.00 0.04 208 0.13 0.39 0.24 0.02 – 0.40 0.00 0.02 0.13 0.00 0.00 204 0.19 0.53 0.31 0.19 0.60 – 0.00 0.17 0.41 0.00 0.03 450 1.00 1.00 1.00 1.00 1.00 1.00 – 1.00 1.00 0.94 0.99 102 0.31 0.91 0.51 0.34 0.99 0.83 0.00 – 0.92 0.00 0.02 430 0.18 0.69 0.34 0.08 0.87 0.59 0.00 0.08 – 0.00 0.00 011 0.99 1.00 0.99 1.00 1.00 1.00 0.06 1.00 1.00 – 0.97 Starting model 0.71 1.00 0.82 0.97 1.00 0.97 0.01 0.98 1.00 0.03 – The performances of the top 10 groups were compared using paired Student’s t-tests. Statistically significant wins (P values < 0.05) of the group listed in the first column from the left against each group listed in the first raw are shown in bold. TERASHI AND KIHARA | 195 25 (59.5%) and 29 (69.0%) out of 42 targets were improved, respectively. These fractions of improved targets are similar to those observed on the CASP11 dataset (Figure 3A). Consistent with what was observed in the CASP11 dataset (Figure 3A), improvements were observed to starting models of various initial quality. When compared with models from the long MD runs, which were submitted as Model 2 models, the short MD models were better for 30 (71.4%) and 33 (78.6%) out of 42 targets. This result is qualitatively consistent with the observation on the CASP11 dataset, but the fractions of the wins by the short MD models are lower than before (Figure 3B). To understand the better performance of the short MD over the long MD, we performed short-MD-based refinement using the constraints of 0.05 kcal mol21 Å22 , the relatively weak constraint we used in the long-MD refinement. This comparison was performed on FIGURE 4 Comparison of short MD models submitted as Model 1 with other models in CASP12. GDT-TS (on the left column) and GDTHA (right) are used for evaluation. A, comparison with the starting models. B, compared with refined models with the long MD runs that were submitted as Model 2. C, comparison with Model 1 models from the FEIG group 196 | TERASHI AND KIHARA 36 CASP12 refinement targets that had their native structures available. TR944 had its native structure in PDB but was excluded from the analysis because its trajectory files used in CASP12 were corrupted and could not be used at the time of the post-analysis. With the increasing constraint strength of 0.1, 0.2 to 0.4 kcal mol21 Å22 , which was used in CASP12, the average GDT-TS and GDT-HA obtained for the 36 targets were 67.32 and 50.11, respectively. These values were decreased to 66.29, and 48.60 for GDT-TS and GDT-HA, respectively, when the constant 0.05 kcal mol21 Å22 was used. Thus, we would have obtained better results with long-MD-based refinement if a stronger constraint had been used. Long MD performed substantially worse than short MD for highquality targets whose initial GDT-HA were >70 (Figure 5). This is probably because moving target structures far away from initial structures by the long MD was more harmful for targets that were already in high quality. We also examined if the oligomeric state of targets affected to the refinement results. In Table 4, we show the average GDT-HA of monomer and oligomer targets in CASP12. Both short MD and long MD performed worse on the oligomer targets and the difference between the short MD- and the long MD-based refinement was smaller for the oligomer targets. Thus, it is not the case that the long MD worked particularly worse on the oligomer targets. The reason of the worse performance on the oligomer targets would probably because we applied the refinement protocol to a single target protein even for an oligomer target without considering physical interaction to other proteins. This would mean, conversely, a structure would be better refined when its interacting proteins, either other subunits in a complex if there are any, or crystal contacts in the protein crystal is considered in refinement. Next, we compared with the Model 1 models from FEIG, which used a similar MD-based refinement protocol (Figure 4C). The FEIG models were better than Kiharalab in 27 (64.3%) and 25 (59.5%) out of 42 targets in terms of GDT-TS and GDT-HA, respectively. This difference between the two groups is much smaller than on the CASP11 dataset (Figure 3D), which is consistent with the results shown as overall average scores in Table 2 and the results in Table 3 that shows the two groups were statistically indistinguishable. 3.5 | Examples of refined models Examples of models with successful and failed refinement are shown in Figure 6. Improved and deteriorated regions in a model are colored in blue and red, the darker more substantial. Green regions are those which did not move >0.1 Å. The first example in the top row is TR948. For this target, the overall improvement of GDT-HA (dGDT-HA) was large, 4.70, and this model was ranked sixth among all the first models submitted by the 36 groups. As shown in the color code, improvement occurred at almost all the helical regions in the structure and degradation occurred at ends of some helices. The next one, TR912, is another successful example, where dGDT-HA of 4.58 was achieved. This model was the best in GDT-HA among all the 31 groups’ Model 1 models. Similar to the previous example, improvement occurred globally, at almost all b-strands in the structure. In TR868, the third example, there was a modest improvement where dGDT-HA at 0.72 was observed. This model was ranked seventh among 34 groups who submitted models for this target. In this refined model, improved and degraded regions co-exist, mixed in the structure, which is typically observed in models with a small improvement of <1.0 dGDT-HA. The last one, TR891, is a case that our refinement failed. The model had a dGDT-HA of 23.57 and ranked 18th among 36 groups. As shown, the refinement protocol moved all the b-sheets away from the native structure. We observed failure for other three targets of b-barrel structures, too (TR879, dGDT-HA of 25.0; TR891, 23.57; TR928, 22.79). Overall, these examples illustrate that improvement and deterioration occurred globally in a model including the structure core rather than merely moving flexible loop regions. 3.6 | Relaxing or compressing? Observing that the structural change occurs globally to a model by our refinement protocol, we questioned whether the observed FIGURE 5 GDT-HA difference between short MD and long MD models relative to the initial quality of the targets. The x axis shows GDT-HA of the starting models. The y axis shows the difference of GDT-HA of Model 1 (short MD) and Model 2 (long MD). The positive value indicates that GDT-HA of Model 1 was higher than Model 2 TABLE 4 Average GDT-HA of monomer and oligomer targets in CASP12 No. of Targets Short MD Long MD Difference Monomers 24 52.69 49.82 2.88 Oligomers 18 47.49 46.32 1.17 Among the 48 refinement targets, the following targets are oligomers: TR520, 594, 694, 862, 866, 870, 875, 877, 881, 887, 893, 896, 909, 912, 913, 917, 945, 947 (all IDs with TR as prefix). TERASHI AND KIHARA | 197 improvement was due to simply compression of the structure since it is widely known that compression of Ca coordinates of a model decreases the radius of gyration, which contributes to apparent improvement of some quality assessment scores, such as GDT-TS, GDT-HA, and RMSD. To answer this question, we evaluated a compactness of refined models and the starting models by comparing the radius of gyration (Rg) of the structures. It is defined as: Rg5 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PN i jvi2vj2 N v u u u t (5) where N is a total number of Ca atoms in the model, vi is the coordinate of ith Ca atom, v is the average coordinate of all Ca atoms of the model. In this calculation, we ignored largely incorrect regions in the FIGURE 6 Examples of successful and failed refinement by our group. The left column shows the native structures (magenta) and the starting models (cyan). The right column shows our Model 1 models that were refined with the short MD runs. The refined models are colored according to the degree of improvement of Ca atom positions from the starting models. Improved regions in a model are colored from light blue to dark blue for small to large improvements. On the other hand, deteriorated regions are colored from light to dark red for slight to large deteriorations. Green represents regions that did not change >0.1 Å. Deviation of the Ca positions of a model from the starting structure was judged after superimposing the starting structure and the model to the native structure using the LGA program with a 4.0 Å threshold 198 | TERASHI AND KIHARA starting model where the distance between the corresponding Ca atom positions to the native structure was larger than 4.0 Å, because these regions are highly unlikely to influence refined model’s GDT-TS and more so for GDT-HA. To assess the compression or relaxation of refined models, we computed the difference of Rg between the starting model and the refine model (dRg), dRg 5 Rg(refined model)2 Rg(starting model). A positive dRg indicates that the refined model was relaxed (expanded) while a negative value shows the model was compressed from the starting model. In Figure 7A, the improvement of models (dGDT-HA) was presented in a color code relative to dRg and GDT-HA of starting models. First, by examining dRg, models for 32 out of 37 targets (five targets were excluded from this analysis because their native structures were not available for computing GDT-HA of their starting models) have positive values, indicating that the refinement actually expanded (or relaxed), not compressed the structures. This figure also shows that the degree of the relaxation did not depend on the quality of the starting models and larger improvement (points in blue) occurred for starting models with a middle range GDT-HA and dRg, namely about 50 and 0.2, respectively. Figure 7B compares dGDT-HA with dRg and the deviation of refined models from their starting structures (iGDT-HA). There is an obvious correlation between iGDT-HA and dRg, which simply shows that the model drifted away from the initial structure as it expanded (larger dRg). An interesting observation is that two FIGURE 7 Refinement results relative to the change of radius of gyration of models. Improvement of models (dGDT-HA) is shown in color code relative to the change of the radius of gyration (dRg) by the short MD refinement protocol. dGDT-HA of over 24.0 to >4.0 is shown in a color scale from red to blue. Each data point represents a Model 1 refined model for the 37 targets that have their native structure available for the analysis. A, The x axis shows the quality, GDT-HA of starting models. B, The x axis is the structural deviation of refined models form the starting model (iGDT-HA) FIGURE 8 Improvement of GDT-HA (dGDT-HA) relative to the quality of starting models (GDT-HA(Starting Model)). Red points represent Model 1 models and open circles represent Model 2 to Model 5. The best dGDT-HA models among the five models for each target are connected with a line TABLE 5 Average performance of Kiharalab model 1–5 in CASP12 GDT-TS GDT-HA RMS_CA MolProbity SphGr QCS Model 1a 67.33 50.46 5.52 1.45 66.80 80.44 Model 2b 65.84 48.32 5.51 1.80 66.11 79.76 Model 3c 65.29 47.19 5.59 2.01 65.85 78.79 Model 4d 64.37 46.22 5.63 1.94 65.40 76.73 Model 5e 66.16 48.38 5.56 2.10 65.59 79.84 Starting Model 66.93 49.76 5.50 1.77 66.99 79.74 The best result among Model 1–5 and the starting model for each score is shown in bold. a Averaged and relaxed model generated from the subset of short MD trajectories. b Averaged and relaxed model generated from the subset of long MD trajectories. c The lowest DFIRE model. d The lowest GOAP model. e The highest iGDT-HA model. TERASHI AND KIHARA | 199 unsuccessful refined models (TR879, dGDT-HA: 25.0; TR928, dGDTHA: 22.8), which are shown in red and orange, are found at high dRg (0.47 and 0.55, respectively) and low iGDT-HA (59.7 and 73.8, respectively), distinct from the other models, in Figure 7B. This result suggests that models of unsuccessful refinement may be better identified by the combination of iGDT-HA and dRg rather than only using iGDT-HA. This idea works particularly well for distinguishing TR879 (the red data point) from the other models that have a similar iGDT-HA value. 3.7 | What went right and what went wrong Following the tradition of the CASP predictors’ reports, we discuss things that worked well and those which need improvement. One thing which clearly worked well was the ranking of the submitted models. Table 5 summarizes the average of evaluation scores of Model 1 to 5 and Figure 8 presents dGDT-HA of individual models relative to the quality (GDT-HA) of the starting models. In the figure, Model 1 models are shown in red and the best model (i.e., the model with the largest dGDT-HA) for each target are connected with lines. From Table 5, Model 1 models were on average the best among the five submitted models for all the evaluation scores except for that of the RMS_CA, where Model 1 was ranked second, following Model 2. Figure 8 visualizes the same conclusion; 28 out of 42, of the Model 1 models were the best for the targets, and if not they were close to the best. Second, as discussed with Figure 4A, our refinement protocol with short MD runs improved models for most of the cases. Additionally, the structure sampling from short MD runs with an increasing Ca constraints, which increased from 0.1, 0.2 to 0.4 kcal mol21 Å22 for every 400 ps, worked well. As shown in Table 6, sampling structures from different portions of MD runs we tried all worked worse than the method we used. Thus, overall, we can conclude that we were successful in exploiting inexpensive short MD runs with an implicit solvent model very effectively. This point becomes evident when our group’s results were compared with a contrasting approach that used significantly more expensive MD runs but had similar performance (Table 2). On the other hand, by design our protocol could not make large refinement to models due to the use of short MD simulations. To make substantial corrections to a model conformation, such as rearrangement of secondary structures or domain moves, a completely different algorithm design, probably with a different protein model, such as a coarse-grained model,2,36,37 is obviously needed. Indeed, this is the challenge left for the whole CASP refinement community. 4 | CONCLUSION We discussed our group’s performance in the CASP12 refinement category. Our protocol makes use of inexpensive short MD simulations with implicit solvent and successfully showed consistent improvements to starting models regardless of the quality of the starting models. By examining submitted models, we found that achieved improvements are due to relaxation of structures rather than compression, which also suggested that the degree of relaxation (dRg) could be another metric to eliminate unsuccessful refined models. However, the protocol does not make large conformational refinement by design due to the use of short MD trajectories, which is still the goal of further development. ACKNOWLEDGMENTS The authors thank Kevin Shim for proofreading the manuscript. This work was partly supported by National Institutes of Health (R01GM123055) and National Science Foundation (IIS1319551, DBI1262189, IOS1127027, DMS1614777). ORCID Daisuke Kihara http://orcid.org/0000-0003-4091-6614 REFERENCES [1] Kinch LN, Li W, Monastyrskyy B, Kryshtafovych A, Grishin NV. Evaluation of free modeling targets in CASP11 and ROLL. Proteins. 2016;84(Suppl 1):51–66. [2] Kihara D, Lu H, Kolinski A, Skolnick J. TOUCHSTONE: an ab initio protein structure prediction method that uses threading-based tertiary restraints. Proc Natl Acad Sci USA. 2001;98(18):10125–10130. [3] Modi V, Xu Q, Adhikari S, Dunbrack RL Jr.Assessment of templatebased modeling of protein structure in CASP11. Proteins. 2016;84 (Suppl 1):200–220. [4] Moult J, Fidelis K, Kryshtafovych A, Schwede T, Tramontano A. Critical assessment of methods of protein structure prediction: progress and new directions in round XI. Proteins. 2016;84(Suppl 1):4–14. [5] Qu X, Swanson R, Day R, Tsai J. A guide to template based structure prediction. Curr Protein Peptide Sci. 2009;10(3):270–285. [6] Moult J, Pedersen JT, Judson R, Fidelis K. A large-scale experiment to assess protein structure prediction methods. Proteins. 1995;23(3):2–4. [7] Kryshtafovych A, Fidelis K, Moult J. Progress from CASP6 to CASP7. Proteins. 2007;69(Suppl 8):194–207. TABLE 6 Results of using different structure sampling form short MD runs Ca restraintsa GDT-TS GDT-HA Section 1 (400 ps) 0.1 66.73 49.21 Section 1, 2 (800 ps) 0.1, 0.2 67.06 49.70 Section 1, 2, 3 (1.2 ns) 0.1, 0.2, 0.4 67.32 50.11 Section 2 (400 ps) 0.2 67.11 49.78 Section 3 (400 ps) 0.4 67.27 50.08 Starting model n/a 67.06 49.80 In the short MD runs of 1.2 ns in total, as described in Methods, we applied an increasing Ca constraints from 0.1, 0.2, to 0.4 to three 400 ps-long subsections sequentially. In this small experiment, structures are sampled from different subsections of the MD runs, which underwent the structure averaging and relaxing procedure to yield a final model. GDT-TS and GDT-HA are average of models for 36 targets, which had the native structure available from PDB. TR944 has its native structure in PDB but was excluded from this data, because the trajectory files used in CASP12 were corrupted and could not be used. a The unit is kcal/mol/Å2 . If more than one value is shown, each was applied subsequently to each subsection in MD trajectories. 200 | TERASHI AND KIHARA [8] Kryshtafovych A, Fidelis K, Moult J. CASP8 results in context of previous experiments. Proteins. 2009;77(Suppl 9):217–228. [9] Kryshtafovych A, Fidelis K, Moult J. CASP9 results compared to those of previous CASP experiments. Proteins. 2011;79(Suppl 10): 196–207. [10] Kryshtafovych A, Fidelis K, Moult J. CASP10 results compared to those of previous CASP experiments. Proteins. 2014;82(Suppl 2): 164–174. [11] Kihara D, ed. Protein Structure Prediction. New York: Humana Press; 2014. [12] Padilla-Sanchez V, Gao S, Kim HR, et al. Structure-function analysis of the DNA translocating portal of the bacteriophage T4 packaging machine. J Mol Biol. 2014;426(5):1019–1038. [13] Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A. Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct. 2000;29:291. [14] Baker D, Sali A. Protein structure prediction and structural genomics. Science. 2001;294(5540):93–96. [15] Modi V, Dunbrack RL Jr. Assessment of refinement of templatebased models in CASP11. Proteins. 2016;84(Suppl 1):260–281. [16] an H, Mark AE. Refinement of homology-based protein structures by molecular dynamics simulation techniques. Protein Sci. 2004;13 (1):211–220. [17] MacCallum JL, Perez A, Schnieders MJ, Hua L, Jacobson MP, Dill KA. Assessment of protein structure refinement in CASP9. Proteins. 2011;79(Suppl 10):74–90. [18] Nugent T, Cozzetto D, Jones DT. Evaluation of predictions in the CASP10 model refinement category. Proteins. 2014;82(Suppl 2):98– 111. [19] Kopp J, Bordoli L, Battey JN, Kiefer F, Schwede T. Assessment of CASP7 predictions for template-based modeling targets. Proteins. 2007;69(Suppl 8):38–56. [20] Mirjalili V, Noyes K, Feig M. Physics-based protein structure refinement through multiple molecular dynamics trajectories and structure averaging. Proteins. 2014;82(Suppl 2):196–207. [21] Feig M, Mirjalili V. Protein structure refinement via moleculardynamics simulations: What works and what does not? Proteins. 2016;84(Suppl 1):282–292. [22] Khoury GA, Smadbeck J, Kieslich CA, et al. Princeton_TIGRESS 2.0: High refinement consistency and net gains through support vector machines and molecular dynamics in double-blind predictions during the CASP11 experiment. Proteins. 2017;85(6):1078–1098. [23] Joung I, Lee SY, Cheng Q, et al. Template-free modeling by LEE and LEER in CASP11. Proteins. 2016;84(Suppl 1):118–130. [24] Della Corte D, Wildberg A, Schroder GF. Protein structure refinement with adaptively restrained homologous replicas. Proteins. 2016;84(Suppl 1):302–313. [25] Lee GR, Heo L, Seok C. Effective protein model structure refinement by loop modeling and overall relaxation. Proteins. 2016;84 (Suppl 1):293–301. [26] Haberthur U, Caflisch A. FACTS: fast analytical continuum treatment of solvation. J Computat Chem. 2008;29(5):701–715. [27] Hassan SA, Guarnieri F, Mehler EL. A general treatment of solvent effects based on screened coulomb potentials. J Phys Chem B. 2000;104:6478–6489. [28] Hua DP, Huang H, Roy A, Post CB. Evaluating the dynamics and electrostatic interactions of folded proteins in implicit solvents. Protein Sci. 2016;25(1):204–218. [29] Zhou H, Zhou Y. Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci 2002.11(11):2714– 2726. [30] Zhang J, Zhang Y. A novel side-chain orientation dependent potential derived from random-walk reference state for protein fold selection and structure prediction. PloS One. 2010;5(10):e15386. [31] Zhou H, Skolnick J. GOAP: a generalized orientation-dependent, allatom statistical potential for protein structure prediction. Biophys J. 2011;101(8):2043–2052. [32] Zemla A. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res. 2003;31(13):3370. [33] Chen VB, Arendall WB III, Headd JJ, et al. MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallograph Sect D Biol Crystallogr. 2010;66(Part 1):12–21. [34] Kryshtafovych A, Monastyrskyy B, Fidelis K. CASP prediction center infrastructure and evaluation measures in CASP10 and CASP ROLL. Proteins. 2014;82(Suppl 2):7–13. [35] Cong Q, Kinch LN, Pei J, et al. An automatic method for CASP9 free modeling structure prediction assessment. Bioinformatics. 2011; 27(24):3371–3378. [36] Kolinski A. Protein modeling and structure prediction with a reduced representation. Acta Biochim Pol. 2004;51(2):349–371. [37] Sieradzan AK, Krupa P, Scheraga HA, Liwo A, Czaplewski C. Physics-based potentials for the coupling between backbone- and side-chain-local conformational states in the united residue (UNRES) force field for protein simulations. J Chem Theory Computat. 2015; 11(2):817–831. How to cite this article: Terashi G, Kihara D. Protein structure model refinement in CASP12 using short and long molecular dynamics simulations in implicit solvent. Proteins. 2018;86:189– 201. https://doi.org/10.1002/prot.25373 TERASHI AND KIHARA | 201