H) Check for updates Nursing Performance Evaluation & the Health Professions Using Generalizability Theory to Inform 2019, Vol. 42(3) 297-327 © The Author(s) 2017 Article reuse guidelines: sagepub.com/journals-permissions DOI: 10.1177/0163278717735565 journals.sagepub.com/home/ehp Optimal Design for a Nursing Performance Assessment ®SAGE Janet O'Brien , Marilyn S. Thompson , and Debra Hagler1 Abstract The promotion of competency of nurses and other health-care professionals is a goal shared by many stakeholders. In nursing, observation-based assessments are often better suited than paper-and-pencil tests for assessing many clinical abilities. Unfortunately, few instruments for simulation-based assessment of competency have been published that have undergone stringent reliability and validity evaluation. Reliability analyses typically involve some measure of rater agreement, but other sources of measurement error that affect reliability should also be considered. The purpose of this study is three-fold. First, using extant data collected from 18 nurses evaluated on 3 Scenarios by 3 Raters, we utilize generalizability (G) theory to examine the psychometric characteristics of the Nursing Performance College of Nursing and Health Innovation, Arizona State University, Phoenix, AZ, USA T. Denny Sanford School of Social and Family Dynamics, Arizona State University, Tempe, AZ, Corresponding Author: Janet O'Brien, College of Nursing and Health Innovation, Arizona State University, 550 N. 3rd St., Mail Code 3020, Phoenix, AZ 85004, USA. Email: jeobrien@asu.edu USA 298 Evaluation & the Health Professions 42(3) Profile, a simulation-based instrument for assessing nursing competency. Results corroborated findings of previous studies of simulation-based assessments showing that obtaining desired score reliability requires substantially greater numbers of scenarios and/or raters. Second, we provide an illustrative exemplar of how G theory can be used to understand the relative magnitudes of sources of error variance—such as scenarios, raters, and items—and their interactions. Finally, we offer general recommendations for the design and psychometric study of simulation-based assessments in health-care contexts. Keywords generalizability theory, observation-based assessment, simulation, nursing, reliability analysis Regulatory boards, educational institutions, and health-care facilities are tasked with ensuring newly graduated health-care professionals are ready to care for patients safely, effectively, and efficiently. These entities invest large amounts of time, resources, and money in the process of competency evaluation of health-care professionals. In addition, maintaining and improving the competency of nurses and other health-care professionals are critical to keeping pace with changes in practice standards and technology. Those leading efforts to assess competency in the health-care professions face various challenges. Many types of clinical knowledge and applications as well as the demonstration of professionalism and skilled communication cannot be adequately assessed by paper-and-pencil tests (Boulet et al, 2003; Goodstone & Goodstone, 2013; Katz, Peifer, & Armstrong, 2010; Swanson & Stillman, 1990). Alternatively, observation-based forms of assessment may be used to measure competency in professional practice contexts that require the simultaneous use of critical thinking and psychomotor skills. Still, clinical opportunities to evaluate high-risk skills are not readily available, and ensuring patient safety prevents the assessment of many skills in the clinical environment. The need for clinical competency measures has led to the use of simulation as a safe, objective method to assess the performance of health-care professionals. In simulation, standardized patients or human patient simulators (HPSs) replace actual patients, allowing for the assessment of a variety of skills and knowledge in realistic clinical situations. Alinier and Piatt (2013) defined simulation O'Brien et al. 299 as being a technique that recreates a situation or environment to allow learners (in the widest sense of the term) to experience an event or situation for the purpose of assimilating knowledge, developing or acquiring cognitive and psychomotor skills, practicing, testing, or to gain understanding of systems or human actions and behaviors (p. 1). Simulated encounters may be part of the formative assessment provided in an educational curriculum or may be used as a summative evaluation component required for graduation, certification, or licensure (Alinier & Piatt, 2013; Sando et al, 2013; Ziv, Berkenstadt, & Eisenberg, 2013). The purpose of this article is three-fold. First, we utilize generalizability (G) theory to examine the psychometric characteristics of the Nursing Performance Profile (NPP), an instrument that measures nursing competency using simulation (Hinton et al, 2012; Randolph et al, 2012). Reliability evidence is critical because the NPP is used to provide supporting evidence during regulatory investigations. Second, we provide through our analysis an illustrative exemplar of how G theory can be used to understand the relative magnitudes of sources of error variance—such as scenarios, raters, and items—and their interactions. Generalizability (G) and decision (D) studies supported estimation of the reliability of designs that vary by the numbers of raters and scenarios, using all 41 items from the NPP, informing designs that reduced sources of error variance while optimizing reliability coefficients. Finally, we offer general recommendations for the design and psychometric study of simulation-based assessments in health-care contexts. Literature Review Measuring Competence One area of concern for educational institutions, health-care facilities, and regulatory boards is the gap between newly graduated nurses' knowledge base and the minimum level needed to practice independently, a gap that appears to be widening (Hughes, Smith, Sheffield, & Wier, 2013). As the Nursing Executive Center (2008) reported, almost 90% of academic leaders were confident their graduates were ready to care for patients safely and effectively, but only 10% of hospital leaders agreed (Ashcraft et al, 2013). Experienced nurses have reported their concerns about new graduate nurses' clinical competence, particularly in the areas of critical thinking, clinical/technical skills, communication skills, and general readiness to practice (Missen, McKenna, & Beauchamp, 2016). Unfortunately, the lack 300 Evaluation & the Health Professions 42(3) of evidence-based performance measures has made it difficult to prescribe solutions for assuring that nursing graduates attain clinical competence and maintain it over the course of a career. Establishing processes for measuring competency is critical. Both the National Board of Osteopathic Medical Examiners and the National Board of Medical Examiners successfully implemented clinical performance exams for medical students after extensive development and piloting of cases and measurement instruments. However, the National Council of State Boards of Nursing [NCSBN] has not yet implemented a clinical performance examination for licensure, and nursing is reportedly the only health profession that does not require one in the United States (Kardong-Edgren, Hanberg, Keenan, Ackerman, & Chambers, 2011). Reports issued by the Carnegie Foundation for the Advancement of Teaching, the NCSBN, and the Joint Commission on Accreditation of Hospitals have indicated the need for nurses to be better prepared for clinical practice (Meyer, Connors, Hou, & Gajewski, 2011). Recommendations stemming from the Carnegie Foundation Report on Nursing Education have been made to the NCSBN to pursue the development of a set of three national, simulation-based examinations of nursing performance, the first to begin before students graduate from nursing school and the third test finalizing licensure after 1 year of a proposed residency program (Kardong-Edgren et al, 2011). State boards of nursing and nursing schools are increasing efforts to develop performance-based assessments. However, in a review of the literature to identify simulation-based assessment in the regulation of healthcare professionals, Holmboe, Rizzolo, Sachdeva, Rosenberg, and Ziv (2011) found that no states have thus far required a clinical exam for graduating nurses. Further, while research focused on Clinical Simulation in Nursing has increased over the last decade, the development of instruments to measure the learning that takes place or the level of competency attained has not kept pace (Manz, Hercinger, Todd, Hawkins, & Parsons, 2013). Systematic reviews on simulation in nursing and other health sciences have reported a lack of measurement tools to evaluate competency using high-fidelity simulation (Harder, 2010; Yuan, Williams, & Fang, 2011), and the majority of the instruments available have not undergone systematic psychometric testing (Elfrink Cordi, Leighton, Ryan-Wenger, Doyle, & Ravert, 2012; Kardong-Edgren, Adamson, & Fitzgerald, 2010; Prion & Adamson, 2012). Problems with observation-based assessment in education are well-documented (Waters, 2011). The provision of accurate and meaningful assessment data requires the development of reliable and valid O'Brien et al. 301 measurement. Even after an instrument has undergone rigorous validation, many sources of error can affect reliability, including issues involving raters and the scenarios used in the assessment process. For example, rater subjectivity may result in bias, and although standardization of scoring may be improved through careful rater training, often rater scoring still results in suboptimal reliability scores. Scenarios used in simulations may be perceived as more or less difficult by participants. A thorough analysis of the variables that impact reliability is critical before assessment data are used for high-stakes decisions such as graduation, licensure, employment, or disciplinary action. Generalizability Theory Traditionally, classical test theory (CTT) is often used as a framework to examine reliability and measurement error (Boulet, 2005). A major limitation of this method is that sources of error are undifferentiated. As an alternative to CTT, G theory may be used to evaluate observational systems and improve the estimation of reliability (Bewley & O'Neil, 2013; Briesch, Swaminathan, Welsh, & Chafouleas, 2014). In G theory, analysis of variance (ANOVA) is used to identify the various sources and magnitudes of error. Rather than emphasizing tests of statistical significance (Boulet, 2005) or F tests (Brennan, 2011) as in ANOVA, G theory focuses on the estimation of variance components (Brennan, 2001), which are partitioned sources of variability in scores on an assessment. In a simulation-based assessment, for example, score variation will result from differences in skills among persons being assessed but may also be attributable, in part, to facets of the assessment design—such as items, scenarios, and raters—as well as their interactions. Two types of studies are utilized within the conceptual framework of G theory to inform the optimal assessment design in terms of numbers of conditions within each of the facets (e.g., items, scenarios, and raters): G and D studies (e.g., Boulet, 2005; Brennan, 2001). First, a G study is typically conducted to estimate the variance components and relative magnitudes of sources of measurement error for participants and each facet, along with their interactions, based on data collected using a specified set of conditions for each facet. Then, to better understand how the assessment design might impact reliability of scores, the information about the sources of measurement error gained from the G study is utilized to conduct one or more D studies. In D studies, the number of conditions in a facet is 302 Evaluation & the Health Professions 42(3) systematically varied to determine how many conditions (e.g., number of scenarios or number of raters) are required to achieve the desired reliability. Although attention to reliability and validity is increasingly being reported in the literature, often only coefficient a or interrater reliability statistics are provided to satisfy reliability testing, and usually only vague references are made to experts ensuring content validity. To our knowledge, only one article has described the use of G theory to identify a minimum number of scenarios or minimum number of raters to achieve high reliability in observation-based assessment in nursing (Prion, Gilbert, & Haerling, 2016). On the other hand, studies conducted in medical education using standardized patients and HPSs have successfully utilized G theory to determine the number of scenarios and number of raters needed for reasonable reliability estimates (Boulet & Murray, 2010; Boulet et al, 2003). NPP The NPP instrument was developed through a collaboration of three entities: the Arizona State Board of Nursing (ASBN), the Arizona State University, and Scottsdale Community College (Hinton et al, 2012; Randolph et al, 2012; Randolph & Ridenour, 2015). Funding from the NCSBN Center for Regulatory Excellence supported the development of an instrument that measures nine categories of clinical competence: professional responsibility, client advocacy, attentiveness, clinical reasoning (noticing), clinical reasoning (understanding), communication, prevention, procedural competency, and documentation (Randolph et al, 2012). The nine categories were developed from modifications of the Taxonomy of Error Root Cause Analysis and Practice Responsibility (TERCAP) categories (Benner et al, 2006) and the NCSBN survey tool, the Clinical Competency Assessment of Newly Licensed Nurses (CCANLN; as cited in NCSBN, 2007; Randolph et al., 2012), a 35-item survey tool used to measure clinical competency, practice errors, and practice breakdown risk (Randolph et al., 2012). The authors of the NPP received permission to categorize CCANLN items into the modified TERCAP-based categories. Items and categories were added and edited, resulting in the final nine-category NPP instrument consisting of 41 items, with 4-8 items per category. Raters completed a dichotomous scale of safe versus unsafe behaviors (Randolph et al, 2012). The authors of the NPP instrument developed scenarios that involve common adult health situations and require nursing actions and behaviors O'Brien et al. 303 involved in the care of a patient. A sample of 21 registered nurse (RN) volunteers each performed nursing care in three high-fidelity simulation scenarios, resulting in 63 videos. Three raters, blinded to participant ability and scenario order to prevent bias, viewed each video independently. The NPP instrument has subsequently been used to provide objective data in assessing nurses referred for evaluation from the ASBN in identifying unsafe nursing practices (Randolph & Ridenour, 2015). Based upon available research, the NPP instrument is one of the few nursing performance instruments that has undergone validity and reliability testing (Hin-ton et al., 2012; Randolph et al, 2012), and it is the only one used to evaluate postlicensure nursing competency at the state level. Building upon the research and analysis already conducted on the NPP instrument and the accompanying scenarios, the current study was intended to provide a deeper analysis of the reliability of data obtained by the instrument and provide guidance for decision-making to ensure a psychometrically sound rating process. During Measuring Competency With Simulation (MCWS) Phase I, reliability was examined using interrater agreement, intrarater reliability, and internal consistency of items (Hinton et al., 2012). Interrater agreement was measured by the percentage of agreement by at least two of the three raters on each item, and internal consistency of items on the NPP was estimated using Cronbach's a. As noted by Boulet and Murray (2010), interrater reliability is important to examining the overall reliability of data obtained by observation-based assessment instruments, but an examination of other sources of error is also critical to achieve a more complete understanding of an assessment's reliability. Prior to the present study, measurement error associated with the scenarios had not been analyzed, and the optimum numbers of raters and scenarios to achieve high reliability had not been identified. Method In the following analysis, we conducted G and D studies to inform the optimal assessment design in terms of numbers of scenarios and raters used in the NPP. First, a G study was conducted to estimate the variance components and relative magnitudes of sources of measurement error for scenarios, raters, items, and nurse participants, as well as their interactions, based on the current simulation-based NPP implemented using three scenarios and three raters. A series of D studies were then conducted to 304 Evaluation & the Health Professions 42(3) estimate the impact on reliability of the 41-item NPP when varying the numbers of scenarios and raters from one to nine. Participants Nurse participants. The MCWS study (Hinton et al, 2012; Randolph et al, 2012) and this secondary analysis were approved by the appropriate institutional review boards. The MCWS Phase I project included 21 participants. Of these, 18 participants' recorded performances were rated by the same three raters and were included in the G and D studies. All participants were practicing RNs working in either academic or professional settings. The mean age of the 16 participants who provided demographic data was 31.81 years, SD = 8.90, and all were female. The racial/ethnic distribution of participants was 56.25% White, 25% Hispanic, and 18.75% Black. The majority of participants had associate's degrees (75%) and 25% had bachelor's degrees. Ten of the 16 participants reported more than 1 year of experience as an RN, whereas six had been licensed less than 1 year (M = 1.35, SD = .74). No simulation experience was reported by 18.75%, some experience was reported by 68.75%, and frequent simulation experience was reported by 12.50% of nurses. Raters. Three subject matter experts independently evaluated the videos from each of the three simulation scenarios completed by all 18 participants included in this study; data from a fourth rater who had only assessed videos for three additional participants were not included in order to utilize a fully crossed design. Raters were blinded to order of scenarios. The three raters whose data were used in this secondary analysis had between 3 and 22 years of nursing experience (M = 9.67 years, SD = 10.69), a bachelor's degree, and experience evaluating nursing performance. They were White, female, and ranged in age from 32 to 51 years. Measure and Scenarios The NPP instrument. Researchers from the MCWS Phase I Study reported the reliability of the NPP was initially evaluated using a pilot scenario. Volunteer nursing students were rated by five expert raters over two measurement occasions (Randolph et al., 2012). The mean percentage of agreement across the five raters over all items was reported at 92%. Cronbach's a internal consistency estimate of reliability was .93, and intrarater reliability O'Brien et al. 305 ranged from 85% to 97%, with a mean of 92% across all raters (Randolph et al, 2012). Scenarios. Three adult health, acute care scenarios were designed in the initial study by a team of expert nurses for use with the NPP tool (Hinton et al., 2012). "Scenarios were intended to measure basic competency with broad applicability and to provide opportunities for individual nurses to exhibit competency on all nursing performance items" (Randolph et al., 2012, p. 544). Three forms of each scenario were developed that included name changes for the patients in each scenario as well as surface changes in the content (e.g., a phone call from a friend vs. a parent during the scenario) that did not affect substantive components. Data from the three forms of each acute care scenario were combined for the current study. Procedure Each nurse participant engaged in a randomly selected form of each of the three scenarios in a randomized order. No order effect on ratings was found in previous studies (Hinton et al, 2012). A simulation nurse specialist was trained to conduct the simulations using standardized cues and responses. Participants were oriented to the simulation environment, and the simulation was recorded using either Meti LearningSpace (CAE Healthcare, 2012), an audiovisual and center management system that provides recording and tracking services integrated with simulation at one facility or a customized system at a second facility. Raters attended a 3.5-hr training session. During the session, two researchers described the project, provided resources (such as nursing scope of practice regulatory documents), and explained the rating forms. The raters viewed recordings of simulation performances representing a range of safe and unsafe nursing behaviors. After viewing each recording, raters independently completed ratings and documented details to support those ratings. The raters then discussed their rationales with the other raters and the training facilitators before viewing the next performance. Raters were reminded to assign scores based on their judgments of safety rather than a judgment of optimal performance. Further discussion and clarification about specific language such as "conflict" and "delegation" also led to improved rater agreement. After viewing three recorded performances and discussing their rationales for ratings, the raters tended to rate the training videos similarly. After raters completed the training session, they were scheduled for independent rating sessions. 306 Evaluation & the Health Professions 42(3) During the research project rating session, videos were organized for rating in random order by participant and by scenario. Raters independently viewed each video and scored performance using the 41-item NPP. Raters had the choice of scoring each item as 1 (safe performance), 0 (unsafe performance), or "NA" (not applicable; no opportunity to observe behavior in the scenario). Data Analysis Descriptive statistics. Item and category means and standard deviations were calculated across scenarios and raters. Means and standard deviations were also calculated for each item and category, by scenario across raters, and by scenario for each rater. In addition, scenario means and standard deviations were calculated for each rater across items and across items and raters. Missing data. Of the 6,642 possible ratings, 12 (.18%) ratings were missing. An additional 70 ratings marked "NA" (1.05%) were treated as missing data after rater reasons for using the "NA" option were reviewed. The 82 missing data points (1.23% of the 6,642 possible observations) were handled using multiple imputation in SPSS Version 21. Five replicate data sets were imputed and imputed values were not rounded (Enders, 2010). Each data set was used to run separate G studies using GENeralized analysis Of VAriance (GENOVA) (Brennan, 2001; Center for Advanced Studies in Measurement and Assessment, 2013; Crick & Brennan, 1983), and the resulting estimated variance components were then combined using Rubin's (1987) rules (Enders, 2010; Wayman, 2003). G study. The design for the G study included a three-facet universe, representing three conditions of measurement: raters, scenarios, and items. Since all raters evaluated all scenarios and all participants using all 41 items on the NPP instrument, raters were crossed with scenarios and items, resulting in a p x s x r x i design, where p = nurse participants, s = scenarios, r = raters, and i = items. The sample of scenarios, raters, and items used was considered to be exchangeable with any other sample of scenarios, raters, and items in the defined universes for these facets, so the design is classified as random. We included items as a facet because we wanted to examine in our G study the percentage of total variance attributed to items, given that the variance component for the item effect captures how much items differ from each other in difficulty. The RNs who participated and were evaluated by raters in the study were the participants, or objects of measurement. O'Brien et al. 307 Using the software program GENOVA (Brennan, 2001; Center for Advanced Studies in Measurement and Assessment, 2013; Crick & Brennan, 1983), 15 sources of variability were explored for this three-facet design, including the universe score variability and 14 sources associated with the three facets. They were the main effects for scenario (s), rater (r), and item (/'); 6 two-way interactions; 4 three-way interactions; and the residual for the rater-scenario-participant-item interaction. Variances were estimated for each effect. Total estimated variance, d (Xpsri), was the sum of the 15 estimated variance components: d2 (Xpsn) = d2 (p) + a2(,,) + a2(r) + d2 (,•) + d2^ + <32(pr) + of .70. The largest cf) obtained in the conducted D studies was .73 with eight scenarios and nine raters. None of the D studies conducted reached sufficiently high levels of dependability coefficients for a high-stakes exam. Alternative D study designs may result in higher reliability estimates, and other factors that improve rater scoring could positively affect results in future studies. Limitations of the Study The current study involved a secondary analysis of extant data. As such, sample size and design of data collection were established a priori. Although minimum sample sizes for multiple facet designs in G theory have not yet been established by researchers, a minimum of 20 persons and 2 conditions per facet has been suggested for a one-facet design (Webb, Rowley, & Shavelson, 1988). However, studies involving fewer persons in conjunction with larger numbers of conditions per facet and a larger number of facets have been successfully conducted (Briesch et al, 2014), so the current study involving 18 participants was considered sufficient, although a larger sample size would have been preferred. O'Brien et al. 321 Directions for Future Research Although G theory has been used more frequently in the last 10+ years (Briesch et al., 2014) in reliability studies, it is still not commonly used in research involving the assessment of nursing competency using simulation. For example, a recent article published in Clinical Simulation in Nursing (Shelestak & Voshall, 2014) focused on validity and reliability concerns and described the use of Cronbach's a, intraclass correlation coefficients, k, and proportion of agreement as suggested methods of assessing reliability but did not mention G theory. The valuable contributions offered by G theory are just beginning to be realized in the measurement of nursing competency using simulation in the broader academic community (Prion et al, 2016). According to Mariani and Doolen (2016), nurses surveyed at an international simulation in nursing conference felt that research in nursing simulation lacked rigor and expressed a need for more research on evaluation methods and psychometric development of tools. The current study provided an in-depth analysis of reliability for a simulation-based nursing competency assessment by examining multiple sources of variance. One important finding was that a greater number of scenarios and/or raters are needed to achieve sufficiently high reliability for a high-stakes assessment. Future research in this area should focus on rater training methods that would result in decreased variance attributed to raters. Often, for practical reasons, programs do not have the luxury of having the same raters available to score all participants. Either the number of participants is too large for this to be feasible, or over time, the pool of raters changes. Research designs that include nesting raters within scenarios would provide valuable information that may support a more flexible rater configuration without sacrificing reliability. Implications for Practice The calibration of raters is an essential component of rater training, yet lack of faculty training to improve rating reliability is often the norm in health professions (McGaghie, Butter, & Kaye, 2009). Training to increase awareness of specific errors raters tend to make, providing a frame of reference using examples of differing levels of performance, and provision of intensive behavioral observation training through the practice of scoring and discussion among raters to reach consensus are methods used to improve rater agreement (McGaghie et al, 2009; Tekian & Yudkowsky, 2009). To prevent the subjective interpretation of rating scales, anchors must be 322 Evaluation & the Health Professions 42(3) developed that establish behaviors agreed upon by raters that constitute particular scores (Yudkowsky, 2009). Raters need sufficient preparation and continual updating to ensure high reliability and the minimization of threats to validity. Recognizing a need to increase rater agreement prior to evaluating nurses referred to the ASBN for practice violations, rater training conducted subsequent to the MCWS Phase I Study was enhanced to increase consensus among raters and standardization of item interpretation for scoring purposes (D. Hagler, personal communication, June 6, 2014). Although improvement in reliability of the NPP is evident when additional raters and/or scenarios are included in the design, these improvements diminish for designs that expand beyond three raters or scenarios. However, unless reliability coefficients are stronger with fewer scenarios or raters, it is also clear that additional scenarios and/or raters are needed to meet suggested minimum levels of reliability for testing situations. Also, the purpose of the assessment (norm referenced vs. criterion referenced) informs choices of increasing scenarios or raters. Last, the availability and cost of resources to implement a simulation-based assessment may vary from organization to organization. Comparing the costs involved in developing and implementing new scenarios with the costs of adding raters and including additional scenarios in the testing situation involves factors unique to each organization. Conclusion Ensuring the safety of patients is a challenge faced by state boards of nursing, health-care facilities, and educational institutions (Scott Tilley, 2008). Confirming nurses at every level are meeting minimum levels of competency continues to be a challenge (Kardong-Edgren, Hayden, Kee-gan, & Smiley, 2014). When designing a system for measuring competency, stakeholders must agree on definitions of minimum competency, instruments must be developed that provide reliable and valid interpretations of data, and scenarios must be designed that provide opportunities for the nurse to demonstrate competency when assessed by trained raters using the instrument. Each component of this process involves tremendous time, work, and expertise. In the literature involving instruments used to measure nursing competency, reported methods of reliability testing have rarely included G theory. Investigation of reliability is often limited to the examination of interrater reliability, using coefficient a or percentage agreement as measurements (Adamson, Kardong-Edgren, & Willhaus, 2012; Hinton et al, 2012; Kardong-Edgren et al, 2010). Continued work is expected as O'Brien et al. 323 state boards of nursing, accreditation boards, schools, and employers look for defensible methods to assess nursing competency. Acknowledgments The authors thank Dr. Janine Hinton and the Measuring Competency With Simulation Phase 1 Team for providing the data. Declaration of Conflicting Interests The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Funding The author(s) received no financial support for the research, authorship, and/or publication of this article. References Adamson, K., Kardong-Edgren, S., & Willhaus, J. (2012, November). An updated review of published simulation evaluation instruments. Clinical Simulation in Nursing, 9, e393-e400. doi:10.1016/j.ecns.2012.09.004 Alinier, G., & Piatt, A. (2014, January). International overview of high-level simulation education initiatives in relation to critical care. Nursing in Critical Care, 79(1), 42-49. Ashcraft, A., Opton, L., Bridges, R., Caballero, S., Veesart, A., & Weaver, C. (2013, March/April). Simulation evaluation using a modified Lasater Clinical Judgment Rubric. Nursing Education Perspectives, 34, 122-126. Retrieved from http:// www.nln. org/nlnj ournal/ Benner, P., Malloch, K., Sheets, V., Bitz, K., Emrich, L., Thomas, M., ... Farrell, M. (2006). TERCAP: Creating a national database on nursing errors. Harvard Health Policy Review, 7, 48-63. Retrieved from http://nursing2015.files.word press.com/2010/02/tercap_201004141512291 .pdf Bewley, W. L., & O'Neil, H. F. (2013). Evaluation of medical simulations. Military Medicine, 178, 64-75. doi:10.7205/MILMED-D-13-00255 Boulet, J. (2005). Generalizability theory: Basics. In B. Everitt & D. Howell (Eds.), Encyclopedia of statistics in behavioral science (Vol. 2, pp. 704-711). Chichester, England: John Wiley. Boulet, J. R., & Murray, D. J. (2010, April). Simulation-based assessment in anesthesiology: Requirements for practical implementation. Anesthesiology, 112, 1041-1052. doi:10.1097/ALN.0b013e3181cea265 Boulet, J. R., Murray, D., Kras, J., Woodhouse, J., McAllister, J., & Ziv, A. (2003, December). Reliability and validity of a simulation-based acute care skills 324 Evaluation & the Health Professions 42(3) assessment for medical students and residents. Anesthesiology, 99, 1270-1280. Retrieved from http://joumals.lww.corn/anesthesiology/pages/default.aspx Brennan, R. (2001). Generalizability theory. In Series: Statistics for Social Science and Public Policy. New York, NY: Springer. Brennan, R. (2011). Generalizability theory and classical test theory. Applied Measurement in Education, 24, 1-21. doi: 10.1080/08957347.2011.532417 Briesch, A. M., Swaminathan, H., Welsh, M., & Chafouleas, S. M. (2014). Generalizability theory: A practical guide to study design, implementation, and interpretation. Journal of School Psychology, 52, 13-35. doi: 10.1016/j.jsp.2013.11.008 CAE Healthcare. (2012). MetiLearning. Retrieved from http://www.meti.com/prod ucts_learningspace.htm Center for Advanced Studies in Measurement and Assessment. (2013). GENOVA. [Computer program]. Retrieved from http://www.education.uiowa.edu/centers/ casma/computer-programs#8f748e48-f88c-6551 -b2b8-ff00000648cd Crick, J. E., & Brennan, R. L. (1983). Manual for GENOVA: A generalized analysis of variance system. Iowa City, IA: The American College Testing Program. Elfrink Cordi, V. L, Leighton, K., Ryan-Wenger, N., Doyle, T. J., & Ravert, P. (2012, July/August). History and development of the simulation effectiveness tool (SET). Clinical Simulation in Nursing, 8, el99-e210. doi:10.1016/j.ecns. 2011.12.001 Enders, C. K. (2010). Applied missing data analysis. New York, NY: Guilford Press. Goodstone, L., & Goodstone, M. (2013). Use of simulation to develop a medication administration safety assessment tool. Clinical Simulation in Nursing, 9, e609-e615. doi:10.1016/j.ecns.2013.04.017 Harder, B. N. (2010). Use of simulation in teaching and learning in health sciences: A systematic review. Journal of Nursing Education, 49, 23-28. doi: 10.3928/ 01484834-20090828-08 Hinton, J., Mays, M., Hagler, D., Randolph, P., Brooks, R., DeFalco, N, ... Weberg, D. (2012). Measuring post-licensure competence with simulation: The nursing performance profile. Journal of Nursing Regulation, 3, 45-53. Retrieved from http://jnr.metapress.com/home/main.mpx Holmboe, E., Rizzolo, M. A., Sachdeva, A. K., Rosenberg, M., & Ziv, A. (2011). Simulation-based assessment and the regulation of healthcare professionals. Simulation in Healthcare, 6, S58-S62. doi:10.1097/SIH.0b013e3182283bd7 Hughes, R., Smith, S., Sheffield, C, & Wier, G. (2013, May/June). Assessing performance outcomes of new graduates utilizing simulation in a military transition program. Journal for Nurses in Professional Development, 29, 143-148. doi:10.1097/NND.0b013e318291c468 Johnson, R. L., Penny, J. A., & Gordon, B. (2009). Assessing performance: Designing, scoring, and validating performance tasks. New York, NY: Guilford Press. O'Brien et al. 325 Kardong-Edgren, S., Adamson, K. A., & Fitzgerald, C. (2010). A review of currently published evaluation instruments for human patient simulation. Clinical Simulation in Nursing, 6, e25-e35. doi:10.1016/j.ecns.2009.08.004 Kardong-Edgren, S., Hanberg, A. D., Keenan, C, Ackerman, A., & Chambers, K. (2011). A discussion of high-stakes testing: An extension of a 2009 INACSL conference roundtable. Clinical Simulation in Nursing, 7, el9-e24. doi:10.1016/ j.ecns.2010.02.002 Kardong-Edgren, S., Hayden, J., Keegan, M., & Smiley, R. (2014). Reliability and validity testing of the Creighton Competency Evaluation Instrument for use in the NCSBN National Simulation Study. Nursing Education Perspectives. Retrieved from http://www.nln.org/nlnjournal/ Katz, G., Peifer, K., & Armstrong, G. (2010). Assessment of patient simulation use in selected baccalaureate nursing programs in the United States. Simulation in Healthcare, 5, 46-51. doi:10.1097/SIH.0b013e3181balf46 Kreiter, C. D. (2009). Generalizability theory. In S. Downing & R. Yudkowsky (Eds.), Assessment in health professions education. [Kindle e-Reader]. New York, NY: Routledge, Taylor, & Francis Group. Manz, J., Hercinger, M., Todd, M., Hawkins, K., & Parsons, M. (2013, July). Improving consistency of assessment of student performance during simulated experiences. Clinical Simulation in Nursing, 9, e229-e233. doi:10.1016/j.ecns. 2012.02.007 Mariani, B., & Doolen, J. (2016, January). Nursing simulation research: What are the perceived gaps? Clinical Simulation in Nursing, 12, 30-36. doi:10.1016/j. ecns.2015.11.004 McGaghie, W., Butter, J., & Kaye, M. (2009). Observational assessment. In S. Downing & R. Yudkowsky (Eds.), Assessment in health professions education. [Kindle e-Reader]. New York, NY: Routledge, Taylor, & Francis Group. Meyer, M. N, Connors, H., Hou, Q., & Gajewski, B. (2011). The effect of simulation on clinical performance: A junior nursing student clinical comparison study. Simulation in Healthcare, 6, 269-277. doi:10.1097/SIH.0b013e318223a048 Missen, K., McKenna, L., & Beauchamp, A. (2016). Registered nurses' perceptions of new nursing graduates' clinical competence: A systematic integrative review. Nursing & Health Sciences, 18, 143-153. doi:10.1111/nhs.l2249 National Council of State Boards of Nursing. (2007). Attachment CI: The impact of transition experience on practice of newly licensed registered nurses. Business Book: NCSBN 2007 Annual Meeting: Navigating the Evolution of Nursing Regulation. Retrieved from https://www.ncsbn.org/2007_BusinessBook_Sec tion2.pdf Nursing Executive Center. (2008). Bridging the preparation-practice gap. Volume I: Quantifying new graduate nurse improvement needs. In The New Graduate 326 Evaluation & the Health Professions 42(3) Nurse Preparation Series. Washington, DC: Advisory Board Company. Retrieved from http://www.advisory.com/Research/Nursing-Executive-Center/ Studies/2008/Bridging-the-Preparation-Practice-Gap-Volume-I Prion, S., & Adamson, K. (2012). Making sense of methods and measurement: The need for rigor in simulation research. Clinical Simulation in Nursing, 8, el93. doi:10.1016/j.ecns.2012.02.005 Prion, S. K., Gilbert, G. E., & Haerling, K. A. (2016). Generalizability theory: An introduction with application to simulation evaluation. Clinical Simulation in Nursing, 12, 546-554. doi:10.1016/j.ecns.2016.08.006 Ram, P., Grol, R., Joost Rethans, J., Schouten, B., van der Vleuten, C, & Kester, A. (1999). Assessment of general practitioners by video observation of communicative and medical performance in daily practice: Issues of validity, reliability, and feasibility. Medical Education, 33, 447-454. Retrieved from http://onlineli brary.wiley.com/journal/10.1111/(ISSN) 1365-2923 Randolph, P. (2013, September). Measuringpost-licensure competence. Paper presented at Arizona Simulation Network, Mesa, AZ. Randolph, P., Hinton, J., Hagler, D., Mays, M., Kastenbaum, B., Brooks, R., ... Weberg, D. (2012). Measuring competence: Collaboration for safety. The Journal of Continuing Education in Nursing, 43, 541-547. doi: 10.3928/00220124-20121101-59 Randolph, P., & Ridenour, J. (2015, April). Comparing simulated nursing performance to actual practice. Journal of Nursing Regulation, 6, 33-38. doi: 10.1016/ S2155-8256(15)30007-7 Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York, NY: John Wiley. Sando, C, Coggins, R., Meakim, C, Franklin, A., Gloe, D., Boese, T., ... Borum, J. (2013, June). Standards of best practice: Simulation standard VII: Participant assessment and evaluation. Clinical Simulation in Nursing, 9, S30-S32. doi: 10. 1016/j.ecns.2013.04.007 Schuwirth, L. W., & van der Vleuten, C. P. (2003). The use of clinical simulations in assessment. Medical Education, 37, 65-71. Retrieved from http://onlinelibrary. wiley.com/journal/10.1111/(ISSN) 1365-2923 Scott Tilley, D. D. (2008). Competency in nursing: A concept analysis. Journal of Continuing Education in Nursing, 39, 58-64. Retrieved from http://web.a.ebsco host. com. ezproxyl. lib. asu.edu/ehost/detail?sid=47ba7222-5d99-47e6-bdde- 69c3d6622785%40sessionmgr4002&vid=l&hid=4204&bdata=JnNpdGU 9ZWhvc3QtbG12ZQ%3d%3d#db=rzh&jid=lFC Shavelson, R., & Webb, N. (1991). Generalizability theory: A primer. Thousand Oaks, CA: Sage. O'Brien et al. 327 Shelestak, D., & Voshall, B. (2014). Examining validity, fidelity, and reliability of human patient simulation. Clinical Simulation in Nursing, 10, e257-e260. doi: 10.1016/j.ecns.2013.12.003 Swanson, D., & Stillman, P. (1990). Use of standardized patients for teaching and assessing clinical skills. Evaluation & the Health Professions, 13, 79-103. doi: 10.1177/016327879001300105 Tekian, A., & Yudkowsky, R. (2009). Assessment portfolios. In S. Downing & R. Yudkowsky (Eds.), Assessment in health professions education. [Kindle e-Reader]. New York, NY: Routledge, Taylor, & Francis Group. Van der Vleuten, C. P. M., & Swanson, D. B. (1990). Assessment of clinical skills with standardized patients: State of the art. Teaching and Learning in Medicine, 2, 58-76. Retrieved from http://www.tandfonline.com/loi/htlm20 Waters, J. K. (2011, May). 360 Degrees of reflection. The Journal, 38, 33-35. Retrieved from http://web.ebscohost.com.ezproxy 1 .lib.asu.edu/ehost/detail? Vid=3&hid=14&sid=7bac7993-a010-4al0-807f-c64640d81bd0% 40sessionmgrl5&bdata=JnNpdGU9ZWhvc3QtbG12ZQ%3d%3d#db= aph&AN=60780497 Wayman, J. C. (2003). Multiple imputation for missing data: What is it and how can I use it? Paper presented at the 2003 Annual Meeting of the American Educational Research Association, Chicago, IL. Webb, N. M., Rowley, G. L., & Shavelson, R. J. (1988). Using generalizability theory in counseling and development. Measurement and Evaluation in Counseling and Development, 21, 81-90. Retrieved from http://journals.sagepub.com/ loi/mec Weller, J., Robinson, B., Jolly, B., Watterson, L., Joseph, M., Bajenov, S., ... Lar-sen, P. (2005). Psychometric characteristics of simulation-based assessment in anaesthesia and accuracy of self-assessed scores. Anaesthesia, 60, 245-250. Retrieved from http://www.aagbi.org/publications/anaesthesia Yuan, H. B., Williams, B. A., & Fang, J. B. (2011). The contribution of high-fidelity simulation to nursing students' confidence and competence: A systematic review. International Nursing Review, 59, 26-33. Retrieved from http://onlineli brary.wiley.com/ journal/10.111 l/(ISSN)1466-7657/issues Yudkowsky, R. (2009). Performance tests. In S. Downing & R. Yudkowsky (Eds.), Assessment in health professions education. [Kindle e-Reader]. New York, NY: Routledge, Taylor, & Francis Group. Ziv, A., Berkenstadt, H., & Eisenberg, O. (2013). Simulation for licensure and certification. In A. Levine, S. DeMaria, A. Schwartz, & A. Sim (Eds.), The comprehensive textbook of healthcare simulation (pp. 161-170). New York, NY: Springer.