| HOME | ARCHIVE | SEARCH | TABLE OF CONTENTS |
|---|
| ||||||||||||||||||||||||||||||||
1 The University of Queensland and Princess Alexandra Hospital, Australia.
2 National Ageing Research Institute and The University of Melbourne, Australia.
3 Western General Hospital, Melbourne, Australia.
4 Arthritis Foundation of Victoria Centre for Rheumatic Disease and Royal Melbourne Hospital, Australia.
Address correspondence to Terry Haines, PhD, Physiotherapy Department, GARU, Princess Alexandra Hospital, Ipswich Rd., Woolloongabba, Queensland, 4112, Australia. E-mail: terrence_haines{at}health.qld.gov.au
|
A |
|---|
|
|
|---|
Methods. A systematic review was undertaken. Two blind reviewers assessed the methodology of relevant publications into a four-point classification system adapted from multiple sources. The association between study design classification and reported results was examined using linear regression with clustering based on screening tool and robust variance estimates with point estimates of Youden Index (= sensitivity + specificity 1) as the dependent variable. Meta-analysis was then performed pooling data from prospective studies.
Results. Thirty-five publications met inclusion criteria, containing 51 evaluations of fall risk screening tools. Twenty evaluations were classified as retrospective validation evaluations, 11 as prospective (temporal) validation evaluations, and 20 as prospective (external) validation evaluations. Retrospective evaluations had significantly higher Youden Indices (point estimate [95% confidence interval]: 0.22 [0.11, 0.33]). Pooled Youden Indices from prospective evaluations demonstrated the STRATIFY, Morse Falls Scale, and nursing staff clinical judgment to have comparable accuracy.
Discussion. Practitioners should exercise caution in comparing validity of fall risk assessment tools where the evaluation has been limited to retrospective classifications of methodology. Heterogeneity between studies indicates that the Morse Falls Scale and STRATIFY may still be useful in particular settings, but that widespread adoption of either is unlikely to generate benefits significantly greater than that of nursing staff clinical judgment.
Mechanisms for selective deployment of falls prevention interventions and intervention programs have varied widely. Commonly these mechanisms have been referred to as fall risk screening or risk assessment tools; however, they have varied in terms of their method of construction, content, and outcomes. The theory that a combination of two or more risk factors may predict falls more accurately than any one factor alone has driven much of the development of fall risk assessment tools. In keeping with the increasing number of publications in this area, a number of reviews of this field have recently been published (1820). However, conclusions arising from these reviews have varied from support of nurse-completed screening tools being appropriate for the acute hospital setting (though questionable for the extended [subacute] care setting) and that a suitable tool should be able to be identified from existing literature, leaving little need for individual sites to develop their own tool (20), to the suggestion that none of the existing tools can be recommended for wholesale implementation and that it may be better to identify modifiable risk factors which could then be targeted for intervention in all patients (19), to highlighting the need for more research of existing tools with researchers independent of tool designers and in research locations separate from the setting of tool development (18). Thus, hospital clinicians and administrators appear to be a little closer to being able to answer the questions of whether a fall risk screening tool should be used in their facility and, if so, which one and how.
Previous reviews have highlighted the need for prospective evaluation of fall risk assessment tools in multiple settings by researchers independent of the tool designers (18,19). These calls have been based on valid theoretical principles from the broader field of screening and prognostic tool development (21). Despite these concerns, little differentiation has been made in previous reviews of studies in terms of methodological quality. Indeed, few systematic reviews of observational studies involve quality assessment (22). The potential impact of study design on results presented in the hospital fall risk screening tool field has not previously been investigated. The sentinel work in this area systematically identified and reviewed 184 original articles, evaluating 218 medical diagnostic tests, employed a single descriptor of predictive accuracy for each test, and entered these along with design covariates into regression analyses to evaluate the effects of various design qualities on reported results (23). This work is recognized by the Standards for Reporting of Diagnostic Accuracy (STARD) initiative as the work that identified specific design features that can produce biased, overoptimistic estimates of diagnostic accuracy (24). This systematic review aims to follow this approach in the field of fall risk screening tools for the hospital setting to determine whether study design is associated with reported study outcomes and to present a pooled analysis of present literature based on the presence or absence of a methodology effect.
| METHOD |
|---|
|
|
|---|
Publications included in this review had to meet the following criteria:
Titles and abstracts were reviewed for all publications identified by one reviewer (TH), and those clearly not meeting the inclusion criteria were discarded. All remaining manuscripts were reviewed completely by one reviewer (TH) to determine if they met the inclusion criteria; those not meeting the criteria were excluded. Of the remaining publications, the study design used was rated according to the classification system described below by two raters (TH, WW) blinded to each others ratings. These ratings were compared by a third reviewer (KH), who provided a deciding vote where discrepancies in classifications arose. Additional data from each study (sample size, reported results, list of authors, hospital staff blinded to screening tool classifications) were then collected by one reviewer (TH) for use in the analysis.
| DEFINITION OF TERMS |
|---|
|
|
|---|
Several authors have noted that an important purpose of these tools lies in accurate deployment of falls prevention interventions, minimizing expenditure of resources on patients who do not require them (20,27,29,40,46,47). Hence a primary means for comparing the relative value of alternate tools lies in comparing their accuracy of prediction.
Examples of tools that do not provide a description of the probability of falling, but provide a list of risk factors for falls, are available in current literature (17,48) and are referred to in this review as being fall risk factor assessment tools. As these tools do not provide a measure of risk of falling, their accuracy in prediction cannot be calculated and cannot be compared through the same means as fall risk screening tools. Although these tools may play an important role in falls prevention, they could not be incorporated in this review.
Study Design Classifications
In their sentinel work, Lijmer and colleagues (23) identified eight design characteristics that they included in their analysis. Not all of these design characteristics are relevant to the field of fall risk screening in the hospital setting. We therefore drew on the relevant items from this work and diagnostic accuracy study design considerations subsequently identified by other authors and developed one composite, four-level classification system to describe the study design characteristics of evaluations in this field.
This classification system is closely related to STARD checklist items 6, 8, and 9 and is further described below.
Retrospective Versus Prospective
The terms "prospective" and "retrospective" can relate to a number of factors in an evaluation of a fall risk screening tool. To be classified as prospective in this review, the following must have all been collected and selected prior to the accuracy of the tool or factors within it being analyzed for that data set: (i) data required for screening tool completion. (ii) content of the screening tool, (iii) weighting of factors included in the screening tool, and (iv) screening tool cutoff point. Neither the content of a screening tool, weighting of factors within the screening tool, nor screening tool cutoff point should be influenced by the data from which the accuracy of the tool is being established. Theoretically, few tools will not fit the data from which they were custom made (49). Other design issues and results from such evaluations provide little generality beyond that data set (50); hence studies not meeting all these criteria for prospective evaluation will be classified as retrospective. It is noted that some studies with partially prospective designs have presented accuracy data on preexisting tools which were then modified by adding another risk factor, modifying the scoring, or changing the cutoff point, presumably to improve the reported accuracy of the tool. Such evaluations have been classified as retrospective unless all modifications were made prior to study.
RetrospectiveAuto Versus Internal Validation
A subclassification of the retrospective study design classification is the internal validation approach. An internal validation collects the data and divides the sample into two groups before the tool is constructed, constructs the tool on one set, and then tests its validity on the other. In its simplest form, this is referred to as "split-sample validation," though more complex cross-validation methods can be used (49). Although this approach is still retrospective using the design classification defined in the previous paragraph, the data with which the tool is being validated are independent from the data with which it was constructed; thus this approach is stronger than the auto validation approach into which all other retrospective designs are placed.
ProspectiveTemporal Versus External Validation
Prospective validation studies have been further subclassified as temporal validation and external validation designs (51). Temporal validation uses a set of patients from the same location but at a time later than the time during which the tool was constructed to validate the tool. This is a prospective evaluation independent of the original data. A difficulty with this design is that the results may not be generalizable to sites other than the site of original development. External validation addresses the wider issue of generality by collecting new data from an appropriate patient population in a different center, so not only is the validation data set prospective and independent of the original data set, it is also independent of characteristics of the original testing location. This makes external validation designs theoretically more rigorous than temporal designs.
Tool Construction Can Also Affect Classification
Where tools were developed through expert opinion by clinicians working in the research location, clinical knowledge gained from previous patients, quality improvement projects, and research at this location may have influenced tool design. These studies were therefore classified as temporal validation studies, and those developed by others not working at the research location were classified as external validation studies. Where the accuracy of nursing staff clinical judgment in classifying fallers was investigated in isolation (i.e., without being combined with other risk factors) and other conditions for prospective evaluation were met, these results were classified as coming from temporal validation studies.
| STATISTICAL ANALYSIS |
|---|
|
|
|---|
To assess the influence of study design on reported results, the mean (standard deviation) Youden Index of tools evaluated under each study design classification were compared. By treating study design as a categorical independent variable, the Youden Index as a continuous outcome variable, and each evaluation as a subject, the effect of methodology on accuracy reported was investigated using linear regression analyses with clustering of data based on screening tool evaluated, and each study design (retrospective-auto, retrospective-internal, prospective-temporal, prospective-external) was entered into the model as a dummy variable. Other variables entered into the model included sample size, whether hospital staff were blinded to screening tool outcomes ("staff blinding"), and whether a member of the authorship team was involved in tool development ("author independence").
Based on the findings of the above analysis, a meta-analysis was undertaken pooling patient level data from prospective studies for screening tools where more than one evaluation has been undertaken. Pooled Youden Indices with 95% confidence intervals were calculated. To be able to be pooled, publications had to present directly the frequency of true positives, false positives, true negatives, and false negatives from their evaluations or sufficient data from which these could be calculated. Five evaluations arising from four studies were excluded (29,44,52,53) because insufficient data were presented, and attempts to obtain further data from authors were unsuccessful.
| RESULTS |
|---|
|
|
|---|
The classification of each of the 52 evaluations from studies included in this review, along with Youden Index (95% confidence interval) results are presented in Figure 1AC. These 52 evaluations arose from 36 separate publications. A total of 20 evaluations were classified as retrospective validation evaluations, 11 as prospective (temporal) validation evaluations, and 21 as prospective (external) validation evaluations. Linear regression output examining the effect of prospective (temporal) design and retrospective (auto) design on Youden Index point estimates from each evaluation relative to prospective (external) design is presented in Table 1. This analysis indicates that the Youden Index scores calculated are significantly lower in prospective (external) designs than in retrospective (auto) designs by a magnitude of 0.22 on the Youden Index scale range (1 to +1). This model explained 18% of variance (R2) in Youden Index scores calculated, and the model was significant (p <.01). A mild trend for higher Youden Index scores was apparent for prospective (temporal) designs relative to prospective (external) designs (magnitude = 0.11), though this was not significant.
|
|
Pooled Youden Index scores for screening tools with two or more prospective evaluations are presented in Figure 2, and indicate that the Schmid and Downton fall risk screening tools have the highest predictive accuracy, though these tools have only been subjected to two prospective evaluations each, whereas the STRATIFY (nine evaluations) and nursing staff clinical judgment (five evaluations) have been subjected to many more.
|
| DISCUSSION |
|---|
|
|
|---|
In light of this evidence, this study completed the first meta-analysis of predictive accuracy data in this field. Only data from prospective studies were pooled to eliminate the potentially overoptimistic data provided from studies with retrospective design characteristics. From this pooled analysis, it appears that the Schmid and Downton fall risk scales potentially offer the greatest accuracy in predicting which patients will become in-hospital fallers. However, it is noted that these tools have been subjected to substantially fewer prospective evaluations than the STRATIFY and nursing staff clinical judgment, that the pooled number of participants is substantially lower than the STRATIFY, Morse Falls Scale, and clinical judgment, and that there is considerable variation in results within the individual prospective evaluations that have looked at the Schmid and Downton fall risk scales. Further prospective evaluations of the Schmid and Downton fall risk scales are required in a range of settings before stronger recommendations can be made regarding the potential use of these tools as "across the board" fall risk screening tools in hospitals. Other factors that should also be considered in such a recommendation are the time cost required for tool completion, completion rates specific to that tool when used in the clinical context, and if the tool identifies modifiable risk factors for which interventions are readily available in that clinical context.
Our pooled analysis revealed only moderate predictive accuracy for the STRATIFY, Morse Falls Scale, Falls Risk Assessment Scale for the Elderly, and nursing staff clinical judgment, though the STRATIFY had two prospective evaluations with favorable results excluded based on insufficient data (which would have increased the pooled accuracy estimate if included), while the Morse Falls Scale had one evaluation with more favorable and one with less favorable results than the pooled analysis excluded (which is unlikely to have affected the pooled accuracy estimate if included). There was also substantial heterogeneity in the ward and patient types in which the prospective evaluations were undertaken, and also large variations in the reported results from individual evaluations of the STRATIFY and Morse Falls Scale. Thus it is suitable to conclude that these two tools will provide moderate accuracy when used across the range of settings from which their evaluations have arisen, but they may provide greater or lesser accuracy in specific patient groups. For example, when used among acute hip fracture patients on an orthopedic ward, the STRATIFY performed poorly [Youden Index = 0.26 (55,56)], yet among older patients admitted to general medical units [Youden Index = 0.51 (57)] and among those admitted to Elderly Care Units [Youden Index = 0.81 (29)], it performed very well. Conversely, nursing staff clinical judgment was consistently demonstrated to have moderate levels of predictive accuracy across the range of settings in which it was evaluated. A moderate level of accuracy may be sufficient to guide the selective deployment of some interventions, as previous research has found that in-hospital falls can be reduced through provision of a multifactorial intervention program selectively deployed largely through clinical judgment of hospital staff (7).
Other factors identified as having significant univariate associations with Youden Index scores calculated were sample size and whether the authors of the evaluation did not include an author involved in the development of the tool. The association between sample size and Youden Index scores calculated may be attributable to authors prematurely ceasing studies with poorer outcomes at smaller sample sizes. The association between the "author independence" variable and Youden Index scores may have been attributable to its multicolinearity with the study design variable, but conceivably also due to a publication bias whereby authors are less willing to proceed to publication with results of their trials if results are poor. Analysis of these factors was not the primary intent of this investigation, and these findings were somewhat serendipitous; however, the potential bias they might introduce is of concern. To account for a sample size bias, one could model the results of smaller studies to be of equivalent sample size to the largest study, pool these results, and then examine if the pooled results from this adjusted model varied markedly from the results arising from the unadjusted pooled analysis.
There are some factors that are likely to further affect the validity of the pooled analysis findings. First, some tools were adapted for specific health care settings (57), and the research conditions within which the evaluation took place varied. For example, missing data (where screening tools were not completed) were excluded in some evaluations (46), but were able to be incorporated in others (58), and interventions were directly provided on the basis of the screening tool in some studies (55), but not in others where clinical staff were blinded to the screening tool classifications for their patients (59). These variations increase heterogeneity of the data that contributed to the pooled data and must be recognized when considering the validity of the pooled results.
Other study limitations require acknowledgment. First, only results relating to whether patients became fallers or nonfallers could be used in the meta-analysis, rather than also analyzing the ability of tools to predict fall event rates. Although arguments for the importance of this latter outcome in the hospital setting are emerging [particularly due to multiple falls being incurred by individual patients, screening tools being applied on multiple occasions, and patients having variable lengths of stay in hospital (60)], there is presently insufficient documentation of these data to justify meta-analysis. Recent advances in methodologies to evaluate the ability of screening tools to predict falls rates (58,60) may promote greater reporting of fall event rate outcomes and predictive accuracy, such that a meta-analysis of these data could be completed in the future.
The classification system used for this analysis grouped together a range of study characteristics and merged them into a four-point classification system. As a result, this study cannot identify the potential impact of each of these characteristics individually. However, in contrast to the sentinel work by Lijmer and colleagues (23), in which 218 evaluations were reviewed, to have done so using a multiple regression analysis in the present study where only 52 evaluations were reviewed would have made overfitting of such a model likely. Counting each evaluation as an individual subject for inclusion in the regression analysis could be considered a crude method for evaluating the effect of study design on reported results, which did not take into account variances in sample size of differences between tools. However, data were clustered according to screening tool evaluated, robust variance estimates were used, and regression coefficients were largely unchanged when the sample size variable was entered into the model (adjusting for studies of different sizes).
This study provides several implications for selection of a fall risk screening tool for use in clinical practice and future research. First, although retrospective evaluations still hold value in generating initial results and identifying tools and cutoff points that may be useful in the clinical setting, their results should not be weighted as heavily as those arising from prospective studies when selecting a screening tool for use in the clinical setting. Second, results from initial studies should not be viewed as being definitive, even if a prospective design was used, until it has been repeated in similar populations elsewhere (if selecting a tool for a similar patient group) and in wider hospital-based populations (if selecting a tool for generic hospital-wide use). Third, the meta-analysis has indicated that, of the tools with multiple prospective evaluations and large pooled participant numbers, the STRATIFY, Morse Falls Scale, and nursing staff clinical judgment provide comparable levels of accuracy. For future research, the Schmid and Downton fall risk screening tools are worthy of further investigation. Nursing staff clinical judgment, STRATIFY, or Morse Falls Scale could all be used as comparison instruments, though the nursing staff clinical judgment had less variation in results reported in individual prospective studies and, logically, a screening tool is only of use if it improves the accuracy of prediction above that which could be predicted by the staff member without its use. Finally, further research of this nature should be undertaken in fields of fall risk assessment in residential aged care facilities and among community-dwelling elders to guide health care practitioners in their selection of fall risk screening tools.
|
|
F |
|---|
|
|
|---|
Received August 3, 2006
Accepted September 25, 2006
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
M. A. Russell, K. D. Hill, L. M. Day, I. Blackberry, L. C. Gurrin, and S. C. Dharmage Development of the Falls Risk for Older People in the Community (FROP-Com) screening tool Age Ageing, January 1, 2009; 38(1): 40 - 46. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Webster, M. Courtney, P. O'Rourke, N. Marsh, C. Gale, B. Abbott, P. McRae, and K. Mason Should elderly patients be screened for their 'falls risk'? Validity of the STRATIFY falls screening tool and predictors of falls in a large acute hospital Age Ageing, November 1, 2008; 37(6): 702 - 706. [Full Text] [PDF] |
||||
![]() |
T. Haines, S. S. Kuys, G. Morrison, J. Clarke, and P. Bew Balance Impairment Not Predictive of Falls in Geriatric Rehabilitation Wards J. Gerontol. A Biol. Sci. Med. Sci., May 1, 2008; 63(5): 523 - 528. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Webster and M. Courtney PREDICTIVE ACCURACY OF FALLS RISK SCREENING TOOLS J. Gerontol. A Biol. Sci. Med. Sci., May 1, 2008; 63(5): 543 - 543. [Full Text] [PDF] |
||||
![]() |
T. Haines, K. Hill, W. Walsh, and R. Osborne AUTHORS' RESPONSE TO LETTER FROM WEBSTER AND COURTNEY J. Gerontol. A Biol. Sci. Med. Sci., May 1, 2008; 63(5): 543 - 543. [Full Text] [PDF] |
||||
![]() |
M. Vassallo, L. Poynter, J. C. Sharma, J. Kwan, and S. C. Allen Fall risk-assessment tools compared with clinical judgment: an evaluation in a rehabilitation ward Age Ageing, May 1, 2008; 37(3): 277 - 281. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||
| HOME | ARCHIVE | SEARCH | TABLE OF CONTENTS |
|---|