Discussion of The Problem of False Discoveries: How to Balance Objectives 2009 IES Research Conference David Judkins Westat      I would like to commend the authors on their fine work. I found nothing to disagree with. I would like to spend my time talking about the nature of confirmatory versus exploratory analysis, how to group outcomes, how to drill down, and the utility of single dimensional summaries of multi-dimensional outcomes. Thanks to Andrea Piesse of Westat for valuable comments Of course, my remarks are personal and do not necessarily reflect Westat policies. 2 G.E.P. Box    I haven’t read any work by him directly on multiple comparisons or false discovery control But he has written elegantly about the nature of discovery and the use of statistics in that process An understanding of his work will help researchers distinguish between exploratory and confirmatory analysis in their own work 3 Statistics for Discovery       Box, 2001, Journal of Applied Statistics Based on his 2000 Deming Lecture Knowledge development is an iterative process Alternates between induction and deduction In the inductive phase, we use new data to improve current models In the deductive phase, we design and conduct experiments to test the logical consequences of the improved models 4 Long History   Francis Bacon discussed the iterative nature of knowledge development at the beginning of the Age of Enlightenment. Steve Stigler told Box that Bishop Robert Grosseteste, one of the founders of Oxford University in the 1200s, also talked about this idea and attributed it to Aristotle. 5 Box’s Illustration     Model: Today is like every day. Deduction: My car will be in my parking space. Data: It isn’t! Induction: Someone must have taken it. 6 Box’s Illustration (2)     Model: My car has been stolen. Deduction: My car will not be in the parking lot. Data: No. It’s over there! Induction: Someone took it and brought it back. 7 Box’s Illustration (3)     Model: A thief took it and brought it back. Deduction: My car will be broken into. Data: No. It’s unharmed and locked! Induction: Someone who had a key took it. 8 Box’s Illustration (4)    Model: My wife used my car. Deduction: She has probably left me a note. Data: Yes. Here it is! 9 Box on Judge versus Detective  In the trial, there is a judge and jury before whom, under very strict rules, all the evidence must be brought together at one time and the jury must decide, whether the hypothesis of innocence can be rejected beyond all reasonable doubt. This is very much like a statistical test. 10 Box on Judge versus Detective (2)  However, the apprehension of the defendant by a detective will have been conducted by a very different process. … The approach of the detective closely parallels that of the scientific investigator. 11 Fitting Randomized Trials into this Paradigm    “Randomized trials” is, I believe, the name favored in education research for experiments. Much of the tradition for how to run them and analyze them comes from the fields of medical interventions, devices and pharmaceuticals, where, of course, they are known as randomized clinical trials. What aspects of that tradition are appropriate in education research? 12 Regulatory Role of CRTs     I think that much of the tradition has arisen from the regulatory role of CRTs. The FDA panels are much like Box’s juries, and the FDA administrators like Box’s judges. Of course, there is a huge set of investigators at the drug companies working to synthesize new drugs and to develop new devices. But there is a severe administrative and legal separation between the two operations. 13 Education Researchers Wear Both Hats   So when are we acting like judges and when like investigators? When like the FDA and when like the drug company development arms? This determines to a large extent whether formal control over family-wise error rates is appropriate and thus whether adjustments must be made for multiple comparisons 14 Enshrinement   I would say that we should treat an analysis as a confirmatory analysis in the language of Schochet and Deke if there is a good chance that the findings will become accepted knowledge for years to come. I also think that there is a fairly strong danger of exploratory analyses being mistaken for confirmatory, so I urge very clear language in the caveats of exploratory analyses 15 What Works Clearinghouse     The title suggests that all the guidance to be found is very solid and reliable. Thus, I think that requiring FWER control for entry into WWC is very appropriate. But then how do we facilitate the induction phase? How do we work to improve the models that for the most part are still very primitive in education research? 16 What Might Work Clearinghouse    Report all the findings from randomized trials with no concern about FWER? Also, report findings from poorly controlled observational studies? A resource for experimenters not for implementers 17 Grouping    Peter and John mention that grouping outcomes is a powerful way of mitigating the multiple comparison problem But how to form them? In education research, there is a strong urge to treat each assessment as a separate domain  Are receptive and expressive vocabulary skills really separate domains? 18 Sources of Resistance to Grouping   Maybe a sense that they want to be doing investigative work rather than judging work? Pressure from test publishers to see results for their assessments presented separately? 19 Post-Peek Grouping     What happens when we use conditional grouping rules? Let X and Y be two outcome variables, and Z be the average of the two Suppose we only estimate the effect of T on Z if we first find that the difference in the effects of T on X and Y are not statistically different from each other? Otherwise, we publish the effects of T on X and T on Y separately with multiple comparison adjustment 20 Post-Peek Grouping (2)   The math is complex. Preliminary simulations hint, however, that the procedure is too liberal, failing to provide FWER control. Also not clear how to generalize to more than two outcomes 21 Drilling Down    If there is a significant effect on the composite domain outcome, then there is natural interest in the components. I think this falls under the rubric of exploratory analysis done to facilitate the induction phase of knowledge building. If FWER control is attempted for the drilldown, the resampling methods would certainly appear best suited given the strong correlations. 22 Multi-Domain Outcome Indices     Not every summary measure needs to be built up from a set of correlated items around the same latent construct. Think of the quality-of-life indices published for cities around the world. Educational and developmental progress is multidimensional, but that does not mean that every dimension needs to be reported separately. We should not insist that all outcome measures have high reliability for uni-dimensional latent variables. 23