OBJECTIVES This study aims to review, critically, the suitability of Kirkpatrick's levels for appraising interventions in medical education, to review empirical evidence of their application in this context, and to explore alternative ways of appraising research evidence. METHODS The mixed methods used in this research included a narrative literature review, a critical review of theory and qualitative empirical analysis, conducted within a process of cooperative inquiry. RESULTS Kirkpatrick's levels, introduced to evaluate training in industry, involve so many implicit assumptions that they are suitable for use only in relatively simple instructional designs, short-term endpoints and beneficiaries other than learners. Such conditions are met by perhaps one-fifth of medical education evidence reviews. Under other conditions, the hierarchical application of the levels as a critical appraisal tool adds little value and leaves reviewers to make global judgements of the trustworthiness of the data. CONCLUSIONS Far from defining a reference standard critical appraisal tool, this research shows that 'quality' is defined as much by the purpose to which evidence is to be put as by any invariant and objectively measurable quality. Pending further research, we offer a simple way of deciding how to appraise the quality of medical education research.