Using randomization to compare #AI and expert-generated formative assessment questions in medical education

Background: AI-generated content is being used across the education spectrum and is beneficial for creating complex multiple-choice questions, such as those used in medical education. However, evaluating AI-generated content is challenging, and existing testing and evaluation methods are falling short. This study uses randomization to compare medical students’ performance on and subjective evaluation of AI versus expert-generated questions. We hypothesized that there would be no difference in student performance or subjective evaluation of AI vs expert-generated questions.

Methods: We designed a single-center, randomized study where medical student participants received either one AI- or expert-generated question per day over four weeks.

Results: Our study showed that participants had similar perceptions of AI versus expert-generated questions (p = 0.18), with no significant difference between the distributions of the proportion of correct responses. The cumulative proportion of questions answered correctly over 28 days was consistent across the two question sets. However, participants rated 53% of AI-generated questions as very easy or easy compared with only 31% of the expert-generated questions.

Discussion: Randomization was crucial to show that AI-generated questions were nearly imperceptible from expert-generated questions, demonstrating the need for additional evaluation methods to compare AI- and expert-generated medical education content.

https://www.tandfonline.com/doi/full/10.1080/10872981.2026.2671586