The DECIDE-AI guideline is the result of an international consensus process involving a diverse group of experts spanning a wide range of professional backgrounds and experience. The level of interest across stakeholder groups and the high response rate among the invited experts speaks to the perceived need for more guidance in the reporting of studies presenting the development and evaluation of clinical AI systems and to the growing value placed on comprehensive clinical evaluation to guide implementation. The emphasis placed on the role of human-in-the-loop decision-making was guided by the Steering Group’s belief that AI will, at least in the foreseeable future, augment, rather than replace, human intelligence in clinical settings. In this context, thorough evaluation of the human–computer interaction and the roles played by the human users will be key to realizing the full potential of AI.
The DECIDE-AI guideline is the first stage-specific AI reporting guideline to be developed. This stage-specific approach echoes recognized development pathways for complex interventions1,8,9,29 and aligns conceptually with proposed frameworks for clinical AI6,30,31,32, although no commonly agreed nomenclature or definition has so far been published for the stages of evaluation in this field. Given the current state of clinical AI evaluation, and the apparent deficit in reporting guidance for the early clinical stage, the DECIDE-AI Steering Group considered it important to crystallize current expert opinion into a consensus, to help improve reporting of these studies. Beside this primary objective, the DECIDE-AI guideline will hopefully also support authors during study design, protocol drafting and study registration, by providing them with clear criteria around which to plan their work. As with other reporting guidelines, it is important to note that the overall effect on the standard of reporting will need to be assessed in due course, once the wider community has had a chance to use the checklist and explanatory documents, which is likely to prompt modification and fine-tuning of the DECIDE-AI guideline, based on its real-world use. Although the outcome of this process cannot be pre-judged, there is evidence that the adoption of consensus-based reporting guidelines (such as CONSORT) does, indeed, improve the standard of reporting33.
The Steering Group paid special attention to the integration of DECIDE-AI within the broader scheme of AI guidelines (for example, TRIPOD-AI, STARD-AI, SPIRIT-AI and CONSORT-AI). It also focused on DECIDE-AI being applicable to all types of decision support modalities (that is, detection, diagnostic, prognostic and therapeutic). The final checklist should be considered as minimum scientific reporting standards and does not preclude reporting additional information, nor are the standards a substitute for other regulatory reporting or approval requirements. The overlap between scientific evaluation and regulatory processes was a core consideration during the development of the DECIDE-AI guideline. Early-stage scientific studies can be used to inform regulatory decisions (for example, based on the stated intended use within the study) and are part of the clinical evidence generation process (for example, clinical investigations). The initial item list was aligned with information commonly required by regulatory agencies, and regulatory considerations are introduced in the E&E paragraphs. However, given the somewhat different focuses of scientific evaluation and regulatory assessment34, as well as differences between regulatory jurisdictions, it was decided to make no reference to specific regulatory processes in the guideline, nor to define the scope of DECIDE-AI within any particular regulatory framework. The primary focus of DECIDE-AI is scientific evaluation and reporting, for which regulatory documents often provide little guidance.
Several topics led to more intense discussion than others, both during the Delphi process and the Consensus Group discussion. Regardless of whether the corresponding items were included, these represent important issues that the AI and healthcare communities should consider and continue to debate. First, we discussed at length whether users (see glossary of terms) should be considered as study participants. The consensus reached was that users are a key study population, about whom data will be collected (for example, reasons for variation from the AI system recommendation and user satisfaction), and who might logically be consented as study participants and, therefore, should be considered as such. Because user characteristics (for example, experience) can affect intervention efficacy, both patient and user variability should be considered when evaluating AI systems and reported adequately.
Second, the relevance of comparator groups in early-stage clinical evaluation was considered. Most studies retrieved in the literature search described a comparator group (commonly the same group of clinicians without AI support). Such comparators can provide useful information for the design of future large-scale trials (for example, information on the potential effect size). However, comparator groups are often unnecessary at this early stage of clinical evaluation, when the focus is on issues other than comparative efficacy. Small-scale clinical investigations are also usually underpowered to make statistically significant conclusions about efficacy, accounting for both patient and user variability. Moreover, the additional information gained from comparator groups in this context can often be inferred from other sources, such as previous data on unassisted standard of care in the case of the expected effect size. Comparison groups are, therefore, mentioned in item VII but considered optional.
Third, output interpretability is often described as important to increase user and patient trust in the AI system, to contextualize the system’s outputs within the broader clinical information environment19 and potentially for regulatory purposes35. However, some experts argued that an output’s clinical value may be independent of its interpretability and that the practical relevance of evaluating interpretability is still debatable36,37. Furthermore, there is currently no generally accepted way of quantifying or evaluating interpretability. For this reason, the Consensus Group decided not to include an item on interpretability at the current time.
Fourth, the notion of users’ trust in the AI system and its evolution with time were discussed. As users accumulate experience with, and receive feedback from, the real-world use of AI systems, they will adapt their level of trust in its recommendations. Whether appropriate or not, this level of trust will influence, as recently demonstrated by McIntosh et al.38, how much effect the systems have on the final decision-making and, therefore, influence the overall clinical performance of the AI system. Understanding how trust evolves is essential for planning user training and determining the optimal timepoints at which to start data collection in comparative trials. However, as for interpretability, there is currently no commonly accepted way to measure trust in the context of clinical AI. For this reason, the item about user trust in the AI system was not included in the final guideline. The fact that interpretability and trust were not included highlights the tendency of consensus-based guidelines development toward conservatism, because only widely agreed-upon concepts reach the level of consensus needed for inclusion. However, changes of focus in the field, as well as new methodological development, can be integrated into subsequent guideline iterations. From this perspective, the issues of interpretability and trust are far from irrelevant to future AI evaluations, and their exclusion from the current guideline reflects less a lack of interest than a need for further research into how we can best operationalize these metrics for the purposes of evaluation in AI systems.
Fifth, the notion of modifying the AI system (the intervention) during the evaluation received mixed opinions. During comparative trials, changes made to the intervention during data collection are questionable unless the changes are part of the study protocol; some authors even consider them as impermissible, on the basis that they would make valid interpretation of study results difficult or impossible. However, the objectives of early clinical evaluation are often not to make definitive conclusions on effectiveness. Iterative design–evaluation cycles, if performed safely and reported transparently, offer opportunities to tailor an intervention to its users and beneficiaries and augment chances of adoption of an optimized, fixed version during later summative evaluation8,9,39,40.
Sixth, several experts noted the benefit of conducting human factors evaluation before clinical implementation and considered that, therefore, human factors should be reported separately. However, even robust preclinical human factors evaluation will not reliably characterize all the potential human factors issues that might arise during the use of an AI system in a live clinical environment, warranting a continued human factors evaluation at the early stage of clinical implementation. The Consensus Group agreed that human factors play a fundamental role in AI system adoption in clinical settings at scale and that the full appraisal of an AI system’s clinical utility can happen only in the context of its clinical human factors evaluation.
Finally, several experts raised concerns that the DECIDE-AI guideline prescribes an evaluation that is too exhaustive to be reported within a single manuscript. The Consensus Group acknowledged the breadth of topics covered and the practical implications. However, reporting guidelines aim to promote transparent reporting of studies rather than mandating that every aspect covered by an item must have been evaluated within the studies. For example, if a learning curves evaluation has not been performed, then fulfilment of item 14b would be to simply state that this was not done, with an accompanying rationale. The Consensus Group agreed that appropriate AI evaluation is a complex endeavour necessitating the interpretation of a wide range of data, which should be presented together as far as possible. It was also felt that thorough evaluation of AI systems should not be limited by a word count and that publications reporting on such systems might benefit from special formatting requirements in the future. The information required by several items might already be reported in previous studies or in the study protocol, which could be cited rather than described in full again. The use of references, online supplementary materials and open-access repositories (for example, Open Science Framework (OSF)) is recommended to allow the sharing and connecting of all required information within one main published evaluation report.
Our work has several limitations that should be considered. First, the issue of potential biases, which apply to any consensus process, must be considered. These include anchoring or participant selection biases41. The research team tried to mitigate bias through the survey design, using open-ended questions analyzed through a thematic analysis, and by adapting the expert recruitment process, but it is unlikely that it was eliminated entirely. Despite an aim for geographical diversity and several actions taken to foster it, representation was skewed toward Europe and, more specifically, the United Kingdom. This could be explained, in part, by the following factors: a likely selection bias in the Steering Group’s expert recommendations; a higher interest in our open invitation to contribute coming from European/United Kingdom scientists (25 of 30 experts approaching us, 83%); and a lack of control over the response rate and self-reported geographical location of participating experts. Considerable attention was also paid to diversity and balance among stakeholder groups, even though clinicians and engineers were the most represented, partly due to the profile of researchers who contacted us spontaneously after the public announcement of the project. Stakeholder group analyses were performed to identify any marked disagreements from underrepresented groups. Finally, as also noted by the authors of the SPIRIT-AI and CONSORT-AI guidelines25,26, few examples of studies reporting on the early-stage clinical evaluation of AI tools were available at the time that we started developing the DECIDE-AI guideline. This might have affeced the exhaustiveness of the initial item list created from literature review. However, the wide range of stakeholders involved and the design of the first round of Delphi allowed identification of several additional candidate items, which were added in the second iteration of the item list.
The introduction of AI into healthcare needs to be supported by sound, robust and comprehensive evidence generation and reporting. This is essential both to ensure the safety and efficacy of AI systems and to gain the trust of patients, practitioners and purchasers, so that this technology can realize its full potential to improve patient care. The DECIDE-AI guideline aims to improve the reporting of early-stage live clinical evaluation of AI systems, which lays the foundations for both larger clinical studies and later widespread adoption.