According to Swedish and Dutch experts, there’s often a bias problem in Artificial Intelligence (AI) programs designed to grade prostate cancer (PCa) aggressiveness. That is, those who develop algorithms that are trained on a particular set of data are also the people who evaluate their accuracy. Where’s the objectivity in that method? There’s a risk that when applied to different cases outside of their system, their algorithm’s predictions may be invalid.
In articulating such a concern, Bulten, et al. (2022) write, “This can result in algorithms that perform poorly outside the cohorts used for their development. Moreover, shortcomings in validating the algorithms’ performance on additional cohorts may lead to such deficiencies in generalization going unnoticed.”[i] Therefore, a process is needed to weed out poorly performing or biased algorithms while at the same time develop highly accurate AI programs for grading PCa.
Who doesn’t love a competition?
Competition has a place to play in healthcare. For instance, economic competition is a key driver in reducing costs. This kind of organic cost reduction is not the result of a formal contest. However, formal competitions have been shown to promote medical innovations in terms of devices, pharmaceuticals, imaging, etc. For example, MedTech Innovator is transforming healthcare through its annual nonprofit global competition and increased visibility. Their mission is to improve human health by bringing new products to market. There is no cost to apply, and their 420 alumni companies have received billions of equity funding dollars. Their videos hold numerous inspiring stories.
Recognizing how much vitality a competition can rouse, the Swedish/Dutch research team designed and implemented the PANDA (Prostate cANcer graDe Assessment) challenge in 2020 to accelerate the validity and reproducibility of high quality PCa grading algorithms. It was “the largest histopathology competition to date, joined by 1,290 developers” from 65 countries. The researchers assembled an international data set of 12,625 whole-slide digitized prostate biopsy images across different patient populations and laboratories. Of these, three data sets were generated:
- Development set (10,616 biopsies to develop grading algorithms during the 3-month competition, Apr. 21 – Jul. 23)
- Tuning set (393 slides for performance evaluation during the competition phase)
- Internal validation set (545 slides during the post competition phase)
- External validation set (1071 during the post competition phase)
As a reference standard, the slides were independently reviewed and graded by experienced pathologists.
The developers were all given the development set in order to train their AI grading algorithms. Each developer/team submitted at least one algorithm. Throughout the competition phase, teams were allowed to test their work on the tuning set, which allowed them to improve their algorithm for final submission. The algorithms were then simultaneously blindly validated on the internal validation set. All told, throughout the process more than 34,000 algorithms were submitted for improvement prior to final submission, and with fine tuning, scores (algorithm agreement with pathologists) quickly improved.
As the authors describe, “The first team to achieve an agreement with the uropathologists of >0.90 on the internal validation set already occurred within the first 10 days of the competition. In the 33rd day of the competition, the median performance of all teams exceeded a score of 0.85.” At the end, among the 10 highest ranking teams, 8 teams were chosen to form the PANDA consortium. Each was asked to share “all data and code necessary for reproducing the exact version of their algorithm that resulted in the final competition submission.”
Comparing pathologists vs. AI
Thanks to the element of competition, accelerated improvement in AI-based grading of PCa tissue slides occurred over a 3-month period, after which objective validation and reproducibility were completed. This study demonstrated high levels of agreement with experienced pathologists. In one particular, AI’s performance was superior: with respect to sensitivity for tumor, the algorithm average was 96.4% sensitivity for AI vs. 91.9% for pathologists. On the other hand, AI was found to average 75% specificity vs. 95% specificity for pathologists. Yet on the whole, “the algorithms missed 1.9% of cancers, whereas the pathologists missed 7.3%.”
The research team concluded that “a diverse set of submitted algorithms reached pathologist-level performance on independent cross-continental cohorts.” The PANDA consortium is to be congratulated on advancing AI algorithms for grading PCa. One of the study’s authors, Professor Lars Egevad of Sweden’s Karolinska Institute articulates its importance: “The idea is not for AI to replace human experts, but rather to function as a safety net to avoid pathologists missing cancer cases and to help in standardising the assessments. AI can also be an option in those parts of the world that today completely lack pathology expertise.”[ii]
In addition, the PANDA consortium is making its own contribution to the field. The development set of digitized slides as well as the algorithm’s code will be publicly available to all interested developers.
NOTE: This content is solely for purposes of information and does not substitute for diagnostic or medical advice. Talk to your doctor if you are experiencing pelvic pain, or have any other health concerns or questions of a personal medical nature.
[i] Bulten W, Kartasalo K, Chen PC, Ström P et al. Artificial intelligence for diagnosis and Gleason grading of prostate
cancer: the PANDA challenge. Nat Med. 2022 Jan;28(1):154-163.
[ii] “Study shows AI systems can accurately identify and grade prostate cancer.” NewsMedical, Jan. 13, 2022. https://www.news-medical.net/news/20220113/Study-shows-AI-systems-can-accurately-identify-and-grade-prostate-cancer.aspx