Comparative analysis of AI tools for TB detection: performance variability and call for standardization

delft-imaging-cad4covid-xr.png

A South African retrospective study evaluated 12 commercially available AI tools for detecting tuberculosis using chest X-rays. The study included 774 participants, with 258 bacteriologically positive and 516 bacteriologically negative individuals. The AI products, including CAD4TB, ChestEye, InferRead DR Chest, Lunit INSIGHT CXR, and qXR - all listed on Health AI Register - were assessed to identify TB-associated abnormalities, aiming to enhance detection accuracy in a high TB and HIV prevalence setting.

The results showed varied performance among the tools, with Lunit INSIGHT CXR achieving the highest accuracy with an AUC of 0.902, followed by qXR, InferRead DR Chest, CAD4TB, and ChestEye, all demonstrating good accuracy with AUCs between 0.8 and 0.9. The study noted differences in sensitivity and specificity across these tools, with some maintaining high sensitivity (>90%) across a broad range of thresholds. The performance discrepancies were attributed to the varying neural network architectures and specific subgroup biases, such as reduced accuracy in older individuals and those with a history of tuberculosis. The study concluded that these AI tools can significantly enhance TB screening accuracy, particularly in resource-limited settings, emphasizing the need for continuous validation and context-specific threshold adjustments.

Read full study


Computer-aided detection of tuberculosis from chest radiographs in a tuberculosis prevalence survey in South Africa: external validation and modelled impacts of commercially available artificial intelligence software

The Lancet Digital Health, 2024

Summary

Background

Computer-aided detection (CAD) can help identify people with active tuberculosis left undetected. However, few studies have compared the performance of commercially available CAD products for screening in high tuberculosis and high HIV settings, and there is poor understanding of threshold selection across products in different populations. We aimed to compare CAD products' performance, with further analyses on subgroup performance and threshold selection.

Methods

We evaluated 12 CAD products on a case–control sample of participants from a South African tuberculosis prevalence survey. Only those with microbiological test results were eligible. The primary outcome was comparing products' accuracy using the area under the receiver operating characteristic curve (AUC) against microbiological evidence. Threshold analyses were performed based on pre-defined criteria and across all thresholds. We conducted subgroup analyses including age, gender, HIV status, previous tuberculosis history, symptoms presence, and current smoking status.

Findings

Of the 774 people included, 516 were bacteriologically negative and 258 were bacteriologically positive. Diverse accuracy was noted: Lunit and Nexus had AUCs near 0·9, followed by qXR, JF CXR-2, InferRead, Xvision, and ChestEye (AUCs 0·8–0·9). XrayAME, RADIFY, and TiSepX-TB had AUC under 0·8. Thresholds varied notably across these products and different versions of the same products. Certain products (Lunit, Nexus, JF CXR-2, and qXR) maintained high sensitivity (>90%) across a wide threshold range while reducing the number of individuals requiring confirmatory diagnostic testing. All products generally performed worst in older individuals, people with previous tuberculosis, and people with HIV. Variations in thresholds, sensitivity, and specificity existed across groups and settings.

Interpretation

Several previously unevaluated products performed similarly to those evaluated by WHO. Thresholds differed across products and demographic subgroups. The rapid emergence of products and versions necessitates a global strategy to validate new versions and software to support CAD product and threshold selections.