Clostridioides difficile infection poses major clinical and operational challenges. Hospitals have both quality and economic motivations to manage CDI effectively. Universal admission screening is rarely recommended, and prior modeling efforts often relied on limited samples, overly complex feature sets, or black-box techniques. Our goal was to create models using patient information to estimate the likelihood of a positive test with strong discrimination, clear interpretability, and a practical set of long-term health indicators. We used records from 157,493 UC San Diego Health patients seen between January 01, 2016, and July 03, 2019 who had at least 6 months of medication history. Pregnant individuals, patients under 18, and incarcerated persons were excluded. We trained Logistic Regression, Random Forest, and Ensemble models using hyperparameters tuned through 10-fold cross-validation. Performance was evaluated by AUROC. Logistic Regression coefficients were examined via odds ratios and p-values; Random Forest feature contributions were assessed using Gini importance. We also compared false-positive and false-negative predictions at selected thresholds.
The Logistic Regression, Random Forest, and Ensemble models produced AUROCs of 0.839, 0.851, and 0.866, respectively. Variables associated with elevated risk included age, use of immunosuppressive therapies, previous antibiotic exposure, and certain gastrointestinal medications. All models demonstrated strong discrimination (AUROC >0.83). Across analytic methods, similar predictors emerged as influential, many of which are consistent with established clinical risk factors for Clostridioides difficile. These human-readable models help identify factors shaping a patient’s likelihood of a positive test and the associated infection risk.