Researchers Explore AI Analysis of WGS Data for Foodborne Illness Source Attribution

Image credit: rawpixel via Freepik
In a recent study published in the U.S. Centers for Disease Control and Prevention’s (CDC’s) Emerging Infectious Diseases, researchers from CDC, the U.S. Food and Drug Administration (FDA), and the U.S. Department of Agriculture (USDA) trained a model to use whole-genome sequencing (WGS) data to estimate the food source attribution of Salmonella illnesses. The model demonstrated a promising level of accuracy, and when tested against a dataset of isolates with unknown sources, predicted that more than 33 percent of isolates from human salmonellosis cases originated from chicken, while 27 percent were from vegetables.
The purpose of the study was to explore the usefulness of machine learning algorithms, trained with foodborne pathogen surveillance and foodborne illness WGS data, to successfully attribute the sources of human infections by foodborne pathogens. A random forest model algorithm was used.
To train the model, the researchers compiled data for available Salmonella isolates from food or cecal from food animals at slaughter in the National Institutes of Health’s (NIH’s) National Center for Biotechnology Information (NCBI), as well as additional related metadata from USDA’s Food Safety and Inspection Service (USDA-FSIS), FDA, and CDC. The researchers then manually identified isolates that could be definitively attributed to one of 15 food categories used by the Interagency Food Safety Analytics Collaboration (IFSAC) scheme for foodborne illness source attribution estimates. In total, the model was trained with 18,661 Salmonella isolates.
When asked to attribute the food sources for the database of known Salmonella isolates, the overall accuracy of the model using all isolate predictions was 81 percent. The model performed best for chicken isolates (95 percent accuracy) and also performed well for other common sources, such as turkey (88 percent), pork (83 percent), vegetables (82 percent), and beef (77 percent). However, the model was not as accurate for less common sources of Salmonella, such as game (10 percent), dairy (29 percent), and other meat (39 percent). The overall accuracy of the model increased to 91 percent when the researchers retained only its confident predictions; specifically, the 14,888 isolates for which the maximum predicted probability of an isolate originating from a single source category was greater than or equal to 50 percent.
The model was also tested against 6,470 Salmonella clinical isolates with unknown source of illness submitted to CDC’s Foodborne Diseases Active Surveillance Network (FoodNet) database from 2014–2017. Chicken and vegetables were the most common predicted sources of salmonellosis, which is consistent with previous analyses; however, the model found chicken to be linked to a substantially higher percentage of illnesses (46 percent) than in recent attribution estimates for poultry products based on outbreak data (17 percent). This is possibly due to outbreak data only reflecting consumed foods versus the model analyzing data from earlier points in the farm-to-fork continuum, or because the risks associated with outbreaks differ from those associated with sporadic infections.
Before the dataset was adjusted to include only single-source isolates attributed to a category with 50 percent or greater probability, the model estimated that 34 percent and 30 percent of human salmonellosis cases were attributable to chicken or vegetables, respectively. After the dataset was adjusted to retain only confident model predictions, chicken and vegetables were together attributed to approximately 73 percent of salmonellosis cases. Additionally, cases were attributed to all 15 modeled categories, indicating that illnesses likely arise from various sources, which is consistent with the most recent IFSAC foodborne illness source attribution report.
The model also estimated the serotypes that cause human illnesses most frequently for different food categories. Specifically, chicken was the most common estimated source of S. Enteritidis, S. Typhimurium, S. Heidelberg, and S. Infantis; pork was the most common source of S. 4,[5],12:i:-; and vegetables were the most common source of S. Javiana and S. Newport.
Overall, the researchers believe that algorithmic analysis of WGS data shows promising utility for foodborne illness source attribution. With further research, models similar to the one tested in the present study could be leveraged with existing genomic surveillance systems to support source identification in outbreak investigations and to help inform regulatory priorities.
The study was led by Erica Billig Rose, Ph.D., an epidemiologist in the Predict Division of CDC’s Center for Forecasting and Analytics; and Molly K. Steele, Ph.D., M.Sc., M.P.H., an epidemiologist in the Division of Foodborne, Waterborne, and Environmental Diseases in in CDC’s National Center for Emerging and Zoonotic Infectious Diseases.
Looking for a reprint of this article?
From high-res PDFs to custom plaques, order your copy today!