Machine Learning Model Could Transform Hepatitis B Screenings

Author(s):

The machine learning model outperformed a comparison logistic regression model in identifying high-risk patients for HBV.

Nathan S. Ramrakhiani

While universal screenings for hepatitis B virus (HBV) is useful in areas of high prevalence, it is not as cost effective in areas like the US where the prevalence is low.

However, utilizing machine learning technology could help identify high-risk individuals.

A team, led by Nathan S. Ramrakhiani, Division of Gastroenterology and Hepatology, Stanford University Medical Center, identified patients with HBV using a newly developed logistic regression and machine learning that leverages demographic data from a population-based data.

While affecting more than 290 million patients worldwide, only 10% of individuals with chronic hepatitis B virus have been officially diagnosed. Even thought for the last 2 decades guidelines have called for screening high-risk individuals, cases still remain underreported.

Demographic Data

In the study, the investigators identified patients with data on hepatitis B surface antigen (HBsAg), birth year, sex, race and ethnicity, and birthplace using 10 cycles from the National Health and Nutrition Examination Survey (NHANES) between 1999-2018. The median birth year for the patient population was 1973

The participants were divided into 2 separate cohorts: training (cycles 2, 3, 5, 6, 8, 10; n = 39,119) and validation (cycles 1, 4, 7, 9; n = 21,569).

Next, the investigators developed and tested the new logistic regression and machine learning models.

The primary outcome for the logistic regression model was HBV infection, which was defined as positive HBsAg with demographic variables as primary predictors. Both univariable and multivariable logistic regression were in the training set.

Comparing the 2 Models

In the machine learning model, the investigators determined the demographic factors and birthplace associated with the primary outcome. The model used the training cohort with down-sampling of the controls and 10-fold cross-validation to determine test characteristics of the model.

Using the multivariable logistic regression, the investigators identified several factors that were more commonly associated HBV infections, including birth year 1991 or after (aOR, 0.28; 95% CI, 0.14-0.55; P < 0.001), male sex (aOR, 1.49; 95% CI, 1.11-2.01; P = 0.0080), Black and Asian/Other vs. White (aOR, 5.23 and 9.13; 95% CI, 3.10-8.83 and 5.23-15.96; P <0.001 for both), and being US-born vs. foreign-born (aOR 0.14; 95% CI, 0.10-0.21; P <0.001).

Machine Learning Model Bests Logistic Regression

Ultimately, the machine learning model was superior, with higher area under the receiver operating characteristic (AUROC) values (0.83 vs. 0.75 in validation cohort, P < 0.001) and better differentiation of high and low risk individuals.

The training cohort showed the AUROC was significantly higher in the machine learning model at 0.90 (95% CI, 0.88-0.92), compared to 0.81 (95% CI, 0.79-0.84) for the logistic model (P <0.001 by De Long test).

In the validation cohort, the AUROC was similarly higher in the machine learning model (0.83; 95% CI, 0.78-0.88 vs. 0.75; 95% CI, 0.70-0.80; P < 0.001).

“Our machine learning model consistently outperformed the logistic regression model, laying the groundwork for what could eventually be a practical and cost-effective HBV screening strategy for low prevalence regions with more “imported” HBV infection such as the United States or Western Europe,” the authors wrote. “We also advocate for additional risk-based screening for populations with specific exposure risks as per professional society and CDC guidelines.”

The study, “Optimizing Hepatitis B Virus Screening in the United States Using a Simple Demographics-Based Model,” was published online in Hepatology.