Using artificial intelligence to predict adverse outcomes in emergency department patients with hyperglycemic crises in real time

Background Hyperglycemic crises are associated with high morbidity and mortality. Previous studies have proposed methods to predict adverse outcomes of patients in hyperglycemic crises; however, artificial intelligence (AI) has never been used to predict adverse outcomes. We implemented an AI model integrated with the hospital information system (HIS) to clarify whether AI could predict adverse outcomes. Methods We included 2,666 patients with hyperglycemic crises from emergency departments (ED) between 2009 and 2018. The patients were randomized into a 70%/30% split for AI model training and testing. Twenty-two feature variables from the electronic medical records were collected. The performance of the multilayer perceptron (MLP), logistic regression, random forest, Light Gradient Boosting Machine (LightGBM), support vector machine (SVM), and K-nearest neighbor (KNN) algorithms was compared. We selected the best algorithm to construct an AI model to predict sepsis or septic shock, intensive care unit (ICU) admission, and all-cause mortality within 1 month. The outcomes between the non-AI and AI groups were compared after implementing the HIS and predicting the hyperglycemic crisis death (PHD) score. Results The MLP had the best performance in predicting the three adverse outcomes, compared with the random forest, logistic regression, SVM, KNN, and LightGBM models. The areas under the curves (AUCs) using the MLP model were 0.852 for sepsis or septic shock, 0.743 for ICU admission, and 0.796 for all-cause mortality. Furthermore, we integrated the AI predictive model with the HIS to assist decision making in real time. No significant differences in ICU admission or all-cause mortality were detected between the non-AI and AI groups. The AI model performed better than the PHD score for predicting all-cause mortality (AUC 0.796 vs. 0.693). Conclusions A real-time AI predictive model is a promising method for predicting adverse outcomes in ED patients with hyperglycemic crises. Further studies recruiting more patients are warranted. Supplementary Information The online version contains supplementary material available at 10.1186/s12902-023-01437-9.


Background
Diabetic ketoacidosis (DKA) and hyperosmolar hyperglycemic state (HHS) are severe acute complications of diabetes [1].Precipitating factors include uncontrolled type 1 and 2 diabetes, infection, new-onset diabetes, pancreatitis, acute coronary syndrome, stroke, and medications [2,3].Visits to the emergency department (ED) for DKA and HHS have been increasing annually in the United States.In 2015, there were 3.1 visits for DKA and 2.9 visits for HHS per 10,000 adults with diabetes [1].Although treatment includes hydration, insulin therapy, and electrolyte replacement, the mortality rate for hyperglycemic crises remains high [4,5] and can also increase the risk for subsequent adverse cardiovascular events, end-stage renal disease, and long-term mortality [6][7][8].Risk stratification (e.g., sepsis, intensive care unit [ICU] admission, and mortality) may improve outcomes in hyperglycemic crises [2,3].Prior studies identified mortality predictors, such as age, mental status, severe coexisting diseases, serum pH < 7.0, high insulin dose within the first 12 h, and serum glucose > 16.7 mmol after 12 h [4,5,8], but a clinical prediction rule may be more practical.
In 2013, the predicting the hyperglycemic crisis death (PHD) score was proposed as a tool to help ED physicians stratify the mortality risk and make decisions regarding patients in hyperglycemic crises [7].It consists of six predictors and stratifies patients into low, intermediate, and high-risk groups.While the area under the curve (AUC) for the rule was 0.925 in the validation set, the PHD score was limited by a small derivation sample and manual calculation [7].In recent years, artificial intelligence (AI) techniques have become a promising method to assist in medical decisions, and several AI predictions for adverse outcomes have been implemented in ED [6,[9][10][11].However, no study has yet evaluated the feasibility and accuracy of AI predictions of adverse outcomes in ED patients with hyperglycemic crises in real time [12,13].Therefore, we carried out this study to clarify it.

Study design, setting, and participants
We established a multi-disciplinary team at the Chi Mei Medical Center (CMMC), including emergency physicians, data scientists, information engineers, nurse practitioners, and quality managers to implement big data and AI.Adults (age ≥ 20 years) with hyperglycemic crises who visited the EDs of three hospitals (CMMC, Chi Mei Liouying Hospital, and Chi Mei Chiali Hospital) between 2009 and 2018 were recruited (Fig. 1).The rationale that we used to select patients aged ≥ 20 years is that a criterion for an adult in Taiwan is " ≥ 20 years", and it has been adopted in many studies [6,11].The criteria for hyperglycemic crises were defined as the final diagnosis of DKA or HHS in the ED, using the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes 250.1 or 250.2 and ICD-10 codes E11.1 or E11.0.Patients who did not have a record of subsequent follow-up and those who visited the ED for multiple hyperglycemic crises were excluded.

Outcome measurements
We defined three adverse outcomes, including sepsis or septic shock < 1 month (ICD-9-CM: 038, 790.7 or ICD-10: A40-A41, R65, R7881), ICU admission < 1 month, and all-cause mortality < 1 month following the time of presentation in the ED.The general "ICU admission" criteria in the study hospital were unstable vital signs and the need for intensive monitoring and treatment."Allcause mortality" was defined as a record of death certification or discharge against medical advice in a patient in critical condition in the EMRs.We defined " < 1 month" for outcomes according to previous studies of hyperglycemic crises and AI [7,11].

Ethical statement
This study was approved by the Institutional Review Board of the CMMC and was conducted according to the Declaration of Helsinki.Informed consent from the patients was waived because this study was retrospective and contained de-identified information, which did not affect the rights or welfare of the patients.

Data processing, comparison, and application
The study had two phases: pre-and post-implementation.The pre-implementation phase developed an AI predictive model and integrated it with the HIS.The postimplementation phase compared outcomes between the non-AI and AI groups.The feature of sex was transformed into 1 (male) or 0 (female).Missing or ambiguous data were defined by a team comprising emergency physicians, data scientists, information engineers, nurse practitioners, and quality managers.Data with missing feature variables were deleted or estimated with an average value.Second, we divided the data into training (70%) and test (30%) datasets according to previous studies [6,11,21].There were fewer outcomes, particularly ICU admissions, which may have caused an imbalance in the data.Therefore, we used the synthetic minority over-sampling technique to improve the data imbalance in the training dataset [22].Machine learning (ML) and deep learning (DL) are the two major fields of AI [23].ML, including random forest, logistic regression, support vector machine (SVM), K-nearest neighbor (KNN), and Light Gradient Boosting Machine (LightGBM), is the ability that a computer system uses to automatically improve their function or to "learn" with continuous data [23].DL, as the multilayer perceptron (MLP) in this study, has a more complex network of nodes between the inputs and outputs for solving complex problems more accurately [23].Because the case number was small, we used MLP, a classical neural network method, to represent the DL method.The MLP has been adopted successfully in our studies [6,9,11,24,25].We used fivefold cross validation technique to build all models.We compared the ML algorithms, including random forest, logistic regression, SVM, KNN, LightGBM, and MLP for accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), F1, and AUC.Accuracy was defined as the fraction of cases that the model correctly predicted [26].Sensitivity was the fraction of positive cases predicted as positive [26].Specificity was the fraction of negative cases predicted as negative [26].PPV was the fraction of true positive cases from all cases that the model predicted to be positive [26].NPV was the fraction of negative cases from all cases that the model predicted to be negative [26].F1 was the harmonic mean of PPV and sensitivity [26].Accuracy, PPV, NPV, and F1 depend on the prevalence of adverse outcomes [26].We used the AUC to determine the best model for further implementation [13][14][15] because the AUC considers the predictive performance of the positive and negative outcomes.An AUC of 0.5 suggests no discrimination, 0.7-0.8suggests acceptable, 0.8-0.9suggests excellent, and > 0.9 suggests outstanding [26].The tuning parameters we used to refine our models are shown in Supplementary Table 1.We performed the DeLong test to assess overfitting of the training and test models and plotted the learning curves for our model (best model) [27].The p-value of the DeLong test for the best model (MLP model) was > 0.05, indicating no significant difference between the training and test models.Therefore, no significant overfitting existed.Using the learning curve [28] (Supplementary Fig. 1), we observed no significant overfitting as the number of samples increased, with the training score (F1 score) curve gradually approaching and overlapping the testing score curve.Subsequently, we integrated the AI predictive model into the HIS, deployed it at the AI web service, and launched it for real-time decision-making assistance by ED physicians.To reveal the real-time prediction result, a physician simply needed to press the AI button set up in the HIS.We then conducted a retrospective impact study between December 1, 2019, and April 30, 2021, in which all ED patients with hyperglycemic crises were identified and divided into non-AI and AI groups to compare outcomes.The use of AI was an aid to decision-making and depended on the physician's discretion.

ML algorithms used in this study
MLP is an artificial neural network that maps input data to appropriate outputs using an input layer, hidden layer, and output layer, each connected by a synaptic weight matrix and with nonlinear activation functions and trained via backpropagation [29].Its multiple layers and activation functions enable it to distinguish non-linearly separable data [29].In a study predicting adverse outcomes from pneumonia, MLP had AUCs of 0.749, 0.792, and 0.802 for sepsis or septic shock, respiratory failure, and mortality, respectively [6].
Random forest is an efficient ensemble technique that contains multiple decision trees generated from combined optimization decision trees, useful for classification and regression, and preventing overfitting with high accuracy even for incomplete datasets [30].Random forest has been widely used in AI medical studies for prediction [31], including a study of predicting outcomes in older ED patients with influenza, where their random forest model achieved an AUC of 0.840 for hospitalization, 0.765 for pneumonia, 0.857 for sepsis or septic shock, 0.885 for ICU admission, and 0.875 for in-hospital mortality [9].
Logistic regression is a statistical approach and supervised ML algorithm used for classification problems by mapping features to categorical targets and predicting the probability of a new case belonging to a target class [32].In a recent study of predicting major adverse cardiac events in ED patients with chest pain, logistic regression was used to achieve AUCs of 0.868 for acute myocardial infarction at < 1 month and 0.716 for all-cause mortality at < 1 month [11].
LightGBM is a high-performing gradient boosting framework that utilizes tree-based learning algorithms and includes Gradient-based One-Side Sampling and Exclusive Feature Bundling methods for selective sampling and reduced dimensionality [33].A study using LightGBM as an algorithm reported AUCs of 0.774 for sepsis or septic shock, 0.847 for respiratory failure, and 0.835 for mortality prediction [6].
SVM is a versatile algorithm that can address regression, binary, and multi-class classification problems by identifying a hyperplane that maximizes the distance between classes in the feature space [34].In cases where the classes are not linearly separable, the kernel trick is used to project the feature vectors to a higher-dimensional space [34].SVM is widely used in medicine, with a study reporting AUCs of 0.840 for hospitalization, 0.733 for pneumonia, 0.806 for sepsis or septic shock, 0.778 for ICU admission, and 0.762 for in-hospital mortality in older patients with influenza [9].
KNN is a non-parametric, supervised learning classifier that predicts the grouping of an individual data point using proximity to other data points [35].A study using KNN to predict major adverse cardiac events in ED patients with chest pain reported AUCs for acute myocardial infarction at < 1 month and all-cause mortality at < 1 month of 0.865 and 0.969, respectively [11].
The MLP model outperformed other algorithms with AUCs of 0.852 for sepsis or septic shock, 0.743 for ICU admission, and 0.796 for all-cause mortality in the testing dataset (Table 2 and Supplementary Fig. 2) [36].After a consensus was reached, MLP was chosen for AI implementation.SHapley Additive exPlanations (SHAP) values were used to identify feature associations and importance (Supplementary Fig. 3).A model was developed for predicting ICU admissions < 48 h with an AUC of 0.780 in the test dataset, outperforming other algorithms.A DeLong test was used to compare AUC values between algorithms (Supplementary Table 4).
Meanwhile, it is crucial for models to be well calibrated when used in real-world patient-level scenarios, as inaccuracies in individual predicted probabilities may lead to inappropriate decisions by physicians.To assess the calibration of our models, we generated calibration plots that depict the distribution of observed and predicted case states across absolute probability subgroups or bins.A calibration curve that closely aligns with the diagonal indicates a higher level of calibration for the corresponding model.Our evaluation, as demonstrated in Figs. 2, 3  and 4, reveals that the calibration guideline for all MLP models was not significantly violated.Therefore, these models can be considered suitable for implementing a prediction system.
Patients with hyperglycemic crises (n = 271) between December 1, 2019 and April 30, 2021 were identified to compare the adverse outcomes between the non-AI and AI groups (Table 3).The AI group tended to have a lower ICU admission rate (11.1% vs. 19.8%)and allcause mortality (11.1% vs. 15.0%)than the non-AI group; however, the differences were not significant.In addition, we used the same data to validate the PHD score and found that the AI model using MLP for predicting all-cause mortality performed better than the PHD score (Table 4).

Discussion
We developed an AI prediction model using MLP for ED patients with hyperglycemic crises that provided real-time decision-making assistance to physicians.The AUC of the model was 0.852 for sepsis or septic shock, 0.743 for ICU admissions, and 0.796 for all-cause mortality within 1 month.The impact study showed that the AI group tended to have lower ICU admissions and all-cause mortality than the non-AI group, but the differences were not significant.
Clinical decision rules (CDRs) like the PHD score can help with critical decision-making regarding patient health [37][38][39], but they have limitations.CDRs are designed to simplify complexity, and they should be externally validated in diverse settings to ensure applicability [37,38].They may not be applicable to a user's clinical setting or a targeted population, and they require manual calculation, which can be inconvenient in a busy ED [37,38].
AI is a breakthrough in healthcare that has the potential to improve the system.MLP, a significant model in the artificial neural network, is preferred for solving nonlinear problems.It consists of the input, hidden, and output layers and mimics the human brain [40].Unlike other computerized tools, AI learns, tests, and generates autonomously by analyzing big data [23,41].AI offers various opportunities for ED care, including image interpretation, predicting patient outcomes, monitoring vital signs, reducing documentation burden with natural-language-processing, home monitoring systems, and outbreak prediction tools [41][42][43][44].
We integrated an AI prediction model into the HIS, which overcame barriers between AI research and clinical practice, but there were implementation barriers.Hospital policies and cooperation from the hospital information department were crucial for successful implementation.Additionally, incorporating AI into the HIS was technically challenging and may require overhauling existing information technology systems.Finally, concerns regarding malpractice, accuracy, and physician   replacement by AI may affect physician acceptance of AI implementation [23].
Based on the same dataset, the AUC of all-cause mortality of the best model in our study was superior to that of the PHD score (0.796 vs. 0.693), suggesting that our AI model may be a better tool for predicting adverse outcomes in ED patients with hyperglycemic crises than the conventional PHD score.
We used the AUC, a recognized and comprehensive metric, to select the algorithm for our study [6,[9][10][11].A major advantage of AUC is that it measures the ranking of predictions, rather than their absolute values, and is classification-threshold-invariant [45].However, the choice of metric depends on the study's aim [10].For instance, if high sensitivity to predict sepsis or septic shock was the aim, we may have chosen LightGBM since it had the best sensitivity of 0.803 in our study.
We used the SHAP value, a new method to increase the transparency of AI prediction, to identify the importance of each feature variable for determining adverse outcomes [36].In the SHAP summary plot, red and blue indicate high and low associations, respectively, between the feature variable and an adverse outcome [36].
The study implemented a real-time AI prediction model integrated in the HIS to predict adverse outcomes in ED patients with hyperglycemic crises, which was a major strength.However, there were some limitations.The AUC for predicting ICU admission was lower than that for sepsis or septic shock and all-cause mortality, possibly due to the subjective nature of ICU admission decision-making [46].The results of the DeLong test (Table 2) indicate that, except for MLP models, there is a potential for overfitting in most models, which should be approached with caution.It is worth considering increasing the size of the data to potentially mitigate this issue and improve the performance of the models.The "black box" phenomenon remained a problem [23], but using the SHAP value may help increase transparency [36].The impact of AI prediction on clinical practice was not fully evaluated, and further studies are needed.The AI prediction model may not be generalizable to other hospitals, and ethical and legislative issues may arise from using AI predictions as a tool.There were also limitations in the ICD measures [47,48].Lastly, the sample size of new patients was small, warranting more patients to be recruited to delineate this issue.

Fig. 1
Fig. 1 Study flow chart.CMMC, Chi Mei Medical Center; ED, emergency department; AI, artificial intelligence; HIS, hospital information system

Fig. 2 Fig. 3
Fig. 2 Calibration plot: predicted and true probability results for sepsis and septic shock

Fig. 4
Fig. 4 Calibration plot: predicted and true probability results for all-cause mortality

Table 1
Characteristics of all ED patients with hyperglycemic crises in the three hospitals

Table 1 (continued) Total Sepsis or septic shock ICU admission All-cause mortality No Yes p-value No Yes p-value No Yes p-value
Data are presented as n (%) or mean ± SD.The independent t-test was used to analyze continuous variables, while the Chi-Square test was utilized to examine categorical variables ED Emergency department, ICU Intensive care unit, BMI Body mass index, hs-CRP PHD Predicting the hyperglycemic crisis death, SD Standard deviation

Table 2
Comparison of performance among the random forest, logistic regression, SVM, KNN, LightGBM, and MLP algorithms for adverse outcomes in ED patients with hyperglycemic crises [27]Multilayer perceptron, LightGBM Light Gradient Boosting Machine, SVM Support vector machine, KNN K-nearest neighbors, ED Emergency department, PPV Positive predictive value, NPV Negative predictive value, F1, 2 × (precision × recall/precision + recall), AUC Area under the curve, CI Confidence interval, ICU intensive care unit *The DeLong test was used to compare the AUC between train and test models[27]

Table 3
Comparison of clinical characteristics and adverse outcomes between the non-AI and AI groups in new ED patients with hyperglycemic crises between December 1, 2019 and April 30, 2021Data are presented as % or mean ± SD.The independent t-test was used to analyze continuous variables, while the Chi-Square test was utilized to examine categorical variables AI Artificial intelligence, ED Emergency department, ICU Intensive care unit, BMI Body mass index, hs-CRP High sensitivity C-reactive protein, PHD Predicting the hyperglycemic crisis death, SD Standard deviation * Because the number of an AI group in the age category "20-34" is 0, we only conducted the test for the other four age subgroups

Table 4
[27]arison of predicting the ICU admission and all-cause mortality rates between the AI model using MLP and the PHD score ICU Intensive care unit, AI Artificial intelligence, MLP Multilayer perceptron, PHD Predicting the hyperglycemic crisis death, PPV Positive predictive value, NPV Negative predictive value; F1, 2 × (precision × recall/precision + recall), AUC Area under the curve * The DeLong test was used to compare the AUC between MLP model and PHD score[27].Note: We adjusted the classification threshold to approach the same level of sensitivity as the prediction using the PHD score