Start Submission Become a Reviewer

Reading: A Machine Learning Pipeline for Mortality Prediction in the ICU


A- A+
Alt. Display

Original Research

A Machine Learning Pipeline for Mortality Prediction in the ICU


Yang Sun,

Department of Statistics, North Carolina State University, Raleigh, NC, US
X close

Yi-Hui Zhou

Department of Biological Science, North Carolina State University, Raleigh, NC; Bioinformatics Research Center, North Carolina State University, Raleigh, NC, US
X close


Mortality risk prediction for patients admitted into the intensive care unit (ICU) is a crucial and challenging task, so that clinicians are able to respond with timely and appropriate clinical intervention. This becomes more urgent under the background of COVID-19 as a global pandemic. In recent years, electronic health records (EHR) have been widely adopted, and have the potential to greatly improve clinical services and diagnostics. However, the large proportion of missing data in EHR poses challenges that may reduce the accuracy of prediction methods. We propose a cohort study that builds a pipeline that extracts ICD-9 codes and laboratory tests from public available electronic ICU databases, and improve the in-hospital mortality prediction accuracy using a combination of neural network missing data imputation approach and decision tree based outcome prediction algorithm. We show the proposed approach achieves a higher area under the ROC curve, ranging from 0.88-0.98, compared with other well-known machine learning methods applied to similar target population. It also offers clinical interpretations through variable selection. Our analysis also shows that mortality prediction for neonates was more challenging than for adults, and that prediction accuracy decreases as patients stayed longer in the ICU.

How to Cite: Sun Y, Zhou Y-H. A Machine Learning Pipeline for Mortality Prediction in the ICU. International Journal of Digital Health. 2022;2(1):3. DOI:
  Published on 12 May 2022
 Accepted on 28 Apr 2022            Submitted on 24 Mar 2022

1 Introduction

The intensive care unit (ICU) admits patients with life-threatening injuries and the most severe illnesses [1]. Patients in the ICU receive much more clinical care than other patients, requiring consistent monitoring and are at high risk of death. For example, under the background of COVID-19 as a global pandemic, the mortality rates of ICU admitted COVID-19 patients are between 30%–50%. Among the various aspects of clinical research, ICU mortality prediction plays an important role as it not only helps researchers to identify the patients in danger, but in principle can save ICU beds for patients most in need [2, 3]. Therefore, building accurate mortality models and identifying important risk factors are needed more than ever to inform clinical decisions and answer urgent research questions.

Existing literature developed score systems to pre-process data using baseline patient characteristics. For example, the Acute Physiology And Chronic Health Evaluation system (APACHE) [4] and the Simplified Acute Physiology Score (SAPS) [5, 6] mostly reply on clinical expertise for variable selection and importance weights. However, none of these models are consistently suitable for all patients in ICUs due to insufficient accuracy or lack of generality [7]. In order to increase the accuracy of mortality prediction, researchers have designed various models and scoring systems with specific requirements. For example, Moridani et al. [8] proposed a mortality scoring system to predict heart disease. Although the model resulted in acceptable results when risks are predicted at an early stage, the sample size in this study was too small to represent the target population (only 90 selected patients), and only one disease factor was considered.

Electronic health records (EHR) contain valuable information such as demographics, lab test, vital signs and disease diagnosis code. In the United States, more than 30 million hospital patient visits happen each year, for which 83% generate electric health records. For researchers, data mining using such a rich data source could potentially provide deeper understanding and achieve satisfactory prediction results. One common challenge in utilizing electric health records is that data are not always available in a clean and tidy format, and much of the primary work is to prepare the data into a standard form amenable to analysis. This issue is imperative for ICU prediction because the clinical decisions have to be made in a timely matter.

Analysis on EHR data could be significantly challenging since there is considerable missing data. The characteristics of the patient records vary greatly among heterogeneous patients, which results in inconsistent and incomplete data. Improper dealing with missing data can lead to significant biased results, especially for data with complex structure. This issue is growing as EHR data become more widespread and datasets increase in scope and size. In EHR data, missing values usually frequently outnumber non-missing values, with missing rates ranging from 20% to 80% [9]. In most EHR data, non-missing features may account for part of the missingness patterns among different patients. Therefore, in principle much of the missingness might be “recoverable” in an approximately unbiased manner, given a sufficiently sophisticated approach to imputation. Numerous methods have been developed to impute missing values. These include spectral analysis [10], kernel methods [11], the EM algorithm [12] for certain models, matrix completion [13] and matrix factorization [14], to name a few. Multiple imputation [15, 16] can be further applied with above missing imputation methods to reduce the variance, by repeating the imputation steps multiple times and taking the average of the results, which also provides estimates of uncertainty. However, existing generative imputation methods have limitations. The EM methods require assumptions about the distributions and are cumbersome to apply to mixed features including both categorical and continuous data, which are very common in EHR data (e.g. ICD-9 codes and laboratory test values). Recent developments in deep learning have shown promising alternatives. For example, the generative adversarial imputation nets (GAIN) [17] train a generator and discriminator aversarially to impute missing data according to the true underlying distribution.

In this paper, we build an accurate ICU prediction pipeline that extracts ICD-9 codes and laboratory tests from the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC) III database. Our model handles large numbers of patients with various diseases and significant missing data. We show that the proposed prediction model based on LightGBM outperforms other ICU mortality prediction models reported in literature under similar target population.

2 Materials and Methods

2.1 MIMIC Electronic Health Records data

The Multiparameter Intelligent Monitoring in Intensive Care (MIMIC) [18] is a public critical care database which includes all patients admitted to the ICUs of Beth Israel Deaconess Medical Center in Boston, MA since 2001. Three major versions of MIMIC database are available for research analysis. The MIMIC-II includes 24,508 adult patients (age > 15, as defined by the MIMIC database) from 2001 to 2008, and MIMIC-III augments it with the data collected between 2008–2012, resulting in a total of 38,597 unique adult patients and 7870 neonates. In addition, the data in MIMIC-III is more reliable and complete to facilitate further research than in MIMIC-II [19]. The MIMIC database provides information about patients’ demographics, diagnosis codes, laboratory tests, and clinical events, for over 350 million values across various sources of data. Among the collected values, the laboratory test is one of the most important data sources, constituting more than 90% of total events. Table 1 summarizes the patient characteristic in different versions of MIMIC database. Noted that the identifier has been changed across versions. Although SUBJECT_ ID (unique to a patient) has been kept consistent between MIMIC-II and MIMIC-III, HADM_ ID (unique to a patient hospital stay) and ICUSTAY_ ID (unique to a patient ICU stay) have been regenerated and will not match between the two databases. Also, some item-ID mapping and Schema have changed, and many new tables have been added into MIMIC-III.

Table 1

Adult patient characteristics in different versions of MIMIC database.


Distinct adult patients 24,508 38,597

Hospital admissions 26,870 49,785

Distinct ICU stays 31,782 55,423

Age (yrs) 65.5 (51.9, 77.7) 65.8 (52.8, 77.8)

Gender (male) 17,857 (56.2%) 27,983 (55.9%)

ICU length of stay (days) 2.1 (1.1, 4.3) 2.1 (1.2, 4.6)

Hospital length of stay (days) 7 (4, 13) 6.9 (4.1, 11.9)

Hospital mortality 3,092 (11.5%) 5,748 (11.5%)

We use MIMIC-III database to build a machine learning pipeline for ICU patients’ in-hospital mortality prediction, which includes 38,597 unique adults and 7,870 neonates. Separate models are built for adults and neonates due to their distinct event rates, ICD-9 diagnoses and covariate distributions. To protect patients’ privacy, all patients’ data are de-identified, and the reported times of the events are randomly drifted for each patient so that only the relevant time intervals within clinical events are known. There are 6,984 features among ICD-9 codes, 726 features among laboratory tests and tens of other demographics features. After feature engineering, a total of 601 features including demographics (age and gender), diagnosis codes (ICD-9 Code) and laboratory tests are selected in the machine learning models (Figure 1). ICD-9 Code documents diagnoses and procedures associated with hospital utilization. For neonates’ ICD-9 records, we removed the V codes because they are designed for occasions when circumstances other than a disease or injury result in an encounter or are recorded by providers as problems or factors that influence care.

Feature Engineering Pipeline of MIMIC-III database
Figure 1 

Feature Engineering Pipeline of MIMIC-III database.

2.2 Data preprocessing and Feature Engineering

2.2.1 ICU patients cohort

All patients in the MIMIC-III database are included, resulting in 38,597 unique adult patients, and 7,870 unique neonates. For each patient, we include their first admission to the ICU.

2.2.2 ICD-9 code

Each patient’s ICD-9 codes are documented at the end of ICU stay, including information during the whole period, however, they are not supposed to contain the information after discharge. We extract and calculate the frequency of distinct ICD-9 codes appeared in the database. The number of ICD-9 codes can be as high as tens of thousands (14,567 by definition and 6,984 in our data), which could cause not only low predictive performance but also memory issues. The observed distribution of ICD-9 codes shows that some high frequent codes dominant the whole diagnoses dataset. As shown in Table 5, the top 10 common ICD-9 codes cover 76.9% of the dataset and the top 50 cover 93.6%. Thus we consider to use the adjusted ICD-9 categories to reduce the feature dimension. The first 3 digits contain the general condition of a patient so that have been commonly used to represent disease categories [20]. Therefore, We follow this tradition and convert features into 1070 dummy variables (one-hot encoding) as indicators for each small group of disease codes. The last two line of Table 5 show after grouping the ICD-9 codes into categories, the top 10 and top 50 most common categories cover 84.2% and 96.8% separately [21].

Although we already reduced the dimension of diagnosis code by extracting the first 3 or 4 digits, some features remain to be too sparse, which may cause common problems such as increasing the space and time complexity of models, causing algorithms to behave in unknown ways due to overfitting. To further simplify the feature space, we drop the adjusted ICD-9 codes which were documented on fewer than 48 patients, following our observation that the median of the 3-digit frequency in the database was 48.5. In this manner we are able to reduce the diagnosis code space to 535 unique values (as in Figure 1).

2.2.3 Laboratory tests

The Lab table in MIMIC-IV spans more than just the patient’s ICU stay, and in fact covers their whole hospital stay (and sometimes includes outpatient stays). As a result, the extracted laboratory test values allows the possibility to predict the patient’s mortality rate before admission to the ICU.

Among the 726 unique laboratory test features, we select only the tests documented after the patient’s first inpatient admission. We also aggregate the laboratory test into each day. If there are more than one same laboratory test for a patient within the same day, we take their average values. If a laboratory test does not exist for a patient in one day, we treat it as a missing value. After the initial aggregation, our lab data are organized into a 3-dimensional patient-lab array, representing patient ID, date and lab test features, separately.

For ICU patients, there are no universal standard rules for prescription of laboratory tests, resulting in a large and sparse lab feature space. Frassica [22] in 2005 compared 45,188 lab tests and profiles in three ICUs: surgical ICU, pediatric ICU, and medical ICU, and discovered that 80% of the tests and profiles in the three ICUs are covered by <25 tests. Sharafoddini [23] combined the previous information and modified the lab tests items based on the Medical Information Mart for Intensive Care III database, and then selected 36 lab items for predicting in-hospital mortality. Inspired by the above analyses, we hypothesized that using more laboratory test items might not improve the prediction accuracy, as the three-dimension patient-lab-day array is extremely sparse and have a high missing data rate. Figure 5 presents the missing proportion of 36 common laboratory tests. Many lab features have missing rate greater than 50%, and the missing rates can be as high as 80%, potentially posing important downstream effects on accuracy for some prediction methods.

To reduce the dimension of laboratory test features, we first take the average of the patient’s test value prior to the target (response) day as the patient’s corresponding laboratory test value. In this manner, the sparse time series of laboratory tests are transformed into scalar values. As a result, the 3-dimensional patient-lab array becomes a patient-lab matrix. While we mitigate the sparsity problem of laboratory tests, the missing rates sill remain problematic.

Second, we filter out the laboratory tests with missing rate greater than 70% (as being effectively uninformative). This reduces the dimension of laboratory items to 64. The remaining missing data are imputed to improve the prediction accuracy. Details about missing data imputation are provided in Section 2.3.

2.2.4 Data Normalization

Some of the prediction methods are sensitive to the scale of values, and data normalization provides a more uniform effective weight. We re-scale all the features using the min-max normalization technique,


where Xi is the scaled value for ith feature, and Xi,raw is the original value. Other approaches are possible, but min-max scaling has the advantage of preserving interpretability for binary variables, as [0,1] maps to [0,1] after scaling.

The missing imputation step is for recovering training data, so that the machine learning models would be trained better to capture the structure of the features if proper imputation methods are applied. In training steps, we applied GAIN on training data to obtain the complete dataset for model training. In prediction steps, we combine the original training data and the prediction data, and then apply GAIN on the whole dataset to obtain the complete prediction features as input of the models.

2.3 Missing Data Imputation via GAIN

The goal of imputation in the mortality prediction context is not to provide unbiased estimates of parameters governing the data structure, as is often the case in formal missing data methods, but to use missing data imputation as a device to increase prediction accuracy for ICU outcome [37]. However, the general rationale behind missing data imputation is valid for prediction even if the predicted values are biased, as prediction does not use the final outcome measure. Here we operate under the standard assumption of missing-at-random as in the missing data literature.

Generative Adversarial Imputation Nets (GAIN) [17] is a missing data imputation algorithm based on the well-known Generative Adversarial Nets (GAN) framework. The generator and discriminator are trained adversarially to learn the desired missing data distribution. GAIN was defined for irregularly-sampled temporal data and has been shown to substantially outperform previous methods including Multiple Imputation by Chained Equations, Matrix Completion and Expectation Maximization on a variety of UCI datasets [17]. A review paper for deep learning methods for medical time series data by Sun et al. also suggested superiority of GAN based missing data imputation methods for EHR data [24].

Let X = (X1, …, XD)T be a vector of random variables (either continuous for laboratory tests or binary for re-coding of ICD-9 codes), whose distribution follows P(X). In the presence of missing values, we use a mask vector M = (M1, …, MD)T with Md ∈ {0, 1} for d = 1,…, D to inform the generator which values are missing or are present.

The generator in GAIN, G, takes realized values of X, M and a random noise vector Z as input, and outputs a vector of imputed values X^. The noise vector Z is introduced to add randomness to imputed values so they’re not the same every time, and thus avoid model overfitting.

The discriminator of GAIN, D, is used as an adversary of G to determine whether a value is observed or imputed. In the discriminator, a hint vector H is further supplied to ensure that the discriminator forces the generator to generate the desired distribution.

The loss of GAIN is defined by


where log is element-wise logarithm and V(D, G) depends on G through the corresponding imputed clinical features matrix X^. The training objective of GAIN is a minimax problem given by minG maxD V(D, G).

We apply GAIN on the whole dataset to obtain the complete feature dataset as input of the subsequent outcome prediction models. Figure 2 summarizes the pipeline of our missing data imputation procedure.

Pipeline of the missing data imputation procedure. The zeros in the Mask Matrix indicate the positions of missing data, while ones indicate the observed data. Data Matrix is generated from Raw Data with missing values imputed by 0. Random Matrix includes random noise at the missing positions and 0 otherwise
Figure 2 

Pipeline of the missing data imputation procedure. The zeros in the Mask Matrix indicate the positions of missing data, while ones indicate the observed data. Data Matrix is generated from Raw Data with missing values imputed by 0. Random Matrix includes random noise at the missing positions and 0 otherwise.

2.4 Outcome Prediction Models Under Comparison

We compare a number of binary outcome prediction models for the ICU mortality prediction task, including several popular machine learning methods. These models are briefly reviewed as follows.

2.5 Regularized logistic regression

We consider logistic regression as a baseline to predict the mortality probability. To avoid over-fitting, regularization is used to control the complexity of the model by penalizing large regression coefficients. The loss function of regularized logistic regression with LK penalty is


where hθ(X) is the predicted probability of event y occurring given X, θ is a vector of the regression coefficients, and λk controls the amount of regularization applied to the model. When K = 1, the loss function with L1 penalty corresponds to Lasso logistic regression. When K = 2 and λ1 = 0, the loss function with L2 penalty corresponds to Ridge logistic regression. Finally, when K = 2 and λ1, λ2 ≠ 0, the loss function with weighted L1 and L2 penalty corresponds Elastic Net logistic regression. We compare the performance of these three different regularized logistic regression models for ICU mortality prediction. There is one tuning parameter for each of the Lasso and Ridge model (L1 and L2 penalty). For Elastic Net, we set the same weight for L1 and L2 penalty term, so that it also has one tuning parameter. In the presence of large number of features, the L1 penalty might be a better choice since it results in sparse coefficients and a simpler model with less features. For each tuning parameter of the penalty terms, a set of ten candidate values in a logarithm scale between 1e–4 and 1e4 is selected by cross-validation according to prediction accuracy.

2.5.1 LightGBM

Ke et al. [25] developed an efficient boosting decision tree using gradient-based one-side sampling (GOSS) and exclusive feature bundling (EFB), called LightGBM. LightGBM significantly improves the training time based on decision tree algorithms, while achieving the same or even better accuracy. To introduce the algorithm of LightGBM, we first review the Gradient Boosted Decision Trees (GBDT) method.

GBDT is a popular machine learning algorithm that iteratively constructs an ensemble of weak decision tree learners through boosting [26]. Given a training set X1,y1,X2,y2,,Xn,yn, where X are the feature samples and y are the mortality labels, the classification function FX=m=1TfmX is trained iteratively to minimizing the loss function V(y, F(x)):


where T is the number of iterations, the newly added decision tree fm is chosen to minimize the aggregated loss. However, if the data has a large sample size or high dimension, GBDT is not able to provide accurate and efficient results.

LightGBM combines exclusive features into one feature and reduce the time complexity of training LightGBM from O(n*D) to O(n*g) [27], where g is the group number of exclusive features. In addition, GOSS ignores the data with minor gradient so that we do not need to go through all data during the update steps. As a result, LightGBM has the ability to achieve the same accuracy as GBDT, but with significantly reduced training time and memory usage with large datasets.

2.5.2 Recurrent neural network

As computational power increases, deep learning methods have become more powerful in recent years. Recurrent Neural Network (RNN) [28] arranges an input layer, several hidden layers, and an output layer sequentially to “recall” historical outcomes. This has been made possible through the structure design of hidden layers, in which the output of a previous layer is then used as a part of the input for the current layer. Schmidhuber and Hochreiter [29] proposed a special RNN model called Long short-term memory (LSTM) to capture subtle effects over long term memory. LSTM includes three gates besides the essential parts of RNN: the input gate quantifies the importance of the input information, the forget gate decides which information needs to be carried over, and the output gate generates the information of the next hidden layer. Together, these internal gates help to avoid the vanishing and exploding gradient problems that can occur with RNNs. The LSTM has been shown to work well for modeling sequential data. In this paper, we consider applying LSTM models to handle the correlations between features and longitudinal measurements from the laboratory tests. The last sigmoid layer outputs the predicted ICU mortality probability.

3 Experimental Settings

The data preprocessing and feature engineering steps described in Section 2.2 results in a total of 601 features (536 binary and 65 continuous). We apply logistic regression with L1 penalty, logistic regression with L2 penalty, logistic elastic net, LightGBM and LSTM separately for ICU mortality predictions. The performance is evaluated based on the area under the receiver-operator characteristic curve (AUROC), and the area under the precision-recall curve for prediction accuracy (AUPRC). A high value of AUROC and AUPRC indicates the prediction model have adequate discrimination to distinguish patients that died in the hospital and those that did not. The AUPRC is more robust under highly skewed datasets (as the neonate cohort in this study), and give a more informative picture of an algorithm’s performance [30]. Various rules of thumb suggest that the AUC of a binary classifier should be greater than 0.8 or 0.9. We obtain estimated 95% confidence interval (CIs) of AUROC and AUPRC via 100 bootstrap samples. In addition to the adult and neonate cohort, we also want to examine the performance of methods on four adult sub-cohorts: patients total length of stay in ICU within 24 hours (adult within 1 day), total length of stay in ICU within 48 hours (adult within 2 day), total length of stay in ICU more than 24 hours (adult >1 day), and total length of stay in ICU more than 48 hours (adult >2 day). These subgroups are of clinical interest and have distinct underlying distributions. The prediction evaluation on these sub-cohorts would allow a more accurate classification and facilitate the understanding about potential differences of patients in different stages.

Comparison of the above 5 algorithms relies on 8-fold cross-validation on the patient level. For each patient, only the first ICU admission is used. That is, we randomly split the whole data set into 8 mutually exclusive blocks. The models are first trained on 7 blocks (training data set, including 87.5% of the cohort) and then used to predict on the rest of the 12.5% of the cohort (validation data set). This process is repeated 8 times to iterate through all 8 blocks. For each of the prediction algorithm, the hyper-parameters are selected via cross-validation to minimize the error, and performance measures are aggregated over all 8 iterations.

4 Results

4.1 Comparison on the adult cohort

Table 2 summarizes the in-hospital mortality rates on different cohorts under study. The mortality rate of the overall adult cohort is 25.63%, with adults admitted to ICU within 1 day having a much higher mortality rate (42.26%) than those admitted after 1 day (24.18%). In contrast, the overall neonate cohort has a very low mortality rate (0.8%). Due to the skewness of this data, AUPRC is a more objective evaluation metric.

Table 2

The mortality rate, accuracy (ACC), AUROC, and AUPRC returned from LightGBM in different patient cohorts.


Adult Total 38557 28675 9882 25.63% 82.47% 90.42% 87.22%

Adult >1 day 35474 > 24 hr 26895 8579 24.18% 81.62% 89.00% 74.52%

Adult >2 day 33029 >48 hr 24957 8072 24.44% 80.95% 88.62% 73.60%

Adult 1 day 3083 0–24 hr 1780 1303 42.26% 89.78% 97.03% 96.50%

Adult 2 day 2445 24–48 hr 1938 507 20.74% 81.75% 89.78% 76.80%

Neonate Total 7649 7588 61 0.80% 95.01% 98.92% 63.66%

The performance of the 5 candidate algorithms on the overall adult cohort, as evaluated by AUROC and AUPRC, are presented in Figures 3 and 4. The LightGBM achieves the highest prediction accuracy and discrimination result, with an average AUROC 90% (95% CIs: 0.902, 0.906) and AUPRC 87% (95% CIs: 0.8660, 0.8746). The RNN LSTM model performs slightly worse than LightGBM, with AUROC 88% (95% CIs: 0.869, 0.891) and AUPRC 83% (95% CIs: 0.822, 0.849). The logistic models have the lowest prediction accuracy, with L1 and L2 penalty achieve similar performance, with AUROC 87% (95% CIs: 0.870, 0.881) and AUPRC 82% (95% CIs: 0.816, 0.833). The logistic elastic net is the worst among all the methods, with AUROC 83% (95% CIs: 0.821, 0.838) and AUPRC 75% (95% CIs: 0.740, 0.768). These results are expected as LightGBM and RNN LSTM are non-parametric models involving more hyper-parameters, and thus are more flexible to learn complex relationship between features and mortality status.

Cross-validation area under receiver operating characteristic curve (AUROC) for 5 candidate algorithms on the adult cohort
Figure 3 

Cross-validation area under receiver operating characteristic curve (AUROC) for 5 candidate algorithms on the adult cohort.

Cross-validation area under precision-recall curve (AUPRC) for 5 candidate algorithms on the adult cohort
Figure 4 

Cross-validation area under precision-recall curve (AUPRC) for 5 candidate algorithms on the adult cohort.

Among the models, LightGBM not only produces a more accurate prediction, but also provides a relatively interpretable model with feature selections. We examine the top 30 features identified by LightGBM (as in Figure 6). The most important features are all laboratory tests including platelet count, red cell distribution width, alanine aminotransferase and blood urea nitrogen. None of the ICD-9 codes are selected by the model probably due to their relatively sparsity in the database.

4.2 Comparison among Sub-cohorts

Table 2 presents the AUROC and AUPRC for additional sub-cohorts using algorithm with the best performance (LightGBM). The AUROCs of adult >1 day, adult >2 day and adult within 2 day sub-cohorts are similar to the adult overall cohort, achieving about 90% AUROC, but slightly lower AUPRC. The prediction accuracy in adult within 1 day sub-cohort is higher, with 97.0% AUROC and 96.5% AUPRC. The AUROC of neonates is 99.0%, much higher than 90.4% for adults, but the AUPRC is only 63.7%. This is due to the extremely low mortality rate in the neonates cohort.

4.3 Comparison with Other Literature

Several previous publications have also proposed mortality prediction models using the MIMIC database. Each study adopted different version of the data and unique inclusion criteria, resulting in sample sizes ranged from hundreds to tens of thousands. In addition, different evaluation metrics and specifically designed features have been chosen to suit their own models. Therefore, it is difficult to fairly compare the model performance across studies. To provide an approximate comparison to our results, we exclude studies with too specific cohort inclusion criteria, or without AUROC reported, or with sample size less than 9,000. Table 3 shows the literature reported prediction performance on in-hospital mortality using MIMIC database. Among these studies, Harutyunyan et al. [31] used the most similar data and criteria as our study. Our pipeline achieves a 4 percentage point increase compared with their results. Overall, while some of the reported AUROCs are close to 90%, our pipeline produces the highest AUROC for in-hospital mortality prediction compared with similar studies.

Table 3

Literature reported prediction performance on in-hospital mortality using MIMIC database.


[32] 14,739 Linear LR 82% II

Inclusion criteria: Have SAPS-I, LOS 24hr, first ICU stay only

[33] 19,308 Linear SVM 84% II

Inclusion criteria: Age > 18, > 100 words across all notes

[34] 9,683 Non-Linear AutoTriage 88% III

Inclusion criteria: Age > 18, In MICU, > 1 obs. for all features, 500hr ≥ LOS ≥ 17hr

[31] 42,276 Non-Linear LSTM 86% III

Inclusion criteria: Age >18, only one ICU stay during the hospital admission

[35] 24,508 Non-Linear SuperLearner 88% II

Inclusion criteria: Age >15

4.4 Validation of the pipeline using eICU database

The eICU Collaborative Research Database collects data from a combination of many critical care units throughout the continental United States. The data in the collaborative database covers 139,367 patients who were admitted to critical care units in 2014 and 2015. To validate the method of constructing the proposed pipeline and test the generalizability of the results, we rebuild the mortality prediction pipeline using adult patients (age > 15, as defined in MIMIC-III) with their first admission records in the eICU database. Fig A.3 presents the corresponding data preprocessing and feature engineering pipeline. As a result, there are 117,456 unique patients and 436 features (389 ICD-9 categories, 45 laboratory tests and 2 demographic features), with in-hospital mortality rate 9.7%. Fig A.4 shows that our pipeline provides satisfactory prediction accuracy on this database. LightGBM still outperforms other methods, achieving the highest AUROC 0.92 (95% CI:0.91, 0.92).

5 Discussion

We propose a pipeline that extracts ICD-9 codes and laboratory tests from public available electronic ICU databases, and improve the in-hospital mortality prediction accuracy using a combination of neural network missing data imputation and decision tree based algorithms. Among the 5 candidate algorithms, LightGBM results in the highest prediction accuracy. Our analysis also show that although the proposed model achieves a very high AUROC for neonates, the AUPRC is relatively low, indicating mortality prediction is more challenging for neonates than for adults due to the extreme low event probability. The predicted accuracy decreases as patients stay longer in the ICU. This is probably because the patients stayed longer are more heterogeneous and involve more complicated disease status. For our data, LightGBM has a higher AUROC compared with other well-known machine learning and deep learning methods, and it also refined the interpretable model by identifying the most important features.

Besides the improvement of AUROC, a precise mortality prediction before the patient enters ICU is much more meaningful when ICU resources are in shortage, especially during crises, as exemplified recently in the COVID-19 pandemic. Using our pipeline, we are able to predict severe ill patients’ mortality probability as long as the needed laboratory test results are available, even if they are not admitted to the ICU. Thus extensions of this work might therefore be used for allocation of scarce ICU resources. We also consider this manuscript to be useful for clinical support and decision-making. The top 30 features selected by LightGBM have been reported by clinical experts to be important indexes associated with mortality rate [23], which further validates the feature importance ranking returned from our LightGBM model. We note that resource allocation depends on a complex relationship between the likelihood of survival, and benefits to the patient or society at large, which will include considerations of remaining years of life. Thus our methods to improve prediction of neonatal mortality are of especial interest, as patient age serves as (effectively) a meaningful predictor of mortality, an important stratification variable, and an important indicator of patient benefit due to the remaining years of life. These aspects raise important ethical considerations that are beyond our scope. However, we believe our approaches serve as an essential input into these considerations. Further extensions to our approaches may also involve differing phenotype or predictors of outcomes, which may be useful in hypothesizing the effect of different treatments, as a step toward justifying experimental clinical trials.

Although we propose a mortality rate prediction pipeline that outperforms the literature reported performance, there are some limitations of this work. First, for MIMIC-III and eICU databases, the ICD-9 codes are documented at the end of patients’ ICU stay. Therefore, the ICD-9 features include the information during the ICU stay, and there is no way to filter them based on time. We follow the approach used by El-Rashid et al. [36] and Huang et al. [21], which have used ICD-9 codes in MIMCI-III in a similar way to provide information about patients’ disease information. However, if timestamp is accompanying the ICD-9 codes in other databases, the disease diagnosis could be used in a more flexible approach for dynamic prediction. Second, both MIMIC and eICU data are collected from US hospitals, which may cause generalization issues when trying to apply the models in non-US hospitals. However, the three types of features, demographics, diagnosis codes (ICD-9 Code: International Classification of Diseases) and laboratory tests, are used worldwide. As a result, for non-US hospital settings, the integration of the proposed pipeline into local, national, and international healthcare systems will still be useful and save lives as long as the three types of data are collected, and the models are properly trained.

Additional File

The additional file for this article can be found as follows:


A.1 Additional Figures, A.2 Additional Tables and A.3 Additional Information. DOI:


Dr. Zhou’s start-up funding and CFF KNOWLE18XX0 have supported this study.

Competing Interests

The authors have no competing interests to declare.


  1. Marjut Varpula, Minna Tallgren, Katri Saukkonen, Liisa-Maria Voipio-Pulkki, Ville Pettilä. Hemodynamic variables related to outcome in septic shock. Intensive care medicine. 2005; 31(8): 1066–1071. DOI: 

  2. Jean-Louis Vincent, Daniel De Backer. Circulatory shock. New England Journal of Medicine. 2013; 369(18): 1726–1734. DOI: 

  3. Daniel De Backer, Patrick Biston, Jacques Devriendt, Christian Madl, Didier Chochrad, Cesar Aldecoa, Alexandre Brasseur, Pierre Defrance, Philippe Gottignies, Jean-Louis Vincent. Comparison of dopamine and norepinephrine in the treatment of shock. New England Journal of Medicine. 2010; 362(9): 779–789. DOI: 

  4. William A Knaus, Elizabeth A Draper, Douglas P Wagner, Jack E Zimmerman. Apache ii: a severity of disease classification system. Critical care medicine. 1985; 13(10): 818–829. DOI: 

  5. Jean-Roger Le Gall, Stanley Lemeshow, Fabienne Saulnier. A new simplified acute physiology score (saps ii) based on a european/north american multicenter study. Jama. 1993; 270(24): 2957–2963. DOI: 

  6. Jason Waechter, Anand Kumar, Stephen E Lapinsky, John Marshall, Peter Dodek, Yaseen Arabi, Joseph E Parrillo, R Phillip Dellinger, Allan Garland, Cooperative Antimicrobial Therapy of Septic Shock Database Research Group, et al. Interaction between fluids and vasoactive agents on mortality in septic shock: a multicenter, observational study. Critical care medicine. 2014; 42(10): 2158–2168. DOI: 

  7. Marc Leone, Pierre Asfar, Peter Radermacher, Jean-Louis Vincent, Claude Martin. Optimizing mean arterial pressure in septic shock: a critical reappraisal of the literature. Critical care. 2015; 19(1): 101. DOI: 

  8. Maurizio Cecconi, Daniel De Backer, Massimo Antonelli, Richard Beale, Jan Bakker, Christoph Hofer, Roman Jaeschke, Alexandre Mebazaa, Michael R Pinsky, Jean Louis Teboul, et al. Consensus on circulatory shock and hemodynamic monitoring. task force of the european society of intensive care medicine. Intensive care medicine. 2014; 40(12): 1795–1815. DOI: 

  9. Kitty S Chan, Jinnet B Fowles, Jonathan P Weiner. Electronic health records and the reliability and validity of quality measures: a review of the literature. Medical Care Research and Review. 2010; 67(5): 503–527. DOI: 

  10. Debashis Mondal, Donald B Percival. Wavelet variance analysis for gappy time series. Annals of the Institute of Statistical Mathematics. 2010; 62(5): 943–966. DOI: 

  11. Kira Rehfeld, Norbert Marwan, Jobst Heitzig, Jürgen Kurths. Comparison of correlation analysis techniques for irregularly sampled time series. Nonlinear Processes in Geophysics. 2011; 18(3): 389–404. DOI: 

  12. Pedro J García-Laencina, José-Luis Sancho-Gómez, Aníbal R Figueiras-Vidal. Pattern classification with missing data: a review. Neural Computing and Applications. 2010; 19(2): 263–282. DOI: 

  13. Rahul Mazumder, Trevor Hastie, Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research. 2010; 11: 2287–2322. 

  14. Yehuda Koren, Robert Bell, Chris Volinsky. Matrix factorization techniques for recommender systems. Computer. 2009; 42(8): 30–37. DOI: 

  15. Ian R White, Patrick Royston, Angela M Wood. Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine. 2011; 30(4): 377–399. DOI: 

  16. Melissa J Azur, Elizabeth A Stuart, Constantine Frangakis, Philip J Leaf. Multiple imputation by chained equations: what is it and how does it work? International journal of methods in psychiatric research. 2011; 20(1): 40–49. DOI: 

  17. Jinsung Yoon, James Jordon, Mihaela Van Der Schaar. Gain: Missing data imputation using generative adversarial nets. arXiv preprint arXiv:1806.02920; 2018. 

  18. Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data. 2016; 3(1): 1–9. DOI: 

  19. Zheng Dai, Siru Liu, Jinfa Wu, Mengdie Li, Jialin Liu, Ke Li. Analysis of adult disease characteristics and mortality on mimic-iii. PloS one. 2020; 15(4): e0232176. DOI: 

  20. Wei-Qi Wei, Lisa A Bastarache, Robert J Carroll, Joy E Marlo, Travis J Osterman, Eric R Gamazon, Nancy J Cox, Dan M Roden, Joshua C Denny. Evaluating phecodes, clinical classification software, and icd-9-cm codes for phenome-wide association studies in the electronic health record. PloS one. 2017; 12(7): e0175508. DOI: 

  21. Jinmiao Huang, Cesar Osorio, Luke Wicent Sy. An empirical evaluation of deep learning for icd-9 code assignment using mimic-iii clinical notes. Computer methods and programs in biomedicine. 2019; 177: 141–153. DOI: 

  22. Joseph J Frassica. Frequency of laboratory test utilization in the intensive care unit and its implications for large-scale data collection efforts. Journal of the American Medical Informatics Association, 2005; 12(2): 229–233. DOI: 

  23. Anis Sharafoddini, Joel A Dubin, David M Maslove, Joon Lee. A new insight into missing data in intensive care unit patient profiles: observational study. JMIR medical informatics, 2019; 7(1): e11605. DOI: 

  24. Chenxi Sun, Shenda Hong, Moxian Song, Hongyan Li. A review of deep learning methods for irregularly sampled medical time series data. arXiv preprint arXiv:2010.12493; 2020. 

  25. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in neural information processing systems. 2017; 3146–3154. 

  26. Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics. 2001; 1189–1232. DOI: 

  27. Cheng Chen, Qingmei Zhang, Qin Ma, Bin Yu. Lightgbm-ppi: Predicting proteinprotein interactions through lightgbm with multi-information fusion. Chemometrics and Intelligent Laboratory Systems. 2019; 191: 54–64. DOI: 

  28. Rafal Jozefowicz, Wojciech Zaremba, Ilya Sutskever. An empirical exploration of recurrent network architectures. In International conference on machine learning. 2015; 2342–2350. PMLR. 

  29. Jürgen Schmidhuber, Sepp Hochreiter. Long short-term memory. Neural Comput. 1997; 9(8): 1735–1780. DOI: 

  30. Jesse Davis, Mark Goadrich. The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning. 2006; 233–240. DOI: 

  31. Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, Aram Galstyan. Multitask learning and benchmarking with clinical time series data. Scientific data. 2019; 6(1): 1–18. DOI: 

  32. Li-wei Lehman, Mohammed Saeed, William Long, Joon Lee, Roger Mark. Risk stratification of icu patients using topic models inferred from unstructured progress notes. In AMIA annual symposium proceedings, volume 2012, page 505. American Medical Informatics Association, 2012. 

  33. Marzyeh Ghassemi, Tristan Naumann, Finale Doshi-Velez, Nicole Brimmer, Rohit Joshi, Anna Rumshisky, Peter Szolovits. Unfolding physiological state: Mortality modeling in intensive care units. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014; 75–84. DOI: 

  34. Jacob Calvert, Qingqing Mao, Jana L Hoffman, Melissa Jay, Thomas Desautels, Hamid Mohamadlou, Uli Chettipally, Ritankar Das. Using electronic health record collected clinical variables to predict medical intensive care unit mortality. Annals of Medicine and Surgery. 2016; 11: 52–57. DOI: 

  35. Romain Pirracchio, Maya L Petersen, Marco Carone, Matthieu Resche Rigon, Sylvie Chevret, Mark J van der Laan. Mortality prediction in intensive care units with the super icu learner algorithm (sicula): a population-based study. The Lancet Respiratory Medicine. 2015; 3(1): 42–52. DOI: 

  36. Nora El-Rashidy, Shaker El-Sappagh, Tamer Abuhmed, Samir Abdelrazek, Hazem M El-Bakry. Intensive care unit mortality prediction: an improved patient-specific stacking ensemble model. IEEE Access. 2020; 8: 133541–133564. DOI: 

  37. Zhou Yi-Hui, Ehsan Saghapour. ImputEHR: a visualization tool of imputation for the prediction of biomedical data. Frontiers in Genetics. 2021; 12. DOI: