Introduction

Millions of people undergo surgery each year, encountering significant risks and substantial costs, even in advanced healthcare systems. 

Atidia, in collaboration with Austin Health, conducted research into the use of Machine Learning with Electronic Health Records (EHR) data. This study shows Machine Learning’s potential to improve risk prediction accuracy and empower clinical decision making. It was published in BMC Medical Informatics and Decision Making (Kowadlo et al). 

Since conducting this research, we have explored how Machine Learning models could be used effectively within a clinical application, Atidia’s ‘Patient Optimiser’. 

We welcome collaboration with others working on similar research or interested to apply these methods in clinical practice. If you would like to learn more, please send us a message (info@atidia.health).

The Study

This research explored whether Machine Learning algorithms could be used to generate useful predictions, on four key surgical outcomes: 

  • Any Complication: post surgical comorbidities
  • Mortality: in-hospital death post-surgery
  • Readmission: repeat hospital admission after surgery and discharge
  • Length of Stay (LOS): duration between admission and discharge. LOS was segmented into ‘low’ (<=31 hours), ‘medium’ (between 31 hours and 117 hours), or ‘high’ (> 117 hours)

A risk prediction model was developed using readily available EHR data, which included demographic and procedure information, medications, pathology results, diagnoses (ICD-10) and comorbidities, involving 11,475 adults who underwent elective procedures.

We also explored whether the models could be interpretable by clinicians to foster trust and enhance decision making.

Contribution to research 

This project resulted in several insights that expanded on prior research:

  1. Small dataset with limited features could be sufficient for useful predictions: the dataset used for this research was of limited size (11,475 admissions) with a relatively small set of variables, many of which included missing data. While these factors presented research challenges we believe that they provide useful insights into real-world implementation of machine learning, because they are representative of real hospital datasets. 
  1. Precision-Recall (PR) curve should be used for evaluation of rare-condition models: most research uses the Receiving Operator Characteristic curve (ROC) to evaluate performance, which is useful for many datasets, but not when the predicted outcome is rare (for example, only 41 mortality cases, 0.36% of total samples).  We used the alternative method of PR curve, which was not commonly used in similar studies; we published the results to demonstrate its usefulness and for future reference.
  1. Length of Stay prediction: to the best of our knowledge, other ML risk predictors developed prior to this paper do not consider LOS, except for a model from May 2018 (Rajkomar et al). They predicted ‘long length-of-stay’, defined as ‘at least 7 days’, whereas our model predicts multivalue LOS: low, medium or high. Multivalue LOS makes it possible to have a dynamic definition of ‘prolonged’ that depends on factors such as procedure and patient. For example, a medium stay (two to four nights) prediction could indicate a risk of ‘prolonged’ stay for  surgery that is typically short-stay (1 night) and healthy patients, but not be considered ‘prolonged’ for other types of surgeries. 

Collaboration between technical and clinical experts

Data extraction from an EHR designed for clinical workflow, not data analytics, posed challenges such as missing values and overlapping categories. Resolving these required both technical and domain knowledge, with clinicians playing a crucial role in interpreting the data and guiding the process. 

Clinical knowledge was invaluable in feature engineering by simplifying data into clinically meaningful categories (e.g., assigning risk levels to procedures) or generating new features; for example, summing the number of diagnoses as a proxy for ‘patient risk’ (i.e. a patient who has been diagnosed with more clinical conditions may be more likely to experience surgical complications).

Note: utilising human expertise can introduce bias into the model, as it relies on the experience and opinions of particular clinicians. However, it remains an effective way to capture existing clinical knowledge, which can enhance the model’s predictive power.

The process was typical for Machine Learning projects. If you are interested in more information, please read the published article.

Model selection and training

The target predictions required classification models. These are designed to predict whether the target outcomes are likely in each patient admission, i.e. each prediction answered a ‘yes’ or ‘no’ question (for LOS: ‘low’, ‘medium’, ‘high’), rather than predicting a quantity such as number of complication or days in hospital. 

The research explored various algorithms, including ‘eXtreme Gradient Boosted Trees’ (XGBoost), logistic regression, and neural networks. Simple methods like logistic regression and XGBoost proved both effective and efficient, outperforming more complex models.

The process of model training included automated feature selection, hyper parameter tuning, and bootstrapping to evaluate performance. 

How the models were evaluated

In a clinical setting, it is important to correctly identify patients who are likely to suffer an adverse condition (‘true positives’, to identify required interventions) and avoid incorrectly classifying patients who would not suffer an adverse condition as if they would (‘false positives’, to avoid unnecessary treatments).

There is a trade-off between the two objectives. As an illustration, consider a model that predicts all patients as ‘positive’ (likely to have an adverse condition ). It will not miss any patients who get the condition; however, it will incorrectly classify patients who will not get the condition, which may result in unnecessary, and in some cases harmful, interventions. 

Mapping these trade-offs is done using curves which plot one metric against the other. The curve can then be used to generate a score of overall performance, by calculating the area under the curve (larger area means better score, because it allows higher performance on both metrics).

For many datasets, a useful method for evaluating performance is the ‘Receiver Operating Characteristic ’ (ROC). It plots model performance on two axes:  how often the model correctly predicts positive cases (‘True Positive Rate’, y axis) against how often it mistakenly predicts negative cases as if they’re positive (‘False Positive Rate’, x axis). ROC is commonly used in research literature; we used it to evaluate all models and compare them to other studies.

As an example, in the context of mortality prediction, the ‘Receiver Operating Characteristic’ axes are:

  • x: what proportion of patients who will not die, were predicted to die
  • y: what proportion of patients who will die, were predicted to die 

This was not a useful assessment because mortality was ‘rare’: there were only 41 cases of mortality (0.36% of total cases). In such scenarios ROC can be misleading and model performance may be overstated. 

To improve the evaluation, we used the ‘Precision-Recall’ (PR) curve, which focuses on how accurate the model is when it predicts positive cases. It plots model performance on two axes:  the proportion of correctly identified positive cases out of all positive cases (recall, x axis) with the proportion of those correct predictions out of all the positive predictions made (precision, y axis).

In the context of mortality prediction, the ‘Precision-Recall’ curve axes are:

  • x: what proportion of patients who will die, were predicted to die
  • y: what proportion of patients who were predicted to die, will actually die

Looking at the plots, it is apparent that the mortality model seemed effective using ROC, but shown to be ineffective using the PR curve. 

Results

Determining at what level of performance a model is considered ‘useful’ is context specific. At a minimum, models should be better than random predictions, i.e. ‘Area Under ROC’ (AUROC) should be higher than 0.5. In practice, performance just above that level is unlikely to be considered useful. A more reasonable level would be higher than 0.7. A threshold for AUPRC is not as straightforward, but the objective is to have an ‘acceptable’ balance between precision and recall.

Overall, the models were effective in predicting LOS and ‘any complication’; they did not perform well on predictions of mortality and readmission, probably because there were insufficient occurrences of these cases in the data.

Prediction AUROCAUPRC
Any complication0.7550.651
Length-of-stay0.8410.741

For length of stay and complications, the results are comparable to similar studies, despite fewer data types and a much smaller dataset. Unfortunately AUPRC was unavailable in other studies, so we could not compare with them to gain a fuller picture.

For the full results, refer to the published paper.

Explainability

Clinician trust, which is essential for successful implementation, can be facilitated by providing explanations: showing which factors led to a particular prediction. Understanding what factors led to a particular prediction assists decision making. For example, if a prediction is based on a particular health condition, the clinician could decide to address that condition prior to surgery.

We used SHapley Additive exPlanations to calculate and visualise these explanations, assigning scores to model features, which reflect their positive or negative impact on the prediction. 

Future potential 

Ultimately, Atidia’s goal is to improve patient outcomes and operational efficiency. 

This study shows Machine Learning’s potential for surgical risk prediction using standard EHR data. The prediction models are relevant, useful, and explainable. If used within clinical applications, they could inform decision making, leading to a wide range of potential benefits, such as preventing health complications, or improving efficiency by predicting demand for resources such as operating theatres and ward beds.

This is an ongoing area of R&D at Atidia, and we welcome opportunities for collaboration. If you would like to explore collaboration opportunities, please contact us (info@atidia.health).

References and further reading

  1. Kowadlo, G., Mittelberg, Y., Ghomlaghi, M. et al. Development and validation of ‘Patient Optimizer’ (POP) algorithms for predicting surgical risk with machine learning. BMC Med Inform Decis Mak 24, 70 (2024). https://doi.org/10.1186/s12911-024-02463-w
  2. Story DA. Postoperative complications in Australia and New Zealand (the REASON study). Perioper Med. 2013;2(1):2–4. https://doi.org/10.1186/2047-0525-2-16 .
  3. Rajkomar, A., Oren, E., Chen, K. et al. Scalable and accurate deep learning with electronic health records. npj Digital Med 1, 18 (2018). https://doi.org/10.1038/s41746-018-0029-1
  4. XGBoost Documentation
  5. Precision-Recall Curves: How to Easily Evaluate Machine Learning Models in No Time | Better Data Science
  6. Receiving Operator Characteristic  
  7. SHAP documentation