Jaw-dropping AI demos are now commonplace. We should all feel excited about the translation of those capabilities to health applications, but can we reasonably expect that to happen? I strongly believe that a key ingredient is availability of large public datasets – but they’re unfortunately very rare, limiting progress.

Let’s pause and reflect on the path that was taken to breakthroughs in other fields. AI first began to rival human performance through Machine Learning (ML). Instead of using features and rules handcrafted by engineers, ML systems could learn directly from data. This shift led to major breakthroughs, fuelled by large public datasets.

One well-known example is ImageNet, a large-scale visual database designed for use in object recognition research. It contains over 14 million images organised into more than 20,000 categories. The associated competition, known as the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), began in 2010 and quickly became a benchmark for progress in computer vision and deep learning. The ILSVRC challenged participants to develop algorithms for tasks such as object detection and image classification. It gained significant attention in 2012 when a deep convolutional neural network called AlexNet, developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (recent Nobel laureate, due to contributions such as this), dramatically outperformed all previous approaches. This breakthrough moment is often cited as the beginning of the deep learning revolution in AI.

The competition not only spurred rapid advancements in computer vision but also demonstrated the power of large, well-annotated datasets in driving AI progress. It led to the development of increasingly sophisticated neural network architectures, such as VGGNet, GoogLeNet, and ResNet, each pushing the boundaries of what was possible in image recognition and classification.

Another notable benchmark is GLUE (General Language Understanding Evaluation), which has been instrumental in advancing natural language processing models.These benchmarks have not only provided standardised evaluation metrics but have also driven competition and innovation in the AI community.

In healthcare, we have seen significant strides by leveraging large private medical datasets to improve diagnostics, treatment planning, and patient care. In particular, deep-learning models trained on extensive collections of medical images have shown remarkable accuracy in detecting various conditions, from cancer to eye disease. However, these types of projects take years to incubate due to slow and complex hospital bureaucracy, patient privacy and IP concerns. As a result, progress is limited. 

If there were large, easily accessible datasets like ImageNet and GLUE, academics, startups and large corporations would race to compete to advance their research and gain competitive advantage, bringing new ideas and energy to the table. As in other domains, there would be an explosion of innovation, creating value that would far outstrip any perceived cost.

Such datasets also promote transparency and enable independent verification of research findings, which is so rare in health, yet crucial for clinician trust and adoption. 

Perhaps the most significant barrier is patient privacy; health data is the most sensitive of personal data. The need to protect patient privacy is undeniable, but it’s a solvable problem. It’s possible to strip away enough information, and use techniques such as differential privacy, to irreversibly de-identify patients, without compromising the data.

Another challenge is a growing awareness of the value of data and a fear of giving that value away. But if you’ve got an orange, the best way to get orange juice is to team up and collaborate with someone that has a juicer. What’s more, this data-orange can be juiced 100 different ways and it still won’t be used up. 

Hospitals will accrue value as they remain a vital part of the full research and development lifecycle. For example, it is possible to do a retrospective study on a public dataset, but the models must be validated and tested prospectively in collaboration with clinicians. In addition, any research will ultimately benefit health systems, creating a virtuous cycle of improvement. So rather than giving away value, it could increase competitive advantage. 

Datasets themselves are a major research contribution. Academic institutions, including hospitals and universities, would therefore have a big research impact, reflected with citations of the dataset. We’ve seen this with many other ML benchmarks (like Imagenet mentioned above), which boost academic prestige for the institute and attract students and researchers.

Publishing data is an opportunity for governments to foster innovation, attract investment, and reduce healthcare costs; after all, governments should have a say in how health data is used. By facilitating breakthroughs in medical AI, governments can position themselves as leaders in healthcare technology, potentially creating new economic opportunities and improving the overall health of their populations.

Despite these benefits, it may be perceived as a risk to take a new approach of sharing data. But health data should be considered the patients’ data – it’s not solely for the caretaker organisation to benefit. There is a moral obligation to share it so that their patients can benefit via innovation, even if part of that innovation comes from competitors.

There are some new examples of public datasets, such as the UK Biobank. It’s a great initiative containing detailed genetic and health information from half a million UK participants. However, accessing data is still challenging; it’s only available to established researchers, and the process can be lengthy.

We urgently need initiatives from governments and hospital groups to make large, de-identified public datasets available in a safe and unidentifiable way. Illness touches everyone, directly or through loved ones, and society deserves the opportunity to pursue better health outcomes as rapidly as possible.