Researchers Use Machine Learning Algorithms To Uncover New Diabetes Sub-Groups


Researchers from the Departments of Endocrinology, Personalized Medicine, Genomic Sciences, and Health Policy at the Icahn School of Medicine have collectively published interesting new findings from a computer science-based approach to researching type 2 diabetes. The team used machine-learning algorithms to sift through thousands of patient records, analyzing hundreds of individual data points within each record, to help them learn more about the underlying workings of type 2 diabetes. While researchers have been studying diabetes for decades, the depth and speed at which advanced algorithms are able to analyze large datasets is helping researchers uncover new information about the disease.


The team employed a machine learning-based sorting technique to create a patient similarity network in which patients with like symptoms or conditions were grouped together in a network map representing the dataset. This algorithm worked its way through the records of 2,500 diabetic patients, sorting them based on hundreds of data points in their chart. The resulting scatter plot, seen above, clearly shows three distinct sub-groups within the overall type 2 diabetes patient population. Endocrinologists had long known through observation that type 2 diabetes seemed to present in a number of distinct groups, but until now these groups were undefined and unstudied.

Once the subgroups were established, researchers returned to the data to see what they could learn about each group. Interestingly, researchers found that classic features of type 2 diabetes, like obesity, high blood sugar, kidney disease, and eye disease were only predominant in one of the three newly discovered subtypes, while the other two groups showed higher rates for cancer and neurological disorders. Researchers next turned to genetic data about the patients, examining underlying genetic mutations and finding that the subgroups had their own unique genetic profiles. “The fact that these genetic factors matched up so nicely with the clinical factors suggests that there’s actual biology underlying the differences between these patients,” explains lead researcher Joel Dudley, director of biomedical informatics at Mount Sinai.

The team clarified that, despite analyzing 2,500 records, their dataset was too small to establish a causal relationship. Larger datasets with this level of detail are in the process of being created. Google is ramping up data-collection efforts within its Life Sciences Division under its Baseline Study that it hopes will lead to a database containing thousands of medical and genealogical data points collected from thousands of patients. President Obama’s Precision Medicine Initiative hopes to create an even larger dataset, in which one million participants will contribute data from EHRs, genetic data, chemical and microbiological data, and lifestyle and environmental data.

With efforts currently underway to build increasingly comprehensive datasets, and researchers advancing machine-learning analysis and discovery methods, big data may well get credit for ramping up the rate of discovery in medical research.

Enjoy HIStalk Connect? Sign up for update alerts, or follow us at @HIStalkConnect.

↑ Back to top

Founding Sponsors

Platinum Sponsors