TThe genome acts as a blueprint for the body, influencing everything from the shape of the face to the arches of the feet and even the development of certain diseases. Some diseases, such as cystic fibrosis, are linked to a single gene and can be reliably predicted based on an individual’s genetic data, while others, such as autism spectrum disorders, Alzheimer’s disease, depression, and obesity, Many other diseases are not like that.
Over the past 15 years, scientists have Genome-wide association research (GWAS) compares the genomes of large groups of people to identify hundreds of thousands of genetic variations associated with traits and diseases.1 This method has helped scientists unravel the biology and risk factors underlying complex diseases, and has also led to the discovery of new drug targets. Despite these advances, GWAS research has limitations, which scientists have attempted to address with the help of artificial intelligence (AI). however, two studies Published in natural geneticsresearchers at the University of Wisconsin-Madison have determined that widespread prejudice These new approaches may be introduced when working with large but incomplete datasets.2, 3
GWAS relies on large biobanks with extensive patient data. However, these repositories may be missing something, from blood reports, scans, and patient medical history to family data. Even with a thorough study, challenges such as a lack of data on late-onset disease in cohorts of young participants can seriously hamper researchers’ plans.
To address the data gap, scientists have developed two approaches. It is GWAS-by-proxy (GWAX), which relies on machine learning and family history data as predictors of late-onset disease. Many researchers combine GWAS and GWAX to improve the statistical power of predictions. But a team of researchers at the University of Wisconsin-Madison has found that these “solutions” can incorrectly link genetic mutations to disease.
“Leveraging advances in machine learning has become very popular in recent years, so researchers are using these advanced machines to predict complex traits and disease risk even with limited data. We can now use learning AI models,” he said. Lu Qiongshiis a biostatistician at the University of Wisconsin-Madison and co-author of the study. press release.
Using AI-assisted GWAS, Lu and his colleagues found a spurious association between genetic variants and type II diabetes. For example, four genetic variants were highly correlated with disease in AI-assisted GWAS, but not when using traditional GWAS approaches. However, previous studies have shown that although these genes act on cellular pathways indirectly related to blood sugar levels, they do not have a strong effect on blood sugar levels.
In cohorts where all samples have genetic data but only some have the required phenotypic data, an AI-assisted GWAS algorithm attempts to fill in the gaps based on learned patterns. However, without knowledge of physiological complexity, this approach can lead researchers down the wrong path.
“The problem is that if we trust the risk of diabetes predicted by machine learning as the actual risk, all these genetic variations are correlated with actual diabetes, even though they are not. It makes you think about it,” Lu said.
There are also problems with filling holes in data banks with proxies. For example, when analyzing the correlation between multiple traits and the risk of developing Alzheimer’s disease, Lu observed discrepancies with GWAS results based on real-world data. The main discrepancy was the association between educational attainment and risk of Alzheimer’s disease. Multiple groups have reported an inverse association between these variables, and this finding is supported by GWAS. However, Lu observed a positive correlation when using the GWAX approach. Additionally, the surrogate information approach was unable to demonstrate an association between illness and subsequent cognitive decline, unlike previous data and GWAS findings.
The research team proposed a new statistical method that researchers can use to correct for these biases and increase the reliability of their research results. They call on the research community to report findings transparently and to adopt a more rigorous and cautious outlook when drawing conclusions from these methods.
“Our group’s recent work provides a humbling example and highlights the importance of statistical rigor in biobank-scale research studies,” Lu said.