Machine learning and artificial intelligence aim to develop computer algorithms that improve with experience. These algorithms can be used to help with the analysis of huge data sets including data from genomic sequencing.
Image Credit: Gorodenkoff / Shutterstock
Machine learning methods
Machine learning methods are performed in three stages. A learning researcher develops an algorithm that they suspect will lead to successful learning.
Afterward, the algorithm is provided with a large collection of data. The data includes both negative and positive results, so the algorithm can learn to distinguish between the two. The results are known as a label, and the algorithm processes these and stores them as a model.
Lastly, new unlabelled data is given to the algorithm and it uses the model to predict the labels for the new set of data. If the learning was successful, then the predicted labels for the new set of data will be correct.
This method is referred to as supervised learning and can be used to see if the algorithm can learn to recognize a specific value from a set of data.
Unsupervised learning methods do not provide the algorithm with labeled examples to aid learning but give the algorithm raw data in the hope that it can find a structure within the data set.
The learning researcher must use what they already know about the data to build a predictive model and apply this to the algorithm.
Applications of machine learning to genetics
Machine learning algorithms can be used to analyze large sets of genomic sequencing data. Supervised learning methods for gene identification requires the input of labeled DNA sequences which specify the start and end locations of the gene.
The algorithm then uses this model to learn the general properties of genes such as DNA-sequencing patterns and the location of stop codons.
After this training, the model can use these learned properties to identify additional genes from new data sets that resemble the genes in the training set.
For the deep learning algorithms to work successfully, loss functions (indicating how accurate a prediction is) and risk functions (indicating the average loss across the training set) are used within the model to adjust for the false predictions of the algorithm.
When training data is not available, unsupervised learning methods are used. An example of when this may be needed is during the interpretation of heterogeneous genomic data.
Histone modification, chromatin accessibility, and transcription factor binding along the genome can provide information regarding the activity of the genome. This information can then be used to create a set of labels.
Both methods can be used to discover genes of interest and other information about a sequenced genome.
Recent advancements in genetics using machine learning
Principal component analysis (PCA) is an example of unsupervised learning which is used to discover the strength of unknown relationships among individuals.
PCA takes a mixture of different genotypes (with very high dimensionality) and produces a lower-dimension summary that reveals how genotypes cluster.
PCA has been previously used to show how relationships among European individuals mirror geography.
Supervised machine learning methods were recently used to discriminate between genomic regions experiencing purifying selection and those that did not have any selective constraint. This was discovered using only population genomic data.
This study discovered candidate regions of the genome that were highly enriched in the regulatory domains of genes that are important for the proper development of the central nervous system.
The presence of the candidate regions near a gene can predict human-specific changes of expression in the brain.
Perspectives and the future of machine learning in genetics
In conclusion, machine learning is a very complex and vast topic. Algorithms can be created that allow for far more accurate analysis of data than many other methods that exist.
The method of machine learning that is used will depend on the nature of the data that is available and what the researchers are trying to discover.
More research into machine learning and artificial intelligence will provide more accurate ways to analyze genomic data in the future, which will lead to more discoveries.
Further Reading