Blog

MIT scientists have found that AI, which can recognize words, can also find new coronavirus mutations

Galileo once thought that nature was written in mathematics, while biology might be written in words. Natural language processing (NLP) algorithms are now able to generate protein sequences and predict virus mutations, including key mutations that can help the new coronavirus evade the attack of the immune system.
This can be achieved thanks to an important insight that many characteristics of biological systems can be explained in words and sentences. “We’re learning evolutionary languages,” says Bonnie Berg, a computational biologist at MIT
In the past few years, geneticist George church’s laboratory team, salesforce team and other researchers have proved that protein sequences and genetic codes can be modeled using NLP technology.
Recently, Berg and his colleagues published their research in the journal Science. In this study, Berg et al. Pooled several strains together and used NLP to predict mutations that could help the virus avoid antibodies in the human immune system. Virus escaping from antibody is called “virus immune escape”. The basic idea of this study is that the immune system interprets virus like human interprets sentence.
Salesforce scientist Ali Madani, who is using NLP to predict protein sequences, said: “the paper is well written and continues the momentum of previous work.”
Berger’s team uses two linguistic concepts: grammar and semantics (or meaning). The ability of virus to infect the host and other genetic or evolutionary adaptive characteristics can be interpreted from the perspective of grammatical correctness. If the virus is infectious, it is correct in grammar; if the virus is not infectious, it is incorrect.
Similarly, virus mutation can be explained semantically. For example, if the protein on the surface of the virus mutates, some antibodies will not be able to detect the virus. The mutation that causes the virus to differentiate from other things in the environment changes the semantics of the virus. Virus mutation can have different semantics, and each virus with its own semantics may need different antibodies to interpret.
To model these features, researchers used LSTM neural networks, which were born before transformers based neural networks, which are now used by large language models such as gpt-3. LSTM and other long-standing networks require much less training data than transformers, and still perform well in many applications.
Instead of millions of sentences, the researchers trained NLP models using thousands of gene sequences from three viruses. These sequences are 45000 unique sequences of influenza virus strains, 60000 unique sequences of HIV virus strains, and 3000 to 4000 unique sequences of new coronavirus strains. Brian shee, a graduate student at MIT, built the model. “Due to less monitoring of the new coronavirus, there is less data on the new coronavirus,” he said
NLP model encodes words in mathematical space. If the meaning between words is similar, the distance will be closer. Otherwise, the distance will be farther. This process is called “embedding.”. At the viral level, the embedding of gene sequence is to group viruses according to the similarity of virus mutation.
The overall goal of this method is to identify those mutations that may help the virus escape from the immune system without reducing the infectivity of the virus, that is, to identify those mutations that change the meaning of the virus without causing syntax errors. To test the model, the team used a common metric to evaluate the predictions made by machine learning models, which rated accuracy in the range of 0.5 (equivalent to the probability of an accident) to 1 (perfect).
In the study, the researchers used the most likely mutations identified by the model, and used the virus in the laboratory to check how many of these mutations really helped the virus escape immunity. The lowest accuracy score was 0.69 for HIV strains and the highest was 0.85 for new coronavirus strains. The researchers said the results were better than those of other state-of-the-art models.
Forewarning
It is easier for hospitals and public health authorities to plan ahead by knowing what mutations may be. For example, to model how much the semantics of an influenza virus has changed since 2020, you can predict how much the antibodies that people have produced will play this year.
The team said it was running the model based on a new variant of the new coronavirus. The new varieties targeted include mutated viruses in the UK, in Danish minks, and in South Africa, Singapore and Malaysia. Researchers have found that the immune escape potential of these mutant viruses may be high, but they have not been tested outside the laboratory.
However, the model did not predict a mutation in the South African mutant virus. People have begun to worry that this mutation may help the virus avoid vaccination, and researchers are trying to find out why. “The South African mutant virus contains multiple mutations, and we believe that the combined effect of these mutations may lead to immune escape,” Berg said
Using NLP can speed up the original slow research process, which used to extract the virus from a new crown patient in the hospital, sequence the genome, and recreate and study the corresponding mutation in the laboratory. Project researcher Brian Bryson, a biologist at the Massachusetts Institute of technology, said that the previous practice may take a few weeks. NLP model can directly predict potential mutations, and laboratory research will find the focus and speed up the work.
“The whole job is eye opening,” Bryson said There are new virus sequences every week. “It’s amazing to update the model and run to the lab to test it,” Bryson said. That’s what’s good about computational biology. “
But that’s just the beginning. Regarding gene mutation as semantic change, it can be applied in different fields of biology. “A good analogy can make a big difference,” Bryson said
For example, Xi believes that the research team’s approach can be applied to the study of drug resistance. “For example, cancer cell proteins are resistant to chemotherapy, or bacterial proteins are resistant to antibiotics,” he said. These mutations can also be seen as changes in meaning. “We can have a lot of creativity in interpreting language models.”
“I think biology is on the verge of revolution,” Madani said. Instead of just collecting a lot of data, we’re moving to learning how to understand it in depth. “
In general, researchers are focusing on the development of NLP and exploring new analogies between language and biology to take advantage of the progress made by NLP. However, Bryson, Berg and shee all believe that the intersection of biology and NLP algorithm can be bidirectional, that is, the new NLP algorithm is inspired by the concept of biology. “Biology has its own language,” Berg said

Leave a Reply

Your email address will not be published. Required fields are marked *

X