The number of studies about COVID-19 has risen exponentially from the start of the pandemic, from around 20,000 in early March to over 30,000 as of late June. In an effort to help clinicians digest the vast amount of biomedical knowledge in the literature, researchers affiliated with Columbia, Brandeis, Darpa, UCLA, and UIUC developed a framework — COVID-KG — that draws on papers to answer natural language questions about drug purposing and more.
The sheer volume of COVID-19 research makes it difficult to sort the wheat from the chaff. Some false information has been promoted on social media and in publication venues like journals. And many results about the virus from different labs and sources are redundant, complementary, or would appear to conflict.
COVID-KG aims to solve the challenge by reading papers to build multimedia knowledge graphs consisting of nodes and edges. The nodes represent entities and concepts extracted from papers’ text and images, while the edges represent relations involving these entities.
COVID-KG ingests entity types including genes, diseases, chemicals, and organisms; relations like mechanisms, therapeutics, and increased expressions; and events such as gene expression, transcription, and localization. It also draws on entities annotated from an open source data set tailored for COVID-19 studies, which includes entity types like coronaviruses, viral proteins, evolution, materials, and immune response).
COVID-KG extracts visual information from figure images (e.g., microscopic images, dosage response curves, and relational diagrams) to enrich the knowledge graph. After detecting and isolating figures from each document with text in its caption or referring context, it then applies computer vision to spot and separate non-overlapping regions and recognize the molecular structures within each figure.
COVID-KG provides semantic visualizations like tag clouds and heat maps that allow researchers to get a view of selected relations from hundreds or thousands of papers at a single glance. This, in turn, allows for the identification of relationships that would typically be missed by keyword searches or simple word cloud or heatmap displays.
In a case study, the researchers posed a series of 11 questions typically answered in a drug repurposing report to COVID-KG, like “Was the drug identified by manual or computation screen?” and “Has the drug shown evidence of systemic toxicity?” With three drugs suggested by DARPA biologists (Benazepril, Losartan, and Amodiaquine) as targets, they used COVID-KG to construct a knowledge base from 25,534 peer-reviewed papers.
Given the question “What is the drug class and what is it currently approved to treat?” COVID-KG responded with:
The team reports that in the opinion of clinicians and medical school students who reviewed the results, COVID-KG’s answers were “informative, valid, and sound.” In the future, the coauthors plan to extend the system to automate the creation of new hypotheses by predicting new links. They also hope to produce a common semantic space for literature and apply it to improve COVID-KG’s cross-media knowledge grounding, inference, and transfer.
“With COVID-KG, researchers and clinicians are able to obtain trustworthy and non-trivial answers from scientific literature, and thus focus on more important hypothesis testing, and prioritize the analysis efforts for candidate exploration directions,” the coauthors wrote. “In our ongoing work we have created a new ontology that includes 77 entity subtypes and 58 event subtypes, and we are re-building an end-to-end joint neural … system following this new ontology.”