Five ways AI supports key drug research decisions with unstructured data

AI is able to incorporate millions of documents in every analysis

The volume of publicly available biomedical literature, life science journals, and other unstructured data is staggering. Within PubMed alone, there are currently 29 million citations[1].  And, according to National Science Foundation, the annual global output of scientific articles has steadily increased over the past two decades to an amazing 2.6 million articles in 2016[2]. With this incredible amount of information available to researchers, the potential for novel insights is higher than ever before. However, this massive volume exceeds the capacity for any person, or even any team of people, to manually read, digest, then collate a comprehensive view. Thankfully, AI methods have been developed to tackle just this type of challenge. Ingesting and analyzing millions of documents in seconds, AI is able to construct dynamic, comprehensive networks of relationships between a wide variety of concepts across a vast corpus of data to provide the inclusive view needed to make informed decisions.

AI understands the context within each document to extract key insights

Unstructured data, in the form of literature, is very rich in information, but poses a significant obstacle for applying automated assessments. Unlike databases and other forms of structured data, human written language is not traditionally accessible to computational understanding given the difficulty of training a machine to understand the nuance of free text. Furthermore, the density of data within scientific literature is on the rise, having doubled over the last twenty years[3]. In order to overcome this obstacle and return real insights, AI, in the form of Natural Language Processing (NLP), applies form and function to the unstructured data through annotations, normalization and ontologies. Annotations, or named entity recognition, serves to identify the words and phrases within a document that are of a certain type, such as genes. Normalization, or named entity resolution, leverages the context of free text to identify which specific gene is being mentioned by the authors. Together, these processes help define relevant meaning for each word within a document in order to provide directional context amongst concepts, while placing these concepts within hierarchical, ontological structures provides an overarching organizational context for these concepts across the corpus of data. Leading solutions, such as IBM Watson for Drug Discovery, take this a step further by grounding these capabilities in a fundamental understanding of biological concepts and processes.

AI looks beyond what is explicitly reported to predict what is inferred across the literature

Applying form and function to the literature is the first step to delivering a deep, contextual analysis. Through semantic understanding of the meaning of individual sentences, NLP is able to provide a rich assessment of relationships between entities that are explicitly defined within an article. Moreover, one key advantage to some of the more advanced AI technologies is the ability for predictive analytics upon this foundational knowledge. For example, the AI can create a unique fingerprint for an entity (or indeed any defined concept) based on the words and phrases that constitute how it is described within the literature; it can then approximate functional similarity between entities by the computed similarity between these “semantic fingerprints” to predict possible connections and commonalities between entities, regardless of their “proximity” to each other within the literature. In this way, researchers have a powerful tool for making truly novel discoveries, as can be seen in the recent discovery of five RNA-binding proteins previously unlinked to amyotrophic lateral sclerosis (ALS)[4].

AI can span numerous disciplines or areas of focus

With the wealth of information now available to us, it is common for many to specialize in specific diseases, disciplines or functions in order to absorb as much knowledge as possible in a given area. This is a boon in terms of individual expertise but may create dependencies within an organization to staff across a number of different areas and instill robust knowledge-share mechanisms in order to produce the comprehensive, homogenous view needed for a given decision. However, as can be seen in recent efforts to find new treatment options for L-DOPA induced dyskinesia[5], valuable insights can span many areas of expertise. AI, on the other hand, is able to apply a broad understanding of life science, instilled in the foundational training of the technology, to assemble the relevant direct, indirect and inferred relationship. It can then present the assessment to the researcher in a way that fully leverages their individual expertise in order to make a more informed decision.

Advanced AI technologies sift through the noise to generate confident insights

A major concern for researchers is the accuracy, or reproducibility, of the data within individual publications, and in turn, the validity of the insights generated from these publications as a whole. There are many strategies for addressing this concern, offering varying levels of success, but the more advanced AI technologies offer powerful mechanisms to overcome this issue. One such mechanism, is through independent verification across multiple publications. In this way, more common, and therefore, more reproducible signals are given a higher confidence rating, while facts that only occur in a single (potentially inaccurate, or contradicting) publication will generally be labeled with low confidence. This mechanism, in concert with other built-in safe guards, enables IBM Watson for Drug Discovery to provide robust results that have been validated with subsequent laboratory verification[6].

Watch our recent webinar to see more on how IBM Watson for Drug Discovery is supporting key decision in drug discovery research.