AI

AI for Code Encourages Collaborative, Open Scientific Discovery

Share this post:

We have seen significant recent progress in pattern analysis and machine intelligence applied to images, audio and video signals, and natural language text, but not as much applied to another artifact produced by people: computer program source code. In a paper to be presented at the FEED Workshop at KDD 2018 (read more about the workshop here), we showcase a system that makes progress towards the semantic analysis of code. By doing so, we provide the foundation for machines to truly reason about program code and learn from it.

The work, also recently demonstrated at IJCAI 2018, is conceived and led by IBM Science for Social Good fellow Evan Patterson and focuses specifically on data science software. Data science programs are a special kind of computer code, often fairly short, but full of semantically rich content that specifies a sequence of data transformation, analysis, modeling, and interpretation operations. Our technique executes a data analysis (imagine an R or Python script) and captures all of the functions that are called in the analysis. It then connects those functions to a Data Science Ontology we have created, performs several simplification steps, and produces a semantic flow graph representation of the program. As an example, the flow graph below is produced automatically from an analysis of rheumatoid arthritis data.

A semantic flow graph representation of code

Semantic flow graph representation produced automatically from an analysis of rheumatoid arthritis data.

The technique is applicable across choices of programming language and package. The three code snippets below are written in R, Python with the NumPy and SciPy packages, and Python with the Pandas and Scikit-learn packages. All produce exactly the same semantic flow graph.

Code snippet in R Code snippet in Python with NumPy and SciPy packages Code snippet in Python with the Pandas and Scikit-learn packages

Semantic flow graph

We can think of the semantic flow graph we extract as a single data point, just like an image or a paragraph of text, on which to perform further higher-level tasks. With the representation we have developed, we can enable several useful functionalities for practicing data scientists, including intelligent search and auto-completion of analyses, recommendation of similar or complementary analyses, visualization of the space of all analyses conducted on a particular problem or dataset, translation or style transfer, and even machine generation of novel data analyses (i.e. computational creativity)—all predicated on the truly semantic understanding of what the code does.

The Data Science Ontology is written in a new ontology language we have developed named Monoidal Ontology and Computing Language (Monocl). This line of work was initiated in 2016 in partnership with the Accelerated Cure Project for Multiple Sclerosis.

We encourage you to contribute to the Data Science Ontology by annotating your favorite data science packages and also to use our semantic flow graph extraction algorithm available on Github. Furthermore, plan to attend the AAAI Spring Symposium: Towards Artificial Intelligence for Collaborative, Open Scientific Discovery (TACOS) in March 2019 to go even deeper into this topic.

More AI stories

IBM Research Pioneers Technologies Behind New AI for IT Capabilities

IBM is launching today a broad range of new AI-powered capabilities and services to help CIOs automate various aspects of IT development, infrastructure and operations, including IBM Watson AIOps and Accelerator for Application Modernization with AI. As is the case with much of IBM’s AI development, significant portions of the technologies underlying Watson AIOps and the Accelerator were born out of IBM Research. 

Continue reading

IBM Research AI at ICASSP 2020

The 45th International Conference on Acoustics, Speech, and Signal Processing is taking place virtually from May 4-8. IBM Research AI is pleased to support the conference as a bronze patron and to share our latest research results, described in nine papers that will be presented at the conference.

Continue reading

IBM Research Progresses Field of Human-Computer Interaction (HCI)

IBM Research's contributions to CHI 2020 focus on creating and designing AI technologies that center on user needs and societal values, spanning the topics of novel human-AI partnerships, AI UX and design, trusted AI, and AI for accessibility. 

Continue reading