Posted in: Thomas J Watson Research Center

Mining Web API Specifications

Online services like Facebook, Twitter, PayPal, LinkedIn, The Weather Company (an IBM business) and IBM Watson – despite their obvious differences – all share one characteristic: application programming interfaces (API). APIs make data and capabilities available to third-party applications.

For application developers, web APIs offer tremendous opportunities to integrate vast amounts of data, like social networks or weather data, and advanced functionalities, like payment processing or machine learning capabilities. However, web APIs also present challenges for developers. One central challenge is how to correctly integrate with a web API and keep an integration intact as the API may change over time.

I recently encountered an example illustrating the severity of this challenge: The payment API Adyen (eBay recently announced that it will be replacing PayPal with Adyen as its primary payment provider) observes an average of 60,000 errors daily from code that application developers wrote accessing its web API. Now, imagine the number of errors across the thousands of web APIs. How much productivity, or even money is lost by wrongly using web APIs?

Web API Specifications

At IBM Research, we are painfully aware of this challenge, because we face it ourselves whenever we build systems or applications that use web APIs. We focused our attention on how to support application developers in using web APIs correctly. One viable path quickly became clear: web API specifications can really make the lives of developers easier! Specifications such as Swagger or OpenAPI describe web APIs in a machine-understandable way and thus allow machines to help developers avoid pitfalls. For example, specifications can automatically generate application code for interacting with an API, test APIs, or document and visualize the capabilities offered by an API.

Despite these advantages, web API specifications aren’t always easy to come by. Providers tend to document their web APIs using HTML pages, which are semi-structured, idiosyncratic and targeted at humans, thus neglecting the potential for machines to support application developers. In fact, we typically come across two general types of documentation, as illustrated in Figure 1: Reference-style documentation that provides information about URLs, parameters, response data and so on in a structured way, and example-style documentation that merely focuses on examples to illustrate the use of the web API.

Web API documentation styles

Figure 1: Web API documentation styles

While different in style, both kinds of documentation contain much of the information that would also go into a web API specification. So, in collaboration with Jinqiu Yang (who interned with IBM Research), Lin Tan from the University of Waterloo and Annie Ying from EquitySim (and previously IBM), we created a system that would read online documentation of web APIs like the ones shown in Figure 1, and automatically generate web API specifications from them.

A major challenge in devising our system is that the information to extract is diverse (e.g., URLs, parameters of different types, data being sent or returned from the API) and located in different places.

In recently published work, we showed that carefully selected and designed machine learning methods help us overcome these challenges. Using supervised learning approaches (classification) as well as unsupervised learning approaches (clustering), we showed that URLs, web API endpoints and methods can be extracted with high accuracy. These initial results are promising, and motivate us to tackle other parts in the future, like extracting data definitions or human-readable description texts. Our vision is that, when in need of a specification that is otherwise not available, web API consumers can use our technology to automatically generate them themselves.

If you are interested in the details, you can find our paper titled “Towards Extracting Web API Specifications from Documentation,” which got the best paper award at this year’s Mining Software Repositories (MSR) conference, in the ACM library or read the preprint here.

API