January 11, 2017 | Written by: James Young
Categorized: Community | Data Analytics
Share this post:
At Offline Camp California, between sunset hayrides and dodging tarantulas on the California coast, the New Builders Podcast caught up with Max Ogden, founder of the Dat Project—a grant-funded, open source, decentralized data sharing tool for efficiently versioning and syncing changes to data.
In particular, the Dat Project is aimed at helping scientific researchers and other academics preserve and share large scientific datasets, both online and offline. Talking to Max raised some interesting ideas about the challenges of dealing with these large datasets—and with the scientists who create them.
As Ogden says: “The scientists we work with—and I think this is true of data scientists in general—their goal isn’t to write a bunch of code; it’s to use the code strategically to answer a scientific question. The code to them is not as much of an art form as it is to a software developer—it’s a lot more pragmatic in certain ways.”
The difference is born of necessity: as a software developer, when you build a new system, you can make your own decisions about what data structures to use, and you generally aim to design a nice, clean, structured database that complements the elegance of your code. Even if you are maintaining an existing system rather than building your own, it’s reasonable to hope that the engineers that designed it were following similar principles and aiming to manage their data in a clean and rational way.
By contrast, as a data scientist, you generally need to work with the data you already have, which may be unstructured, multi-structured, incomplete, sparse, redundant, unreliable—or have a combination of all of these problems. Beauty and elegance aren’t the main criteria; you just need to build something that works.
Finding a middle ground to make big data work for you
As businesses begin to focus on harnessing big data in their applications, developers and data scientists need to work together to find a sensible middle ground between these two different mindsets. The question becomes: how do you build a sane, bulletproof business system that can reliably deal with the large and haphazard datasets generated by big data sources such as social media or the Internet of Things?
The Dat Project provides an interesting example of how a developer changed his way of thinking to accommodate a user community of data scientists. When Ogden began working on Dat, he thought in database terms: structure as much of the data as possible first, and then add whatever data can’t be structured afterwards.
However, the current version of Dat evolved to almost take the opposite approach. It acts more like a file system, capturing data in whatever format it has been produced, then adding structured metadata to help with version control and data retrieval.
Dealing with unconventional datasets
Fundamentally, the problem with a developer’s ideal, structured-data approach is that it requires a level of standardization that big datasets and their users simply may not be able to adhere to. As Ogden says of his efforts to get scientists to use an early database-centric version of Dat: “Trying to get their data into JSON was like pulling teeth. They were saying, ‘We’ve been using this format for astronomy since the ’70s, and we’re not going to move off of it—and in fact there are 10 different astronomy formats that we need to use.’”
The sheer scale of data also makes it impractical to restructure it into a new format. Odgen says, “We worked with one astronomy dataset that was just one 600 million-line CSV, with one line for every star that they could see. It was split across 40 files, each of which was one segment of the sky. It was hundreds of gigabytes—and that was just the tabular data. For every star, they also had a picture of the sky that they took with a telescope, and the pictures were 40 terabytes—and they do it all again every 10 years. And that’s just one team.”
So from both a human and a technical perspective, developers in the big data space found that they need to cut their coat according to their cloth. In most use cases, using some kind of database is the best approach. For example, even though Dat looks like a file system to its users, it relies on LevelDB, an open source database platform, for its core operations. However, the important point is that the choice of database should depend on the data that needs to be managed, not on the developer’s preferences.
Choosing the right tools for the job
Although each use case may be different, there are still some general rules we can establish about the kinds of databases that are best (or worst) suited to dealing with large scientific or other research datasets. For example, it’s interesting to note that when Ogden first started working on the Dat Project, he didn’t pick a traditional relational database—he chose Apache CouchDB™, a JSON document store with strong offline capabilities.
One likely reason for choosing a non-relational database is that when you are dealing with extremely varied datasets, where one record has very different attributes from the next, a relational database is probably not going to be an optimal way of storing the data. A relational schema would need a huge number of columns or tables to capture data for all the possible record types, and if a new type of record should be introduced, you would have to extend the whole schema to capture the new data properly.
By contrast, a NoSQL database like open source CouchDB or a fully managed alternative like IBM Compose for MongoDB or IBM Cloudant doesn’t require individual records to comply with a fixed schema. Each record has its own special attributes, none of which need to be shared by any other record. This unlocks an enormous amount of flexibility for storing different types of data while retaining traditional database advantages such as the ability to retrieve, aggregate, and update records by querying their attributes.
A second useful characteristic of IBM Cloudant and CouchDB in particular is their ability to sync data easily between offline and online repositories. When data scientists are running analysis on the kinds of huge datasets that Dat manages, performance considerations mean that it’s important to have a local copy of the data to work on, rather than having to continuously fetch it from across a network.
So your solution should adopt “offline-first” principles: the user should be able to do the vast majority of their work on their local device, without any need for a persistent internet connection. And whenever a connection is available, it should be easy to replicate with an online repository to get the data back in sync.
In the case of the Dat Project, this offline-first capability works much like a combination of Git and BitTorrent—the files are simply downloaded locally. But if you have a use case where you want to use a more database-like approach instead, the ability to run a local instance of the database on your own machine and seamlessly sync it with a central repository is a great advantage.
Looking to the cloud
Having a database that is hosted and managed in the cloud is a big plus. In many cases, neither data scientists nor developers are experts in database administration. When you are aiming at building a solution that has both the robust elegance desired by software engineers and the flexibility required by data scientists, you don’t need to worry about database and infrastructure issues as well. Cloud data services take that complexity out of your hands and focus on developing the applications and algorithms you need.
Sign up for Bluemix. It’s free!
To learn more about Max’s work with the Dat Project, the serious artistic mission behind his cat photo database, and the challenges of building new services for open data, listen to the full New Builders podcast episode, “All Dat and More.”
For more stories from The New Builders podcast, find us on SoundCloud, IBM developerWorks TV and on iTunes and Google Play. Please send your thoughts, feedback and guest ideas in the comments.