Your interviewer is David Mertz: This is David Mertz, again reporting on OSCON 2011 for IBM developerWorks. I had an opportunity to speak also to Bradford Stephens, director of the OSCON data track. Do you have any comments or insight into why in particular this was spun off from the general OSCON program as it was just a unified program in previous years?
Stephens: The main reason why we spun the OSCON data off as a separate subconference is that, you know, we felt that data's infrastructures challenges are so prevalent among people in the open source community today that we needed sort of the ability to dedicate a lot of energy towards, you know, making a conference that addresses these problems.
So everything from distributed systems to equal database turning to no SQL to low-level discussion on flaws in virtualization OSCON data is a place for people who are building interesting things either at scale or in-depth and come together to learn how to do it better.
Mertz: I asked Bradford about the comparison of storage technologies and analysis technologies.
Stephens: It's sort of a feedback loop in a way. Because storage is so cheap these days, we're storing more. And because we can easily provision things on the clouds, it's quite simple to accumulate more and more data just like more servers. I think that's one key aspect of the data explosion or the sort of the new interest in data infrastructure.
Another part of that is that because of the nature of the data, where the existing databases just collapse once you try to make them distribute these or lose a lot of functionality. So that's where I think a lot of the no SQL stores have come into play. Non-distributed stores such as MongoDB which just focus on sort of, as you put it, more [INAUDIBLE] master queries of document data sets.
Then there's distributed databases like hpace, which don't have a lot of functionality but can scale to massive quantities of data.
And then [INAUDIBLE] said too, storing this data is just one of the problems. So actually using it, analyzing it and applying, quote, data science to it, is, you know, a whole other realm of possibility and [INAUDIBLE]. And you can have most of the nice space of SQL and schemas and all that. But some parts of SQL are fundamentally unable to remain distributed, like join. And you can never do joins across networks because they're so memory and CPU constrained.
And they're I/O constrained. They're constrained in basically every direction you can imagine. So you'll never be able to do that over a network.
But it turns out both mathematically and in reality, that you don't need join to have sort of nicely organized large data sets. You don't need to separate data into separate tables. Which is how we approached the problem.
Mertz: I asked Bradford about domains for big data.
Stephens: I think the main areas are of course anything associate because that's sort of a classic network problem. The more social networks you have, the more people [INAUDIBLE] exponentially more the data grows. Anything in social gaming or social BI or analytics, that whole world.
And the near sort of more machine to machine things, social networks, smart thread, even server logs, anything where machines can talk to each other, they can do it at a rate that produces a massive amount of data. Status updates or things like that. It's a kind of text, but technically it's small text so you're not so interested in drilling into the structure of the book or whatever that is being indexed to find [stack] relationships.
There's a lot of unstructured data out there, and there's often sort of structured metadata surrounding it, a tweet being a perfect example of it. So, you know, at a very low level unstructured data, it's really just a different kind of indexing than structured data.
And you can still do it. It's something we discovered when we were building out our platform is that there's an interesting answer section of where you want to use unstructured data to explore structured data.
You see this a lot in the analytics and the BI world. And that world, you know, very, as you pointed out, very deep analytics aren't really necessary. What people want to do is just be able to search things.
There's more difficult things that you can do with unstructured data as well that are quite difficult in the big data world, such as no detailed document ranking and things like that. So I think there's a lot of growth still to be had in the unstructured data side of things. What file system that scale well to store, not necessarily complex and deeply transactional but very large bodies of data that need to be well distributed.
Well certainly Hadoop and Mapbar file systems are great at storing massive a little of data. Mapbar is fairly low latency, Hadoop is not. But if you don't need to store very complex data, then you just need to store huge, you know, multi-terabyte files, Hadoop and Mapbar is sort of the state of the art in that department.
Mertz: How do both of those handle liability issues? They have all kinds of built in fail lever modes and that sort of thing and [INAUDIBLE] guarantees?
Stephens: Yes. Hadoop has pretty strong redundancy. All data is written three times. They can either be actively or passively replicated. Of course the draw back in Hadoop, if you lose any node another one will automatically come up with the data. There is draw back in Hadoop in that there is sort of a central file system table called the name node. And if that data is erased it can be very difficult to recover the data on all the nodes.
Mapbar, which is sort of a more closed source clone of Hadoop, which is written to be much faster and more efficient also has a node...a distributed name node so that losing data in a named node will not cause you to lose data throughout the [INAUDIBLE].
Mertz: Thanks for listening again to IBM developerWorks coverage of OSCON 2011. This has been David Mertz and please tune in for future podcasts.