Jeff Jonas on next-gen identity analytics

Jeff Jonas, IBM Distinguished Engineer and Chief Scientist, Entity Analytic Solutions, IBM Software Group, discusses the responsibility of shaping the overall technical strategy of next generation identity analytics.


Scott Laningham (, developerWorks Podcast Editor, IBM developerWorks

Scott LaninghamScott Laningham, host of developerWorks podcasts, was previously editor of developerWorks newsletters. Prior to IBM, he was an award-winning reporter and director for news programming featured on Public Radio International, a freelance writer for the American Communications Foundation and CBS Radio, and a songwriter/musician.

10 March 2009

You can listen to this podcast HERE.

developerWorks: This is a developerWorks podcast. I'm Scott Laningham here with Todd Watson. Joining us this time is Jeff Jonas, IBM Distinguished Engineer and Chief Scientist, Entity Analytic Solutions, IBM Software Group.

Jeff is responsible for shaping the overall technical strategy of next generation identity analytics and the use of this new capability in the overall IBM technology strategy. You can check out his blog by Googling Jeff Jonas. It comes right up. Welcome, Jeff.

Jonas: Hello.

developerWorks: I wanted to kick this off looking back really at 9/11. You related your perspective on the part that non-obvious relationship awareness could have played in preventing that. What led you to put those pieces together and why? And we're also wondering what the response was within the U.S. intelligence and national security community?

Jonas: Well, right after 9/11, I found myself building ... or helping organizations chase down a list they were provided. The government was passing around a list of suspected terrorists or people related to the 9/11 incident and they were handing [it] out to corporate America. And corporate America, you know, they don't really know that there's 128 ways or so to spell "Mohammed" so they're having some difficulty trying to see if their data relates to these subjects of interest. So that put me into the awareness of this particular list and the data.

Guest: Dave Mitchell

Jeff Jonas is chief scientist of the IBM Entity Analytics group and an IBM Distinguished Engineer. The IBM Entity Analytics group was formed based on technologies developed by Systems Research and Development (SRD), founded by Jonas in 1984 and acquired by IBM in January 2005. Prior to the IBM's acquisition of SRD, Jonas lead it through the design and development of a number of extraordinary systems including technology used by the surveillance intelligence arm of the gaming industry. Leveraging facial recognition, this technology enabled the gaming industry to protect itself from aggressive card count teams, the most notable known as the MIT team and the subject of the book "Bringing Down the House" as well as the recent movie "21." Today, possibly half the casinos in the world use technology created by Jonas and his SRD team. This work is frequently featured on the Discovery Channel, Learning Channel, and the Travel Channel.

And at the same time, I was watching newspaper stories and good investigative journalism and they were showing network diagrams of how all these people are related. Here's Mohammed Atta, related to this guy, related to this cleric over here. And there was some speculation the shape of the network was really the tell-tale signature of terrorist planning.

And I've seen a lot of data in my life, having built many, many systems and many with very large data sets, and the shape of a network really isn't that important. In fact, if you get too caught up in that, you end up finding soccer teams and say, people moving around for family reunions.

What I did was I took the open source information that was coming out in all these investigatory reports and I told a different story. Instead of saying the shape of the network mattered, the real story is shapes of the networks don't matter unless you know the entrance point: At what point you're peeking in. And you know, Mohammed Atta had a bench warrant, but if you think, you know, if the airport started screening everybody that had bench warrants maybe the airports would shut down.

So the question is, what part of the network, where was the network exposable? Where could you have peeked in? And two of the individuals, Nawaf al-Hazmi and Khalid al-Mihdhar, the CIA knew they were bad people and our government was in fact looking for them. And that's the entrance point. And what I demonstrated is that by just starting with those two, the number of known terrorists in the United States right now happens to be a very small list. It was then, I'm sure is still today.

But when you peek in, you very quickly and easily can find these two individuals. They're in the phone book in San Diego. They're operating in plain sight using their real identities and they made plane reservations just a few days later after the government started looking for them.

And on these plane reservations they use addresses that tie you to, quickly allow you to discover Mohammed Atta. And if you just keep digging through that way very narrowly, not far out and wide and shapes of networks, but just link analysis, we quickly get to at least 13 of the 19 hijackers.

So I packaged that up as a little presentation to talk about how plain as day it was to find these folks. And I presented it in public once and somebody that I knew from the national security community kind of tapped me on the shoulder and said, hey, geez, you know, you don't really have to ... can I have a copy of that and could you not show that any more? And I thought, ah, sure.


And then about, I don't know, within a year it had been leaked and the exact link analysis, my work, word-for-word made a policy report. And I called my ... this friend of mine and I said, now, geez-louise, you linked it. And he said, yes, I didn't really mean to do that, sorry about that. So it's quite public now, it's seen a bit of media.

But the principle is that it's, shapes of networks don't matter and it's entrance points that do. And I've also used it to make the point that at that time you didn't need new laws and you didn't actually need new technology to find those guys.

I'm not saying that we can't have laws that are better and better suited for counter-terrorism, and we don't...we certainly could use improved technology, especially technologies that enable information sharing to address these counter-terrorism problems. But at that time, around 9/11 specifically, neither were needed.

Todd: So, Jeff, this is Todd. There seems to be a consistent theme in your work, and I'm maybe characterizing this inappropriately but I think you'll get the gist of it, and that is that you don't know what you don't know until you know it, right?

And I think that this example of 9/11 and the hijackers having could have been could have been kind of finger pointed at least before the event is one good example. There was another that you had brought up I think in some of your Homeland Security testimony that you gave out in Vegas last fall that I thought was interesting that I actually hadn't heard about. In this case, it was around Hurricane Katrina and the fact there were all these people who couldn't find their way back to their loved ones even though we had multiple, multiple sources of data, IE bulletin boards or Web sites where people were self-identifying and saying, I'm looking for so and so ....

And that simply what needed to be done was to find a way of aggregating that and providing a view into it so that one dot could get connected to another. Could you talk a little bit about that particular incident, because I think it's another that was just ... it had such immediacy and such impact, and yet there was immediate value that could be provided to those folks if we just put the dots together in a way that they were meaningful.

Jonas: Yes. So right after Katrina hit, 50 Web sites at least popped up that were naming missing and found. And some of them are real popular ones like National Center For Missing and Exploited Children and the Red Cross Web site. But other Web sites, one put up by a TV station in the Louisiana area and another one put up by a kid who is a teenager in his garage. And so when you locate somebody, for example, a daughter looking for her father who is a diabetic, she may name him on five or seven of the Web sites that she can find saying, I'm looking for my father. And the father in fact maybe is registered on the 10th Web site saying I'm here, I'm here, I'm alive. But they, they're separated. It's across all these silos.

So we did a project with the Governor's office, we took the 15 biggest Web sites, somewhere around 1.5 million named identities. But when the daughter names her father five times and it would count as five times. So some of the work that we've done over the years is, well, you could call it identity resolution. In this case, you're kind of ... you're creating a consolidated dossier, you're saying, hey there's this one person and you've been named on five Web sites by the daughter and named on the sixth Web site by the father himself.

So this was used, the one point something million identities named across all these Web sites turned into 36,000 unique people. And then the difference you're looking for is between people naming a missing and naming a found and this resulted in over 100 loved ones being reunited through a reunification project.

developerWorks: Wow.

Jonas: An interesting side note is it takes a little while to set the policy up on these things because what you don't want to have happen is somebody using the witness relocation program to be located by somebody that's hunting for him.

developerWorks: Sure.

Jonas: And you don't want debt collectors to use this to locate people that they're trying to get to collect debts. You really wanted to make sure it was used for the purpose it was created.

By the way, the way you remedy that is if I say you are lost and you say you're found, then we tell you that I'm looking for you. We don't tell the person looking for somebody where to find them, we tell the person who has been lost who's trying to find them.

Todd: So here we are on, I guess Monday, March the 2nd, and we were talking earlier before you started reporting that the market was heading south again and it seems to be a general trend....

And referring back again I think to some of the things you said in your testimony, I think at one point you had made a comment about Goldman Sachs and that they had indicated in some discussions that you'd had with them that every millisecond that they gained, for example, in their trading programs, was worth to them $100 million a year.

Now, that was obviously probably a while back, before some of the financial downturn started to kick in. And I'm just curious in light of what's going on in the markets and all this data about the pending crisis with everything from collateral debt obligations to the other things that I think led to this capital crunch.... What could be done moving forward to put more checks and balances into place, specifically using technology? Is there an answer there in terms of some of the things that you've already talked about in terms of the relationship awareness that in a system that is financially oriented, some of those same triggers could be identified and the dots connected?

Jonas: So, sometimes I have characterized the work that I've done and the technology that IBM acquired from my SRD company, I've characterized it as perpetual analytics. It's really this notion that you have to be able to use ... you want to be able to use new observations so you can change your mind about the past. I do a blog post on this called Smart Systems Flip Flop. You're in real big trouble if your systems make a decision based on available information, but then as information changes, they never reconsider earlier decisions. Risky, risky moves. You know, the Federal government will do a background clearance on somebody once every five years. I mean, I visualize that like taking a pen and sticking it into a an XY matrix about what the current risk is and then turning your back on it and looking at it five years later to see where it is then.

Well, if you're trying to be more competitive, if you're trying to prevent really bad things from happening, you really want to be able to monitor risk and trends as they're happening. So that means when you establish whether something's got opportunity or risk on it and you're giving it some score, you want to be able to use new observations, new data that arrives in the enterprise, and you want to be able to monitor its motion and see that it's in the green area and it's drifting towards a dangerous area, maybe it turns to yellow and it crosses a threshold.

And you know, I've been thinking about what this might look like in terms of mortgage-backed securities, had there been some perpetual analytics on the risk maybe three years ago, long before the crisis. You would have seen an aggregate drift, they wouldn't have been all red, but you would have seen an aggregate drift towards red line. What one would have done about that I don't know, but anybody who could see it first might have had some distinct benefit.

Todd: You can't do anything about it if you can't see it, so that's kind of where I was going with the question. So that's interesting. Scott?

developerWorks: Yeah. You know, Jeff, that really kind of dovetails into something I was thinking about as well, that you blogged last week about macrotrends and one of those was the fact that data is being created faster than organizations can make sense of it. It's kind of similar to what you're talking about. It's not the same issue, but it is an issue of how often, what's the frequency of analyzing that data. And so maybe the statement there is organizations are getting dumber in a sense. And of course, IBM information management technologies is focused on solutions in that space. What do you see in terms of how companies are addressing this and maybe some common things that they're doing or that they need to be doing?

Jonas: Well, yes, the notion of that post that you're referencing is as computers are getting faster, organizations are getting dumber. And that's because the volume of information that's being created in the world than then a company has access to is growing faster than their ability to make sense of it.

developerWorks: Right.

Jonas: Which means if you can make sense of seven percent of what you know today as a company, in 2 years maybe you can only make sense of four percent of what you know. And in 10 years you'll be able to make sense of one percent of what you know. So to that extent, to do what's knowable, organizations are getting less intelligent.

So I did this other blog post called Algorithms at Dead End, You Cannot Squeeze Knowledge Out of a Pixel. If you have a red pixel, you don't know if it's fire or a fire engine. You could use an infinite amount of computing power, time, and energy and still really know nothing about it.

Today, what's been happening, I think most organizations have been instrumenting pixels -- I mean by that instrumenting transactions, an atomic transaction: Somebody applies for a loan, somebody fills out a job application, somebody's transferring money from one account to another.

Organizations have been tending to try to use algorithms -- math and instrumentation -- on each of these atomic transactions. And the problem is it's hard to get much knowledge out of there to make a good decision.

And what you really want to do is take pixels or think of those as puzzle pieces and you want to stitch them into puzzles -- meaning, you want to put information in context. You want to take a transaction that's happening and you want to say what does our organization know about this person, about this product, about this bank account?

And you want to know the net sum of what the enterprise knows. And when you have that much richness, it is really your best opportunity to make the right decision whether you're trying to improve opportunities -- cross-sell/up-sell -- or mitigate risk.

I mean, heck, today organizations will buy marketing lists and unbeknownst to them, they're sending credit card offers to people who are in jail who have already been arrested for stealing from them.


developerWorks: Right. That's called no conceptualizing at all, right?

Jonas: Well, yes, I mean, I call it enterprise amnesia. I mean, we found, we took data from one of the real large retailers in the U.S. and showed them that two out of every thousand people they were hiring had already been arrested for stealing from them at the same store.

And that's what happens when you stare information out of context, when you're just staring at an individual job applicant and you're not saying, how does this relate to the net sum of what we know?

In fact, the smartest an organization can be is the net sum of their perceptions, right? But perceptions are all those data that they've been collecting across all of these many silos. That's the smartest they can be, but because it's been scattered and the yellow puzzle pieces are laying in this pile and there's people and systems studying the yellow pieces and then over on the other side of the building there's the blue pieces and there's systems and people studying those. And they're wondering why they're missing the obvious.

Well, this is really about putting information together. And that's really how to make organizations smarter. And I'll tell you what, as I watch the economy tank, there's just no time to waste to be a smarter organization.

developerWorks: What areas of study do you think support individuals in wanting to be well adapted for careers in this kind of space?

Jonas: So the broadest category would be information management, like how to better manage information. I think some of the underpinnings of their ... it turns out I believe at least that the smartest an organization can be is computationally most efficient if you try to make sense at the moment the transaction's happening.

So this is the story of doing real-time and streaming data. So I think the study area about real-time transactional systems, ultra low latency subsecond response, doing that you end up having a real good understanding of information structures because the structure's going to govern the function like the kind of schemas and data models you use.

And then on top of that, to get context out of information, probably the most important thing is broadly called semantic reconciliation. That means recognizing when two things are the same despite having been described differently. So one is Bob Ricard and the other one is Robert Ricard, you've got to be able to figure out they're the same person.

developerWorks: Right.

Jonas: Starbucks Number 123 and the Starbucks on the corner of Sahara and Marilyn Parkway is the same place. If you can't see in your data when like things are the same, you basically have a problem of inability to count. And if you can't count things, how could you possibly have good prediction? Those would be maybe the first few things that come to mind.

Todd: So, Jeff, just building on this notion that you mentioned earlier about becoming smarter, as you know, IBM has launched this strategic initiative starting I think all the way at the top of the organization with Sam Palmisano and his speech to the Council of Economic Advisors last fall about this idea of a Smarter Planet. And yet as I look at some of what we're saying, we're talking about putting sensors into everything from shipping containers to passports. Recognizing that privacy has been playing an increasingly critical role in your work, especially with the national security community as it relates to civil liberties, I'm interested in how do you think we -- the collective we, IBM -- individuals strike that balance so that privacy is respected while also fulfilling the promise of this notion of new intelligence? Because clearly there are benefits to the new intelligence and yet when I look out at some of the discussions that are going on, sometimes I think privacy is kind of distinctively absent from the conversation.

Jonas: Well, that's a funny thing. I spend, maybe if there's any one area I spend the most time on, it's thinking about how can governments protect society and at the same time not unravel the privacy and civil liberties of the population, the shutting down of the Fourth Amendment, like that's a really interesting and hard problem.

In terms of macrotrends, I've come to conclude that surveillance societies are not only inevitable and irreversible, but the more interesting thing is they're irresistible. And they're irresistible because the consumers are clamoring to optimize their lives -- so, GPS on our phones, everyone, I mean, you just start to love it. You can figure out where you're going, you can find Starbucks, you know where your kids are. You're going to love RFID everywhere. You're going to put them ... you're going to make sure it's in your glasses because you'll never lose them again. So I just bring that up because that's really the trend.

So the next question is, consumers for the most part are not reading their privacy statements. And one of the things that I've been trying to do is get more technologists to spend more time with people in the privacy community because as innovators and we're going to end up waking up in the bed we've made, right? The toothpaste is going to get out of the tube.

And some of the goals that you have when you think about responsible innovation and things that I'm out encouraging people building the systems, and these would be consistent with things that I've been trying to build into our systems, is one of them is data tethering. It says that they're going to have data and transfer data; you'd better make sure it stays current through the ecosystem. So if somebody produces a watch list and they pass the watch list off to a secondary system, if that source system -- the ownership of the watch listing record -- were to change the watch list record or delete a record, you'd better be able to know everywhere it's been transferred so you can correct it throughout the whole system.

Another principle is to try to minimize the amount of data that is actually transferred because the more you transfer it, the harder it is to keep it all current. And some work that I've done in this area is around the area of anonymization. One of the, in the health care field if you're trying to protect the identity of your patients which is you have to do that by law because of HIPPA, you would strike off the health care records the name and the address, you would strike off the phone number and the tax ID number and make it non identifiable. But the problem is, if the parties that want to share data all remove the identity data you no longer have the ability to count things, you can't tell whether it's five cases of lupus or one case reported five times.

So some work that I've done in the area of semantically reconciling identities -- or often called identity resolution -- it was formally called Anna and it's been renamed now that IBM's bought my company, but what it does is it allows you to anonymize the identity, so it takes the name, address, tax ID, date of birth.... And using something called one-way hashes, basically it renders non-human readable and not mathematically reversible. But the Reader's Digest version of this is if you take a pig and put it through a grinder, I can give you the sausage and the grinder and you can't go backwards and make a pig. So to that extent, the grinder's a one-way hash function.

But this technique allows multiple organizations or multiple holders of data to grind their data up and anonymize it. And after it's been anonymized the personally identifiable information is now in a protected form, there's still an ability to make sense of it, to do identity resolution very robustly with fuzzy matching qualities: Jeffrey versus Jeff and 123 South Main Street versus 123 Main Avenue. You can still find matches and count things.

So this opens the door to doing deeper levels of analytics and doing it in ways that are more privacy protected. And especially, companies are getting pretty tense about this notion of having their data run away on them. And every copy an organization mixes their data puts them at greater risk.

I had this notion that anonymization is going to be a pretty big wave -- the reason being is I kind of picture a conversation with a CEO and it says, look, here's a way to share your data in an anonymized form and you can get a materially similar result.

Well, the question is, why would an organization want to share their data any other way? You know, a lot of copies of data are being moved around and you have to bring data together to make sense of it. You can't leave it in 50 piles and run algorithms across 50 locations and assume they're all up, assume they have all have zero latency, assume they all have the indexes that you need.

So to create intelligence, this whole new intelligence, parts of it that are really smart are going to be the commingling of some data. And the question is, how many copies of data does there need to be? And any time that you can either reduce the number of copies or protect it better is going to be really important as organizations shift their agenda to the information and making more sense of it.

Watson: So, Jeff, I think I mentioned to you earlier, I work in Web marketing and as I look out on the Web marketing landscape continuing this meme on privacy, it really does rear its head in everything. I mean, certainly Google has taken its fair share of shots, some warranted, probably some not. We also have seen this coming up as a discussion in behavioral targeting. And when you made that comment earlier, it kind of made me think back to Scott McNealy from Sun way back when, when he said, you have zero privacy, get over it. I mean, is he right? Is that where we're at? Or do you think that that balance that you're describing can be struck between the need to get more efficient in the markets, especially in light of what's been happening recently, using some of these technologies, balancing that out with an individual's ability to follow the basic fair information practices and it was ... I think it was described in the Fair Credit Reporting Act, of notice, choice, consent, those things. Can we strike that balance?

Jonas: Well, it's important, I think, to note, and I've heard many say this, but the first that I heard say this is David Brin, he wrote, The Transparent Society, is the idea about what privacy means is changing generationally. And you know, if the advent of MySpace and Facebook, where you can give up so much data, there's a lot of people giving up a lot of data.

So I think the bigger thing is that I spend a lot of time talking to people in the privacy community, from ESF to EPIC to ACLU and others, and if I were to synthesize it down to like, what's the shortest way to get into the heads of a privacy advocate, I guess the shortest thing would be, avoid consumer surprise.

So as organizations build systems and try to better service their customers, any thinking that one does about, hey, let's do it in a way that's consistent with what we tell them on our Web site about how they use their data, right, that's full disclosure and how you're going to use the data, let's use your data just the way that we've said and I think those are really good first steps for privacy.

I think a lot of organizations are paying a lot more attention, especially around the data breach. And I think there's been some high-profile cases of organizations that have got into a bit of trouble for using data in ways that were unanticipated. And the world's given them some pretty loud feedback. And I think that causes companies to be a little more careful about what agreements they're making with their customers and keeping to them.

developerWorks: Jeff, it's interesting what you're talking about there. And I know you're a parent, so you've got kids growing up in this whole, as you say, MySpace, Facebook, I-want-to-talk-about-myself-in-detail generational thing. How do you see a generation, a new generation embracing the positives of all this stuff, this global sharing and openness, and at the same time developing the street smarts and maybe long-term vision to know what to share and what not to share? How do you think about that?

Jonas: You know, I mean, I encourage my kids to be active in the social networking space. I mean, they got there before me. I mean, heck, they realized text messaging was cool for years before I got this new girlfriend and realized text messaging is cool, you know.


They were busy in my MySpace, and I took a look and they've learned some things on the journey that they've all run into a few creepy people here and there. But you know, I don't ... it's a funny thing. is I don't really think the world's a more dangerous place. This is another macrotrend I like to point out. It's not more dangerous. I mean, the media's ability to take every bad thing that happens on earth and package and throw it in your face all day long makes the world appear more dangerous. But you know, in the late 1800s, early 1900s, the average lifespan in western Europe was 37. Today the average lifespan, including Africa is 67. You're going to live older today than at any time in the history of mankind.

So I think the world's a really fun place and there's a lot of opportunity and it's really not that scary. And so I encourage kids and adults and people that haven't quite gotten to the ramp to not be afraid, get in there and play around. And if something just doesn't make sense, like why don't you get all your money out of the bank and come over to my house, you know, just ... you wouldn't do that if they called you and asked you that over the phone, either.

developerWorks: This has been a great chat, by the way, and we could go on forever, but I know how busy you are and I'm wondering if you might want to just kind of as a wrap up talk about maybe any...some future scenarios that you'd like to see around the use of entity analytics?

Jonas: Well, today organizations have all this data sitting in all these different silos, you know the blue puzzle piece is over there, the yellow puzzle piece is over there. And I think that organizations are going to have to get way more efficient. And to get way more efficient they need to be able to make more sense of what they know so they can better predict who's their better customers, who's the customers that really aren't great customers, and where there's criminal activity so they're not ... so they can shut the door on that earlier than later.

When organizations get smarter like that it's going to create more confidence. It's going to show in earnings per share because they will be more efficient than their competitor and it will create more confidence in the marketplace.

And I think that's just going to be a real giant push. And that's been an area of my focus, is how do you make more sense of information or what you might call sense making, how do you do it in real-time, at the moment the transaction's happening so you can do something about it while it's occurring?

This whole notion of waiting until Friday night and running a batch job over the weekend so you can be smarter on Monday isn't going to cut it. And any window you have of any latency longer than makes sense of what you know and respond accordingly is going to leave organizations out in the wind, and they're going to be not competitive.

developerWorks: Our guest has been Jeff Jonas, IBM Distinguished Engineer and Chief Scientist, Entity Analytics Solutions, IBM Software Group. Jeff, it's been a pleasure, man. Thanks for your time.

Jonas: Thanks. I enjoyed it.

developerWorks: Again, check out Jeff's blog, just Google Jeff Jonas, J-O-N-A-S, and his blog is one of the first things to come up.

Also visit Todd Watson's or my blog at Just look in the list for Todd Watson or Scott Laningham. We'd love to have your comments about what you heard or what you'd like to hear more of in the future.

developerWorks is IBM's premium technical resource for software developers with tools, code and education on IBM products and open standards technology. I'm Scott Laningham. Talk to you next time.



developerWorks: Sign in

Required fields are indicated with an asterisk (*).

Need an IBM ID?
Forgot your IBM ID?

Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.


All information submitted is secure.

Dig deeper into developerWorks

ArticleTitle=Jeff Jonas on next-gen identity analytics