We are often asked where to find the sample files for our text analytics tutorial. It's a small set of IBM quarterly reports. I've uploaded them here, just click on that to download them. Happy coding!!
Peggy Zagelow - a view from the lab
PeggyZ 060000UWWS 1,097 Visits
PeggyZ 060000UWWS 1,097 Visits
I think that most of you reading this work for large companies, and our U.S. large companies tend to have pretty active legal departments. One of the hot topics these days around litigation is the investigation of email to answer legal requirements for evidence. Yep, they're likely keeping all of your email, and are required to comply when asked to provide the relevant ones as part of a lawsuit. Getting that set right is a big deal.
Now, I'm not a lawyer. I do happen to come from a family of lawyers, but that's not really here nor there for this discussion. The group where I work in IBM's Information Mangement has just produced a pretty cool part of the eDiscovery puzzle. It's called eDiscovery Analyzer. As you can see in the announcement letter, it works in conjunction with other IBM products to analyze email content in a repository.
The cool part is what's under the hood here. Based on the open, unstructured information management architecture-based search and text analytics (known as UIMA to those who know and love it), this product processes the text inside as well as the associated information about all the emails. This processing in turn allows a legal email analyzer person to work with and filter based on extracted entities from the email, such as people and company names, and stuff like sender, recipient and date. Combine that with powerful free-text search and you really have some amazing capability to categorize, gather, flag... this really helps a legal staff when they're asked to provide exactly what's needed and no more.
Now... what if you had this kind of capability on other information besides legal email repositories in your enterprise. What would you do with it? What other business problems could this kind of technology solve for you?
PeggyZ 060000UWWS 1,086 Visits
I found this article online today, which highlights the importance of enterprise search.
Company networks contain mountains of structured and unstructured data archived in numerous formats, some of them decades old and stored in secure servers.
IBM also is building a portfolio of enterprise search tools and services, under the OmniFind brand.
Of course you know that DB2 for z/OS data contains mountains of information! This is what our just-released text search support addresses for DB2 for z/OS data - character, binary, and XML. And it's built on OmniFind technology. With this support, you can do text search queries using the built-in CONTAINS() function. It's provided with DB2 9 for z/OS and the no-charge accessories suite.
Now, I know that this is just one piece of enterprise search. In fact, I joke with my colleagues that all of the work that we've put into this is "just an SQL statement". :-) But hey, it's an important piece - it can keep the DB2 for z/OS data where it is and "let the searches come to us".[Read More]
Last week in Athens I attended a presentation by Julian Stuhler of Triton Consulting. I was of course aware of the support, but more from a DB2 internals point of view. It was great to get an external perspective on it from someone who has been working with it.
The key information is that spatial data can be points, lines, or polygons (including multi-part polygons). If you think about it, this is really powerful. One example that Julian used is that an address is a point, and a flood zone is a polygon. So now you can ask "is the house in a flood zone", which is "is this point inside the polygon"? Cool stuff! I can really imagine how this could be used by some situational applications to use data in DB2 alongside other data.
Complete documentation on the spatial features can be found in this book.[Read More]
Hi all, and welcome to my new blog!
All this week, I am busy at the International DB2 User's Group conference here in Athens, Greece. The two topics I'm presenting are "Native SQL stored procedures in DB2 9 for z/OS" and "Java stored procedures".
I was very glad to get the chance to get across this key point in yesterday's presentation on native SQL stored procedures -- yes, when they execute they are eligible for redirect to the zIIP processor, but only when they are invoked from a remote client, and then at the same percentage as other DDF work. Lots of presentations lately have stated this a bit more broadly, leaving people with the wrong impression. So, let's be clear - when a remote thread comes into DB2, it executes on an enclave SRB, and DB2 dials the zIIP redirect to a certain percentage. A DB2 9 native SQL procedure executes on the invoking execution block and not on a WLM-SPAS TCB - that's one of their big advantages. Thus, when a native SQL stored procedure is invoked from a remote client over a TCP/IP connection, it runs on the enclave SRB and thus picks up that same DDF zIIP redirect percentage. On the other hand, when a native SQL stored procedure is invoked locally on z/OS, it is executed on the TCB that PC'd in from CICS or batch, and that is not eligible for zIIP redirect.
Lots of other good stuff going on here - tomorrow is a DB2 for z/OS Special Interest Group as well as our IBM query panel.[Read More]
PeggyZ 060000UWWS 947 Visits
I had the pleasure this week of participating in a couple of live IBM Academy of Technology events - the first for me in several years. It was great to reconnect with some of my colleagues, and get a chance to talk over IBM technology and client engagements.
Our academy president, Rashik Parmar, had arranged for a session with Bill Gajda from Visa. Bill's role at Visa includes mobile strategy as well as global innovation. We had a frank and open discussion about Visa's use of data, who their customers are, and IBM's technology role.
Please note that any errors in here are likely mine, I didn't take notes, I was far too engaged and inspired thinking about all the data (!) during the discussion.
Clearly, credit card transactions are "big data". I don't recall the exact numbers, but here's an article from over 5 years ago which references 300 million transactions a day. Bill Gajda told us that Visa keeps around 5-7 years of past transactions. They primarily use this information for real-time fraud scoring of individual transactions. So, when you charge something, Visa gets the approval request, and attaches a score which indicates the likelihood that the transaction is fraudulent. This is based on several factors, such as your usual spending patterns and the location of the transaction. That is the primary use for the historical data. Bill also told us of a partnership they did with The Gap, where if a Gap customer in their loyalty program opted in with their phone number, Visa would tell the Gap when a purchase was being made nearby a Gap store, and then The Gap could text an offer to their customer for a discount at the Gap. I was able to find an article describing that, and was surprised to see it was from 2011! Now, if you're a geek like me, you'll think for a second about how the data flows. (This is just Peggy speculating, I have no further knowledge of the internals of this...) So say you charge your lunch on your Visa card, and the approval data flows to Visa, then after Visa processes the approval, it also looks at the zip code and compares it to a list of zip codes it has from The Gap of their store locations (I'm making this easy by Zip code rather than a lattitude/longitude based proximity lookup). When there's a match, Visa initiates a message to The Gap to tell them you're close to a Gap store... and Gap then sends you a text message with a promotion. Phew! Do you think Visa also has a indicator on your credit card that you're a Gap customer? They must... I can't imagine they do this on "every" Visa transaction... ! Actually, I guess there is "Gap Visa" card, so it's probably those which are targeted.
I don't know about you, but I have fun thinking about this kind of stuff. Some other interesting facts I learned are that Visa doesn't have your name or personal information - that data all resides with the issuing bank. One of my colleagues asked Bill Gajda whether if a person has multiple Visa cards, does Visa have information across the cards. At first, he answered "yes", but then with another colleague's question about matching the names, he said "no" - Visa has no idea that you are the same person when you have multiple cards. Interesting thought about whether we could perhaps match spending patterns to identify who might be the same person? That might not be something Visa wants to do, of course, I was thinking of it more as a theoretical exercise. Also, Visa doesn't really consider cardholders customers, despite its advertising budget - its customers are the issuing banks and the accepting retailers, so they are the ones who would be likely consumers of the volume of data or aggregated insights from it. Tho I guess the banks already have it, too.
In any case, again, data about people and their habits is big data and big business. I personally don't find any of this "scary" but I guess some people might. I just like people, so data about people is interesting data to me. I might those who do find it scary might just have to take a closer look at some of those tiny print privacy agreements!
We now have the capability in DB2 9 for z/OS to search text data that is stored in DB2 for z/OS using SQL statements. Wahoo!
You mean you missed the announcement?
And you just followed that link and still couldn't find it? It's under "utilities", no, it's not that kind of utility, but still, that's where it is.
What is added is built-in functions for contains() and score(), and also shipping a text search server which runs on a separate, non-z/OS server. For more details, see the announcements!
One prerequisite for this is to have a WLM application environment set up to run a java user-defined function. The early customers I've been working with have had the most stumbling with this part of it. So, this is something you can set up even if you are not quite to DB2 9 for z/OS yet. I'll post some more on the setup steps for that.
So, what kind of data are you going to search, and what kinds of searches are you going to do?[Read More]
PeggyZ 060000UWWS 601 Visits
I thought I'd share a nice little AQL output example that might be of help in your Big Insights text analytics programming.
Consider that you've created a very complex AQL view, one which extracts a very detailed concept from text. This might take into account many different text constructs and match a large variety of different text in documents.
If you want to take that view and use it to simply identify which documents have an occurance of this advanced concept, you can do it like this. In this case, I've assumed the complex view name is called "Division'.
create view DivisionCount as
select Count(*) as dc from Division D;
create view DivisionBoolean as
when Equals(DC.dc,0) then 'no'
from Document R,DivisionCount DC;
To test this using the Text Analytics tutorial example, I created a new document called text.txt which doesn't have any matches for Division in it. When I run it and select to see DivisionBoolean in the output, I get this as a result:
PeggyZ 060000UWWS 387 Visits
I'm really enjoying being back working at IBM's Silicon Valley Lab, after spending some time on assignment in IBM's Global Business Services division. Since November, I've been working as one of the product architects for IBM's Big Insights product, focusing on Text Analytics.
This week we were working closely with our top-notch design team, and it occurred to me that the four big areas of data actifity we were discussing spelled out LOVE.
Load, Organize, Visualize, Export.
Since we started just after Valentine's day, we decided this was appropriate. And for me, this has all truly been a labor of LOVE. I'm trying to make it stick as a code name. You heard it here first!