A Graph-Based Movie Recommendation Solution based on Titan DB(Hbase) + TinkerPop(sparkgraphcomputer) + Gephi

Background

Titan is designed to support the processing of graphs and Scaling graph data processing for real time traversals and analytical queries. This blog will introduce and expatiate the usage of Titan based on a real data model, Movie Recommender Engine.

Data Source

There are publicly available sample data source here.Consider the following scenario, with plenty of data.

N Billions User data
M Billions Movies data
Max N*M Billions Rating data

What challenge we will face?

Diverse data should be collected from different source(Youtube,AMC and so on)
Diverse data should be managed in a unified way
High concurrent insert rates from parallel different source
Efficient Processing and Querying
Some operations require complex query and searching

Titan Graph based solution

High Performance: Distributed memory calculation VS traditional SQL which costs long I/O disk delay
Data distributed across a multi-machine cluster
Flexibility: Flexible changes of attributes VS traditional database which has fixed schema
Property-based graph: Each vertex and edge is easily managed through property

Implements for Movie Recommender Engine:

Generate MovieRating Graph for movie recommendation
Analysis the MovieRating Graph for movie recommendation

Generate MovieRating Graph for movie recommendation:

We have sample data to simulate this scenario. The downloaded data will be inserted into the graph database Titan by the following code. Of course, this is not the only way to build Movie Graph in Titan. It just for a reference.

Movie_Graph

Open titan graph and define occupations:

g = TitanFactory.open('/usr/iop/current/titan-client/conf/titan-hbase-solr.properties')  t = g.traversal()  occupations = [0:'other', 1:'academic/educator', 2:'artist',    3:'clerical/admin', 4:'college/grad student', 5:'customer service',    6:'doctor/health care', 7:'executive/managerial', 8:'farmer',    9:'homemaker', 10:'K-12 student', 11:'lawyer', 12:'programmer',    13:'retired', 14:'sales/marketing', 15:'scientist', 16:'self-employed',    17:'technician/engineer', 18:'tradesman/craftsman', 19:'unemployed', 20:'writer']

Parsing Movie Data:

new File('$FilePATH/ml-1m/movies.dat').eachLine {def line ->    def components = line.split('::');    def movieVertex = g.addVertex("type","Movie", "movieId",components[0].toInteger(), "title",components[1]);    components[2].split('\\|').each { def genera ->     def hits = t.V("genera":genera).iterator();     def generaVertex = hits.hasNext() ? hits.next() : g.addVertex("type","Genera", "genera",genera);     movieVertex.addEdge("hasGenera", generaVertex);    }  }

Parsing User Data:

new File('$FilePATH/ml-1m/users.dat').eachLine {def line ->    def components = line.split('::');    def userVertex = g.addVertex("type","User", "userId",components[0].toInteger(), "gender",components[1], "age",components[2].toInteger());    def occupation = occupations[components[3].toInteger()];    def hits = t.V("occupation",occupation).iterator();    def occupationVertex = hits.hasNext() ? hits.next() : g.addVertex("type","Occupation", "occupation",occupation);    userVertex.addEdge('hasOccupation',occupationVertex);  }

Parsing Rating Data:

new File('$FilePATH/ml-1m/ratings.dat').eachLine {def line ->    def components = line.split('::');    def ratedEdge = t.V().has("userId",components[0].toInteger()).next().addEdge("rated",t.V().has("movieId",components[1].toInteger()).next());    ratedEdge.property('stars', components[2].toInteger());  }  g.tx().commit()

Now that the Movie data has been parsed and represented as a graph, the data is ready to be traversed (i.e. queried).

Analysis the MovieRating Graph for movie recommendation

Traverse the Movie Graph in Titan DB

TitanDB_A

Import data source and Generate graph
Populate Graph in Titan
Execute Graph Traversal

graph = TitanFactory.open('/usr/iop/current/titan-client/conf/titan-hbase-solr.properties')  graph.traversal()  ...........

Perform Movie Graph data analysis in Spark

Spark_A

Import data source /Read Graph from HBase
Load Graph to Spark memory
Execute Graph Traversal/VertexProgram

graph = GraphFactory.open('/usr/iop/current/titan-client/conf/hadoop-graph/hadoop-hbase-read.properties') //Read Graph from Hbase  t = graph.traversal().withComputer(SparkGraphComputer)  t.V().valueMap()  ............

Visualize Movie Graph with Gephi

Gephi

Get Graph
Set Configuration
Traversal Visualization: the Gephi plugin for Gremlin Console utilizes this API to allow for graph and traversal visualization

How to setup Gephi

The Gephi plugin for Titan Client (Gremlin Console) is compatible with Gephi 0.9.x. Download Gephi from https://gephi.org/

Note: Both Titan and Gephi requires JDK1.8

Install Graph Streaming plugin

a. (Tools > Plugins)

Gephi11

b. Click Available Plugins, Search: graph streaming, check Graph Streaming plugin and click Install.

Gephi10

c. Follow the instruction to install the plugin and click Finish to restart Gephi.

Gephi9

d. Launch Gephi, and click New Project

Gephi8

e. In the lower left view, click the “Streaming” tab, open the Master drop down, and right click Master Server > Start which starts the Graph Streaming server in Gephi and by default accepts requests at http://localhost:8080/workspace1:
Important: Gephi Streaming plugin only supports localhost connection.

Gephi7

Note: The Gephi Streaming Plugin doesn’t detect port conflicts and will appear to start the plugin successfully even if there is something already active on that port it wants to connect to (which is 8080 by default). Be sure that there is nothing running on the port before Gephi will be using before starting the plugin. Failing to do this produce behavior where the console will appear to submit requests to Gephi successfully but nothing will render.

f. Start the Gremlin Console and activate the Gephi plugin:

  gremlin> :plugin use tinkerpop.gephi  		==>tinkerpop.gephi activated  		  		gremlin> graph = TinkerFactory.createModern()  		==>tinkergraph[vertices:6 edges:6]  		  		gremlin> :remote connect tinkerpop.gephi  ==>Connection to Gephi - http://localhost:8080/workspace1 with stepDelay:1000, startRGBColor:[0.0, 1.0, 0.5], colorToFade:g, colorFadeRate:0.7, startSize:10.0,sizeDecrementRate:0.33  		  		gremlin> :> graph  		==>tinkergraph[vertices:6 edges:6]  		==>false

The above Gremlin session activates the Gephi plugin, creates the “modern” TinkerGraph, uses the :remote command to setup a connection to the Graph Streaming server in Gephi (with default parameters that will be explained below), and then uses :submit which sends the vertices and edges of the graph to the Gephi Streaming Server. The resulting graph appears in Gephi as displayed in the below.

Gephi6

g. Now that the graph is visualized in Gephi, You may think the graph looks very awkward, this is where graph layout algorithm and graph settings come in.
1) Choose graph layout in Layout Tab, there are lots of available layouts, here we use Fruchterman Reingold layout and click Run.

Gephi5

2) Increasing the node size, decreasing the edge scale, and displaying the id, name, and weight attributes

Gephi4

3) The graph should now look like the following

Install Graph Streaming plugin

Gephi3

Regarding Layout Selections, below is the recommendation for general scenarios

Gephi2

For more detailed tutorials on Layout Algorithm and Graph Settings, please take a look at the following slides provided by Gephi:
https://gephi.org/users/tutorial-layouts/
https://gephi.org/users/tutorial-visualization/

Traversal Visualization

Visualization of a Traversal has a different approach as the visualization occurs as the Traversal is executing, thus showing a real-time view of its execution. A Traversal must be “configured” to operate in this format and for that it requires use of the visualTraversal option on the config function of the :remote command:

  gremlin> :remote config visualTraversal graph //(1)  ==>Connection to Gephi - http://localhost:8080/workspace1 with stepDelay:1000, startRGBColor:[0.0, 1.0, 0.5], colorToFade:g, colorFadeRate:0.7, startSize:10.0,sizeDecrementRate:0.33    gremlin> traversal = vg.V(2).in().out('knows').                               has('age',gt(30)).outE('created').                               has('weight',gt(0.5d)).inV();[] //(2)    gremlin> :> traversal //(3)  ==>v[5]  ==>false

1. Configure a “visual traversal” from your “graph” – this must be a Graph instance. This command will create a new TraversalSource called “vg” that must be used to visualize any spawned traversals in Gephi.
2. Define the traversal to be visualized. Note that ending the line with ;[] simply prevents iteration of the traversal before it is submitted.
3. Submit the Traversal to visualize to Gephi.
When the :> line is called, each step of the Traversal that produces or filters vertices generates events to Gephi. The events update the color and size of the vertices at that step with startRGBColor and startSize respectively. After the first step visualization, it sleeps for the configured stepDelay in milliseconds. On the second step, it decays the configured colorToFade of all the previously visited vertices in prior steps, by multiplying the current colorToFade value for each vertex with the colorFadeRate. Setting the colorFadeRate value to 1.0 will prevent the color decay. The screenshots below show how the visualization evolves over the four steps:

Gephi1

IBM Support

Tips

A Graph-Based Movie Recommendation Solution based on Titan DB(Hbase) + TinkerPop(sparkgraphcomputer) + Gephi - Hadoop Dev

Technical Blog Post

Abstract

Body