Background
Titan is designed to support the processing of graphs and Scaling graph data processing for real time traversals and analytical queries. This blog will introduce and expatiate the usage of Titan based on a real data model, Movie Recommender Engine.
Data Source
There are publicly available sample data source here.Consider the following scenario, with plenty of data.
- N Billions User data
- M Billions Movies data
- Max N*M Billions Rating data
What challenge we will face?
- Diverse data should be collected from different source(Youtube,AMC and so on)
- Diverse data should be managed in a unified way
- High concurrent insert rates from parallel different source
- Efficient Processing and Querying
- Some operations require complex query and searching
Titan Graph based solution
- High Performance: Distributed memory calculation VS traditional SQL which costs long I/O disk delay
- Data distributed across a multi-machine cluster
- Flexibility: Flexible changes of attributes VS traditional database which has fixed schema
- Property-based graph: Each vertex and edge is easily managed through property
Implements for Movie Recommender Engine:
- Generate MovieRating Graph for movie recommendation
- Analysis the MovieRating Graph for movie recommendation
Generate MovieRating Graph for movie recommendation:
We have sample data to simulate this scenario. The downloaded data will be inserted into the graph database Titan by the following code. Of course, this is not the only way to build Movie Graph in Titan. It just for a reference.
- Open titan graph and define occupations:
g = TitanFactory.open('/usr/iop/current/titan-client/conf/titan-hbase-solr.properties') t = g.traversal() occupations = [0:'other', 1:'academic/educator', 2:'artist', 3:'clerical/admin', 4:'college/grad student', 5:'customer service', 6:'doctor/health care', 7:'executive/managerial', 8:'farmer', 9:'homemaker', 10:'K-12 student', 11:'lawyer', 12:'programmer', 13:'retired', 14:'sales/marketing', 15:'scientist', 16:'self-employed', 17:'technician/engineer', 18:'tradesman/craftsman', 19:'unemployed', 20:'writer']
new File('$FilePATH/ml-1m/movies.dat').eachLine {def line -> def components = line.split('::'); def movieVertex = g.addVertex("type","Movie", "movieId",components[0].toInteger(), "title",components[1]); components[2].split('\\|').each { def genera -> def hits = t.V("genera":genera).iterator(); def generaVertex = hits.hasNext() ? hits.next() : g.addVertex("type","Genera", "genera",genera); movieVertex.addEdge("hasGenera", generaVertex); } }
new File('$FilePATH/ml-1m/users.dat').eachLine {def line -> def components = line.split('::'); def userVertex = g.addVertex("type","User", "userId",components[0].toInteger(), "gender",components[1], "age",components[2].toInteger()); def occupation = occupations[components[3].toInteger()]; def hits = t.V("occupation",occupation).iterator(); def occupationVertex = hits.hasNext() ? hits.next() : g.addVertex("type","Occupation", "occupation",occupation); userVertex.addEdge('hasOccupation',occupationVertex); }
new File('$FilePATH/ml-1m/ratings.dat').eachLine {def line -> def components = line.split('::'); def ratedEdge = t.V().has("userId",components[0].toInteger()).next().addEdge("rated",t.V().has("movieId",components[1].toInteger()).next()); ratedEdge.property('stars', components[2].toInteger()); } g.tx().commit()
Now that the Movie data has been parsed and represented as a graph, the data is ready to be traversed (i.e. queried).
Analysis the MovieRating Graph for movie recommendation
- Traverse the Movie Graph in Titan DB
- Import data source and Generate graph
- Populate Graph in Titan
- Execute Graph Traversal
- Perform Movie Graph data analysis in Spark
graph = TitanFactory.open('/usr/iop/current/titan-client/conf/titan-hbase-solr.properties') graph.traversal() ...........
- Import data source /Read Graph from HBase
- Load Graph to Spark memory
- Execute Graph Traversal/VertexProgram
graph = GraphFactory.open('/usr/iop/current/titan-client/conf/hadoop-graph/hadoop-hbase-read.properties') //Read Graph from Hbase t = graph.traversal().withComputer(SparkGraphComputer) t.V().valueMap() ............
Visualize Movie Graph with Gephi
- Get Graph
- Set Configuration
- Traversal Visualization: the Gephi plugin for Gremlin Console utilizes this API to allow for graph and traversal visualization
How to setup Gephi
- The Gephi plugin for Titan Client (Gremlin Console) is compatible with Gephi 0.9.x. Download Gephi from https://gephi.org/
- Install Graph Streaming plugin
- Install Graph Streaming plugin
Note: Both Titan and Gephi requires JDK1.8
a. (Tools > Plugins)
b. Click Available Plugins, Search: graph streaming, check Graph Streaming plugin and click Install.
c. Follow the instruction to install the plugin and click Finish to restart Gephi.
d. Launch Gephi, and click New Project
e. In the lower left view, click the “Streaming” tab, open the Master drop down, and right click Master Server > Start which starts the Graph Streaming server in Gephi and by default accepts requests at http://localhost:8080/workspace1:
Important: Gephi Streaming plugin only supports localhost connection.
Note: The Gephi Streaming Plugin doesn’t detect port conflicts and will appear to start the plugin successfully even if there is something already active on that port it wants to connect to (which is 8080 by default). Be sure that there is nothing running on the port before Gephi will be using before starting the plugin. Failing to do this produce behavior where the console will appear to submit requests to Gephi successfully but nothing will render.
f. Start the Gremlin Console and activate the Gephi plugin:
gremlin> :plugin use tinkerpop.gephi ==>tinkerpop.gephi activated gremlin> graph = TinkerFactory.createModern() ==>tinkergraph[vertices:6 edges:6] gremlin> :remote connect tinkerpop.gephi ==>Connection to Gephi - http://localhost:8080/workspace1 with stepDelay:1000, startRGBColor:[0.0, 1.0, 0.5], colorToFade:g, colorFadeRate:0.7, startSize:10.0,sizeDecrementRate:0.33 gremlin> :> graph ==>tinkergraph[vertices:6 edges:6] ==>false
The above Gremlin session activates the Gephi plugin, creates the “modern” TinkerGraph, uses the :remote command to setup a connection to the Graph Streaming server in Gephi (with default parameters that will be explained below), and then uses :submit which sends the vertices and edges of the graph to the Gephi Streaming Server. The resulting graph appears in Gephi as displayed in the below.
g. Now that the graph is visualized in Gephi, You may think the graph looks very awkward, this is where graph layout algorithm and graph settings come in.
1) Choose graph layout in Layout Tab, there are lots of available layouts, here we use Fruchterman Reingold layout and click Run.
2) Increasing the node size, decreasing the edge scale, and displaying the id, name, and weight attributes
3) The graph should now look like the following
Regarding Layout Selections, below is the recommendation for general scenarios
For more detailed tutorials on Layout Algorithm and Graph Settings, please take a look at the following slides provided by Gephi:
https://gephi.org/users/tutorial-layouts/
https://gephi.org/users/tutorial-visualization/
Traversal Visualization
Visualization of a Traversal has a different approach as the visualization occurs as the Traversal is executing, thus showing a real-time view of its execution. A Traversal must be “configured” to operate in this format and for that it requires use of the visualTraversal option on the config function of the :remote command:
gremlin> :remote config visualTraversal graph //(1) ==>Connection to Gephi - http://localhost:8080/workspace1 with stepDelay:1000, startRGBColor:[0.0, 1.0, 0.5], colorToFade:g, colorFadeRate:0.7, startSize:10.0,sizeDecrementRate:0.33 gremlin> traversal = vg.V(2).in().out('knows'). has('age',gt(30)).outE('created'). has('weight',gt(0.5d)).inV();[] //(2) gremlin> :> traversal //(3) ==>v[5] ==>false
1. Configure a “visual traversal” from your “graph” – this must be a Graph instance. This command will create a new TraversalSource called “vg” that must be used to visualize any spawned traversals in Gephi.
2. Define the traversal to be visualized. Note that ending the line with ;[] simply prevents iteration of the traversal before it is submitted.
3. Submit the Traversal to visualize to Gephi.
When the :> line is called, each step of the Traversal that produces or filters vertices generates events to Gephi. The events update the color and size of the vertices at that step with startRGBColor and startSize respectively. After the first step visualization, it sleeps for the configured stepDelay in milliseconds. On the second step, it decays the configured colorToFade of all the previously visited vertices in prior steps, by multiplying the current colorToFade value for each vertex with the colorFadeRate. Setting the colorFadeRate value to 1.0 will prevent the color decay. The screenshots below show how the visualization evolves over the four steps: