Support My IBM Log in

IBM Support

A Graph-Based Movie Recommendation Solution based on Titan DB(Hbase) + TinkerPop(sparkgraphcomputer) + Gephi - Hadoop Dev

Technical Blog Post


Abstract

A Graph-Based Movie Recommendation Solution based on Titan DB(Hbase) + TinkerPop(sparkgraphcomputer) + Gephi - Hadoop Dev

Body

Background

Titan is designed to support the processing of graphs and Scaling graph data processing for real time traversals and analytical queries. This blog will introduce and expatiate the usage of Titan based on a real data model, Movie Recommender Engine.

Data Source

There are publicly available sample data source here.Consider the following scenario, with plenty of data.

  • N Billions User data
  • M Billions Movies data
  • Max N*M Billions Rating data

What challenge we will face?

  • Diverse data should be collected from different source(Youtube,AMC and so on)
  • Diverse data should be managed in a unified way
  • High concurrent insert rates from parallel different source
  • Efficient Processing and Querying
  • Some operations require complex query and searching

Titan Graph based solution

  • High Performance: Distributed memory calculation VS traditional SQL which costs long I/O disk delay
  • Data distributed across a multi-machine cluster
  • Flexibility: Flexible changes of attributes VS traditional database which has fixed schema
  • Property-based graph: Each vertex and edge is easily managed through property

Implements for Movie Recommender Engine:

  • Generate MovieRating Graph for movie recommendation
  • Analysis the MovieRating Graph for movie recommendation

Generate MovieRating Graph for movie recommendation:

We have sample data to simulate this scenario. The downloaded data will be inserted into the graph database Titan by the following code. Of course, this is not the only way to build Movie Graph in Titan. It just for a reference.

Movie_Graph

  • Open titan graph and define occupations:
  • g = TitanFactory.open('/usr/iop/current/titan-client/conf/titan-hbase-solr.properties')  t = g.traversal()  occupations = [0:'other', 1:'academic/educator', 2:'artist',    3:'clerical/admin', 4:'college/grad student', 5:'customer service',    6:'doctor/health care', 7:'executive/managerial', 8:'farmer',    9:'homemaker', 10:'K-12 student', 11:'lawyer', 12:'programmer',    13:'retired', 14:'sales/marketing', 15:'scientist', 16:'self-employed',    17:'technician/engineer', 18:'tradesman/craftsman', 19:'unemployed', 20:'writer']
  • Parsing Movie Data:
  • new File('$FilePATH/ml-1m/movies.dat').eachLine {def line ->    def components = line.split('::');    def movieVertex = g.addVertex("type","Movie", "movieId",components[0].toInteger(), "title",components[1]);    components[2].split('\\|').each { def genera ->     def hits = t.V("genera":genera).iterator();     def generaVertex = hits.hasNext() ? hits.next() : g.addVertex("type","Genera", "genera",genera);     movieVertex.addEdge("hasGenera", generaVertex);    }  }
  • Parsing User Data:
  • new File('$FilePATH/ml-1m/users.dat').eachLine {def line ->    def components = line.split('::');    def userVertex = g.addVertex("type","User", "userId",components[0].toInteger(), "gender",components[1], "age",components[2].toInteger());    def occupation = occupations[components[3].toInteger()];    def hits = t.V("occupation",occupation).iterator();    def occupationVertex = hits.hasNext() ? hits.next() : g.addVertex("type","Occupation", "occupation",occupation);    userVertex.addEdge('hasOccupation',occupationVertex);  }  
  • Parsing Rating Data:
  • new File('$FilePATH/ml-1m/ratings.dat').eachLine {def line ->    def components = line.split('::');    def ratedEdge = t.V().has("userId",components[0].toInteger()).next().addEdge("rated",t.V().has("movieId",components[1].toInteger()).next());    ratedEdge.property('stars', components[2].toInteger());  }  g.tx().commit()

Now that the Movie data has been parsed and represented as a graph, the data is ready to be traversed (i.e. queried).

Analysis the MovieRating Graph for movie recommendation

  1. Traverse the Movie Graph in Titan DB
  2. TitanDB_A

    • Import data source and Generate graph
    • Populate Graph in Titan
    • Execute Graph Traversal
    graph = TitanFactory.open('/usr/iop/current/titan-client/conf/titan-hbase-solr.properties')  graph.traversal()  ...........
  3. Perform Movie Graph data analysis in Spark

Spark_A

  • Import data source /Read Graph from HBase
  • Load Graph to Spark memory
  • Execute Graph Traversal/VertexProgram
graph = GraphFactory.open('/usr/iop/current/titan-client/conf/hadoop-graph/hadoop-hbase-read.properties') //Read Graph from Hbase  t = graph.traversal().withComputer(SparkGraphComputer)  t.V().valueMap()  ............

Visualize Movie Graph with Gephi

Gephi

  • Get Graph
  • Set Configuration
  • Traversal Visualization: the Gephi plugin for Gremlin Console utilizes this API to allow for graph and traversal visualization

How to setup Gephi

  1. The Gephi plugin for Titan Client (Gremlin Console) is compatible with Gephi 0.9.x. Download Gephi from https://gephi.org/
  2. Note: Both Titan and Gephi requires JDK1.8

  3. Install Graph Streaming plugin
  4. a. (Tools > Plugins)

    Gephi11

    b. Click Available Plugins, Search: graph streaming, check Graph Streaming plugin and click Install.

    Gephi10

    c. Follow the instruction to install the plugin and click Finish to restart Gephi.

    Gephi9

    d. Launch Gephi, and click New Project

    Gephi8

    e. In the lower left view, click the “Streaming” tab, open the Master drop down, and right click Master Server > Start which starts the Graph Streaming server in Gephi and by default accepts requests at http://localhost:8080/workspace1:
    Important: Gephi Streaming plugin only supports localhost connection.

    Gephi7

    Note: The Gephi Streaming Plugin doesn’t detect port conflicts and will appear to start the plugin successfully even if there is something already active on that port it wants to connect to (which is 8080 by default). Be sure that there is nothing running on the port before Gephi will be using before starting the plugin. Failing to do this produce behavior where the console will appear to submit requests to Gephi successfully but nothing will render.

    f. Start the Gremlin Console and activate the Gephi plugin:

      gremlin> :plugin use tinkerpop.gephi  		==>tinkerpop.gephi activated  		  		gremlin> graph = TinkerFactory.createModern()  		==>tinkergraph[vertices:6 edges:6]  		  		gremlin> :remote connect tinkerpop.gephi  ==>Connection to Gephi - http://localhost:8080/workspace1 with stepDelay:1000, startRGBColor:[0.0, 1.0, 0.5], colorToFade:g, colorFadeRate:0.7, startSize:10.0,sizeDecrementRate:0.33  		  		gremlin> :> graph  		==>tinkergraph[vertices:6 edges:6]  		==>false  

    The above Gremlin session activates the Gephi plugin, creates the “modern” TinkerGraph, uses the :remote command to setup a connection to the Graph Streaming server in Gephi (with default parameters that will be explained below), and then uses :submit which sends the vertices and edges of the graph to the Gephi Streaming Server. The resulting graph appears in Gephi as displayed in the below.

    Gephi6

    g. Now that the graph is visualized in Gephi, You may think the graph looks very awkward, this is where graph layout algorithm and graph settings come in.
    1) Choose graph layout in Layout Tab, there are lots of available layouts, here we use Fruchterman Reingold layout and click Run.

    Gephi5

    2) Increasing the node size, decreasing the edge scale, and displaying the id, name, and weight attributes

    Gephi4

    3) The graph should now look like the following

  5. Install Graph Streaming plugin
  6. Gephi3

Regarding Layout Selections, below is the recommendation for general scenarios

Gephi2

For more detailed tutorials on Layout Algorithm and Graph Settings, please take a look at the following slides provided by Gephi:
https://gephi.org/users/tutorial-layouts/
https://gephi.org/users/tutorial-visualization/

Traversal Visualization

Visualization of a Traversal has a different approach as the visualization occurs as the Traversal is executing, thus showing a real-time view of its execution. A Traversal must be “configured” to operate in this format and for that it requires use of the visualTraversal option on the config function of the :remote command:

  gremlin> :remote config visualTraversal graph //(1)  ==>Connection to Gephi - http://localhost:8080/workspace1 with stepDelay:1000, startRGBColor:[0.0, 1.0, 0.5], colorToFade:g, colorFadeRate:0.7, startSize:10.0,sizeDecrementRate:0.33    gremlin> traversal = vg.V(2).in().out('knows').                               has('age',gt(30)).outE('created').                               has('weight',gt(0.5d)).inV();[] //(2)    gremlin> :> traversal //(3)  ==>v[5]  ==>false  

1. Configure a “visual traversal” from your “graph” – this must be a Graph instance. This command will create a new TraversalSource called “vg” that must be used to visualize any spawned traversals in Gephi.
2. Define the traversal to be visualized. Note that ending the line with ;[] simply prevents iteration of the traversal before it is submitted.
3. Submit the Traversal to visualize to Gephi.
When the :> line is called, each step of the Traversal that produces or filters vertices generates events to Gephi. The events update the color and size of the vertices at that step with startRGBColor and startSize respectively. After the first step visualization, it sleeps for the configured stepDelay in milliseconds. On the second step, it decays the configured colorToFade of all the previously visited vertices in prior steps, by multiplying the current colorToFade value for each vertex with the colorFadeRate. Setting the colorFadeRate value to 1.0 will prevent the color decay. The screenshots below show how the visualization evolves over the four steps:

Gephi1


[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16260015

Overview Annual report Corporate social responsibility Financing Investor Newsroom Security, privacy & trust Senior leadership Careers with IBM Website Blog Publications Automotive Banking Consumer Goods Energy Government Healthcare Insurance Life Sciences Manufacturing Retail Telecommunications Travel Our strategic partners Find a partner Become a partner - Partner Plus Partner Plus log in IBM TechXChange Community LinkedIn X Instagram YouTube Subscription Center Participate in user experience research Podcasts Contact IBM Privacy Terms of use Accessibility United States — English