JanusGraph重新索引reindex

来源:互联网 发布:淘宝手机图片尺寸多少 编辑:程序博客网 时间:2024/05/22 02:17

Chapter 31.1 Reindexing

9.1章和9.2章 “Graph Index” and Section 9.2, “Vertex-centric Indexes” 已经讲了如何创建全局的和vertex-centric的索引来提高查询性能。如果索引的key和label是在同一个事务中新创建的则索引会立即生效,这样就无需执行reindex操作;如果在建索引之前,索引的key和label已经存在,则就必须要对整个图中与索引相关的元素执行reindex操作以保证索引包含了之前的元素。本章的主要内容就是reindex操作。

这里写图片描述reindex是一个包含了多个步骤的手动的过程。这些步骤的顺序必须要正确,否则,会导致索引不一致的情况。

31.1.1. 综述

一个索引定义之后,JanusGraph可以立刻开始自增索引的更新写操作。然而,在索引完成和可用之前,JanusGraph必须同时进行一次性读取所有的与新创建索引相关的元素。一旦reindex完成,索引就会完全对已有的数据生效且为enabled即可用状态。

31.1.2. 在reindex之前

reindex过程开始的点是创建一个索引。参考第9章关于如何创建一个全局或与某个集合绑定的索引。注意,一个全局的索引名字是唯一的,其名字就是其id。一个与某集合绑定的索引,它的唯一性依据是 索引名+label或索引中的property key (即本章后面涉及到的index type的name 并且只适用于vertex-centric索引)

在给某一个已存的集合创建一个新索引之后,需要等待几分钟来将新索引通知给集群中的其他节点。注意,reindex时,索引的名字(假设索引类型为vertex-centric)是必须的。

31.1.3. 准备reindex

reindex操作可以有两个执行框架的选择:

  • MapReduce
  • JanusGraphManagement

MapReduce的Reindex 支持大的和水平分布的数据库。JanusGraphManagement的reindex操作是单机线性的OLAP操作,这是专为为单机就能满足方便性和速度要求的较小的数据库设计的。

Reindex需要:

  • 索引名
  • 索引类型(如果是vertex-centric类型的索引还需要label名或property key,其他索引不需要)

31.1.4. 执行重新索引on MapReduce

基于MapReduce来生成和执行一个reindex操作的比较推荐的方式是通过 MapReduceIndexManagement这个类。下面是使用这个类来运行重新索引的大概步骤:

  • open一个JanusGraph对象
  • 将graph实例传给MapReduceIndexManagement这个类的构造方法
  • 调用MapReduceIndexManagement实例化后的对象的updateIndex(index, SchemaAction.REINDEX)方法
  • 如果索引还没有enabled,通过JanusGraphManagement来将它变为enabled

MapReduceIndexManagement类实现了updateIndex方法,updateIndex方法只支持SchemaAction的REINDEX 和 REMOVE_INDEX 操作。该类会使用Hadoop的配置和classpath中的jars包来开启一个Hadoop MapReduce的job。该类对Hadoop1和Hadoop2都支持。该类通过JanusGraph实例构造方法传进来的参数来获取索引的metadata和存储后端(例如Cassandra分区)

graph = JanusGraphFactory.open(...)mgmt = graph.openManagement()mr = new MapReduceIndexManagement(graph)mr.updateIndex(mgmt.getRelationIndex(mgmt.getRelationType("battled"), "battlesByTime"), SchemaAction.REINDEX).get()mgmt.commit()

31.1.4.1. MapReduce重新索引示例

下面的gremlin语句使用了一个单独的实例,包含了MapReduce reindex过程的所有步骤,其存储后端为Cassandra:

// Open a graphgraph = JanusGraphFactory.open("conf/janusgraph-cassandra-es.properties")g = graph.traversal()// Define a propertymgmt = graph.openManagement()desc = mgmt.makePropertyKey("desc").dataType(String.class).make()mgmt.commit()// Insert some datagraph.addVertex("desc", "foo bar")graph.addVertex("desc", "foo baz")graph.tx().commit()// Run a query -- note the planner warning recommending the use of an indexg.V().has("desc", containsText("baz"))// Create an indexmgmt = graph.openManagement()desc = mgmt.getPropertyKey("desc")mixedIndex = mgmt.buildIndex("mixedExample", Vertex.class).addKey(desc).buildMixedIndex("search")mgmt.commit()// Rollback or commit transactions on the graph which predate the index definitiongraph.tx().rollback()// Block until the SchemaStatus transitions from INSTALLED to REGISTEREDreport = mgmt.awaitGraphIndexStatus(graph, "mixedExample").call()// Run a JanusGraph-Hadoop job to reindexmgmt = graph.openManagement()mr = new MapReduceIndexManagement(graph)mr.updateIndex(mgmt.getGraphIndex("mixedExample"), SchemaAction.REINDEX).get()// Enable the indexmgmt = graph.openManagement()mgmt.updateIndex(mgmt.getGraphIndex("mixedExample"), SchemaAction.ENABLE_INDEX).get()mgmt.commit()// Block until the SchemaStatus is ENABLEDmgmt = graph.openManagement()report = mgmt.awaitGraphIndexStatus(graph, "mixedExample").status(SchemaStatus.ENABLED).call()mgmt.rollback()// Run a query -- JanusGraph will use the new index, no planner warningg.V().has("desc", containsText("baz"))// Concerned that JanusGraph could have read cache in that last query, instead of relying on the index?// Start a new instance to rule out cache hits.  Now we're definitely using the index.graph.close()graph = JanusGraphFactory.open("conf/janusgraph-cassandra-es.properties")g.V().has("desc", containsText("baz"))

31.1.5. 基于JanusGraphManagement 执行更新索引操作

基于JanusGraphManagement执行一个reindex操作,通过SchemaAction.REINDEX 参数调用JanusGraphManagement.updateIndex方法,代码示例如下:

m = graph.openManagement()i = m.getGraphIndex('indexName')m.updateIndex(i, SchemaAction.REINDEX).get()m.commit()

31.1.5.1. JanusGraphManagement重新索引示例

下面的示例是存储后端为BerkeleyDB的JanusGraph数据库,定义一个索引之后,使用 JanusGraphManagement,重新索引,并最终使索引可用:

import org.janusgraph.graphdb.database.management.ManagementSystem// Load some data from a file without any predefined schemagraph = JanusGraphFactory.open('conf/janusgraph-berkeleyje.properties')g = graph.traversal()m = graph.openManagement()m.makePropertyKey('name').dataType(String.class).cardinality(Cardinality.LIST).make()m.makePropertyKey('lang').dataType(String.class).cardinality(Cardinality.LIST).make()m.makePropertyKey('age').dataType(Integer.class).cardinality(Cardinality.LIST).make()m.commit()graph.io(IoCore.gryo()).readGraph('data/tinkerpop-modern.gio')graph.tx().commit()// Run a query -- note the planner warning recommending the use of an indexg.V().has('name', 'lop')graph.tx().rollback()// Create an indexm = graph.openManagement()m.buildIndex('names', Vertex.class).addKey(m.getPropertyKey('name')).buildCompositeIndex()m.commit()graph.tx().commit()// Block until the SchemaStatus transitions from INSTALLED to REGISTEREDManagementSystem.awaitGraphIndexStatus(graph, 'names').status(SchemaStatus.REGISTERED).call()// Reindex using JanusGraphManagementm = graph.openManagement()i = m.getGraphIndex('names')m.updateIndex(i, SchemaAction.REINDEX)m.commit()// Enable the indexManagementSystem.awaitGraphIndexStatus(graph, 'names').status(SchemaStatus.ENABLED).call()// Run a query -- JanusGraph will use the new index, no planner warningg.V().has('name', 'lop')graph.tx().rollback()// Concerned that JanusGraph could have read cache in that last query, instead of relying on the index?// Start a new instance to rule out cache hits.  Now we're definitely using the index.graph.close()graph = JanusGraphFactory.open("conf/janusgraph-berkeleyje.properties")g = graph.traversal()g.V().has('name', 'lop')
原创粉丝点击