JanusGraph重新索引reindex

来源：互联网发布：淘宝手机图片尺寸多少编辑：程序博客网时间：2024/05/22 02:17

Chapter 31.1 Reindexing

9.1章和9.2章 “Graph Index” and Section 9.2, “Vertex-centric Indexes” 已经讲了如何创建全局的和vertex-centric的索引来提高查询性能。如果索引的key和label是在同一个事务中新创建的则索引会立即生效，这样就无需执行reindex操作；如果在建索引之前，索引的key和label已经存在，则就必须要对整个图中与索引相关的元素执行reindex操作以保证索引包含了之前的元素。本章的主要内容就是reindex操作。

这里写图片描述 reindex是一个包含了多个步骤的手动的过程。这些步骤的顺序必须要正确，否则，会导致索引不一致的情况。

31.1.1. 综述

一个索引定义之后，JanusGraph可以立刻开始自增索引的更新写操作。然而，在索引完成和可用之前，JanusGraph必须同时进行一次性读取所有的与新创建索引相关的元素。一旦reindex完成，索引就会完全对已有的数据生效且为enabled即可用状态。

31.1.2. 在reindex之前

reindex过程开始的点是创建一个索引。参考第9章关于如何创建一个全局或与某个集合绑定的索引。注意，一个全局的索引名字是唯一的，其名字就是其id。一个与某集合绑定的索引，它的唯一性依据是索引名+label或索引中的property key （即本章后面涉及到的index type的name 并且只适用于vertex-centric索引）

在给某一个已存的集合创建一个新索引之后，需要等待几分钟来将新索引通知给集群中的其他节点。注意，reindex时，索引的名字（假设索引类型为vertex-centric）是必须的。

31.1.3. 准备reindex

reindex操作可以有两个执行框架的选择：

MapReduce
JanusGraphManagement

MapReduce的Reindex 支持大的和水平分布的数据库。JanusGraphManagement的reindex操作是单机线性的OLAP操作，这是专为为单机就能满足方便性和速度要求的较小的数据库设计的。

Reindex需要：

索引名
索引类型（如果是vertex-centric类型的索引还需要label名或property key，其他索引不需要）

31.1.4. 执行重新索引on MapReduce

基于MapReduce来生成和执行一个reindex操作的比较推荐的方式是通过 MapReduceIndexManagement这个类。下面是使用这个类来运行重新索引的大概步骤：

open一个JanusGraph对象
将graph实例传给MapReduceIndexManagement这个类的构造方法
调用MapReduceIndexManagement实例化后的对象的updateIndex(index, SchemaAction.REINDEX)方法
如果索引还没有enabled，通过JanusGraphManagement来将它变为enabled

MapReduceIndexManagement类实现了updateIndex方法，updateIndex方法只支持SchemaAction的REINDEX 和 REMOVE_INDEX 操作。该类会使用Hadoop的配置和classpath中的jars包来开启一个Hadoop MapReduce的job。该类对Hadoop1和Hadoop2都支持。该类通过JanusGraph实例构造方法传进来的参数来获取索引的metadata和存储后端（例如Cassandra分区）

graph = JanusGraphFactory.open(...)mgmt = graph.openManagement()mr = new MapReduceIndexManagement(graph)mr.updateIndex(mgmt.getRelationIndex(mgmt.getRelationType("battled"), "battlesByTime"), SchemaAction.REINDEX).get()mgmt.commit()

31.1.4.1. MapReduce重新索引示例

下面的gremlin语句使用了一个单独的实例，包含了MapReduce reindex过程的所有步骤，其存储后端为Cassandra：

// Open a graphgraph = JanusGraphFactory.open("conf/janusgraph-cassandra-es.properties")g = graph.traversal()// Define a propertymgmt = graph.openManagement()desc = mgmt.makePropertyKey("desc").dataType(String.class).make()mgmt.commit()// Insert some datagraph.addVertex("desc", "foo bar")graph.addVertex("desc", "foo baz")graph.tx().commit()// Run a query -- note the planner warning recommending the use of an indexg.V().has("desc", containsText("baz"))// Create an indexmgmt = graph.openManagement()desc = mgmt.getPropertyKey("desc")mixedIndex = mgmt.buildIndex("mixedExample", Vertex.class).addKey(desc).buildMixedIndex("search")mgmt.commit()// Rollback or commit transactions on the graph which predate the index definitiongraph.tx().rollback()// Block until the SchemaStatus transitions from INSTALLED to REGISTEREDreport = mgmt.awaitGraphIndexStatus(graph, "mixedExample").call()// Run a JanusGraph-Hadoop job to reindexmgmt = graph.openManagement()mr = new MapReduceIndexManagement(graph)mr.updateIndex(mgmt.getGraphIndex("mixedExample"), SchemaAction.REINDEX).get()// Enable the indexmgmt = graph.openManagement()mgmt.updateIndex(mgmt.getGraphIndex("mixedExample"), SchemaAction.ENABLE_INDEX).get()mgmt.commit()// Block until the SchemaStatus is ENABLEDmgmt = graph.openManagement()report = mgmt.awaitGraphIndexStatus(graph, "mixedExample").status(SchemaStatus.ENABLED).call()mgmt.rollback()// Run a query -- JanusGraph will use the new index, no planner warningg.V().has("desc", containsText("baz"))// Concerned that JanusGraph could have read cache in that last query, instead of relying on the index?// Start a new instance to rule out cache hits.  Now we're definitely using the index.graph.close()graph = JanusGraphFactory.open("conf/janusgraph-cassandra-es.properties")g.V().has("desc", containsText("baz"))

31.1.5. 基于JanusGraphManagement 执行更新索引操作

基于JanusGraphManagement执行一个reindex操作，通过SchemaAction.REINDEX 参数调用JanusGraphManagement.updateIndex方法，代码示例如下：

m = graph.openManagement()i = m.getGraphIndex('indexName')m.updateIndex(i, SchemaAction.REINDEX).get()m.commit()

31.1.5.1. JanusGraphManagement重新索引示例

下面的示例是存储后端为BerkeleyDB的JanusGraph数据库，定义一个索引之后，使用 JanusGraphManagement，重新索引，并最终使索引可用：

import org.janusgraph.graphdb.database.management.ManagementSystem// Load some data from a file without any predefined schemagraph = JanusGraphFactory.open('conf/janusgraph-berkeleyje.properties')g = graph.traversal()m = graph.openManagement()m.makePropertyKey('name').dataType(String.class).cardinality(Cardinality.LIST).make()m.makePropertyKey('lang').dataType(String.class).cardinality(Cardinality.LIST).make()m.makePropertyKey('age').dataType(Integer.class).cardinality(Cardinality.LIST).make()m.commit()graph.io(IoCore.gryo()).readGraph('data/tinkerpop-modern.gio')graph.tx().commit()// Run a query -- note the planner warning recommending the use of an indexg.V().has('name', 'lop')graph.tx().rollback()// Create an indexm = graph.openManagement()m.buildIndex('names', Vertex.class).addKey(m.getPropertyKey('name')).buildCompositeIndex()m.commit()graph.tx().commit()// Block until the SchemaStatus transitions from INSTALLED to REGISTEREDManagementSystem.awaitGraphIndexStatus(graph, 'names').status(SchemaStatus.REGISTERED).call()// Reindex using JanusGraphManagementm = graph.openManagement()i = m.getGraphIndex('names')m.updateIndex(i, SchemaAction.REINDEX)m.commit()// Enable the indexManagementSystem.awaitGraphIndexStatus(graph, 'names').status(SchemaStatus.ENABLED).call()// Run a query -- JanusGraph will use the new index, no planner warningg.V().has('name', 'lop')graph.tx().rollback()// Concerned that JanusGraph could have read cache in that last query, instead of relying on the index?// Start a new instance to rule out cache hits.  Now we're definitely using the index.graph.close()graph = JanusGraphFactory.open("conf/janusgraph-berkeleyje.properties")g = graph.traversal()g.V().has('name', 'lop')

阅读全文

0 0