graph database project: Titan 特性小结

来源：互联网发布：图象扭曲算法编辑：程序博客网时间：2024/06/15 16:58

这两天整理了一下，转载请注明出处，谢谢

Titan
1. Overview
1.1 structure
Titan is a graph database engine. Titan itself is focused on compact graph serialization, rich graph data modeling, and efficient query execution. In addition, Titan utilizes Hadoop for graph analytics and batch graph processing. Titan implements robust, modular interfaces for data persistence, data indexing, and client access. Titan’s modular architecture allows it to interoperate with a wide range of storage, index, and client technologies; it also eases the process of extending Titan to support new ones.

1.2 General Titan Benefits
• Support for very large graphs. Titan graphs scale with the number of machines in the cluster.
• Support for very many concurrent transactions and operational graph processing. Titan’s transactional capacity scales with the number of machines in the cluster and answers complex traversal queries on huge graphs in milliseconds.
• Support for global graph analytics and batch graph processing through the Hadoop framework.
• Support for geo, numeric range, and full text search for vertices and edges on very large graphs.
• Native support for the popular property graph data model exposed by Blueprints.
• Native support for the graph traversal language Gremlin.
• Easy integration with the Rexster graph server for programming language agnostic connectivity.
• Numerous graph-level configurations provide knobs for tuning performance.
• Vertex-centric indices provide vertex-level querying to alleviate issues with the infamous super node problem.
• Provides an optimized disk representation to allow for efficient use of storage and speed of access.
• Open source under the liberal Apache 2 license.
1.3 Benefit of Titan with Cassandra
• Continuously available with no single point of failure.
• No read/write bottlenecks to the graph as there is no master/slave architecture.
• Elastic scalability allows for the introduction and removal of machines.
• Caching layer ensures that continuously accessed data is available in memory.
• Increase the size of the cache by adding more machines to the cluster.
• Integration with Hadoop.
• Open source under the liberal Apache 2 license.

2. Graph Model

A property graph has these elements:
2.1 A set of vertices
o each vertex has a unique identifier.
o each vertex has a set of outgoing edges.
o each vertex has a set of incoming edges.
o each vertex has a collection of properties defined by a map from key to value.
2.2 A set of edges
o each edge has a unique identifier.
o each edge has an outgoing tail vertex.
o each edge has an incoming head vertex.
o each edge has a label that denotes the type of relationship between its two vertices.
o each edge has a collection of properties defined by a map from key to value.
3. Data storage
Between Titan and the disks sits one or more storage and indexing adapters. Titan comes standard with the following adapters, but Titan’s modular architecture supports third-party adapters.
o Cassandra
o HBase
o BerkeleyDB
Cassandra TitanGraph
1. TitanGraph g = TitanFactory.build() .set("storage.backend","cassandra") .set("storage.hostname","127.0.0.1").open();
4. Access method
Broadly speaking, applications can interact with Titan in two ways:
• Method calls to Titan’s Java-language APIs, which include
o Titan’s native Blueprints API implementation
o A superset of Blueprints functionality called TitanGraph, which provides some Titan features that aren’t part of the vendor-neutral Blueprints spec
• TinkerPop stack utilities built atop Blueprints, such as
o The Gremlin query language
o The Rexster graph server
4.1 Gremlin
object {
string "_id";
array { string } inEdges;
array { string } outEdges;
object { }* properties;
};
object {
string "_id";
string label;
string inVertex;
string outVertex;
object { }* properties;
};
Gremlin:
gremlin> $v := g:add-v($g) ==>v[0]
gremlin> $u := g:add-v($g) ==>v[1]
gremlin> $e := g:add-e($g, $v, 'related_to', $u) ==>e[2][0-related_to->1]
4.2 Roxster:
Rexster can be wrapped around each Titan instance defined in the previous subsection. In this way, the end-user application need not be a Java-based application as it can communicate with Rexster over REST.
This type of deployment is great for polyglot architectures where various components written in different languages need to reference and compute on the graph.
http://rexster.titan.machine1/mygraph/vertices/1
http://rexster.titan.machine2/mygraph/tp/gremlin?script=g.v(1).out('follows').out('created')
Refer to basic rest API : https://github.com/tinkerpop/rexster/wiki/Basic-REST-API
• Limitation: Titan automatically assigns identifiers. Hence, the POST of an edge cannot be done with an identifier. In other words, POST to this:http://localhost/graphs/titan/edges and not to this http://localhost/graphs/titan/edges/1234. Titan uses key indices and does not support manual indices. Hence, all operations on the indices resource are not supported. Use key indexes instead.
4.3 Blueprints: Blueprints is a generic Java graph API that binds to various graph backends (i.e. frameworks and databases). Blueprints has connectors to popular graph databases including Neo4j, OrientDB, Titan, DEX, and InfiniteGraph.
Graph graph = new Neo4jGraph("/tmp/my_graph");
Vertex a = graph.addVertex(null);
Vertex b = graph.addVertex(null);
a.setProperty("name","marko");
b.setProperty("name","peter");
Edge e = graph.addEdge(null, a, b, "knows");
e.setProperty("since", 2006);
graph.shutdown();

5. Three deployments
5.1 local server mode

5.2 Remote server mode
Each Rexster server would be configured to connect to the Cassandra cluster.

5.3 Embedded mode Titan
Internally starts a cassandra daemon and Titan no longer connects to an existing cluster but is its own cluster.

6. Transactional scope
All graph elements (vertices, edges, and types) are associated with the transactional scope in which they were retrieved or created. Under Blueprint’s default transactional semantics, transactions are automatically created with the first operation on the graph and closed explicitly using commit() or rollback(). Once the transaction is closed, all graph elements associated with that transaction become stale and unavailable. However, Titan will automatically transition vertices and types into the new transactional scope as shown in this example
TitanGraph g = TitanFactory.open("berkeleyje:/tmp/titan"); Vertex juno = g.addVertex(null); //Automatically opens a new transaction g.commit(); //Ends transaction juno.setProperty("name", "juno"); //Vertex is automatically transitioned
Edges, on the other hand, are not automatically transitioned and cannot be accessed outside their original transaction. They must be explicitly transitioned.
Edge e = juno.addEdge("knows",g.addVertex(null)); g.commit(); //Ends transaction e = g.getEdge(e); //Need to refresh edge e.setProperty("time", 99);
7. Performance
To provide a foundational layer of data, the Twitter graph as of 2009 was first loaded into the Titan cluster. This data includes 41.7 million user vertices and 1.47 billion follows edges. After loading, the 40 m1.small machines are put into a “while(true) loop” (in fact, there are 10 concurrent threads on each worker running 125,000 iterations). During each iteration of the loop, a worker selects an operation to enact using a biased coin toss (see the diagram on the left). The distribution heavily favors stream reading as this is typically the most prevalent operation in such online social systems. Next, if a recommendation is provided, then there is a 30% chance that the user will follow one of the recommended users. This is how follows edges are added to the graph.

The normal load simulation ran for 2.3 hours and during that time, 49 million transactions occurred. This comes to approximately 5,900 transactions a second. Assuming that a human user does a transaction every 5-10 seconds (e.g. reads their stream and then publishes a tweet, etc.), this Titan cluster is supporting approximately 50,000 concurrent users. In the table below, the number of transactions per operation, the average transaction times, the standard deviation of those times, and the 3 sigma times are presented. 3 sigma is 3 standard deviations greater than the mean and represents the expected worst case time that 0.1% of the users will experience. Finally, note that creating an account is a slower transaction because it is a locking operation that ensures that no two users have the same username (i.e. Twitter handle).

Normal Load performance (2.3 hours):

Peak Load performance (1.3 hours):

8. Graph DB compare
News of Titan: The prize in DataStax's acquisition of open-source firm Aurelius is not the Titan database but rather engineering expertise, which will be used in developing a new graph database.
http://www.zdnet.com/article/datastax-snaps-up-aurelius-and-its-titan-team-to-build-new-graph-database/

8 Reference:
http://thinkaurelius.com/2012/08/06/titan-provides-real-time-big-graph-data/
http://s3.thinkaurelius.com/docs/titan/0.5.4/index.html
https://github.com/tinkerpop/gremlin
http://www.zdnet.com/article/datastax-snaps-up-aurelius-and-its-titan-team-to-build-new-graph-database/
http://db-engines.com/en/ranking/graph+dbms
http://db-engines.com/en/system/Neo4j%3BOrientDB%3BTitan

0 0