Apache Cassandra Learning Step by Step (4): Data Modeling
来源:互联网 发布:淘宝号查询网站 编辑:程序博客网 时间:2024/05/10 14:24
22 Feb 2012, by Bright Zheng (IT进行时)
写在这章前面的几点牢骚或感慨:
1. 我发现建模是比较别扭的一件事情,尤其是你的脑子里都是RDBMS的ERD的时候;
2. 本人试图通过两者的建模过程体现思考要点,但感觉在NoSQL的建模上有点“那个”——如果不在大型项目上吃亏过或者直接受教于前辈,总感觉缺那么点味道;
3. 这篇是我写的最郁闷的一篇,而且可能后面需要无数个补丁,但管不了了,有错误才有感悟
5. Data Modeling
Data Modeling is one of the most important things in experiencing Cassandra, especially to those who have lots of experiences with RDBMS data modeling.
By admiring Twissandra project, we name it as Jtwissandra as an example. If possible, I’ll try to create and implement it and share it in GitHub.
This is a simple example to showcase the NoSQL concepts by admiring the Twitter via Cassandra.
5.1. Tranditional RDBMS Data Modeling
Following are the core Entities & Relationships if we’re modeling in RDBMS concepts.
Here are some pseudo codes for demonstrating the business logic/requirements:
1. Adding a new user:
USER.insert(user_id, user_name, user_password, create_timestamp);
2. Following a friend:
FRIEND.insert(user_id, followed_id, create_timestamp)
as ($current_user_id, user_id, create_timestamp);
FOLLOWER.insert(user_id, follower_id, create_timestamp)
as (user_id, $current_user_id, create_timestamp);
3. Tweetting:
FRIEND.insert(user_id, followed_id, create_timestamp)
as ($current_user_id, user_id, create_timestamp);
FOLLOWER.insert(user_id, follower_id, create_timestamp)
as (user_id, $current_user_id, create_timestamp);
4. Getting Tweets (that are twitted by self and friends):
select * from TWEET t
where
t.user_id = $current_user_id
or t.user_id in (
select followed_id from FRIEND
where user_id = $current_user_id
)
Comment:: What a bottleneck is here!! That’s also the most important reason why Twitter has to migrate to NoSQL solutions.
5.2. NoSQL Data Modeling
Before we go deeper of NoSQL data modeling with Cassandre, we must understand the key design points of it.
1. Cassandra is a key-value based model
2. Cassandra supports more complex modeling by importing the concept of Super Column
3. The data can be stored in two ways: as column names or as values (it’s really confusing for the beginners sometimes, but you will be free if you understand more especially on the indexing)
4. The Columns, normal Columns or Super ones, in the Column Family is sorted by Column Names, not values
So let’s get started.
We need to create the Keyspace first.
create keyspace JTWISSANDRA
with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'
and strategy_options = [{replication_factor:1}];
Under this Keyspace, we’ll be working on the data modeling one by one.
5.2.1. User
The key points should be under consideration:
- The key we can simply use Time UUID
- The user_name must be (secondary) indexed because we may use it for search
- The create_timestamp should be (secondary) indexed because we may use it for search or some kinds of partitioning
- The password must be encoded as base64. No more CSDN story please.
So the sample data model will be as following:
ColumnFamily: USER
Key
Columns
550e8400-e29b-41d4-a716-446655440000
name
value
“user_name”
“itstarting”
“password”
"******"
“create_timestamp”
1329836819890000
550e8400-e29b-41d4-a716-446655440001
name
value
“user_name”
“test1”
“password”
"******"
“create_timestamp”
1329836819890001
Here is the create script:
create column family USER
with comparator = UTF8Type
and key_validation_class = UTF8Type
and default_validation_class = UTF8Type
and column_metadata = [
{column_name: user_name, validation_class: UTF8Type,
index_name:user_name_idx, index_type:KEYS }
{column_name: user_password, validation_class: UTF8Type}
{column_name: create_timestamp, validation_class: LongType,
index_name:create_timestamp_idx, index_type:KEYS}
];
And the insert script/CLI for showcase only:
// insert user 550e8400-e29b-41d4-a716-446655440000
set USER[‘550e8400-e29b-41d4-a716-446655440000’][‘user_name’] = ‘itstarting’;
set USER[‘550e8400-e29b-41d4-a716-446655440000’][‘password’] = ‘111222’;
set USER[‘550e8400-e29b-41d4-a716-446655440000’][‘create_timestamp’] =1329836819890000;
// insert user 550e8400-e29b-41d4-a716-446655440001
set USER[‘550e8400-e29b-41d4-a716-446655440001’][‘user_name’] = ‘test1;
set USER[‘550e8400-e29b-41d4-a716-446655440001’][‘password’] = ‘222111’;
set USER[‘550e8400-e29b-41d4-a716-446655440001’][‘create_timestamp’] =1329836819890001;
5.2.2. Friend
The friends mean: who are the user X following?
The key points should be under consideration:
- The key should be the uuid of the user X
- The timestamp when the relationship is built is the column (for friend sorting) and the friend’s uuid is the value. Wow again here. Right?
Let’s say the two users we created are friends each other.
So the sample data model will be as following:
ColumnFamily: FRIEND
Key
Columns
550e8400-e29b-41d4-a716-446655440000
name
value
“1329836819859000”
“550e8400-e29b-41d4-a716-446655440001”
If the guy has more friends, insert colums here
550e8400-e29b-41d4-a716-446655440001
name
value
“1329836819781000”
“550e8400-e29b-41d4-a716-446655440000”
If the guy has more friends, insert colums here
The first record means the user X is 550e8400-e29b-41d4-a716-446655440000 and his/her friend is 550e8400-e29b-41d4-a716-446655440001 and the relationship is established at timestamp of 1329836819859000.
Here is the create script:
create column family FRIEND
with comparator = UTF8Type LongType
and key_validation_class = UTF8Type
and default_validation_class = UTF8Type;
No more column name definitions here? Yes, Cassandra is a so-called schema-free data store. Wow!
And the insert script/CLI for showcase only:
set FRIEND[‘550e8400-e29b-41d4-a716-446655440000’][‘1329836819859000’]
= ‘550e8400-e29b-41d4-a716-446655440001;
set FRIEND[‘550e8400-e29b-41d4-a716-446655440001’][‘1329836819781000’]
= ‘550e8400-e29b-41d4-a716-446655440000;
5.2.3. Follower
The Follower is a reversed concept compared to Friend: Who are following user X?
The key points should be under consideration:
- The key should be the uuid of the user X
- The timestamp when the relationship is built is the column (for follower sorting) and the follower’s user uuid is the value.
Actually the logic should be within the same transaction of friend creation. So we’d like to follow the sample in Friend chapter.
So the sample data model will be as following:
ColumnFamily: FOLLOWER
Key
Columns
550e8400-e29b-41d4-a716-446655440001
name
value
“1329836819859000”
“550e8400-e29b-41d4-a716-446655440000”
If the guy has more friends, insert colums here
550e8400-e29b-41d4-a716-446655440000
name
value
“1329836819781000”
“550e8400-e29b-41d4-a716-446655440001”
If the guy has more friends, insert colums here
Here is the create script:
create column family FOLLOWER
with comparator = UTF8Type LongType
and key_validation_class = UTF8Type
and default_validation_class = UTF8Type;
And the insert script/CLI for showcase only:
set FOLLOWER[‘550e8400-e29b-41d4-a716-446655440001’][‘1329836819859000’’]
= ‘550e8400-e29b-41d4-a716-446655440000;
set FOLLOWER[‘550e8400-e29b-41d4-a716-446655440000’][‘329836819781000’]
= ‘550e8400-e29b-41d4-a716-446655440001;
5.2.4. Tweet & Timeline
The tweets are the soul of Twitter.
The key points should be under consideration:
- How to get my tweets?
- How to get my friends’ tweets without join?
- How to sort all tweets including mine and my friends’.
That’s why Twitter imported the concept of Timeline.
Let’s imagine something like this (please correct me if I’m wrong on following discussions):
<!--
Copied from: http://user.services.openoffice.org/en/forum/viewtopic.php?f=9&t=32185
All the events (tweets) are going along the time.
The Timeline means the line with the specified user’s all related events including
- The events (tweets) I sent
- The events (tweets) my friends sent
So the tweets are inserted to CF of Tweet but need to add one more CF: Timeline.
CAUTION: The following learning experiences/exercises might be not correct, please take your own risks if you still want to read on. But of course, any feedback is welcome.
5.2.4.1. Tweet
ColumnFamily: TWEET
Key
Columns
550e8400-e29b-41d4-a716-446655440011
name
value
“user_uuid”
“550e8400-e29b-41d4-a716-446655440000”
“tweet_content”
“Hello world: 11”
550e8400-e29b-41d4-a716-446655440012
name
value
“user_uuid”
“550e8400-e29b-41d4-a716-446655440000”
“tweet_content”
“Hello world: 12”
550e8400-e29b-41d4-a716-446655440021
name
value
“user_uuid”
“550e8400-e29b-41d4-a716-446655440001”
“tweet_content”
“Hello world: 21”
550e8400-e29b-41d4-a716-446655440022
name
value
“user_uuid”
“550e8400-e29b-41d4-a716-446655440001”
“tweet_content”
“Hello world: 22”
Here is the create script:
create column family TWEET
with comparator = UTF8Type
and key_validation_class = UTF8Type
and default_validation_class = UTF8Type
and column_metadata = [
{column_name: user_uuid, validation_class: UTF8Type}
{column_name: tweet_content, validation_class: UTF8Type}
];
And the insert script/CLI for showcase only:
set TWEET['550e8400-e29b-41d4-a716-446655440011']['user_uuid']
= '550e8400-e29b-41d4-a716-446655440000';
set TWEET['550e8400-e29b-41d4-a716-446655440011']['tweet_content'] = 'Hello world: 11';
set TWEET['550e8400-e29b-41d4-a716-446655440012']['user_uuid']
= '550e8400-e29b-41d4-a716-446655440000';
set TWEET['550e8400-e29b-41d4-a716-446655440012']['tweet_content'] = 'Hello world: 12';
set TWEET['550e8400-e29b-41d4-a716-446655440021']['user_uuid']
= '550e8400-e29b-41d4-a716-446655440001';
set TWEET['550e8400-e29b-41d4-a716-446655440021']['tweet_content'] = 'Hello world: 21';
set TWEET['550e8400-e29b-41d4-a716-446655440022']['user_uuid']
= '550e8400-e29b-41d4-a716-446655440001';
set TWEET['550e8400-e29b-41d4-a716-446655440022']['tweet_content'] = 'Hello world: 22';
5.2.4.2. Timeline
ColumnFamily: TIMELINE
Key
Columns
550e8400-e29b-41d4-a716-446655440000
name
value
“1329883039824000”
“550e8400-e29b-41d4-a716-446655440011”
“1329883039825000”
“550e8400-e29b-41d4-a716-446655440021”
“1329883039934000”
“550e8400-e29b-41d4-a716-446655440012”
“1329883039935000”
“550e8400-e29b-41d4-a716-446655440022”
550e8400-e29b-41d4-a716-446655440001
name
value
“1329883039824000”
“550e8400-e29b-41d4-a716-446655440011”
“1329883039825000”
“550e8400-e29b-41d4-a716-446655440021”
“1329883039934000”
“550e8400-e29b-41d4-a716-446655440012”
“1329883039935000”
“550e8400-e29b-41d4-a716-446655440022”
Here is the create script:
create column family TIMELINE
with comparator = UTF8Type LongType
and key_validation_class = UTF8Type
and default_validation_class = UTF8Type;
And the insert script/CLI for showcase only:
set TIMELINE['550e8400-e29b-41d4-a716-446655440000']['1329883039824000']
= '550e8400-e29b-41d4-a716-446655440011';
set TIMELINE['550e8400-e29b-41d4-a716-446655440000']['1329883039825000']
= '550e8400-e29b-41d4-a716-446655440021';
set TIMELINE['550e8400-e29b-41d4-a716-446655440000']['1329883039834000']
= '550e8400-e29b-41d4-a716-446655440012';
set TIMELINE['550e8400-e29b-41d4-a716-446655440000']['1329883039835000']
= '550e8400-e29b-41d4-a716-446655440022';
set TIMELINE['550e8400-e29b-41d4-a716-446655440001']['1329883039824000']
= '550e8400-e29b-41d4-a716-446655440011';
set TIMELINE['550e8400-e29b-41d4-a716-446655440001']['1329883039825000']
= '550e8400-e29b-41d4-a716-446655440021';
set TIMELINE['550e8400-e29b-41d4-a716-446655440001']['1329883039834000']
= '550e8400-e29b-41d4-a716-446655440012';
set TIMELINE['550e8400-e29b-41d4-a716-446655440001']['1329883039835000']
= '550e8400-e29b-41d4-a716-446655440022';
- Apache Cassandra Learning Step by Step (4): Data Modeling
- Apache Cassandra Learning Step by Step (1): Setup to play
- Apache Cassandra Learning Step by Step (2): Core Concepts
- Apache Cassandra Learning Step by Step (3): Samples ABC
- Apache Cassandra Learning Step by Step (5): 实战性的JTwissandra项目
- Step By Step Learning
- Learning URDF Step by Step
- Learning URDF Step by Step
- <Deep Learning>Step by Step
- Securing Apache: Step-by-Step
- [Data Pump]Learning Data Pump Step by Step -- (2) Data Pump Export
- Step-by-step learning C + +(chapter fivesummary____Expressions)
- Learning MVC 3.0 Step by Step
- Learning Python Step by Step Using Example
- angularjs(step by step):Data Binding
- Securing Apache 2: Step-by-Step
- [学习] Apache + mod_python + Django Step by Step
- Apache+php+MySQL安装 Step By Step
- 02/26/周日
- java中static的作用
- Problem 1004
- 反射机制的理解及其用途
- js:创建对象
- Apache Cassandra Learning Step by Step (4): Data Modeling
- 周末
- JDK源码阅读 java.util.concurrent.Executors相关分析
- Problem 1005
- MyEclipse注册机源代码,在线注册
- AutoCompleteTextView
- Problem 1006
- 第一次用ps制作网页
- 改良女人健康 找准最佳机会