Napkin math for MongoDB performance
来源:互联网 发布:完整的数据库源代码 编辑:程序博客网 时间:2024/06/05 11:43
文章来源:http://rickosborne.org/blog/2010/02/napkin-math-for-mongodb-performance/
As we all know, there are lies, damned lies, and statistics. What I’m about to present shouldn’t even qualify as statistics—it’s just a bunch of damned lies. I’m not set up to do any sort of rigorous performance testing, so these should not be construed as anything but what they are: one guy’s half-assed and probably flawed measurements.
I was playing around with MapReduce on MongoDB, trying to figure out how to code the equivalent of SQL’s COUNT(DISTINCT column) functionality. The short answer is: don’t do it. Or, if you do it, figure out a better way than I did. Along the way, I gathered some metrics on what types of operations cause what kinds of performance hits.
The Setup
My set up is a database of 3,397,115 records, all of which look something like this:
Yeah, I just took the Netflix prize data and inserted ~3M records. I did the inserts across 3 shard services, all running on the same machine, which led to 9 chunks of roughly equal size. I let MongoDB handle the sharding—I didn’t manually split the shards. I ensured one index on the collection, over movie and cust, which isn’t really used for the query in question, but I thought it was worth mentioning.
Yeah, I know performance is going to suffer because I’m running 3 shards from the same hard drive. That’s kindof the point.
I ran all of this on my MacBook Pro, which is a 2.66 GHz Core 2 Duo with 4GB of 1067 MHz DDR3. I continued to do other light-duty tasks while running the tests, but nothing that should have interfered greatly.
The Queries
Here’s the starting query’s SQL equivalent:
And the MapReduce query itself, as I wrote it:
Those nasty bits with the for-in loops are for the COUNT(DISTINCT column) logic. This query produces the following result set:
The Results
All times below are in mm:ss format. (Minutes, not hours.)
Lessons Learned
- Queries scream when a single shard is left to its own devices—but when parallelism is attempted on the same shard you get a massive performance hit. Don't run different shards off the same hard drive—no matter how many cores you have.
- Don't try to emulate COUNT(DISTINCT). Really.
I have to wonder if mongos can be tweaked to serialize queries against chunks on the same shard, to prevent disk contention issues?
推荐阅读:MongoDB: Terrible MapReduce Performance
MongoDB's performance on aggregation queries
Is this Map Reduce performance normal or I am missing something
- Napkin math for MongoDB performance
- MongoDB Performance for more data than memory
- M101P: MongoDB for Developers - Chapter 4: Performance
- Improving Performance for Wlan
- code for performance(STL)
- performance tuning for linux
- Designing For Performance
- Multithreading For Performance
- Android: Multithreading For Performance
- Multithreading For Performance
- Best Practices for Performance
- Storage: Optimizing For Performance
- Multithreading For Performance
- Designing for Performance
- Multithreading For Performance
- Best Practices for Performance
- Multithreading For Performance
- A Taxonomy for Performance
- 工作与生活
- const
- Eclipse导入Android项目的正确方法
- input_dev & battery temperature
- Boa服务器移植
- Napkin math for MongoDB performance
- 哈希表
- 多分辨率支持
- Dirty data
- weblogic java虚拟内存设置
- 人生中重要的经历4
- 【转】探索推荐引擎内部的秘密
- 男人城府的修练.
- 嵌套删除非空目录