redis中HyperLogLogs使用介绍

来源:互联网 发布:linux udp 端口检测 编辑:程序博客网 时间:2024/05/20 04:51

A HyperLogLog is a probabilistic data structure used in order to count unique things (technically this is referred to estimating the cardinality of a set). Usually counting unique items requires using an amount of memory proportional to the number of items you want to count, because you need to remember the elements you have already seen in the past in order to avoid counting them multiple times. However there is a set of algorithms that trade memory for precision: you end with an estimated measure with a standard error, in the case of the Redis implementation, which is less than 1%. The magic of this algorithm is that you no longer need to use an amount of memory proportional to the number of items counted, and instead can use a constant amount of memory! 12k bytes in the worst case, or a lot less if your HyperLogLog (We’ll just call them HLL from now) has seen very few elements.

    一个HyperLogLog是一个用于统计唯一性事物数量的概率性数据结构(专业的说这叫估算集合的基数)。通常统计总数需要消耗与数量成正比例的存储空间,为了避免重复统计一个项多次你需要把你已经统计过的项记录下来,然而存在一种算法,损耗一定的精确度来节省存储空间,用一种带有一定误差的估算方式,在redis的实现中,这种误差小于1%。这种算法的魔力在于,你不再需要使用和已存储元素的数目成比例的存储空间,取而代之只使用固定数目的内存,在最差的情况下也仅仅需要12Kb的内存!如果你的元素数目很少,消耗的内存会更少。


HLLs in Redis, while technically a different data structure, is encoded as a Redis string, so you can call GET to serialize a HLL, and SET to deserialize it back to the server.

Conceptually the HLL API is like using Sets to do the same task. You would SADD every observed element into a set, and would use SCARD to check the number of elements inside the set, which are unique since SADD will not re-add an existing element.

While you don’t really add items into an HLL, because the data structure only contains a state that does not include actual elements, the API is the same:

  • Every time you see a new element, you add it to the count with PFADD.
  • Every time you want to retrieve the current approximation of the unique elements added with PFADD so far, you use thePFCOUNT.

  > pfadd hll a b c d  (integer) 1  > pfcount hll  (integer) 4

An example of use case for this data structure is counting unique queries performed by users in a search form every day.

Redis is also able to perform the union of HLLs, please check the full documentation for more information.


HyperLogLog在redis中是一种新的数据结构,被当作一个redis 字符串来处理,你可以使用get命令来序列化一个HyperLogLog,或者使用set命令来反序列化再把它存到服务器中。

使用HyperLogLog API和使用sets来做同样的事在概念上是相似的。用SADD命令添加一个元素到set中,用SCARD命令来统计set中元素的个数,并且set中的元素始终会保持唯一,因为SADD命令对于已存在的元素不会重复添加。

不同的是,你不会真正添加添加元素到一个HyperLogLog中,因为这种数据结构只是存储一种状态而不会真正包含一个元素,API如下:

  • 用PFADD命令添加一个新元素
  • 用PFCOUNT命令统计一个HyperLogLog中大概的元素个数

一个使用HyperLogLog的案例是统计一个查询表单每天执行了多少种不同的查询条件组合。redis同时也能计算多个HyperLogLog之间的并集,具体请查阅完全文档来获取更多信息

PHP代码演示:


<?php$redis = new Redis();$redis->connect('127.0.0.1');$redis->del('hhl');for($i = 0; $i < 10000; $i ++){        $redis->rawCommand('pfadd', 'hhl', md5($i)); //由于文档中没有介绍$redis->pfadd如何使用,使用原生命令的方式执行pfadd,每次添加$i的md5散列值,执行10000次}echo $redis->pfcount('hhl') . PHP_EOL;  $redis->close();                                                                                                                                                                                                              

执行结果

9923

正如官网介绍的,HyperLogLogs统计结果并不是完全精确,存在一定误差,对于精确度要求高的统计不适用。




原创粉丝点击