TopK问题探索-最小堆JAVA实现

来源：互联网发布：剑三高冷成男捏脸数据编辑：程序博客网时间：2024/05/24 00:23

TopK问题即如何从大量数据中找出前K个数（数之间可比较，较大的排前面）。

注：这里使用数的概念，并不一定是数字，可以是任何对象，对象之间可以比较大小。

实际场景：比如搜索引擎找出得分最高的10篇文章，歌曲库中统计下载率最高的前10首歌等等。

下面探导有哪些实现方式：

方法一、将全部数据存放数组，然后对数组排序（大到小排），取出前K个数即可。

这种方式是最直接、最容易想到的方式。但由于是大量数据，存储和排序过程对内存、CPU资源消耗很大、效率低，不推荐使用。

方法二、从全部数据中取出K个数存入K大小的数组a中，对a按从小到大排序，则a[0]为最小值。然后依次取出其余数据，每取出一个数，都与a[0]比较，如果比a[0]小或相等，则取下一个数；反之，则丢弃a[0]的值，利用二分法找到其位置，然后该位置前的数组元素整体向前移动，如此反复读取，直到数据结尾。

这比方法一效率有很大提高，但如果K比较大时，整体移动也是比较耗时的

对于这种问题，效率比较高的解决方式是使用最小堆

最小堆（小根堆）是一种数据结构，它首先是一颗完全二叉树，并且，它所有父节点的值小于或等于两个子节点的值

最小堆的实际存储可以是数组，或者链表，用链表会更加灵活。

下面给出最小堆的一种JAVA实现方式（来自lucene源代码）

public abstract class PriorityQueue<T> {  private int size;  private final int maxSize;  private final T[] heap;  public PriorityQueue(int maxSize) {    this(maxSize, true);  }  @SuppressWarnings("unchecked")  public PriorityQueue(int maxSize, boolean prepopulate) {    size = 0;    int heapSize;    if (0 == maxSize) {      // We allocate 1 extra to avoid if statement in top()      heapSize = 2;    } else {      if (maxSize > ArrayUtil.MAX_ARRAY_LENGTH) {        throw new IllegalArgumentException("maxSize must be <= " + ArrayUtil.MAX_ARRAY_LENGTH + "; got: " + maxSize);      } else {        // NOTE: we add +1 because all access to heap is        // 1-based not 0-based.  heap[0] is unused.        heapSize = maxSize + 1;      }    }    heap = (T[]) new Object[heapSize]; // T is unbounded type, so this unchecked cast works always    this.maxSize = maxSize;        if (prepopulate) {      // If sentinel objects are supported, populate the queue with them      T sentinel = getSentinelObject();      if (sentinel != null) {        heap[1] = sentinel;        for (int i = 2; i < heap.length; i++) {          heap[i] = getSentinelObject();        }        size = maxSize;      }    }  }  /** Determines the ordering of objects in this priority queue.  Subclasses   *  must define this one method.   *  @return <code>true</code> iff parameter <tt>a</tt> is less than parameter <tt>b</tt>.   */  protected abstract boolean lessThan(T a, T b);  /**   * This method can be overridden by extending classes to return a sentinel   * object which will be used by the {@link PriorityQueue#PriorityQueue(int,boolean)}    * constructor to fill the queue, so that the code which uses that queue can always   * assume it's full and only change the top without attempting to insert any new   * object.<br>   *    * Those sentinel values should always compare worse than any non-sentinel   * value (i.e., {@link #lessThan} should always favor the   * non-sentinel values).<br>   *    * By default, this method returns false, which means the queue will not be   * filled with sentinel values. Otherwise, the value returned will be used to   * pre-populate the queue. Adds sentinel values to the queue.<br>   *    * If this method is extended to return a non-null value, then the following   * usage pattern is recommended:   *    * <pre class="prettyprint">   * // extends getSentinelObject() to return a non-null value.   * PriorityQueue<MyObject> pq = new MyQueue<MyObject>(numHits);   * // save the 'top' element, which is guaranteed to not be null.   * MyObject pqTop = pq.top();   * <...>   * // now in order to add a new element, which is 'better' than top (after    * // you've verified it is better), it is as simple as:   * pqTop.change().   * pqTop = pq.updateTop();   * </pre>   *    * <b>NOTE:</b> if this method returns a non-null value, it will be called by   * the {@link PriorityQueue#PriorityQueue(int,boolean)} constructor    * {@link #size()} times, relying on a new object to be returned and will not   * check if it's null again. Therefore you should ensure any call to this   * method creates a new instance and behaves consistently, e.g., it cannot   * return null if it previously returned non-null.   *    * @return the sentinel object to use to pre-populate the queue, or null if   *         sentinel objects are not supported.   */  protected T getSentinelObject() {    return null;  }  /**   * Adds an Object to a PriorityQueue in log(size) time. If one tries to add   * more objects than maxSize from initialize an   * {@link ArrayIndexOutOfBoundsException} is thrown.   *    * @return the new 'top' element in the queue.   */  public final T add(T element) {    size++;    heap[size] = element;    upHeap();    return heap[1];  }  /**   * Adds an Object to a PriorityQueue in log(size) time.   * It returns the object (if any) that was   * dropped off the heap because it was full. This can be   * the given parameter (in case it is smaller than the   * full heap's minimum, and couldn't be added), or another   * object that was previously the smallest value in the   * heap and now has been replaced by a larger one, or null   * if the queue wasn't yet full with maxSize elements.   */  public T insertWithOverflow(T element) {    if (size < maxSize) {      add(element);      return null;    } else if (size > 0 && !lessThan(element, heap[1])) {      T ret = heap[1];      heap[1] = element;      updateTop();      return ret;    } else {      return element;    }  }  /** Returns the least element of the PriorityQueue in constant time. */  public final T top() {    // We don't need to check size here: if maxSize is 0,    // then heap is length 2 array with both entries null.    // If size is 0 then heap[1] is already null.    return heap[1];  }  /** Removes and returns the least element of the PriorityQueue in log(size)    time. */  public final T pop() {    if (size > 0) {      T result = heap[1];       // save first value      heap[1] = heap[size];     // move last to first      heap[size] = null;        // permit GC of objects      size--;      downHeap();               // adjust heap      return result;    } else {      return null;    }  }    /**   * Should be called when the Object at top changes values. Still log(n) worst   * case, but it's at least twice as fast to   *    * <pre class="prettyprint">   * pq.top().change();   * pq.updateTop();   * </pre>   *    * instead of   *    * <pre class="prettyprint">   * o = pq.pop();   * o.change();   * pq.push(o);   * </pre>   *    * @return the new 'top' element.   */  public final T updateTop() {    downHeap();    return heap[1];  }  /** Returns the number of elements currently stored in the PriorityQueue. */  public final int size() {    return size;  }  /** Removes all entries from the PriorityQueue. */  public final void clear() {    for (int i = 0; i <= size; i++) {      heap[i] = null;    }    size = 0;  }  private final void upHeap() {    int i = size;    T node = heap[i];          // save bottom node    int j = i >>> 1;    while (j > 0 && lessThan(node, heap[j])) {      heap[i] = heap[j];       // shift parents down      i = j;      j = j >>> 1;    }    heap[i] = node;            // install saved node  }  private final void downHeap() {    int i = 1;    T node = heap[i];          // save top node    int j = i << 1;            // find smaller child    int k = j + 1;    if (k <= size && lessThan(heap[k], heap[j])) {      j = k;    }    while (j <= size && lessThan(heap[j], node)) {      heap[i] = heap[j];       // shift up child      i = j;      j = i << 1;      k = j + 1;      if (k <= size && lessThan(heap[k], heap[j])) {        j = k;      }    }    heap[i] = node;            // install saved node  }    /** This method returns the internal heap array as Object[].   * @lucene.internal   */  protected final Object[] getHeapArray() {    return (Object[]) heap;  }}

上面的抽象类封装了最小堆的一些基本操作，包括如何初始化最小堆、新增元素、弹出元素、调整根元素到适当位置等操作。在进行这些操作时，保证了最小堆的基本性质，即父结点的值小于或等于两个子结点值。

由于是抽象类，需要子类继续该类，并提供自己的lessThan方法的实现。

子类在使用时，需要先调用父类public PriorityQueue(int maxSize)或public PriorityQueue(int maxSize, boolean prepopulate)构造方法。

注：public PriorityQueue(int maxSize)实际是调用了后一个构造方法，参数prepopulate值为true

子类可重写protected T getSentinelObject()方法来决定是否要预填充堆，当该返回方法不为NULL时，且调用构造方法时参数prepopulate为true时，会预填充堆，即堆成员变量 T[] heap 数组的每个元素（除第一个元素外）赋上了非NULL值，且size赋值为maxSize的值，代表数组中已经有maxSize个元素;

注：size表示当前堆中元素实际个数，maxSize表示堆中可容纳的总元素，即容量。

另需要说明下，如果getSentinelObject()返回非NULL值，需要保证每次调用该方法，返回的都是new出来的新对象，而且该对象比其它任何对象的优先级都要低或相当，即lessThan(sentinelObject, otherObject)返回true

在初始化最小堆后，如果堆中未填充元素，可调用add方法新增元素到堆中;如果已经填充了元素，可调用top方法获取树最顶端元素，即根元素，改变根元素的一些值，当根元素改变时，需要子类调用public final T updateTop()方法来将根元素调整到适当位置。

注：由于最小堆的目的是存放前K个元素，在每次调用add方法前都要拿欲加入元素与根元素比较，如果小于或等于根元素，就不要执行add方法。同理，在欲改变根元素的一些值时，也是要进行比较的，只有当新的值比原来的值大时，才更改，并调用updateTop()方法

然后就涉及到如何取出堆中K个元素，这时就要循环调用pop()方法，每次调用都会弹出根元素，并存入数组或列表。由最小堆的性质可知，先弹出的元素肯定是要小于或等于后面的元素，这样就得到了排好序的前K个元素。

最小堆新增元素的时间复杂度为log(N)

1 1