不使用随机数的洗牌算法/Shuffle sort and other fallacies of randomization

来源：互联网发布：淘宝网首页女装秋装编辑：程序博客网时间：2024/04/30 05:52

Quick, how do you write code that shuffles a collection of objects? In the real world, it’s fairly easy to see how a deck of cards is shuffled – but how would you do that in code? Obviously, you would have to use some sort of random number generator, but beyond that it’s not straightforward. Furthermore, how do you ensure that the shuffling is fair; that is, all permutations appear with equal probability?

The astute among you will know that one way is by implementing the Fisher-Yates shuffle algorithm. But, let’s investigate what happens when other, seemingly adequate solutions, are used instead.

Shuffler’s dilemma

One example that highlights the implications of using seemingly random and unbiased “shuffling” involves the Windows EU Browser Ballot.

The purpose of the Browser Ballot was so that Microsoft could comply with an EU directivethat ordered them to provide users with a clear choice of which browser to use with Windows. The ballot was supposed to have been carefully designed; the first section contained the five most common browsers (Chrome, Firefox, Internet Explorer, Opera and Safari) with the second section containing seven lesser-used browsers.

Since it is generally considered an advantage to be placed first in a ballot (because users who don’t care one way or another might just pick the first option), the EU directed Microsoft to randomize the order in which the browsers appeared on the ballot in each of the sections. Thus, when users visited the ballot page, they were supposed to get a random ordering of the top five browsers followed by a random ordered of the lesser seven.

Looking into the cards

The team in charge of the ballot design came up with a solution to use a sorting function with a randomized comparator. In this way, they hoped to apply a randomized sort to the list of browsers, producing a random ordering that met the rules of the EU. While this did produce a random ordering, (and thus technically was in compliance with the EU directive), it did not produce unbiased results, as the ordering of some browsers came up more often than others. This lead to allegations that Microsoft was trying yet again to stifle its competition to unscrupulous practices.

However, this probably wasn’t the case, and was likely an instance of a seemingly adequate solution being used without proper testing/analysis.

To understand the problem, we first have to understand sorting algorithms, specifically those that use comparisons as their basis. (i.e. comparison sorts) An example of a comparison sort is Insertion sort. Through each step of the sorting process, the sorted portion of the list increases by one as we take the next element from the unsorted portion and position it properly within the sorted portion. This is done by comparing the element to be inserted with each element in the sorted portion until we find the correct position.

A comparator is simply a function that compares two objects, and tells us which one is greater than the other, or whether they are equal. In Java, the general structure of aComparator would be:

new Comparator<T>() {  @Override  public int compare(final T o1, final T o2) {    // If o1 is "less than" o2, return a negative integer.    // If o1 is "equal to" o2, return 0.    // If o1 is "greater than" o2, return a positive integer.  }};

In this way, the exact semantics of comparisons can be tailored to meet the needs of the application. In the case of the Browser Ballot, the team implemented something akin to the following: (Though the language was JavaScript, not Java, as the Browser Ballot was a website)

new Comparator<T>() {  @Override  public int compare(final T o1, final T o2) {    return Math.random() < 0.5 ? 1 : -1;  }};

In this comparator, the objects being compared are not even considered; instead the either-1 or 1 is returned with equal probability. Using this sort of “comparison” with a sorting algorithm might seem to provide random, unbiased results – but in reality, such “common sense” did not make so much sense after further analysis.

Looking deeper

To look at the problem a bit more in-depth, I decided to write an example in Java that would test the results of a “random sort” on a set of four objects, using the following sorting algorithms: A modified Merge sort (provided by Java’s Collections.sort()), Quicksort, Insertion sort and Selection sort. These results would then be compared to using the proper method, a Fisher-Yates shuffle.

Some more details:

The list to be shuffled started off with the ordering of [1, 2, 3, 4]
I used my own implementation of Quicksort that just selected the pivot from the middle of the list.
Bubble sort was not considered, since with a randomized comparator, the algorithm may never terminate!

Here are the results:

As you can see, the Fisher-Yates shuffling algorithm produces the expected near-uniform distribution, represented by the almost-straight line. The “randomized” sorting algorithms most definitely did not produce a uniform distribution. Interestingly, the modified merge sort used by Collections.sort() produced the same distribution as Insertion sort; This may mean that this specific implementation of merge sort uses an insertion sort when the collection size is sufficiently low.

Of the sorting algorithms tested, Quicksort had the least difference between the “peaks” and “troughs”, though the bias was still very evident. I’ve heard that Quicksort may produce a near-uniform distribution when used with a randomized comparator, but I’m guessing that would depend greatly on the choice of pivot.

Conclusion

Randomization is never an easy thing; one must fully consider the effects of a specific algorithm before jumping to conclusions as “common sense” doesn’t really make much sense in this area. Come to think of it, “common sense” is rarely sensible.

Careful analysis is always needed whenever dealing with distribution of random variables. In general, for a random variable A that has a certain distribution, f(A) is not guaranteed to have the same distribution. This is the point that was missed when designing the ballot.

Analysis of the Fisher-Yates algorithm is fairly easy: The reason why it works is that at each step of selecting an element from the collection, each remaining element has an equally-likely probability of being selected. Analysis of each of the randomized sorting algorithms to understand why each distribution was produced is more difficult, and is left as an exercise to the reader.

It’s also worthwhile to note that the correct shuffling algorithm (Fisher-Yates) runs in O(n)time, while all of the randomized sorting algorithms have a longer asymptotic run time.