Sorting in Perl

来源：互联网发布：mac五国无限重启编辑：程序博客网时间：2024/05/05 16:33

Sorting in Perl

John Klassa / Raleigh Perl Mongers / June, 2000

Introduction

Perl has a built-in "sort" function.
It uses a quicksort algorithm, which has good (in fact, O(n log n)) performance. (Note, however, that a simple bubble sort can be faster than most other sorts for very short lists.)
It's easy to use. You say:
```
@s = sort @a;
```
and you've got a sorted array...

So Why Do We Need a Talk?

In order to do useful sorts, you need to add a bit of code to your "sort" calls.
If you haven't had the need to do a complex sort, you will.
The built-in "sort" compares the sortkeys O(n log n) times. How (and how often) you generate the sortkeys, and what you sort them with, makes a huge difference.

The Basics

Ascending lexicographic sort:

@s = sort @a;  (does {$a cmp $b} implicitly)

Ascending numeric sort:
```
@s = sort {$a <=> $b} @a;
```

More Basics

Descending lexicographic sort:
```
@s = sort {$b cmp $a} @a;@s = reverse sort @a; (faster)
```
Comment from Uri Guttman: "The reverse is only faster in the real world of your data. With a long enough list and perl now (or soon) recognizing and optimizing simple compare blocks, the reverse would be extra work."
Descending numeric sort:
```
@s = sort {$b <=> $a} @a;
```

Variations

Case-insensitive sort:
```
@s = sort {lc $a cmp lc $b} @a;
```
Element-length sort:
```
@s = sort {lenth $a <=> length $b} @a;
```
Any function will do...

Combination Sorts

Length first, then lexicographic:

@s = sort {length $a <=> length $b ||$a cmp $b} @a;

Size of file, then age of file:

@s = sort {-s $a <=> -s $b ||-M $b <=> -M $a} @a;

Sort Subroutines

Useful when your sort criteria gets a bit involved. $a and $b are automatic.

@s = sort mycriteria @a;   sub mycriteria { my($aa) = $a =~ /(\d+)/; my($bb) = $b =~ /(\d+)/; sin($aa) <=> sin($bb) || $aa*$aa <=> $bb*$bb;   }

Advanced Sorting: Motivation

Everything that happens in your sort subroutine/clause happens O(n log n) times.
If you do something expensive (like an extraction via a regexp, or perhaps a "stat" on a file), this is a Bad Thing.
The Goal: Extract the sortkeys just once.

Solution #1: "Orcish" Maneuver

Cache the computed values in a hash (so that you're only computing them once). Use an "or" to set missing values. An "or-cache".
```
@s = sort {($hash{$a} ||= fn($a)) cmp($hash{$b} ||= fn($b))     } @a;
```

Problems with the OM

Performs an extra test after each sortkey lookup.
False values are recomputed each time.

A Better Way: The Schwartzian Transform

The Schwartzian Transform creates a sorted list by transforming the original list into an intermediate form, where the sortkeys are cached, and then pulling the original list back out.

But First, A Digression...

Just as @a = (1, 2, 3) creates an array, $aref = [1, 2, 3] creates a reference to an anonymous array.
Just as $a[0] is the first element of @a, $aref->[0] is the first element of the anonymous array to which $aref refers.
Understanding this is central to understanding the ST.

The Schwartzian Transform

Goal: Sort a list of filenames by age (oldest last), efficiently.
The nave approach does O(n log n) "stat" operations, so it's inefficient:
```
@s = sort {-M $a <=> -M $b} @a;
```

ST: Mechanics

Map the list into a new one that contains the extracted sortkeys and the original values.
Sort on the sortkeys.
Map the resulting list into a new one that contains the original values in the sorted order.

ST: Verbose Approach

Verbosely, in code:

# @a exists, and contains filenames@x = map { [ $_, -M ] } @a;  # transform: value, sortkey@sx = sort { $a->[1] <=> $b->[1] } @x;# sort@s = map { $_->[0] } @sx;  # restore original values

ST: Final

Put it all together? The key is to read it backwards.

# @a exists, and contains filenames@s = map { $_->[0] }  # restore original values     sort { $a->[1] <=> $b->[1] }  # sort     map { [$_, -M] } @a;  # transform: value, sortkey

Can We Do Better?

I didn't think so until I read the Guttman-Rosler paper.
Turns out, yes. Use packed sortkeys and the default sort.

The "Packed Default" Sort

So-named because it uses packed sortkeys, then sorts them with the default "sort" (i.e. no sort subroutine or sort clause. just the native, all-in-C comparison routine).
The benefits: fast, one-time sortkey generation; fast comparison; fast extraction.

PD: The Mechanics

Pack the sortkeys into a single string (tack on subkeys, if any).
Tack on the original values (or an index, if the original values are complex data structures).
Sort.
Retrieve original values via "substr", "split" or whatever.

PD: Example

Sort "dotted-quad" values:

@out =    map substr($_, 4) =>    sort    map pack(`C4', /(\d+)\. (\d+)\. (\d+)\. (\d+)/)    . $_ => @a;

Again, read it in reverse...

Conclusion

Using "sort" is always O(n log n).
For complicated sorts, how you pull out the sortkeys and how you compare them is what matters.
The ST is my personal favorite. It's easy to remember, and it's fast.
The PD sort is faster, but it's also a bit more cryptic (unless you're a natural with "pack", and have a desire to really understand your data). By the way, Uri Guttman started on a Sort::Records module that does a PD sort under the covers, but did not finish it or publish it to CPAN. He has, however, offered to give us the current source, design ideas, help, etc. if anyone in raleigh.pm would like to pick it up.

References

More About the Schwartzian Transform (Joseph Hall), at: http://www.stllinux.org/meeting_notes/1997/0918/schwtr.html
A Fresh Look at Efficient Perl Sorting (Uri Guttman and Larry Rosler), at: http://www.sysarch.com/perl/sort_paper.html

Revisions

November 20, 2002: Rob West: Update based on feedback from Uri Guttman.
August 22, 2003: Rob West: Updated References links.

0 0