Sorting in Perl

来源:互联网 发布:mac五国无限重启 编辑:程序博客网 时间:2024/05/05 16:33

Sorting in Perl

John Klassa / Raleigh Perl Mongers / June, 2000

 

Introduction

  • Perl has a built-in "sort" function.
  • It uses a quicksort algorithm, which has good (in fact, O(n log n)) performance. (Note, however, that a simple bubble sort can be faster than most other sorts for very short lists.)
  • It's easy to use. You say:
    @s = sort @a;
    and you've got a sorted array...

 

So Why Do We Need a Talk?

  • In order to do useful sorts, you need to add a bit of code to your "sort" calls.
  • If you haven't had the need to do a complex sort, you will.
  • The built-in "sort" compares the sortkeys O(n log n) times. How (and how often) you generate the sortkeys, and what you sort them with, makes a huge difference.

 

The Basics

  • Ascending lexicographic sort:
    @s = sort @a;  (does {$a cmp $b} implicitly)
  • Ascending numeric sort:
    @s = sort {$a <=> $b} @a;

 

More Basics

  • Descending lexicographic sort:
    @s = sort {$b cmp $a} @a;@s = reverse sort @a; (faster)
    Comment from Uri Guttman: "The reverse is only faster in the real world of your data. With a long enough list and perl now (or soon) recognizing and optimizing simple compare blocks, the reverse would be extra work."
  • Descending numeric sort:
    @s = sort {$b <=> $a} @a;

 

Variations

  • Case-insensitive sort:
    @s = sort {lc $a cmp lc $b} @a;
  • Element-length sort:
    @s = sort {lenth $a <=> length $b} @a;
  • Any function will do...

 

Combination Sorts

  • Length first, then lexicographic:
    @s = sort {length $a <=> length $b ||$a cmp $b} @a;
  • Size of file, then age of file:
    @s = sort {-s $a <=> -s $b ||-M $b <=> -M $a} @a;

 

Sort Subroutines

  • Useful when your sort criteria gets a bit involved. $a and $b are automatic.
    @s = sort mycriteria @a;   sub mycriteria { my($aa) = $a =~ /(\d+)/; my($bb) = $b =~ /(\d+)/; sin($aa) <=> sin($bb) || $aa*$aa <=> $bb*$bb;   }

 

Advanced Sorting: Motivation

  • Everything that happens in your sort subroutine/clause happens O(n log n) times.
  • If you do something expensive (like an extraction via a regexp, or perhaps a "stat" on a file), this is a Bad Thing.
  • The Goal: Extract the sortkeys just once.

 

Solution #1: "Orcish" Maneuver

  • Cache the computed values in a hash (so that you're only computing them once). Use an "or" to set missing values. An "or-cache".
    @s = sort {($hash{$a} ||= fn($a)) cmp($hash{$b} ||= fn($b))     } @a;

 

Problems with the OM

  • Performs an extra test after each sortkey lookup.
  • False values are recomputed each time.

 

A Better Way: The Schwartzian Transform

  • The Schwartzian Transform creates a sorted list by transforming the original list into an intermediate form, where the sortkeys are cached, and then pulling the original list back out.

 

But First, A Digression...

  • Just as @a = (1, 2, 3) creates an array, $aref = [1, 2, 3] creates a reference to an anonymous array.
  • Just as $a[0] is the first element of @a, $aref->[0] is the first element of the anonymous array to which $aref refers.
  • Understanding this is central to understanding the ST.

 

The Schwartzian Transform

  • Goal: Sort a list of filenames by age (oldest last), efficiently.
  • The nave approach does O(n log n) "stat" operations, so it's inefficient:
    @s = sort {-M $a <=> -M $b} @a;

 

ST: Mechanics

  • Map the list into a new one that contains the extracted sortkeys and the original values.
  • Sort on the sortkeys.
  • Map the resulting list into a new one that contains the original values in the sorted order.

 

ST: Verbose Approach

  • Verbosely, in code:
    # @a exists, and contains filenames@x = map { [ $_, -M ] } @a;  # transform: value, sortkey@sx = sort { $a->[1] <=> $b->[1] } @x;# sort@s = map { $_->[0] } @sx;  # restore original values

 

ST: Final

  • Put it all together? The key is to read it backwards.
    # @a exists, and contains filenames@s = map { $_->[0] }  # restore original values     sort { $a->[1] <=> $b->[1] }  # sort     map { [$_, -M] } @a;  # transform: value, sortkey

 

Can We Do Better?

  • I didn't think so until I read the Guttman-Rosler paper.
  • Turns out, yes. Use packed sortkeys and the default sort.

 

The "Packed Default" Sort

  • So-named because it uses packed sortkeys, then sorts them with the default "sort" (i.e. no sort subroutine or sort clause. just the native, all-in-C comparison routine).
  • The benefits: fast, one-time sortkey generation; fast comparison; fast extraction.

 

PD: The Mechanics

  • Pack the sortkeys into a single string (tack on subkeys, if any).
  • Tack on the original values (or an index, if the original values are complex data structures).
  • Sort.
  • Retrieve original values via "substr", "split" or whatever.

 

PD: Example

  • Sort "dotted-quad" values:
    @out =    map substr($_, 4) =>    sort    map pack(`C4', /(\d+)\. (\d+)\. (\d+)\. (\d+)/)    . $_ => @a;
  • Again, read it in reverse...

 

Conclusion

  • Using "sort" is always O(n log n).
  • For complicated sorts, how you pull out the sortkeys and how you compare them is what matters.
  • The ST is my personal favorite. It's easy to remember, and it's fast.
  • The PD sort is faster, but it's also a bit more cryptic (unless you're a natural with "pack", and have a desire to really understand your data). By the way, Uri Guttman started on a Sort::Records module that does a PD sort under the covers, but did not finish it or publish it to CPAN. He has, however, offered to give us the current source, design ideas, help, etc. if anyone in raleigh.pm would like to pick it up. 

References

  • More About the Schwartzian Transform (Joseph Hall), at: http://www.stllinux.org/meeting_notes/1997/0918/schwtr.html
  • A Fresh Look at Efficient Perl Sorting (Uri Guttman and Larry Rosler), at: http://www.sysarch.com/perl/sort_paper.html

Revisions

  1. November 20, 2002: Rob West: Update based on feedback from Uri Guttman.
  2. August 22, 2003: Rob West: Updated References links.
0 0
原创粉丝点击