Effective C# 10:Understand the Pitfalls of GetHashCode()

来源:互联网 发布:雾化芯常用绕丝数据 编辑:程序博客网 时间:2024/04/29 05:48

Item 10: Understand the Pitfallsof GetHashCode()

理解GetHashCode()的缺陷

This is the only item in this bookdedicated to one function that you should avoid writing. GetHashCode() is usedin one place only: to define the hash value for keys in a hash-basedcollection, typically the Hashtable or Dictionary containers. That's goodbecause there are a number of problems with the base class implementation of GetHashCode().For reference types, it works but is inefficient. For value types, the baseclass version is often incorrect. But it gets worse. It's entirely possiblethat you cannot write GetHashCode() so that it is both efficient and correct.No single function generates more discussion and more confusion than GetHashCode().Read on to remove all that confusion.

这是本书中唯一的这样一个条款:致力于一个应该避免编写的方法。GetHashCode()仅仅用在一个地方:在基于hash(哈希)结构的集合中,用来定义key(键值)hash值,典型的是Hashtable(哈希表)或者Dictionary(字典)容器。因为基类在对GetHashCode()的实现上存在很多问题,所以仅用在一个地方很好。对于引用类型,这也能工作但是效率低。对于值类型,基类的版本经常是不正确的,而且越来越糟。不写GetHashCode()是完全可能的,那样就会同时获得效率和正确性。没有哪个单独的方法比GetHashCode()带来更多的讨论和混乱。继续读来移除所有的困惑。

If you're defining a type that won'tever be used as the key in a container, this won't matter. Types that representwindow controls, web page controls, or database connections are unlikely to beused as keys in a collection. In those cases, do nothing. All reference typeswill have a hash code that is correct, even if it is very inefficient. Valuetypes should be immutable (see Item 7),in which case, the default implementation always works, although it is alsoinefficient. In most types that you create, the best approach is to avoid theexistence of GetHashCode() entirely.

如果你正在定义一个从不会在容器里面用作key的类型,这没什么影响。表示WinForm控件、web页面控件或数据库连接的类型,不大可能被用作集合中的key。在那些情况下,什么也不要做。所有的引用类型将会有一个正确的hash码,即使是很低效的。值类型应该是不可变性的,这种情况下,默认的实现,尽管是效率低的,但是是可以工作的。在你创建的多数类型中,最好的途径就是完全避免GetHashCode()的存在。

One day, you'll create a type that ismeant to be used as a hashtable key, and you'll need to write your ownimplementation of GetHashCode(), so read on. Hash-based containers use hashcodes to optimize searches. Every object generates an integer value called ahash code. Objects are stored in buckets based on the value of that hash code.To search for an object, you request its key and search just that one bucket.In .NET, everyobject has a hash code, determined by System.Object.GetHashCode().Any overload of GetHashCode() must follow these three rules:

有一天,你会创建一个要用作hashtablekey的类型,需要编写自己的GetHashCode()实现,那么继续读。基于hash结构的容器使用hash码来优化搜索。每个对象生成一个叫做hash码的整型值。对象都被存储在基于hash值的bucket(容器,桶?)里。为了搜索一个对象,你需要它的键值,在bucket容器里面搜索它。在.Net里面,每个对象都有一个由System.Object.GetHashCode()决定的hash码。任何对GetHashCode()的重载必须遵守这三个规则:

If two objects are equal (as defined byoperator==), they must generate the same hash value. Otherwise, hash codescan't be used to find objects in containers.

1.如果2个对象是相等的(==操作符定义)它们必须生成同样的hash值。否则,hash值不能被用来在容器里面查找对象。

For any object A, A.GetHashCode() mustbe an instance invariant. No matter what methods are called on A, A.GetHashCode()must always return the same value. That ensures that an object placed in abucket is always in the right bucket.

2.对于任何对象AA.GetHashCode()必须是一个实例不变量。无论在A里面调用什么方法,A.GetHashCode()必须总是返回同样的值。这能保证,放在bucket容器里的对象永远在正确的bucket里。

The hash function should generate arandom distribution among all integers for all inputs. That's how you getefficiency from a hash-based container.

3.Hash方法应该为所有的输入在整型范围内生成一个随机的分布。这就是使用基于hash结构的容器里面获得效率的原因。

Writing a correct and efficient hashfunction requires extensive knowledge of the type to ensure that rule 3 isfollowed. The versions defined in System.Object and System.ValueType do nothave that advantage. These versions must provide the best default behavior withalmost no knowledge of your particular type. Object.GetHashCode() uses aninternal field in the System.Object class to generate the hash value. Eachobject created is assigned a unique object key, stored as an integer, when itis created. These keys start at 1 and increment every time a new object of anytype gets created. The object identity field is set in the System.Objectconstructor and cannot be modified later. Object.GetHashCode() returns thisvalue as the hash code for a given object.

编写一个正确且高效的hash方法要求对该类型有更多了解来保证遵守规则3。在System.ObjectSystem.ValueType中定义的版本没有这优点。这些版本在几乎不知道你的特定类型的情况下,必须提供最好的默认行为。Object.GetHashCode()使用了System.Object类的一个内部字段来生成hash值。每个对象在它被创建的时候都被分配一个唯一的对象值(以一个整型值来存储)。这些值以1开始,每次有任何类型的一个新对象被创建时该值就会增加。对象标识符字段在System.Object构造器的内部被设置,以后不能再被修改。Object.GetHashCode()将对象标识符字段的hash值作为结果hash值返回。

Now examine Object.GetHashCode() inlight of those three rules. If two objects are equal, Object.GetHashCode()returnsthe same hash value, unless you've overridden operator==. System.Object'sversion of operator==() tests object identity. GetHashCode() returns theinternal object identity field. It works. However, if you've supplied your ownversion of operator==, you must also supply your own version of GetHashCode()to ensure that the first rule is followed. See Item 9for details on equality.

现在根据那三条规则来检查Object.GetHashCode()。如果2个对象是相等的,除非你重写过了==操作符,Object.GetHashCode()会返回同样的hash值。System.Object==版本检测对象标识符。GetHashCode()返回内部的对象标识符字段,这能工作。然而,如果你已经提供了自己版本的==,就必须也要提供自己版本的GetHashCode()才能确保遵守了第一条规则。Item 9详细介绍了相等性。

The second rule is followed: After anobject is created, its hash code never changes.

遵循了第二个规则:一个对象在被创建后,hash码从不改变。

The third rule, a random distributionamong all integers for all inputs, does not hold. A numeric sequence is not arandom distribution among all integers unless you create an enormous number ofobjects. The hash codes generated by Object.GetHashCode() are concentrated atthe low end of the range of integers.

第三个规则,对所有的输入要随机分布在整型范围内,这一条不成立。除非你创建大量的对象,否则一个数字队列不是整型范围内的随机分布,由Object.GetHashCode()生成的hash码集中在整型范围的低端部分。

This means that Object.GetHashCode() iscorrect but not efficient. If you create a hashtable based on a reference typethat you define, the default behavior from System.Object is a working, butslow, hashtable. When you create reference types that are meant to be hashkeys, you shouldoverride GetHashCode()to get a better distribution of the hashvalues across all integers for your specific type.

这意味着Object.GetHashCode()是正确的但是非高效的。如果你创建一个基于你定义的引用类型的hashtable,继承自System.Object的默认行为就是可工作、比较慢的hashtable。当你创建一个准备作为hash键值的引用类型时,应该重写GetHashCode(),以便于为你的特定类型在整型范围内得到一个更好的hash值分布。

Before covering how to write your ownoverride of GetHashCode, this section examines ValueType.GetHashCode()withrespect to those same three rules. System.ValueType overrides GetHashCode(),providing the default behavior for all value types. Its version returns thehash code from the first field defined in the type. Consider this example:

在讲述怎么编写自己重写版本的GetHashCode之前,这一节用那三条同样的规则来检查Value.GetHashCode()System.ValueType重写了GetHashCode(),为所有的值类型提供了默认的行为。这个版本返回在该类型内部定义的首个字段的hash值作为自己的hash值。考虑这个例子:

  1.     public struct MyStruct
  2.     {
  3.         private String msg;
  4.         private Int32 id;
  5.         private DateTime epoch;
  6. }

The hash code returned from a MyStruct objectis the hash code generated by the _msg field. The following code snippet alwaysreturns true:

MyStruct对象返回的hash码就是由msg字段生成的hash码。下面代码段总是返回true

  1.     MyStruct s = new MyStruct();
  2.     return s.GetHashCode() == s.msg.GetHashCode();

 

翻译时试验:

总是返回false

The first rule says that two objectsthat are equal (as defined by operator==()) must have the same hash code. Thisrule is followed for value types under most conditions, but you can break it,just as you could with for reference types. ValueType.operator==() compares thefirst field in the struct, along with every other field. That satisfies rule 1.As long as any override that you define for operator== uses the first field, itwill work. Any struct whose first field does not participate in the equality ofthe type violates this rule, breaking GetHashCode().

第一个规则是说2个相等的对象(==定义的相等)必须由相同的hash码。该规则对于值类型来说,在多数情况下是被遵守的。但是你可以打破它,就像对待引用类型一样。ValueType的操作符==()比较结构体中很多字段中的首个字段,这满足了规则1。只要你定义了任何重写的==操作符,就使用了首个字段,就能工作。任何结构体,如果它的首个字段没有参与类型的相等性,那么就违背了该规则,破坏了GetHashCode()

The second rule states that the hashcode must be an instance invariant. That rule is followed only when the firstfield in the struct is an immutable field. If the value of the first field canchange, so can the hash code. That breaks the rules. Yes, GetHashCode() isbroken for any struct that you create when the first field can be modified duringthe lifetime of the object. It's yet another reason why immutable value typesare your best bet (see Item 7).

第二个规则阐明了hash码必须是一个实例不变量。只有当这个结构体中的首个字段是不可变字段时,才符合该规则。如果首个字段的值可改变,那么hash码也可变,这就违背了该规则。是的,对于任何你创建的结构体,如果在它的生命期内首个字段是可以被修改的,那么GetHashCode()就会被打破。为什么不可变的值类型是你最好的选择呢,这也是另外一个原因(参看Item 17)

The third rule depends on the type ofthe first field and how it is used. If the first field generates a randomdistribution across all integers, and the first field is distributed across allvalues of the struct, then the struct generates an even distribution as well.However, if the first field often has the same value, this rule is violated.Consider a small change to the earlier struct:

第三个规则依赖于首个字段的类型和它如何被使用。如果首个字段生成了一个在整型范围的随机分布,而且它也遍布了结构中的所有值,那么,该结体构也能生成一个很好的平均分布。然而,如果首个字段经常有同样的值,这个规则也会被打破。考虑对前面的结构体做个小小的修改;

  1.     public struct MyStruct
  2.     {
  3.         private DateTime epoch;
  4.         private String msg;
  5.         private Int32 id;
  6.  }

If the _epoch field is set to thecurrent date (not including the time), all MyStruct objects created in a givendate will have the same hash code. That prevents an even distribution among allhash code values.

如果epoch字段被设置成了当前的日期(不含时间),所有在某个特定日期被创建的MyStruct对象将会有同样的hash值。这就阻止了所有hash值的平均分布。

Summarizing the default behavior, Object.GetHashCode()works correctly for reference types, although it does not necessarily generatean efficient distribution. (If you have overridden Object.operator==(), you canbreak GetHashCode()). ValueType.GetHashCode() works only if the first field inyour struct is read-only. ValueType.GetHashCode() generates an efficient hashcode only when the first field in your struct contains values across ameaningful subset of its inputs.

概括Object.GetHashCode()的默认行为,在引用类型上工作得很正确,尽管它没必要生成一个高效的分布(如果你已经重写了Object.operator==(),会打破GetHashCode())。只有在结构体中的首个字段是只读的情况下,ValueType.GetHashCode()才能工作。只有当结构体满足下列条件的时候:包含了遍布于他的输入中某个有意义的集合的值,ValueType.GetHashCode()才能生成高效的hash码,

If you're going to build a better hashcode, you need to place some constraints on your type. Examine the three rulesagain, this time in the context of building a working implementation of GetHashCode().

如果你正打算构建一个更好的hash码,需要在你的类型里面加入一些限制。重新检测这三个规则,这次是在构建一个可工作的对GetHashCode()的实现的上下文中来检测。

First, if two objects are equal, asdefined by operator==(), they must return the same hash value. Any property ordata value used to generate the hash code must also participate in the equalitytest for the type. Obviously, this means that the same properties used forequality are used for hash code generation. It's possible to have propertiesparticipate in equality that are not used in the hash code computation. Thedefault behavior for System.ValueType does just that, but it often means thatrule 3 usually gets violated. The same data elements should participate in bothcomputations.

首先,如果2个对象是==操作符定义的相等的话,它们必须返回同样的hash值,任何被用来生成hash码的属性或者数据值必须参加该类型的相等性判断。显然,这意味着,被用作相等性的属性同时也被用作来生成hash码。有的属性参与相等性判断,但不被用来进行hash码计算,这也是可能的。System.ValueType的默认行为就是那样做的,但是这意味着规则3经常被违背,同样的数据元素应该同时参加2个计算。

The second rule is that the return valueof GetHashCode() must be an instance invariant. Imagine that you defined areference type, Customer:

第二条规则是,GetHashCode()返回的值必须是一个实不例变量。想象,你定义了一个引用类型Customer:

  1.     public class Customer
  2.     {
  3.         private String name;
  4.         private decimal revenue;
  5.  
  6.         public Customer(string name)
  7.         {
  8.             name = name;
  9.         }
  10.  
  11.         public String Name
  12.         {
  13.             get { return name; }
  14.             set { name = value; }
  15.         }
  16.  
  17.         public override Int32 GetHashCode()
  18.         {
  19.             return name.GetHashCode();
  20.         }
  21.   }

Suppose that you execute the followingcode snippet:

假设执行下面的代码段:

  1.     Customer c1 = new Customer("Acme Products");
  2.     myHashMap.Add(c1, orders);
  3.     // Oops, the name is wrong:
  4.     c1.Name = "Acme Software";

c1 is lost somewhere in the hash map.When you placed c1 in the map, the hash code was generated from the string "AcmeProducts". After you change the name of the customer to "AcmeSoftware", the hash code value changed. It's now being generated from thenew name: "Acme Software". C1 is stored in the bucket defined by "AcmeProducts", but it should be in the bucket defined for "AcmeSoftware". You've lost that customer in your own collection. It's lostbecause the hash code is not an object invariant. You've changed the correctbucket after storing the object.

C1遗失在hashmaphash图)中的某个地方。当你把c1放到图中时,hash码由字符串“AcmeProducts”生成。在将客户的名字修改为“Acme Software”之后,hash码值发生了变化。现在它由新的名字“Acme Software”生成。C1存储在以“Acme Products”定义的bucket容器里面,但是它应该存储在以“Acme Software”定义的bucket容器里面。你已经将客户遗失在了自己的集合里面,因为hash码不是一个对象不变量。在存储完该对象之后,你已经修改了这个正确的bucket容器。

The earlier situation can occur only if Customeris a reference type. Value types misbehave differently, but they still causeproblems. If customer is a value type, a copy of c1 gets stored in the hashmap.The last line changing the value of the name has no effect on the copy storedin the hashmap. Because boxing and unboxingmake copies as well, it's veryunlikely that you can change the members of a value type after that object hasbeen added to a collection.

仅仅当Customer是一个引用类型时,前面的情况才会发生。值类型做了不同的错误行为,但是它们也会引起问题。如果Customer是值类型,c1的一个拷贝就会被存储在hash图中。最后一行修改name值的代码对存储在hash图中的拷贝没有影响。因为装箱和拆箱都是进行拷贝的,所以,在一个值类型的对象被添加到一个集合中后,想修改它的成员是非常不可能的。

The only way to address rule 2 is todefine the hash code function to return a value based on some invariantproperty or properties of the object. System.Object abides by this rule usingthe object identity, which does not change. System.ValueType hopes that thefirst field in your type does not change. You can't do better without makingyour type immutable. When you define a value type that is intended for use as akey type in a hash container, it must be an immutable type. Violate thisrecommendation, and the users of your type will find a way to break hashtablesthat use your type as keys. Revisiting the Customer class, you can modify it sothat the customer name is immutable:

表达规则2的唯一方法就是定义hash码方法,让它返回一个基于一个或多个不变属性的值。System.Object通过使用不变的对象标识符遵守了该规则。System.ValueType希望你的类型的首个字段不会改变。除了使你的类型是个不可变的之外,没有更好的方法。当你定义一个准备在hash容器中作为key使用的值类型时,它必须是一个不可变的类型。若违背该建议,你的类型的用户将会找到一个打破将你的类型用作keyhashtable的方法。再看Customer类,你可以修改使用户名不可变:

 

  1.    public class Customer
  2.     {
  3.         private readonly String name;
  4.         private Decimal revenue;
  5.  
  6.         public Customer(String name): this(name, 0)
  7.         {
  8.         }
  9.  
  10.         public Customer(String name, Decimal revenue)
  11.         {
  12.             name = name;
  13.             revenue = revenue;
  14.         }
  15.  
  16.         public String Name
  17.         {
  18.             get { return name; }
  19.         }
  20.  
  21.         // Change the name, returning a new object:
  22.         public Customer ChangeName(String newName)
  23.         {
  24.             return new Customer(newName, revenue);
  25.         }
  26.  
  27.         public override Int32 GetHashCode()
  28.         {
  29.             return name.GetHashCode();
  30.         }
  31.  }

Making the name immutable changes howyou must work with customer objects to modify the name:

使name不可变,将改变你的下述行为:你该如何处理客户对象来修改name

  1.     Customer c1 = new Customer("Acme Products");
  2.     myHashMap.Add(c1, orders);
  3.     // Oops, the name is wrong:
  4.     Customer c2 = c1.ChangeName("Acme Software");
  5.     Order o = myHashMap[c1] as Order;
  6.     myHashMap.Remove(c1);
  7. myHashMap.Add(c2, o);

You have to remove the originalcustomer, change the name, and add the new customer object to the hashtable. Itlooks more cumbersome than the first version, but it works. The previousversion allowed programmers to write incorrect code. By enforcing theimmutability of the properties used to calculate the hash code, you enforcecorrect behavior. Users of your type can't go wrong. Yes, this version is morework. You're forcing developers to write more code, but only because it's theonly way to write the correct code. Make certain that any data members used tocalculate the hash value are immutable.

你不得不移除原始的客户,修改name,将新的客户对象添加到hashtable中。它看起来比第一个版本更笨重,但能工作。前面的版本允许程序员编写不正确的代码。通过将被用来计算hash值的属性强制为不可变的,可以得到正确的行为,你的类型的用户不会出错了。是的,这个版本更能工作。你正在强迫开发者编写更多的代码,但是仅仅因为这是编写正确代码的唯一方式。请确认任何被用来计算hash值的数据成员是不可变的。

The third rule says that GetHashCode()should generate a random distribution among all integers for all inputs.Satisfying this requirement depends on the specifics of the types you create.If a magic formula existed, it would be implemented in System.Object and thisitem would not exist. A common and successful algorithm is to XOR all thereturn values from GetHashCode() on all fields in a type. If your type containssome mutable fields, exclude those fields from the calculations.

第三个规则是说GetHashCode()应该为所有的输入生成一个在整型范围内的随机分布。要满足这个要求依赖于你创建的类型的细节。如果存在一个魔法公式,就肯定早就在System.Object里面实现了,而这个条款也不会存在。一个通用并且成功的算法是:对类型里面的所有字段使用GetHashCode()后,对其返回值取XOR。如果你的类型包含一些可变的字段,在计算中排除它们。

GetHashCode() has very specificrequirements: Equal objects must produce equal hash codes, and hash codes mustbe object invariants and must produce an even distribution to be efficient. Allthree can be satisfied only for immutable types. For other types, rely on thedefault behavior, but understand the pitfalls.

GetHashCode()有非常特别的要求:相等的对象必须产生相等的hash码,hash码必须是对象不可变的,必须产生一个平均的分布以便获得效率。仅仅有不可变的值类型才能满足3个规则,对于其它类型,依赖于默认的行为,但是要理解它的缺陷。

原创粉丝点击