Effective C# 10:Understand the Pitfalls of GetHashCode()

Item 10: Understand the Pitfallsof GetHashCode()


This is the only item in this bookdedicated to one function that you should avoid writing. GetHashCode() is usedin one place only: to define the hash value for keys in a hash-basedcollection, typically the Hashtable or Dictionary containers. That's goodbecause there are a number of problems with the base class implementation of GetHashCode().For reference types, it works but is inefficient. For value types, the baseclass version is often incorrect. But it gets worse. It's entirely possiblethat you cannot write GetHashCode() so that it is both efficient and correct.No single function generates more discussion and more confusion than GetHashCode().Read on to remove all that confusion.


If you're defining a type that won'tever be used as the key in a container, this won't matter. Types that representwindow controls, web page controls, or database connections are unlikely to beused as keys in a collection. In those cases, do nothing. All reference typeswill have a hash code that is correct, even if it is very inefficient. Valuetypes should be immutable (see Item 7),in which case, the default implementation always works, although it is alsoinefficient. In most types that you create, the best approach is to avoid theexistence of GetHashCode() entirely.


One day, you'll create a type that ismeant to be used as a hashtable key, and you'll need to write your ownimplementation of GetHashCode(), so read on. Hash-based containers use hashcodes to optimize searches. Every object generates an integer value called ahash code. Objects are stored in buckets based on the value of that hash code.To search for an object, you request its key and search just that one bucket.In .NET, everyobject has a hash code, determined by System.Object.GetHashCode().Any overload of GetHashCode() must follow these three rules:


If two objects are equal (as defined byoperator==), they must generate the same hash value. Otherwise, hash codescan't be used to find objects in containers.


For any object A, A.GetHashCode() mustbe an instance invariant. No matter what methods are called on A, A.GetHashCode()must always return the same value. That ensures that an object placed in abucket is always in the right bucket.


The hash function should generate arandom distribution among all integers for all inputs. That's how you getefficiency from a hash-based container.


Writing a correct and efficient hashfunction requires extensive knowledge of the type to ensure that rule 3 isfollowed. The versions defined in System.Object and System.ValueType do nothave that advantage. These versions must provide the best default behavior withalmost no knowledge of your particular type. Object.GetHashCode() uses aninternal field in the System.Object class to generate the hash value. Eachobject created is assigned a unique object key, stored as an integer, when itis created. These keys start at 1 and increment every time a new object of anytype gets created. The object identity field is set in the System.Objectconstructor and cannot be modified later. Object.GetHashCode() returns thisvalue as the hash code for a given object.


Now examine Object.GetHashCode() inlight of those three rules. If two objects are equal, Object.GetHashCode()returnsthe same hash value, unless you've overridden operator==. System.Object'sversion of operator==() tests object identity. GetHashCode() returns theinternal object identity field. It works. However, if you've supplied your ownversion of operator==, you must also supply your own version of GetHashCode()to ensure that the first rule is followed. See Item 9for details on equality.

The second rule is followed: After anobject is created, its hash code never changes.


The third rule, a random distributionamong all integers for all inputs, does not hold. A numeric sequence is not arandom distribution among all integers unless you create an enormous number ofobjects. The hash codes generated by Object.GetHashCode() are concentrated atthe low end of the range of integers.


This means that Object.GetHashCode() iscorrect but not efficient. If you create a hashtable based on a reference typethat you define, the default behavior from System.Object is a working, butslow, hashtable. When you create reference types that are meant to be hashkeys, you shouldoverride GetHashCode()to get a better distribution of the hashvalues across all integers for your specific type.


Before covering how to write your ownoverride of GetHashCode, this section examines ValueType.GetHashCode()withrespect to those same three rules. System.ValueType overrides GetHashCode(),providing the default behavior for all value types. Its version returns thehash code from the first field defined in the type. Consider this example:


  1.     public struct MyStruct
  2.     {
  3.         private String msg;
  4.         private Int32 id;
  5.         private DateTime epoch;
  6. }

The hash code returned from a MyStruct objectis the hash code generated by the _msg field. The following code snippet alwaysreturns true:


  1.     MyStruct s = new MyStruct();
  2.     return s.GetHashCode() == s.msg.GetHashCode();




The first rule says that two objectsthat are equal (as defined by operator==()) must have the same hash code. Thisrule is followed for value types under most conditions, but you can break it,just as you could with for reference types. ValueType.operator==() compares thefirst field in the struct, along with every other field. That satisfies rule 1.As long as any override that you define for operator== uses the first field, itwill work. Any struct whose first field does not participate in the equality ofthe type violates this rule, breaking GetHashCode().


The second rule states that the hashcode must be an instance invariant. That rule is followed only when the firstfield in the struct is an immutable field. If the value of the first field canchange, so can the hash code. That breaks the rules. Yes, GetHashCode() isbroken for any struct that you create when the first field can be modified duringthe lifetime of the object. It's yet another reason why immutable value typesare your best bet (see Item 7).

The third rule depends on the type ofthe first field and how it is used. If the first field generates a randomdistribution across all integers, and the first field is distributed across allvalues of the struct, then the struct generates an even distribution as well.However, if the first field often has the same value, this rule is violated.Consider a small change to the earlier struct:


  1.     public struct MyStruct
  2.     {
  3.         private DateTime epoch;
  4.         private String msg;
  5.         private Int32 id;
  6.  }

If the _epoch field is set to thecurrent date (not including the time), all MyStruct objects created in a givendate will have the same hash code. That prevents an even distribution among allhash code values.


Summarizing the default behavior, Object.GetHashCode()works correctly for reference types, although it does not necessarily generatean efficient distribution. (If you have overridden Object.operator==(), you canbreak GetHashCode()). ValueType.GetHashCode() works only if the first field inyour struct is read-only. ValueType.GetHashCode() generates an efficient hashcode only when the first field in your struct contains values across ameaningful subset of its inputs.


If you're going to build a better hashcode, you need to place some constraints on your type. Examine the three rulesagain, this time in the context of building a working implementation of GetHashCode().


First, if two objects are equal, asdefined by operator==(), they must return the same hash value. Any property ordata value used to generate the hash code must also participate in the equalitytest for the type. Obviously, this means that the same properties used forequality are used for hash code generation. It's possible to have propertiesparticipate in equality that are not used in the hash code computation. Thedefault behavior for System.ValueType does just that, but it often means thatrule 3 usually gets violated. The same data elements should participate in bothcomputations.


The second rule is that the return valueof GetHashCode() must be an instance invariant. Imagine that you defined areference type, Customer:


  1.     public class Customer
  2.     {
  3.         private String name;
  4.         private decimal revenue;
  6.         public Customer(string name)
  7.         {
  8.             name = name;
  9.         }
  11.         public String Name
  12.         {
  13.             get { return name; }
  14.             set { name = value; }
  15.         }
  17.         public override Int32 GetHashCode()
  18.         {
  19.             return name.GetHashCode();
  20.         }
  21.   }

Suppose that you execute the followingcode snippet:


  1.     Customer c1 = new Customer("Acme Products");
  2.     myHashMap.Add(c1, orders);
  3.     // Oops, the name is wrong:
  4.     c1.Name = "Acme Software";

c1 is lost somewhere in the hash map.When you placed c1 in the map, the hash code was generated from the string "AcmeProducts". After you change the name of the customer to "AcmeSoftware", the hash code value changed. It's now being generated from thenew name: "Acme Software". C1 is stored in the bucket defined by "AcmeProducts", but it should be in the bucket defined for "AcmeSoftware". You've lost that customer in your own collection. It's lostbecause the hash code is not an object invariant. You've changed the correctbucket after storing the object.

The earlier situation can occur only if Customeris a reference type. Value types misbehave differently, but they still causeproblems. If customer is a value type, a copy of c1 gets stored in the hashmap.The last line changing the value of the name has no effect on the copy storedin the hashmap. Because boxing and unboxingmake copies as well, it's veryunlikely that you can change the members of a value type after that object hasbeen added to a collection.


The only way to address rule 2 is todefine the hash code function to return a value based on some invariantproperty or properties of the object. System.Object abides by this rule usingthe object identity, which does not change. System.ValueType hopes that thefirst field in your type does not change. You can't do better without makingyour type immutable. When you define a value type that is intended for use as akey type in a hash container, it must be an immutable type. Violate thisrecommendation, and the users of your type will find a way to break hashtablesthat use your type as keys. Revisiting the Customer class, you can modify it sothat the customer name is immutable:



  1.    public class Customer
  2.     {
  3.         private readonly String name;
  4.         private Decimal revenue;
  6.         public Customer(String name): this(name, 0)
  7.         {
  8.         }
  10.         public Customer(String name, Decimal revenue)
  11.         {
  12.             name = name;
  13.             revenue = revenue;
  14.         }
  16.         public String Name
  17.         {
  18.             get { return name; }
  19.         }
  21.         // Change the name, returning a new object:
  22.         public Customer ChangeName(String newName)
  23.         {
  24.             return new Customer(newName, revenue);
  25.         }
  27.         public override Int32 GetHashCode()
  28.         {
  29.             return name.GetHashCode();
  30.         }
  31.  }

Making the name immutable changes howyou must work with customer objects to modify the name:


  1.     Customer c1 = new Customer("Acme Products");
  2.     myHashMap.Add(c1, orders);
  3.     // Oops, the name is wrong:
  4.     Customer c2 = c1.ChangeName("Acme Software");
  5.     Order o = myHashMap[c1] as Order;
  6.     myHashMap.Remove(c1);
  7. myHashMap.Add(c2, o);

You have to remove the originalcustomer, change the name, and add the new customer object to the hashtable. Itlooks more cumbersome than the first version, but it works. The previousversion allowed programmers to write incorrect code. By enforcing theimmutability of the properties used to calculate the hash code, you enforcecorrect behavior. Users of your type can't go wrong. Yes, this version is morework. You're forcing developers to write more code, but only because it's theonly way to write the correct code. Make certain that any data members used tocalculate the hash value are immutable.


The third rule says that GetHashCode()should generate a random distribution among all integers for all inputs.Satisfying this requirement depends on the specifics of the types you create.If a magic formula existed, it would be implemented in System.Object and thisitem would not exist. A common and successful algorithm is to XOR all thereturn values from GetHashCode() on all fields in a type. If your type containssome mutable fields, exclude those fields from the calculations.


GetHashCode() has very specificrequirements: Equal objects must produce equal hash codes, and hash codes mustbe object invariants and must produce an even distribution to be efficient. Allthree can be satisfied only for immutable types. For other types, rely on thedefault behavior, but understand the pitfalls.

