Tags: , , | Posted by Kevin Babcock on 7/1/2009 11:59 PM | Comments (0)

LINQ’s set operations provide an easy (and fast!) way to filter or combine collections of objects in .NET. The four extension methods that perform set operations – Distinct, Except, Intersect, and Union – are available through the IEnumerable<T> and IQueryable<T> interfaces and use an instance of IEqualityComparer<T> to produce a result based on the presence or absence of equivalent elements.

IEqualityComparer<T>

When performing set operations against collections of built-in types you don’t need to use your own IEqualityComparer<T> implementation; using the default, EqualityComparer<T>, is sufficient. However, if you are performing these operations against collections of your own complex types you will need to create your own and pass it in to the method call. The details of how IEqualityComparer<T> works is probably the single most important thing to understand when using set operations.

The IEqualityComparer<T> interface requires implementation of two methods, Equals() and GetHashCode(). The first is an obvious requirement: you need to check the members of the set against each other for equality. The second has been a source of much confusion. If you take a look at the implementation of set operations in .NET Reflector, you’ll see that when iterating through collections to perform equality checks, it uses both Equals() and GetHashCode() to check for equality. 

int hashCode = this.InternalGetHashCode(value);
for (int i = this.buckets[hashCode % this.buckets.Length] - 1; 
     i >= 0; 
     i = this.slots[i].next)
{
    if ((this.slots[i].hashCode == hashCode) 
        && this.comparer.Equals(this.slots[i].value, value))
    {
        return true;
    }
}

Here we see that the hash code is retrieved from a method called InternalGetHashCode(), which in turn calls the IEqualityComparer<T>’s GetHashCode() method. The actual values are stored in a private member array called slots and then another private array, buckets, is used to map values to an index based on the hash code. This is a hash table implementation, making value lookups very fast. It makes sense that we’d need two values considered to be equal to also return an identical hash code. Otherwise, the hash table would not work correctly and we’d have to iterate through the entire list of values to check each one for equality, which would seriously degrade performance. The use of hash tables is what makes the set operations so fast.

So with this newfound understanding of how IEqualityComparer<T> is used with set operations, we can design custom implementations correctly. Let’s create two simple implementations to cover the general cases that I have run into: one in which the equality of an object is based on the value of a single member, and another in which equality is based on two or more members.

For the first case, implementing IEqualityComparer<T> is simple:

public class SimpleData
{
    public string Value { get; set; }
}

public class SimpleDataComparer : IEqualityComparer<SimpleData>
{
    public bool Equals(SimpleData x, SimpleData y)
    {
        return x.Value == y.Value;
    }

    public int GetHashCode(SimpleData obj)
    {
        return obj.Value.GetHashCode();
    }
}

I have a class, SimpleData, that has a single member I want to use to test my class for equality. Notice in the implementation of SimpleDataComparer that I don’t call the object’s GetHashCode() method, but instead call GetHashCode() on the member being used to compare the data. This is an important distinction. Consider the following code snippet in which I use two SimpleData objects, each of which contains identical values. Calling GetHashCode() on these identical objects returns different values, whereas calling Value.GetHashCode() returns identical values.

var data1 = new SimpleData { Value = "SomeValue" };
var data2 = new SimpleData { Value = "SomeValue2" };
var hashCode1 = data1.GetHashCode();        // returns 37121646
var hashCode2 = data2.GetHashCode();        // returns 45592480
var hashCode3 = data1.Value.GetHashCode();  // returns -840951029 
var hashCode4 = data2.Value.GetHashCode();  // returns -840951029

As you can see, calling the object’s GetHashCode() method returns different values, whereas calling it on the specific member we are comparing returns the same value. Of course, we could call SimpleData.GetHashCode() if we provided an overloaded implementation that returned identical values for equal objects. As an aside, don’t get too hung up on the negative hash code value that was returned. Internally the leftmost bit in the value is actually flipped, making it a positive value, before it is used.

The second case I want to bring up is one in which you might want to use more than one member to determine equality. Consider the following example, in which I use a custom class, ComplexData.

var data1 = new ComplexData 
    { FirstValue = 0.12, SecondValue = 1.45m, ThirdValue = "MyValue" };
var data2 = new ComplexData 
    { FirstValue = 0.12, SecondValue = 1.45m, ThirdValue = "MyValue" };
var hashCode1 = data1.GetHashCode();                // returns 22008501
var hashCode2 = data2.GetHashCode();                // returns 9008175
var hashCode3 = data1.FirstValue.GetHashCode()  ^   // returns -228796266 
                data1.SecondValue.GetHashCode() ^
                data1.ThirdValue.GetHashCode();     
var hashCode4 = data2.FirstValue.GetHashCode()  ^   // returns -228796266
                data2.SecondValue.GetHashCode() ^
                data2.ThirdValue.GetHashCode();

If I want to test for equality using multiple members of my custom class I must ensure that GetHashCode() returns a unique value based on these each unique set of values for my objects. As you’ve seen in the example above, this can be accomplished very easily by calling GetHashCode() on each value used in the comparison and then combining the results with an exclusive OR operator (^). Consider the following implementation:

public class ComplexDataDataComparer : IEqualityComparer<ComplexData>
{
    public bool Equals(ComplexData x, ComplexData y)
    {
        return x.FirstValue  == y.FirstValue  && 
               x.SecondValue == y.SecondValue && 
               x.ThirdValue  == y.ThirdValue;
    }

    public int GetHashCode(ComplexData obj)
    {
        return obj.FirstValue.GetHashCode()  ^
               obj.SecondValue.GetHashCode() ^
               obj.ThirdValue.GetHashCode();
    }
}

So there you have it. It’s been a somewhat length explanation, but I’ve seen quite a few questions in forums and on blogs regarding how to properly implement IEqualityComparer<T> (and why). And since this is a crucial element when performing set operations on complex types, it must be understood. So let’s move on to set operations and how we can use what we’ve learned so far to take advantage of this great LINQ functionality.

Distinct

The Distinct() method returns all objects that are unique in the set.

var numbers = new List<int> { 1, 2, 3, 2, 4, 1 };
// returns { 1, 2, 3, 4 }
var distinctNumbers = numbers.Distinct();          

You can also pass in an instance of your own class to compare for equality:

var set = new List<ComplexData>
{
    new ComplexData { FirstValue = 1.2, SecondValue = 0.1m, ThirdValue = "Ab" },
    new ComplexData { FirstValue = 3.1, SecondValue = 9.7m, ThirdValue = "Cd" },
    new ComplexData { FirstValue = 1.2, SecondValue = 0.1m, ThirdValue = "Ab" },
    new ComplexData { FirstValue = 2.1, SecondValue = 3.4m, ThirdValue = "Xy" }
};
var comparer = new ComplexDataDataComparer();
var distinctSet = set.Distinct<ComplexData>(comparer);

distinct set operation

Except

The Except() method returns the unique set of objects from one set that do not appear in a second set.

var numbers = new List<int> { 1, 2, 3, 2, 4, 1 };
var otherNumbers = new List<int> { 3, 4, 1, 9, 4, 7 };
// returns { 2 }
var filteredNumbers = numbers.Except(otherNumbers);    

You can also pass in an instance of your own class to compare for equality:

var set = new List<ComplexData>
{
    new ComplexData { FirstValue = 1.2, SecondValue = 0.1m, ThirdValue = "Ab" },
    new ComplexData { FirstValue = 3.1, SecondValue = 9.7m, ThirdValue = "Cd" },
    new ComplexData { FirstValue = 1.2, SecondValue = 0.1m, ThirdValue = "Ab" },
    new ComplexData { FirstValue = 2.1, SecondValue = 3.4m, ThirdValue = "Xy" }
};
var secondSet = new List<ComplexData>
{
    new ComplexData { FirstValue = 2.1, SecondValue = 3.4m, ThirdValue = "Xy" },
    new ComplexData { FirstValue = 7.1, SecondValue = 4.5m, ThirdValue = "Ef" },
    new ComplexData { FirstValue = 3.1, SecondValue = 9.7m, ThirdValue = "Cd" },
    new ComplexData { FirstValue = 0.4, SecondValue = 3.9m, ThirdValue = "Rs" }
};
var comparer = new ComplexDataDataComparer();
var filteredSet = set.Except(secondSet, comparer);

except set operation

Intersect

The Intersect() method returns all objects that two sets share in common.

var numbers = new List<int> { 1, 2, 3, 2, 4, 1 };
var otherNumbers = new List<int> { 3, 4, 1, 9, 4, 7 };
// returns { 1, 3, 4 }
var commonNumbers = numbers.Intersect(otherNumbers); 

You can also pass in an instance of your own class to compare for equality:

var set = new List<ComplexData>
{
    new ComplexData { FirstValue = 1.2, SecondValue = 0.1m, ThirdValue = "Ab" },
    new ComplexData { FirstValue = 3.1, SecondValue = 9.7m, ThirdValue = "Cd" },
    new ComplexData { FirstValue = 1.2, SecondValue = 0.1m, ThirdValue = "Ab" },
    new ComplexData { FirstValue = 2.1, SecondValue = 3.4m, ThirdValue = "Xy" }
};
var secondSet = new List<ComplexData>
{
    new ComplexData { FirstValue = 2.1, SecondValue = 3.4m, ThirdValue = "Xy" },
    new ComplexData { FirstValue = 7.1, SecondValue = 4.5m, ThirdValue = "Ef" },
    new ComplexData { FirstValue = 3.1, SecondValue = 9.7m, ThirdValue = "Cd" },
    new ComplexData { FirstValue = 0.4, SecondValue = 3.9m, ThirdValue = "Rs" }
};
var comparer = new ComplexDataDataComparer();
var commonSet = set.Intersect(secondSet, comparer);

intersect set operation

Union

The Union() method returns the combined set of unique objects from two sets.

var numbers = new List<int> { 1, 2, 3, 2, 4, 1 };
var otherNumbers = new List<int> { 3, 4, 1, 9, 4, 7 };
// returns { 1, 2, 3, 4, 9, 7 }
var combinedUniqueNumbers = numbers.Union(otherNumbers); 

You can also pass in an instance of your own class to compare for equality:

var set = new List<ComplexData>
{
    new ComplexData { FirstValue = 1.2, SecondValue = 0.1m, ThirdValue = "Ab" },
    new ComplexData { FirstValue = 3.1, SecondValue = 9.7m, ThirdValue = "Cd" },
    new ComplexData { FirstValue = 1.2, SecondValue = 0.1m, ThirdValue = "Ab" },
    new ComplexData { FirstValue = 2.1, SecondValue = 3.4m, ThirdValue = "Xy" }
};
var secondSet = new List<ComplexData>
{
    new ComplexData { FirstValue = 2.1, SecondValue = 3.4m, ThirdValue = "Xy" },
    new ComplexData { FirstValue = 7.1, SecondValue = 4.5m, ThirdValue = "Ef" },
    new ComplexData { FirstValue = 3.1, SecondValue = 9.7m, ThirdValue = "Cd" },
    new ComplexData { FirstValue = 0.4, SecondValue = 3.9m, ThirdValue = "Rs" }
};
var comparer = new ComplexDataDataComparer();
var combinedUniqueValues = set.Union(secondSet, comparer);

union set operation

Wrap Up

Using set operations is very easy and can be used to perform fast set operations on your collections. When performing set operations on collections of complex types, it is important to understand how to properly implement IEqualityComparer<T>. I used LINQ to Objects in the examples of this blog post, but you can just as easily take this concepts to LINQ to SQL, LINQ to Entities, or other LINQ implementations.

Have fun!

kick it on DotNetKicks.com