Computatus: Adventures with Pangram

I was recently given a fun little puzzle. A Pangram. I'm not going to pretend that I knew what a Pangram was before it was explained to me. Basically it is a sentence that has every letter of the alphabet. For instance 'The quick brown fox jumped over the lazy dog'. Wow, I've used Pangrams without even knowing it!

First Concept

Not so great. My first thought is to have an array of the alphabet chars, iterate through the input and check to see if the current char is used and store the result somewhere. Meh, it works but it's not that exciting really. It leaves a lot to be desired as far as elegant algorithm is concerned much less performance. There is a lot of checking you have to do and I wanted to find a better solution (without googling the answer of course). I'm not so concerned with readability as I am with performance (as this is a puzzle and not real production code).

Second Concept

I walked away and when I got into my car it immediately dawned on me another possible solution. 26 letters in the alphabet could be represented by a single 32 bit integer. If all the letters in the alphabet were represented by bit positions (a=0, b=1, c=2...) and all the letters in the alphabet were present then all the bits in the integer would be 1s instead of 0s. You would already know that the answer is:

0011 1111 1111 1111 1111 1111 1111

67,108,863

This is an intriguing idea as the operation to check if the integer value of the bits your flipping >= 67,108,863 is very fast and very simple. Sounds good right? Well as it turns out... it works great! Here is the end result:

If you go to the 'Gist' snippet at https://gist.github.com/hobeau/5b941124719a10ca2ce6 you can see the original implementation along with the revisions.

First Iteration

My first iteration was to use BitVector32 which is a special .net collection that is very good and VERY fast at bit operations with a 32 bit integer. It is a bit difficult to grasp at first as it requires a few things such as a 'mask' array to point your code to each bit based on a zero based array of bits. To see more check out the source code here. Basically it uses a private uint field to keep track of the current value of the 32 bit integer and creates a zero-based indexer with a set operation that performs an OR or AND bit operation based on the boolean value that is passed from the calling code. The CreateMask function literally just takes in an integer value and bit shifts it by 1 to return an integer result that is the bit value of the current mask. All that to say, it is a helper method to build an array that looks like:

[1,2,4,8,16...]

This allows you to flip bits by the internal |= operation. So if you are rusty on your binary math, here's a quick refresher. If you 4 |= 8 you are essentially saying:

00100000 |
00010000

= 00110000 => 12

(note I'm only doing 8 bit bytes instead of 32 bit integers for simplicity and going left to right)

Second Iteration

My second iteration was to realize that setting up the mask and using BitVector32 was great but it also had more overhead than I really needed. Why not just do the bit operations manually? This increased performance greatly.

Next Iterations...

The next iterations included simple things such as:

Creating a char[] at the beginning instead of dealing with the string objects indexer
Changing the loop into an unrolled loop. (the CLR only has to do bound-checking for half the iterations instead of every one)
Checking to see if the iterator ('i') was greater than 25 (no need to check the value of the integer if the iterator wasn't passed the possible 26 chars in the alphabet
Performing bit operations by hand vs using BitVector32

Performance

After I finished I figured I would check to see how this algorithm performs compared to some other algorithms out there. I found http://rosettacode.org/wiki/Pangram_checker#C.23 which was just what I needed. The test I run is using System.StopWatch and ran 100 rounds of 100,000 iterations of my code vs the examples I found on rosettacode. The first example uses Linq and performance isn't really in mind. Ellapsed milliseconds is about 500 on average. The second example runs at about 56-58 milliseconds which is much better. The algorithm I wrote runs on average of 21-23 milliseconds so even faster with all of the tweaks/modifications.

Conclusion

This was a really fun little project. Working with low level bits tends to be very efficient. There is a lot more to talk about with BitVector32 and I'll talk about it more in another post. However, in this case I found it to be faster and more efficient to do the bit operations by hand vs using BitVector32. It tends to be a bit more confusing code but again this is really more of an algorithm (a toy algorithm at that) so performance is all I'm going for here and not readability.

Addendum

Note that as this is really just a toy algorithm, this assumes an ASCII string.

Computatus

Saturday, February 21, 2015

Adventures with Pangram