De Gan filter implementation

Joel De Gan - Mon, April 26 2004

The purpose of this one-way-hash filter is to have large scale hashing in a small subset of data.

It is similar to a bloom filter except that instead of using a binary system the filter uses an a-z system where each point can hold more that one value and is not limited to binary values.

  • View source code v.01 going to add in bitwise shifting soon.
  • View example of a spellchecker done entirely in php with a 300,000 word dictionary (2.5M), the hash table is a 50x50 (1 in 3,250,000 probabilty) at that size the hash table is 419k serialized and uncompressed, it could be done in a 25x25 hash table which would be around 200k however the probabilty of false positives is much much higher (Compare 25x25 matrix example against the above example).


    Why use these?

    A 250x250 matrix (large) has a unique factor of over 40 million and is around 16 megabytes blank. The structure is designed for large hashes. It would be easily possible to hold the entire domain name system hashed in around 20 megs. What this means is you could have very fast lookups on if a name is in the system.


    How does this filter work?

    Each item hashed in the database will have a unique value set to each of the above markers. For lookups, the letter must be in two exact places withing the entire filter. In the above example of the spelling checker, there is a 1/1.625M probabily of a duplicate at any given point (due to doubling data inserting).

    These filters can take a while to generate, they are very processor intensive to create. These filters use a matrix format for speed in lookups, each matrix point holds a small filter.


    How is that this filter can hold so much information?

    Example small filter

    0000000000

    Insert "d" at position '2' (note:Counting starts at zero)

    00d0000000

    Insert "z" at position '5' (note:Counting starts at zero)

    00d00z0000

    Now it becomes interesting. insert "f" at position '2' (on top of another letter in place)

    00Df00z0000

    If another needs to be positioned at the same point. insert "g" at position '2' (on top of another two letters in place)

    00DFg00z0000

    The array maintains all positions and if it encounters an uppercase letter it concats until a non-uppercase letter is found.

    Array
    (
        [0] => 0
        [1] => 0
        [2] => DFg
        [3] => 0
        [4] => 0
        [5] => z
        [6] => 0
        [7] => 0
        [8] => 0
        [9] => 0
    )
    

    Multiple hashing algorithms are used for placements.

    TO DO: