NAME

src/hash.c - Hash table

DESCRIPTION

A hashtable contains an array of bucket indexes. Buckets are nodes in a linked list, each containing a STRING key and a value. The value is currently stored as a HASH_ENTRY, which maybe makes sense for some hashes but probably doesn't for what they're currently used for, which is PerlHash and Hash PMCs (since those should probably just be hashes of STRINGs mapping to PMCs.)

To minimize memory overhead, buckets are carved out of a pool that is allocated normally by parrot's memory subsystem. That means that the pool can get moved around a lot.

Non-ASCII Keys

The USE_STRING_EQUAL define allows 2 different strategies to deal with non-ASCII hash keys.

#define USE_STRING_EQUAL 1: Original code. Keys are stored as is. Each compare (which can happen more then once per lookup) possibly transcodes keys and then compares these keys.
#define USE_STRING_EQUAL 0: This is the default. Ascii keys are stored as is. As soon as the first non-ASCII key is stored in the Hash, all keys are converted to utf8 and the key is stored in that encoding too.; Keys are transcoded once before insert or lookup. The only (theoretical?) problem is, that keys are either just stored or (after transcoding) a copy of a key is stored. This could probably be solved by transcoding the string in place.

Timing of (string, hash -O3, Athlon-800)

    % parrot -C examples/benchmarks/hash-utf8.pasm
    #define USE_STRING_EQUAL 1
    0.35
    8.2
    #define USE_STRING_EQUAL 0
    0.36
    0.54

Functions

static PARROT_INLINE HashBucket *getBucket(Hash *hash, BucketIndex idx): Returns the bucket at bucket index idx.
static PARROT_INLINE BucketIndex lookupBucketIndex(Hash *hash, HashIndex slot): Returns bucket index for hash index slot.
static PARROT_INLINE HashBucket *lookupBucket(Hash *hash, HashIndex slot): Returns the bucket for hash index slot.
static size_t key_hash_STRING(Interp *interpreter, Hash *hash, void *value): Return the hashed value of the string value.
static int STRING_compare(Parrot_Interp interp, void *a, void *b): Compares the two strings, return 0 if they are identical.; a is the search key, b is the bucket key.
static size_t key_hash_cstring(Interp *interpreter, Hash *hash, void *value)
static int cstring_compare(Parrot_Interp interp, void *a, void *b): C string versions of the key_hash and compare functions.
void dump_hash(Interp *interpreter, Hash *hash): Print out the hash in human-readable form.
void mark_hash(Interp *interpreter, Hash *hash): Marks the hash and its contents as live.
void hash_visit(Interp *interpreter, Hash *hash, void *pinfo): This is used by freeze/thaw to visit the contents of the hash.; pinfo is the visit info, (see include/parrot/pmc_freeze.h>).
static void expand_hash(Interp *interpreter, Hash *hash): For a hashtable of size N, we use MAXFULL_PERCENT % of N as the number of buckets. This way, as soon as we run out of buckets on the free list, we know that it's time to resize the hashtable.; Algorithm for expansion: We exactly double the size of the hashtable. Keys are assigned to buckets with the formula; so when doubling the size of the hashtable, we know that every key is either already in the correct bucket, or belongs in the current bucket plus hash_size (the old hash_size). In fact, because the hashtable is always a power of two in size, it depends only on the next bit in the hash value, after the ones previously used.; So we scan through all the buckets in order, moving the buckets that need to be moved. No bucket will be scanned twice, and the cache should be reasonably happy because the hashtable accesses will be two parallel sequential scans. (Of course, this also mucks with the ->next pointers, and they'll be all over memory.)
static BucketIndex new_bucket(Interp *interpreter, Hash *hash, void *key, void *value): Returns a new bucket with key and value set.
static HashBucket *find_bucket(Interp *interpreter, Hash *hash, BucketIndex head, void *key): Returns the bucket whose key is equal to key.
void new_hash(Interp *interpreter, Hash **hptr): Returns a new Parrot string hash in hptr.; new_pmc_hash(Interp *interpreter, PMC *container)>; Create a new Parrot string hash in PMC_struct_val(container)
void new_cstring_hash(Interp *interpreter, Hash **hptr): Returns a new C string hash in hptr.
void new_hash_x(Interp *interpreter, Hash **hptr, PARROT_DATA_TYPES val_type, size_t val_size, Hash_key_type hkey_type, hash_comp_fn compare, hash_hash_key_fn keyhash, hash_mark_key_fn mark): Returns a new hash in hptr.; FIXME: This function can go back to just returning the hash struct pointer once Buffers can define their own custom mark routines.; The problem is: During DODs stack walking the item on the stack must be a PMC. When an auto Hash* is seen, it doesn't get properly marked (only the Hash* buffer is marked, not its contents). By passing the **hptr up to the PerlHash's or Hash's init function, the newly constructed PMC is on the stack including this newly constructed Hash, so that it gets marked properly.
void new_pmc_hash_x(Interp *interpreter, PMC *container, PARROT_DATA_TYPES val_type, size_t val_size, Hash_key_type hkey_type, hash_comp_fn compare, hash_hash_key_fn keyhash, hash_mark_key_fn mark): Like above but w/o the decribed problems. The passed in container PMC gets stored in the Hash end the newly created Hash is in PMC_struct_val(container).
INTVAL hash_size(Interp *interpreter, Hash *hash): Return the number of used entries in the hash.
STRING *hash_get_idx(Interp *interpreter, Hash *hash, PMC *key): Called by interator.
HashBucket *hash_get_bucket(Interp *interpreter, Hash *hash, void *key): Returns the bucket for key.
void *hash_get(Interp *interpreter, Hash *hash, void *key): Returns the bucket for key or NULL if no bucket is found.
INTVAL hash_exists(Interp *interpreter, Hash *hash, void *key): Returns whether the key exists in the hash.
HashBucket *hash_put(Interp *interpreter, Hash *hash, void *key, void *value): Puts the key and value into the hash. Note that key is not copied.
void hash_delete(Interp *interpreter, Hash *hash, void *key): Deletes the key from the hash.
void hash_clone(Interp *interp, Hash *hash, Hash **dest): Clones hash to dest.

HISTORY

Initial version by Jeff G. on 2001.12.05

Substantially rewritten by Steve F.

2003.10.25

leo add function pointer for compare, hash, mark

hash keys are now (void *)

add new_cstring_hash() function

2003.11.04

bucket->value is now a plain pointer, no more an HASH_ENTRY

With little changes, we can again store arbitrary items if needed, see TODO in code.

2003.11.06

boemmels renamed HASH and HASHBUCKET to Hash and HashBucket

2003.11.11

leo randomize key_hash seed

extend new_hash_x() init call by value_type and _size.

2003.11.14

leo USE_STRING_EQUAL define, see comment above

TODO

Future optimizations:

Stop reallocating the bucket pool, and instead add chunks on. (Saves pointer fixups and copying during realloc.)

Hash contraction (dunno if it's worth it)