NAME

docs/pdds/pdd28_strings.pod - Parrot Strings

ABSTRACT

This PDD describes the conventions for strings in Parrot, including but not limited to support for multiple character sets, encodings, and languages.

VERSION

$Revision$

DEFINITIONS

Character

A character is the abstract description of a symbol. It's the smallest chunk of text a computer knows how to deal with. Internally to the computer, a character (just like everything else) is a number, so a few further definitions are needed.

Character Set

The Unicode Standard prefers the concepts of character repertoire (a collection of characters) and character code (a mapping which tells you what number represents which character in the repertoire). Character set is commonly used to mean the standard which defines both a repertoire and a code.

Codepoint

A codepoint is the numeric representation of a character according to a given character set. So in ASCII, the character A has codepoint 0x41.

Encoding

An encoding determines how a codepoint is represented inside a computer. Simple encodings like ASCII define that the codepoints 0-127 simply live as their numeric equivalents inside an eight-bit bytes. Other fixed-width encodings like UTF-16 use more bytes to encode more codepoints. Variable-width encodings like UTF-8 use one byte for codepoints 0-127, two bytes for codepoints 127-2047, and so on.

Character sets and encodings are related but separate concepts. An encoding is the lower-level representation of a string's data, whereas the character set determines higher-level semantics. Typically, character set functions will ask a string's encoding functions to retrieve data from the string, and then process the retrieved data.

Combining Character

A combining character is a Unicode concept. It is a character which modifies the preceding character. For instance, accents, lines, circles, boxes, etc. which are not to be displayed on their own, but to be composed with the preceding character.

Grapheme

In linguistics, a grapheme is a single symbol in a writing system (letter, number, punctuation mark, kanji, hiragana, Arabic glyph, Devanagari symbol, etc), including any modifiers (diacritics, etc).

The Unicode Standard defines a grapheme cluster (commonly simplified to just grapheme) as one or more characters forming a visible whole when displayed, in other words, a bundle of a character and all of its combining characters. Because graphemes are the highest-level abstract idea of a "character", they're useful for converting between character sets.

Normalization Form

A normalization form standardizes the representation of a string by transforming a sequence of combining characters into a more complex character (composition), or by transforming a complex character into a sequence of composing characters (decomposition). The decomposition forms also define a standard order for the composing characters, to allow string comparisons. The Unicode Standard defines four normalization forms: NFC and NFKC are composition, NFD and NFKD are decomposition. See Unicode Normalization Forms for more details.

Grapheme Normalization Form

Grapheme normalization form (NFG) is a normalization which allocates exactly one codepoint to each grapheme.

DESCRIPTION

Parrot supports multiple string formats, and so users of Parrot strings must be aware at all times of string encoding issues and how these relate to the string interface.

Parrot provides an interface for interacting with strings and converting between character sets and encodings.

Operations that require understanding the semantics of a string must respect the character set of the string.

Operations that require understanding the layout of the string must respect the encoding of the string.

In addition to common string formats, Parrot provides an additional string format that is a sequence of 32-bit Unicode codepoints in NFG.

IMPLEMENTATION

Parrot was designed from the outset to support multiple string formats: multiple character sets and multiple encodings. We don't standardize on Unicode internally, converting all strings to Unicode strings, because for the majority of use cases it's still far more efficient to deal with whatever input data the user sends us.

Consumers of Parrot strings need to be aware that there is a plurality of string encodings inside Parrot. (Producers of Parrot strings can do whatever is most efficient for them.) To put it in simple terms: if you find yourself writing *s++ or any other C string idioms, you need to stop and think if that's what you really mean. Not everything is byte-based anymore.

Grapheme Normalization Form

Unicode characters can be expressed in a number of different ways according to the Unicode Standard. This is partly to do with maintaining compatibility with existing character encodings. For instance, in Serbo-Croatian and Slovenian, there's a letter which looks like an i without the dot but with two grave (`) accents (ȉ). Unicode can represent this letter as a composed character 0x209, also known as LATIN SMALL LETTER I WITH DOUBLE GRAVE, which does the job all in one go. It can also represent this letter as a decomposed sequence: LATIN SMALL LETTER I (0x69) followed by COMBINING DOUBLE GRAVE ACCENT (0x30F). We use the term grapheme to refer to a "letter" whether it's represented by a single codepoint or multiple codepoints.

String operations on this kind of variable-byte encoding can be complex and expensive. Operations like comparison and traversal require a series of computations and lookaheads, because any given grapheme may be a sequence of combining characters. The Unicode Standard defines several "normalization forms" that help with this problem. Normalization Form C (NFC), for example, decomposes everything, then re-composes as much as possible. So if you see the integer stream 0x69 0x30F, it needs to be replaced by 0x209. However, Unicode's normalization forms don't go quite far enough to completely solve the problem. For example, Serbo-Croat is sometimes also written with Cyrillic letters rather than Latin letters. Unicode doesn't have a single composed character for the Cyrillic equivalent of the Serbo-Croat LATIN SMALL LETTER I WITH DOUBLE GRAVE, so it is represented as a decomposed pair CYRILLIC SMALL LETTER I (0x438) with COMBINING DOUBLE GRAVE ACCENT (0x30F). This means that even in the most normalized Unicode form, string manipulation code must always assume a variable-byte encoding, and use expensive lookaheads. The cost is incurred on every operation, though the particular string operated on might not contain combining characters. It's particularly noticeable in parsing and regular expression matches, where backtracking operations may re-traverse the characters of a simple string hundreds of times.

In order to reduce the cost of variable-byte operations and simplify some string manipulation tasks, Parrot defines an additional normalization: Normalization Form G (NFG). In NFG, every grapheme is guaranteed to be represented by a single codepoint. Graphemes that don't have a single codepoint representation in Unicode are given a dynamically generated codepoint unique to the NFG string.

An NFG string is a sequence of signed 32-bit Unicode codepoints. It's equivalent to UCS-4 except for the normalization form semantics. UCS-4 specifies an encoding for Unicode codepoints from 0 to 0x7FFFFFFF. In other words, any codepoints with the first bit set are undefined. NFG interprets the unused bit as a sign bit, and reserves all negative codepoints as dynamic codepoints. A negative codepoint acts as an index into a lookup table, which maps between a dynamic codepoint and its associated decomposition.

In practice, this goes as follows: When our Russified Serbo-Croat string is converted to NFG, it is normalized to a single character having the codepoint 0xFFFFFFFFF (in other words, -1 in 2's complement). At the same time, Parrot inserts an entry into the string's grapheme table at array index -1, containing the Unicode decomposition of the grapheme 0x00000438 0x000000030F.

Parrot will provide both grapheme-aware and codepoint-aware string operations, such as iterators for string traversal and calculations of string length. Individual language implementations can choose between the two types of operations depending on whether their string semantics are character-based or codepoint-based. For languages that don't currently have Unicode support, the grapheme operations will allow them to safely manipulate Unicode data without changing their string semantics.

Advantages

Applications that don't care about graphemes can handle a NFG codepoint in a string as if it's any other character. Only applications that care about the specific properties of Unicode characters need to take the load of peeking inside the grapheme table and reading the decomposition.

Using negative numbers for dynamic codepoints allows Parrot to check if a particular codepoint is dynamic using a single sign-comparison operation. It also means that NFG can be used without conflict on encodings from 7-bit (signed 8-bit integers) to 63-bit (using signed 64-bit integers) and beyond.

Because any grapheme from any character set can be represented by a single NFG codepoint, NFG strings are useful as an intermediate representation for converting between string types.

Disadvantages

A 32-bit encoding is quite large, considering the fact that the Unicode codespace only requires up to 0x10FFFF. The Unicode Consortium's FAQ notes that most Unicode interfaces use UTF-16 instead of UTF-32, out of memory considerations. This means that although Parrot will use 32-bit NFG strings for optimizations within operations, for the most part individual users should use the native character set and encoding of their data, rather than using NFG strings directly.

The conceptual cost of adding a normalization form beyond those defined in the Unicode Standard has to be considered. However, to fully support Unicode, Parrot already needs to keep track of what normalization form a given string is in, and provide functions to convert between normalization forms. The conceptual cost of one additional normalization form is relatively small.

The grapheme table

When constructing strings in NFG, graphemes not expressible as a single character in Unicode are represented by a dynamic codepoint index into the string's grapheme table. When Parrot comes across a multi-codepoint grapheme, it must first determine whether or not the grapheme already has an entry in the grapheme table. Therefore the table cannot strictly be an array, as that would make lookup inefficient. The grapheme table is represented, then, as both an array and a hash structure. The array interface provides forward-lookup and the hash interface reverse lookup. Converting a multi-codepoint grapheme into a dynamic codepoint can be demonstrated with the following Perl 5 pseudocode, for the grapheme 0x438 0x30F:

   $codepoint = ($grapheme_lookup->{0x438}{0x30F} ||= do {
                   push @grapheme_table, "\x{438}\x{30F}";
                   ~ $#grapheme_table;
                });
   push @string, $codepoint;

String API

Strings in the Parrot core should use the Parrot STRING structure. Parrot developers generally shouldn't deal with char * or other string-like types outside of this abstraction. It's also best not to access members of the STRING structure directly. The interpretation of the data inside the structure is determined by the data's encoding. Parrot's strings are encoding-aware so your functions don't need to be.

Parrot's internal strings (STRINGs) have the following structure:

  struct parrot_string_t {
      UnionVal                      cache;
      Parrot_UInt                   flags;
      UINTVAL                       bufused;
      UINTVAL                       strlen;
      UINTVAL                       hashval;
      const struct _encoding       *encoding;
      const struct _charset        *charset;
      const struct _normalization  *normalization;
  };

The fields are:

cache: A structure that holds the buffer for the string data and the size of the buffer in bytes.; {{ NOTE: this is currently called "cache" for compatibility with PMC structures. As we eliminate the cache from PMCs, we will flatten out this union value in the string structure to two members: a string buffer and the size of the buffer used. }}
flags: Binary flags used for garbage collection, copy-on-write tracking, and other metadata.
bufused: The amount of the buffer currently in use, in bytes.
strlen: The length of the string, in bytes. {{NOTE, not in characters, as characters may be variably sized.}}
hashval: A cache of the hash value of the string, for rapid lookups when the string is used as a hash key.
encoding: How the data is encoded (e.g. fixed 8-bit characters, UTF-8, or UTF-32). Note that this specifies encoding only -- it's valid to encode EBCDIC characters with the UTF-8 algorithm. Silly, but valid.; The encoding structure specifies the encoding (by index number and by name, for ease of lookup), the maximum number of bytes that a single character will occupy in that encoding, as well as functions for manipulating strings with that encoding.
charset: What sort of string data is in the buffer, for example ASCII, EBCDIC, or Unicode.; The charset structure specifies the character set (by index number and by name) and provides functions for transcoding to and from that character set.
normalization: What normalization form the string data is in, one of the four Unicode normalization forms or NFG. This structure stores the current normalization form, function pointers for composition and decomposition for the current normalization form, and a pointer to the grapheme table for NFG.

{{DEPRECATION NOTE: the enum parrot_string_representation_t will be removed from the parrot string structure. It's been commented out for years.}}

{{DEPRECATION NOTE: the char * pointer strstart will be removed. It complicates the entire string subsystem for a tiny optimization on substring operations, and offset math is messy with encodings that aren't byte-based.}}

Conversions between normalization form, encoding, and charset

Conversion will be done with a function called Parrot_string_grapheme_copy:

    INTVAL string_grapheme_copy(STRING *src, STRING *dst)

Converting a string from one format to another involves creating a new empty string with the required attributes, and passing the source string and the new string to string_grapheme_copy. This function iterates through the source string one grapheme at a time, using the character set function pointer get_grapheme (which may read ahead multiple characters with strings that aren't in NFG). For each source grapheme, the function will call set_grapheme on the destination string (which may append multiple characters in non-NFG strings). This conversion effectively uses an intermediate NFG representation.

String Interface Functions

The current string functions will be maintained, with some modifications for the addition of the NFG string format. Many string functions that are part of Parrot's external API will be renamed for the standard "Parrot_*" naming conventions.

Parrot_string_set (was string_set)

Set one string to a copy of the value of another string.

Parrot_string_new_COW (was Parrot_make_COW_reference)

Create a new copy-on-write string. Creating a new string header, clone the struct members of the original string, and point to the same string buffer as the original string.

Parrot_string_reuse_COW (was Parrot_reuse_COW_reference)

Create a new copy-on-write string. Clone the struct members of the original string into a passed in string header, and point the reused string header to the same string buffer as the original string.

Parrot_string_write_COW (was Parrot_unmake_COW)

If the specified Parrot string is copy-on-write, copy the string's contents to a new string buffer and clear the copy-on-write flag.

Parrot_string_concat (was string_concat)

Concatenate two strings. Takes three arguments: two strings, and one integer value of flags. If both string arguments are null, return a new string created according to the integer flags.

Parrot_string_append (was string_append)

Append one string to another and return the result. In the default case, the return value is the same as the first string argument (modifying that argument in place). If the first argument is COW or read-only, then the return value is a new string.

Parrot_string_from_cstring (was string_from_cstring)

Create a Parrot string from a C string (a char *). Takes two arguments, a C string, and an integer length of the string (number of characters). If the integer length isn't passed, the function will calculate the length.

{{NOTE: the integer length isn't really necessary, and is under consideration for deprecation.}}

Parrot_string_new

Return a new string with the default encoding and character set. Accepts one argument, a C string (char *) to initialize the value of the string.

Parrot_string_new_noinit (was string_make_empty)

Returns a new empty string with the default encoding and chararacter set.

Parrot_string_new_init (was string_make_direct)

Returns a new string of the requested encoding, character set, and normalization form, initializing the string value to the value passed in. The five arguments are a C string (char *), an integer length of the string argument in bytes, and struct pointers for encoding, character set, and normalization form structs. If the C string (char *) value is not passed, returns an empty string. If the encoding, character set, or normalization form are passed as null values, default values are used.

{{ NOTE: the crippled version of this function, string_make, used to accept a string name for the character set. This behavior is no longer supported, but Parrot_find_encoding and Parrot_find_charset can look up the encoding or character set structs. }}

Parrot_constant_string_new (was const_string)

Creates and returns a new Parrot constant string. Takes one C string (a char *) as an argument, the value of the constant string. The length of the C string is calculated internally.

Parrot_string_resize (was string_grow)

Resize the string buffer of the given string adding the number of bytes passed in the integer argument. If the argument is negative, remove the given number of bytes. Throws an exception if shrinking the string buffer size will truncate the string (if strlen will be longer than buflen).

Parrot_string_length (was string_compute_strlen)

Returns the number of characters in the string. Combining characters are each counted separately. Variable-width encodings may lookahead.

Parrot_string_grapheme_length

Returns the number of graphemes in the string. Groups of combining characters count as a single grapheme.

Parrot_string_byte_length (was string_length)

Returns the number of bytes in the string. The character width of variable-width encodings is ignored. Combining characters are not treated any differently than other characters. This is equivalent to accessing the strlen member of the STRING struct directly.

Parrot_string_index (was string_index)

Returns the character at the specified index (the Nth character from the start of the string). Combining characters are counted separately. Variable-width encodings will lookahead to capture full character values.

Parrot_string_grapheme_index

Returns the grapheme at the given index (the Nth grapheme from the string's start). Groups of combining characters count as a single grapheme, so this function may return multiple characters.

Parrot_string_find_substr (was string_str_index)

Search for a given substring within a string. If it's found, return an integer index to the substring's location (the Nth character from the start of the string). Combining characters are counted separately. Variable-width encodings will lookahead to capture full character values. Returns -1 unless the substring is found.

init: Initialize a new String PMC.
new_from_string: Create a new String PMC from a Parrot string argument.
clone: Clone a String PMC.
mark: Mark the string value of the String PMC as live.
get_integer: Return the integer representation of the string.
get_number: Return the floating-point representation of the string.
get_bignum: Return the big number representation of the string.
get_string: Return the string value of the String PMC.
get_bool: Return the boolean value of the string.
set_integer_native: Set the string to an integer value, transforming the integer to its string equivalent.
set_bool: Set the string to a boolean (integer) value, transforming the boolean to its string equivalent.
set_number_native: Set the string to a floating-point value by transforming the number to its string equivalent.
set_string_native: Set the String PMC's stored string value to be the string argument. If the passed in string is a constant, store a copy.
assign_string_native: Set the String PMC's stored string value to a copy of the string argument.
set_string_same: Set the String PMC's stored string value to the same as another String PMC's stored string value. {{NOTE: uses direct access into the storage of the two PMCs, very ugly.}}
set_pmc: Set the String PMC's stored string value to the same as another PMC's string value, as returned by that PMC's get_string vtable function.
*bitwise*: All the bitwise string vtable functions, for AND, OR, XOR, and NOT, both inplace and standard return.
is_equal: Compares the string values of two PMCs and returns true if they match exactly.
is_equal_num: Compares the numeric values of two PMCs (first transforming any strings to numbers) and returns true if they match exactly.
is_equal_string: Compares the string values of two PMCs and returns true if they match exactly. {{ NOTE: the documentation for the PMC says that it returns FALSE if they match. This is not the desired behavior. }}
is_same: Compares two PMCs and returns true if they are the same PMC class and contain the same string (not an equivalent string value, but aliases to the same low-level string).
cmp: Compares two PMCs and returns 1 if SELF is shorter, 0 if they are equal length strings, and -1 if the passed in string argument is shorter.
cmp_num: Compares the numeric values of two PMCs (first changing those values to numbers) and returns 1 if SELF is smaller, 0 if they are equal, and -1 if the passed in string argument is smaller.
cmp_string: Compares two PMCs and returns 1 if SELF is shorter, 0 if they are equal length strings, and -1 if the passed in string argument is shorter.
substr: Extract a substring of a given length starting from a given offset (in graphemes) and store the result in the string argument.
substr_str: Extract a substring of a given length starting from a given offset (in graphemes) and return the string.
exists_keyed: Return true if the Nth grapheme in the string exists. Negative numbers count from the end.
get_string_keyed: Return the Nth grapheme in the string. Negative numbers count from the end.
set_string_keyed: Insert a string at the Nth grapheme position in the string. {{NOTE: this is different than the current implementation.}}
get_integer_keyed: Returns the integer value of the Nth char in the string. {{DEPRECATE}}
set_integer_keyed: Replace the char at the Nth character position in the string with the char that corresponds to the passed integer value key. {{DEPRECATE}}

Methods

The String PMC provides the following methods.

replace: Replace every occurance of one string with another.
to_int: Return the integer equivalent of a string.
lower: Change the string to all lowercase.
trans: Translate an ASCII string with entries from a translation table.; {{NOTE: likely to be deprecated.}}
reverse: Reverse a string, one grapheme at a time. {{ NOTE: Currenly only works for ASCII strings, because it reverses one char at a time. }}
is_integer: Checks if the string is just an integer. {{ NOTE: Currently only works for ASCII strings, fix or deprecate. }}

REFERENCES

http://sirviente.9grid.es/sources/plan9/sys/doc/utf.ps - Plan 9's Runes are not dissimilar to NFG strings, and this is a good introduction to the Unicode world.

http://www.unicode.org/reports/tr15/ - The Unicode Consortium's explanation of different normalization forms.

http://unicode.org/reports/tr29/ - "grapheme clusters" in the Unicode Standard Annex

"Unicode: A Primer", Tony Graham - Arguably the most readable book on how Unicode works.

"Advanced Perl Programming", Chapter 6, "Unicode"

parrotcode: Parrot Strings
Contents \| Documentation