NAME

pirlexer.c - lexical analysis for Parrot Intermediate Representation

THOUGHTS FOR LATER

Implement dictionary as hashtable, which will be MUCH faster

Optimize small functions (using #define to inline) and optimize by 'smarter' implementation (where appropiate). I'm doing a lot of stuff in read_char(), which might slow down things.

TODO: implement POD parsing

TODO: read input files a few blocks a time, instead of the whole file at once. This is to prevent failure when compiling 100M files.

Place and remove checks for EOF where appropiate, they are scattered throughout the code. Clean that up.

Check for 'correct' use of data types (unsigned etc.) (should characters be stored in chars or ints?

KEYWORDS

dictionary contains *all* keywords, directives, flags and other (descriptions of) tokens that are recognized by the lexer.

  global    goto
  if        n_operators
  int       null
  num       pmc
  string    unless

DIRECTIVES

The following are PIR directives.

  .arg               .const      .constant    .emit             .end
  .endnamespace      .endm       .eom         .get_results      .global
  .globalconst       .HLL        .HLL_map     .include          .invocant
  .lex               .loadlib    .local       .macro            .meth_call
  .namespace         .nci_call   .param       .pcc_begin        .pcc_begin_return
  .pcc_begin_yield   .pcc_call   .pcc_end     .pcc_end_return   .pcc_end_yield
  .pcc_sub           .pragma     .result      .return           .sub
  .sym               .yield

FLAGS

The following are flags for subroutines:

  :anon     :immediate   :init        :lex         :load        :main
  :method   :multi       :outer       :postcomp    :vtable      :named

The following are flags for parameters/arguments.

  :opt_flag
  :optional
  :slurpy
  :flat
  :unique_reg

STRING ENCODINGS

The following are string encoding specifiers:

  ascii:
  binary:
  iso-8859-1:
  unicode:

file_buffer structure

Structure that represents a file. Its layout is shown below. First, it contains the filename of the file that is represented by this buffer. Then, the buffer is an array that holds the complete file contents. This is done for efficiency (instead of reading character by character from disk). The curchar acts like a cursor, that points to the current character. The field filesize contains the size of the file counted in bytes, line keeps track of the current line number, and linepos counts the number of characters since the last newline character. The field lastchar stores the previous character (so the character before the character pointed to by curchar. This field is used to decide whether the previous character was a newline. If so, then curchar is at the start of a line (needed for Heredoc delimiters).

The field prevbuffer points to another file_buffer; if the current file was .included, then prevbuffer points to the file_buffer that represents the including file. An example:

 $ cat main.pir

 .include "util.pir"

 .sub main
 # ...
 .end

 $ cat util.pir

 .sub foo
 # ...
 .end

In this case, when parsing the file main.pir, prevbuffer is NULL, because this file was not included. Then, when the file util.pir is included, a new file_buffer is created for that file, and prevbuffer is set to the file_buffer representing main.pir.

The file_buffer structure is shown below:

 typedef struct file_buffer {
     char     *filename;              -- the name of this file
     char     *buffer;                -- buffer holding contents of this file
     char     *curchar;               -- pointer to the current char.
     unsigned  filesize;              -- size of this file in bytes
     unsigned  long line;             -- line number
     unsigned  short linepos;         -- position on the current line
     char      lastchar;              -- the previous character that was read.
     struct file_buffer *prevbuffer;  -- pointer to 'including' file if any

 } file_buffer;

lexer_state structure

Structure representing the lexer. It holds a pointer to the current file being read, a buffer holding the current token, and a pointer to add characters to the token buffer.

 typedef struct lexer_state {
     struct file_buffer *curfile;    -- pointer to the current file
     char *token_chars;              -- characters of the current token
     char *charptr;                  -- used for adding/removing token chars

 } lexer_state;

ACCESSOR FUNCTIONS

find_keyword(): Get the spelling of a keyword based on the specified token.
get_current_token(): return a constant pointer to the current token buffer
get_current_file(): return a constant pointer to the current file name
get_current_line(): return the current line number
get_current_linepos(): Returns the current line position (i.o.w., how many characters have been read on the current line?)
get_current_filepos(): Returns the number of charactars read in the current file so far.
print_error_context(): Print some surrounding text from the file to indicate where the error occurred. This may make finding the error easier.

INTERNAL FUNCTIONS

buffer_char(): Store a character in the lexer's buffer.
read_char(): Return the next character from the buffer. It's a good idea to check for "c == EOF_MARKER" after each call.
unread_char(): Push back the last read character.
print_buffer(): Debug function to show the rest of the current buffer. (starting from current character)
clear_buffer(): Clears the buffer in which the current token is stored.
read_file(): Allocate a new file_buffer structure, allocate memory for the file's contents and read all contents into this buffer. The file_buffer structure is returned.
destroy_buffer(): Destructor for file_buffer.
do_include_file(): Calls read_file() that returns a file_buffer structure. This file_buffer's previous buffer is set to the current file_buffer. The newfile buffer is assigned to the lexer's current file buffer.
is_start_of_line(): Checks whether the current pointer in the specified file buffer is at the beginning of a line.
check_dictionary(): Checks whether the current token is a member of the specified dictionary. If it is, the index of the word in the dict. is returned. If not, T_NOT_FOUND is returned.
switch_buffer(): set the current file_buffer to the previous one stored in the field prevbuffer. The .include'ing file is now continued to be processed after this.
read_digits(): Helper function to read as many digits into the current token's buffer. Returns the number of digits read.
update_line(): Updates the line number in the lexer, and resets the line position pointer.
read_string(): Read a quoted string.

LEXER API

read_heredoc(): Reads heredoc text up to the specified heredoc label. Returns either T_HEREDOC_STRING if successful, or T_EOF (if encountered). The heredoc string is stored in the token buffer.
read_macro(): Just skip all tokens until we find ".endm" (or end of file) Later this can be improved.
new_lexer(): Constructor for the lexer.
destroy_lexer(): Destructor for lexer.
include_file(): This function takes a quoted string, to be found the current token, and removes the quotes. Then the file is included through do_include_file().
close_file(): Opposite of include_file(), it sets the current file in the lexer to the 'including' file (found through the 'prevbuffer' pointer).
next_token(): Reads a token from the current file buffer.

LEXICAL SPECIFICATION

Comments

Comments start with the pound sign ('#') and continue up to the end of the line.

POD comments are not yet supported.

Tokens

Any whitespace in the specification is merely for readability. Significant whitespace is indicated explicitly.

  PASM-REG        -> PASM-PREG | PASM-SREG | PASM-NREG | PASM-IREG

  PASM-PREG       -> 'P' DIGIT+

  PASM-SREG       -> 'S' DIGIT+

  PASM-NREG       -> 'N' DIGIT+

  PASM-IREG       -> 'I' DIGIT+

  IDENT           -> [a-zA-Z_][a-zA-Z_0-9]*

  LABEL           -> IDENT ':'

  INVOCANT-IDENT  -> IDENT '.'

  PARROT-OP       -> IDENT

  MACRO-IDENT     -> '.' IDENT

  MACRO-LABEL     -> '$' IDENT ':'

  PIR-REGISTER    -> '$' PASM-REG

  HEREDOC-IDENT   -> << STRINGC

  STRING-CONSTANT -> ' <characters> ' | " <characters> "

  INT-CONSTANT    -> [-] DIGIT+ | 0 [xX] DIGIT+ | 0 [bB] DIGIT+

  NUM-CONSTANT    -> [-] DIGIT+ '.' DIGIT*

  DIGIT           -> [0-9]

Special tokens

  ( ) [ ] , ;

Operators

Due to PIR's simplicity, there are no different levels of precedence for operators.

Unary operators

    -   !   ~

Binary operators

    **  *  %  /  //  +  -  >>  >>>  <<  ~   ~~   &  &&  |  ||  .

Augmented operators

    **=   *=    %=   /=   //=   +=   -=  .=  >>=  >>>=   <<=  &=   |=   ~=

Conditional operators

    <    >   ==   <=   >=  !=

Miscellaneous operators

    ->   =>   ..

parrotcode: lexical analysis for Parrot Intermediate Representation
Contents \| Compilers