NAME

pirlexer.c - lexical analysis for Parrot Intermediate Representation

THOUGHTS FOR LATER

Implement dictionary as hashtable, which will be MUCH faster

Optimize small functions (using #define to inline) and optimize by 'smarter' implementation (where appropiate). I'm doing a lot of stuff in read_char(), which might slow down things.

TODO: implement POD parsing

TODO: read input files a few blocks a time, instead of the whole file at once. This is to prevent failure when compiling 100M files.

Place and remove checks for EOF where appropiate, they are scattered throughout the code. Clean that up.

Check for 'correct' use of data types (unsigned etc.) (should characters be stored in chars or ints?

KEYWORDS

dictionary contains *all* keywords, directives, flags and other (descriptions of) tokens that are recognized by the lexer.

  global    goto
  if        n_operators
  int       null
  num       pmc
  string    unless

DIRECTIVES

The following are PIR directives. (this should be re-ordered)

  .arg               .const      .macro_const      .end
  .endnamespace      .endm       .get_results      .globalconst
  .HLL               .HLL_map    .include          .invocant          .lex
  .loadlib           .local      .macro            .meth_call         .namespace
  .nci_call          .param      .begin_call       .begin_return      .begin_yield
  .call              .end_call   .end_return       .end_yield         .pragma
  .result            .return     .sub              .yield

FLAGS

The following are flags for subroutines:

  :anon     :immediate   :init        :lex         :load        :main
  :method   :multi       :outer       :postcomp    :vtable      :named

The following are flags for parameters/arguments.

  :opt_flag
  :optional
  :slurpy
  :flat
  :unique_reg

STRING ENCODINGS

The following are string encoding specifiers:

  ascii:
  binary:
  iso-8859-1:
  unicode:

file_buffer structure

Structure that represents a file. Its layout is shown below. First, it contains the filename of the file that is represented by this buffer. Then, the buffer is an array that holds the complete file contents. This is done for efficiency (instead of reading character by character from disk). The curchar acts like a cursor, that points to the current character. The field filesize contains the size of the file counted in bytes, line keeps track of the current line number, and linepos counts the number of characters since the last newline character. The field lastchar stores the previous character (so the character before the character pointed to by curchar. This field is used to decide whether the previous character was a newline. If so, then curchar is at the start of a line (needed for Heredoc delimiters).

The field prevbuffer points to another file_buffer; if the current file was .included, then prevbuffer points to the file_buffer that represents the including file. An example:

 $ cat main.pir

 .include "util.pir"

 .sub main
 # ...
 .end

 $ cat util.pir

 .sub foo
 # ...
 .end

In this case, when parsing the file main.pir, prevbuffer is NULL, because this file was not included. Then, when the file util.pir is included, a new file_buffer is created for that file, and prevbuffer is set to the file_buffer representing main.pir.

The file_buffer structure is shown below:

 typedef struct file_buffer {
     char     *filename;              -- the name of this file
     char     *buffer;                -- buffer holding contents of this file
     char     *curchar;               -- pointer to the current char.
     unsigned  filesize;              -- size of this file in bytes
     unsigned  long line;             -- line number
     unsigned  short linepos;         -- position on the current line
     char      lastchar;              -- the previous character that was read.
     struct file_buffer *prevbuffer;  -- pointer to 'including' file if any

 } file_buffer;

lexer_state structure

Structure representing the lexer. It holds a pointer to the current file being read, a buffer holding the current token, and a pointer to add characters to the token buffer.

 typedef struct lexer_state {
     struct file_buffer *curfile;    -- pointer to the current file
     char *token_chars;              -- characters of the current token
     char *charptr;                  -- used for adding/removing token chars

 } lexer_state;

ACCESSOR FUNCTIONS

char const *find_keyword(token t): Get the spelling of a keyword based on the specified token.
char *const get_current_token(lexer_state const *s): return a constant pointer to the current token buffer
char *const get_current_file(struct lexer_state *s): return a constant pointer to the current file name
long get_current_line(struct lexer_state *s): return the current line number
unsigned short get_current_linepos(struct lexer_state *s): Returns the current line position (i.o.w., how many characters have been read on the current line?)
long get_current_filepos(struct lexer_state *s): Returns the number of charactars read in the current file so far.
void print_error_context(struct lexer_state *s): Print some surrounding text from the file to indicate where the error occurred. This may make finding the error easier.

INTERNAL FUNCTIONS

static void buffer_char(lexer_state *lexer, char c): Store a character in the lexer's buffer.
static char read_char(file_buffer *buf): Return the next character from the buffer. It's a good idea to check for "c == EOF_MARKER" after each call.
static void unread_char(file_buffer *buf): Push back the last read character.
static void print_buffer(lexer_state *lexer): Debug function to show the rest of the current buffer. (starting from current character)
static void clear_buffer(lexer_state *lexer): Clears the buffer in which the current token is stored.
static file_buffer *read_file(char const *filename): Allocate a new file_buffer structure, allocate memory for the file's contents and read all contents into this buffer. The file_buffer structure is returned.
static void destroy_buffer(file_buffer *buf): Destructor for file_buffer.
static void do_include_file(lexer_state *lexer, char const *filename): Calls read_file() that returns a file_buffer structure. This file_buffer's previous buffer is set to the current file_buffer. The newfile buffer is assigned to the lexer's current file buffer.
static int is_start_of_line(file_buffer *buf): Checks whether the current pointer in the specified file buffer is at the beginning of a line.
static token check_dictionary(lexer_state *lexer, char const *dictionary[]): Checks whether the current token is a member of the specified dictionary. If it is, the index of the word in the dict. is returned. If not, T_NOT_FOUND is returned.
static void switch_buffer(lexer_state *lexer): set the current file_buffer to the previous one stored in the field prevbuffer. The .include'ing file is now continued to be processed after this.
static int read_digits(lexer_state *lexer): Helper function to read as many digits into the current token's buffer. Returns the number of digits read.
static void update_line(lexer_state *lexer): Updates the line number in the lexer, and resets the line position pointer.
static token read_string(lexer_state *lexer, char delimiter): Read a quoted string.

LEXER API

token read_heredoc(lexer_state *lexer, char *heredoc_label): Reads heredoc text up to the specified heredoc label. Returns either T_HEREDOC_STRING if successful, or T_EOF (if encountered). The heredoc string is stored in the token buffer.
token read_macro(lexer_state *lexer): Just skip all tokens until we find ".endm" (or end of file) Later this can be improved.
lexer_state *new_lexer(char const *filename): Constructor for the lexer.
void destroy_lexer(lexer_state *lexer): Destructor for lexer.
void open_include_file(lexer_state *lexer): This function takes a quoted string, to be found the current token, and removes the quotes. Then the file is included through do_include_file().
void close_include_file(NOTNULL(lexer_state *lexer)): Opposite of include_file(), it sets the current file in the lexer to the 'including' file (found through the 'prevbuffer' pointer).
token next_token(lexer_state *lexer): Reads a token from the current file buffer.

LEXICAL SPECIFICATION

Comments

Comments start with the pound sign ('#') and continue up to the end of the line.

POD comments are not yet supported.

Tokens

Any whitespace in the specification is merely for readability. Significant whitespace is indicated explicitly.

  PASM-REG        -> PASM-PREG | PASM-SREG | PASM-NREG | PASM-IREG

  PASM-PREG       -> 'P' DIGIT+

  PASM-SREG       -> 'S' DIGIT+

  PASM-NREG       -> 'N' DIGIT+

  PASM-IREG       -> 'I' DIGIT+

  IDENT           -> [a-zA-Z_][a-zA-Z_0-9]*

  LABEL           -> IDENT ':'

  INVOCANT-IDENT  -> IDENT '.'

  PARROT-OP       -> IDENT

  MACRO-IDENT     -> '.' IDENT

  MACRO-LABEL     -> '$' IDENT ':'

  PIR-REGISTER    -> '$' PASM-REG

  HEREDOC-IDENT   -> << STRINGC

  STRING-CONSTANT -> ' <characters> ' | " <characters> "

  INT-CONSTANT    -> [-] DIGIT+ | 0 [xX] DIGIT+ | 0 [bB] DIGIT+

  NUM-CONSTANT    -> [-] DIGIT+ '.' DIGIT*

  DIGIT           -> [0-9]

Special tokens

  ( ) [ ] , ;

Operators

Due to PIR's simplicity, there are no different levels of precedence for operators.

Unary operators

    -   !   ~

Binary operators

    **  *  %  /  //  +  -  >>  >>>  <<  ~   ~~   &  &&  |  ||  .

Augmented operators

    **=   *=    %=   /=   //=   +=   -=  .=  >>=  >>>=   <<=  &=   |=   ~=

Conditional operators

    <    >   ==   <=   >=  !=

Miscellaneous operators

    =>   ..

parrotcode: lexical analysis for Parrot Intermediate Representation
Contents \| Compilers