NAME ^

docs/pdds/pdd07_codingstd.pod - Conventions and Guidelines for Parrot Source Code

ABSTRACT ^

This document describes the various rules, guidelines and advice for those wishing to contribute to the source code of Parrot, in such areas as code structure, naming conventions, comments etc.

DESCRIPTION ^

One of the criticisms of Perl 5 is that its source code is impenetrable to newcomers, due to such things as inconsistent or obscure variable naming conventions, lack of comments in the source code, and so on. We don't intend to make the same mistake when writing Parrot. Hence this document.

We define three classes of conventions. Those that say must are mandatory, and code will not be accepted (apart from in exceptional circumstances) unless it follows these rules. Those that say should are strong guidelines that should normally be followed unless there is a sensible reason to do otherwise. Finally, those that say may, are tentative suggestions to be used at your discretion.

Note this particular PDD makes some recommendations that are specific to the C programming language. This does not preclude Parrot (or Perl 6) being implemented in other languages, but in this case, additional PDDs may need to be authored for the extra language-specific features.

IMPLEMENTATION ^

Coding style ^

The following must apply:

The following should apply

To enforce the spacing, indenting, and bracing guidelines mentioned above, the following arguments to GNU Indent should be used:

   -kr -nce -sc -cp0 -l79 -lc79 -psl -nut -cdw -ncs -lps

This expands out to:

-nbad

Do not force blank lines after declarations.

-bap

Force blank lines after procedure bodies.

-bbo

Prefer to break long lines before boolean operators.

-nbc

Do not force newlines after commas in declarations

-br

Put braces on line with if, etc.

-brs

Put braces on struct declaration line.

-c33

Put comments to the right of code in column 33 (not recommended)

-cd33

Put declaration comments to the right of code in column 33

-ncdb

Do not put comment delimiters on blank lines.

-nce

Do not cuddle } and else.

-cdw

Do cuddle do { } while.

-ci4

Continuation indent of 4 spaces

-cli0

Case label indent of 0 spaces

-ncs

Do not put a space after a cast operator.

-d0

Set indentation of comments not to the right of code to 0 spaces.

-di1

Put declaration variables 1 space after their types

-nfc1

Do not format comments in the first column as normal.

-nfca

Do not format any comments

-hnl

Prefer to break long lines at the position of newlines in the input.

-i4

4-space indents

-ip0

Indent parameter types in old-style function definitions by 0 spaces.

-l79

maximum line length for non-comment lines is 79 spaces.

-lc79

maximum line length for comment lines is 79 spaces.

-lp

maximum line length for non-comment lines is 79 spaces.

-npcs

Do not put a space after the function in function calls.

-nprs

Do not put a space after every ´(´ and before every ´)´.

-saf

Put a space after each for.

-sai

Put a space after each if.

-saw

Put a space after each while.

-sc

Put the `*´ character at the left of comments.

-nsob

Do not swallow optional blank lines.

-nss

Do not force a space before the semicolon after certain statements

-nut

Use spaces instead of tabs.

-lps

Leave space between `#´ and preprocessor directive.

-psl

Put the type of a procedure on the line before its name. (.c files), or

-npsl

Leave a procedure declaration's return type alone (.h files)

Please note that it is also necessary to include all typedef types with the "-T" option to ensure that everything is formatted properly.

A script (tools/dev/run_indent.pl) is provided which runs indent properly automatically.

Naming conventions ^

Subsystems and APIs

The Parrot core will be split into a number of subsystems, each with an associated API. For the purposes of naming files, data structures, etc, each subsystem will be assigned a short nickname, eg pmc, gc, io. All code within the core will belong to a subsystem; miscellaneous code with no obvious home will be placed in the special subsystem called misc.

Filenames

Filenames must be assumed to be case-insensitive, in the sense that that you may not have two different files called Foo and foo. Normal source-code filenames should be all lower-case; filenames with upper-case letters in them are reserved for notice-me-first files such as README, and for files which need some sort of pre-processing applied to them or which do the preprocessing - eg a script foo.SH might read foo.TEMPLATE and output foo.c.

The characters making up filenames must be chosen from the ASCII set A-Z,a-z,0-9 plus .-_

An underscore should be used to separate words rather than a hyphen (-). A file should not normally have more than a single '.' in it, and this should be used to denote a suffix of some description. The filename must still be unique if the main part is truncated to 8 characters and any suffix truncated to 3 characters. Ideally, filenames should restricted to 8.3 in the first place, but this is not essential.

Each subsystem foo should supply the following files. This arrangement is based on the assumption that each subsystem will - as far as is practical - present an opaque interface to all other subsystems within the core, as well as to extensions and embeddings.

foo.h

This contains all the declarations needed for external users of that API (and nothing more), ie it defines the API. It is permissible for the API to include different or extra functionality when used by other parts of the core, compared with its use in extensions and embeddings. In this case, the extra stuff within the file is enabled by testing for the macro PERL_IN_CORE.

foo_private.h

This contains declarations used internally by that subsystem, and which must only be included within source files associated the subsystem. This file defines the macro PERL_IN_FOO so that code knows when it is being used within that subsystem. The file will also contain all the 'convenience' macros used to define shorter working names for functions without the perl prefix (see below).

foo_globals.h

This file contains the declaration of a single structure containing the private global variables used by the subsystem (see the section on globals below for more details).

foo.sym

This file (format and contents TBD) contains information about global symbols associated with the subsystem, and may be used by scripts to auto-generate such stuff as the include files mentioned above, linker map tables, documentation etc, based upon portability and extensibility requirements.

foo_bar.[ch] etc

All other source files associated with the subsystem will have the prefix foo_

Header Files

All .h files should include the following "guards" to prevent multiple-inclusion:

    /* file header comments */

    #if !defined(PARROT_<FILENAME>_H_GUARD)
    #define PARROT_<FILENAME>_H_GUARD

    /* body of file */

    #endif /* PARROT_<FILENAME>_H_GUARD */
Names of code entities

Code entities such as variables, functions, macros etc (apart from strictly local ones) should all follow these general guidelines.

Global Variables

Global variables must never be accessed directly outside the subsystem in which they are used. Some other method, such as accessor functions, must be provided by that subsystem's API. (For efficiency the 'accessor functions' may occasionally actually be macros, but then the rule still applies in spirit at least).

All global variables needed for the internal use of a particular subsystem should all be declared within a single struct called foo_globals for subsystem foo. This structure's declaration is placed in the file foo_globals.h. Then somewhere a single compound structure will be declared which has as members the individual structures from each subsystem. Instances of this structure are then defined as a one-off global variable, or as per-thread instances, or whatever is required.

[Actually, three separate structures may be required, for global, per-interpreter and per-thread variables.]

Within an individual subsystem, macros are defined for each global variable of the form GLOBAL_foo (the name being deliberately clunky). So we might for example have the following macros:

    /* perl_core.h or similar */

    #ifdef HAS_THREADS
    #  define GLOBALS_BASE (aTHX_->globals)
    #else
    #  define GLOBALS_BASE (Perl_globals)
    #endif

    /* pmc_private.h */

    #define GLOBAL_foo   GLOBALS_BASE.pmc.foo
    #define GLOBAL_bar   GLOBALS_BASE.pmc.bar
    ... etc ...

Code comments ^

The importance of good code documentation cannot be stressed enough. To make your code understandable by others (and indeed by yourself when you come to make changes a year later :-), the following conventions apply to all source files.

Developer files

Each source file (eg a foo.c foo.h pair), should contain inline POD documentation containing information on the implementation decisions associated with the source file. (Note that this is in contrast to PDDs, which describe design decisions). In addition, more discussive documentation can be placed in *.dev files in the docs/dev directory. This is the place for mini-essays on how to avoid overflows in unsigned arithmetic, or on the pros and cons of differing hash algorithms, and why the current one was chosen, and how it works.

In principle, someone coming to a particular source file for the first time should be able to read the inline documentation file and gain an immediate overview of what the source file is for, the algorithms it implements, etc.

The POD documentation should follow the standard POD layout:

Copyright

The Parrot copyright statement.

SVN

A SVN id string.

NAME

src/foo.c - Foo

SYNOPSIS

When appropriate, some simple examples of usage.

DESCRIPTION

A description of the contents of the file, how the implementation works, data structures and algorithms, and anything that may be of interest to your successors, eg benchmarks of differing hash algorithms, essays on how to do integer arithmetic.

HISTORY

Record major changes to the file, eg "we moved from a linked list to a hash table implementation for storing Foos, as it was found to be much faster".

SEE ALSO

Links to pages and books that may contain useful info relevant to the stuff going on in the code - eg the book you stole the hash function from.

Per-section comments

If there is a collection of functions, structures or whatever which are grouped together and have a common theme or purpose, there should be a general comment at the start of the section briefly explaining their overall purpose. (Detailed essays should be left to the developer file). If there is really only one section, then the top-of-file comment already satisfies this requirement.

Per-entity comments

Every non-local named entity, be it a function, variable, structure, macro or whatever, must have an accompanying comment explaining it's purpose. This comment must be in the special format described below, in order to allow automatic extraction by tools - for example, to generate per API man pages, perldoc -f style utilities and so on.

Often the comment need only be a single line explaining its purpose, but sometimes more explanation may be needed. For example, "return an Integer Foo to its allocation pool" may be enough to demystify the function del_I_foo()

Each comment should be of the form

    /*

    =item C<function(arguments)>

    Description.

    =cut

    */
This inline POD documentation is parsed to HTML by running:

    % perl tools/docs/write_docs.pl -s
Optimizations

Whenever code has deliberately been written in an odd way for performance reasons, you should point this out - if nothing else, to avoid some poor schmuck trying subsequently to replace it with something 'cleaner'.

    /* The loop is partially unrolled here as it makes it a lot faster.
     * See the .dev file for the full details
     */
General comments

While there is no need to go mad commenting every line of code, it is immensely helpful to provide a "running commentary" every 10 or so lines say; if nothing else, this makes it easy to quickly locate a specific chunk of code. Such comments are particularly useful at the top of each major branch, eg

    if (FOO_bar_BAZ(**p+*q) <= (r-s[FOZ & FAZ_MASK]) || FLOP_2(z99)) {
        /* we're in foo mode: clean up lexicals */
        ... (20 lines of gibberish) ...
    }
    else if (...) {
        /* we're in bar mode: clean up globals */
        ... (20 more lines of gibberish) ...
    }
    else {
        /* we're in baz mode: self-destruct */
        ....
    }

Extensibility ^

If Perl 5 is anything to go by, the lifetime of Perl 6 will be at least seven years. During this period, the source code will undergo many major changes never envisaged by its original authors - cf threads, unicode in perl 5. To this end, your code should balance out the assumptions that make things possible, fast or small, with the assumptions that make it difficult to change things in future. This is especially important for parts of the code which are exposed through APIs - the requirements of src or binary compatibility for such things as extensions can make it very hard to change things later on.

For example, if you define suitable macros to set/test flags in a struct, then you can later add a second word of flags to the struct without breaking source compatibility. (Although you might still break binary compatibility if you're not careful.) Of the following two methods of setting a common combination of flags, the second doesn't assume that all the flags are contained within a single field:

    foo->flags |= (FOO_int_FLAG | FOO_num_FLAG | FOO_str_FLAG);
    FOO_valid_value_SETALL(foo);

Similarly, avoid using a char* (or {char*,length}) if it is feasible to later use a PMC* at the same point: cf UTF-8 hash keys in Perl 5.

Of course, private code hidden behind an API can play more fast and loose than code which gets exposed.

Portability ^

Related to extensibility is portability. Perl runs on many, many platforms, and will no doubt be ported to ever more bizarre and obscure ones over time. You should never assume an operating system, processor architecture, endian-ness, word size, or whatever. In particular, don't fall into any of the following common traps:

Internal data types and their utility functions (especially for strings) should be used over a bare char * whenever possible. Ideally there should be no char * in the source anywhere, and no use of C's standard string library.

Don't assume GNU C, and don't use any GNU extensions unless protected by #ifdefs for non-GNU-C builds.

TBC ... Any contributions welcome !!!

Defensive programming ^

The const keyword on arguments

Use the const keyword as often as possible on pointers. It lets the compiler know when you intend to modify the contents of something. For example, take this definition:

    int strlen( const char *p );

The const qualifier tells the compiler that the argument will not be modified. The compiler can then tell you that this is an uninitialized variable:

    char *p;
    int n = strlen(p);

Without the const, the compiler has to assume that strlen() is actually initializing the contents of p.

The const keyword on variables

If you're declaring a temporary pointer, declare it const, with the const to the right of the *, to indicate that the pointer should not be modified.

    Wango * const w = get_current_wango();
    w->min = 0;
    w->max = 14;
    w->name = "Ted";

This prevents you from modifying w inadvertantly.

    new_wango = w++; /* Error */

If you're not going to modify the target of the pointer, put a const to the left of the type, as in:

    const Wango * const w = get_current_wango();
    if ( n < wango->min || n > wango->max ) {
        /* do something */
    }

Localizing variables

Declare variables in the innermost scope possible.

    if ( foo ) {
        int i;
        for ( i=0; i<n; i++ ) {
            do_something(i);
        }
    }

Don't reuse unrelated variables. Localize as much as possible, even if the variables happen to have the same names.

    if ( foo ) {
        int i;
        for ( i=0; i<n; i++ ) {
            do_something(i);
        }
    }
    else {
        int i;
        for ( i=14; i>0; i-- ) {
            do_something_else(i*i);
        }
    }

You could hoist the int i; outside the test, but now you'll have an i that's visible after it's used, which is confusing at best.

Performance ^

We want Perl to be fast. Very fast. But we also want it to be portable and extensible. Based on the 90/10 principle, (or 80/20, or 95/5, depending on who you speak to), most performance is gained or lost in a few small but critical areas of code. Concentrate your optimization efforts there.

Note that the most overwhelmingly important factor in performance is in choosing the correct algorithms and data structures in the first place. Any subsequent tweaking of code is secondary to this. Also, any tweaking that is done should as far as possible be platform independent, or at least likely to cause speed-ups in a wide variety of environments, and do no harm elsewhere. Only in exceptional circumstances should assembly ever even be considered, and then only if generic fallback code is made available that can still be used by all other non-optimized platforms.

Probably the dominant factor (circa 2001) that effects processor performance is the cache. Processor clock rates have increased far in excess of main memory access rates, and the only way for the processor to proceed without stalling is for most of the data items it needs to be found to hand in the cache. It is reckoned that even a 2% cache miss rate can cause a slowdown in the region of 50%. It is for this reason that algorithms and data structures must be designed to be 'cache-friendly'.

A typical cache may have a block size of anywhere between 4 and 256 bytes. When a program attempts to read a word from memory and the word is already in the cache, then processing continues unaffected. Otherwise, the processor is typically stalled while a whole contiguous chunk of main memory is read in and stored in a cache block. Thus, after incurring the initial time penalty, you then get all the memory adjacent to the initially read data item for free. Algorithms that make use of this fact can experience quite dramatic speedups. For example, the following pathological code ran four times faster on my machine by simply swapping i and j.

    int a[1000][1000];

    ... (a gets populated) ...

    int i,j,k;
    for (i=0; i<1000; i++) {
        for (j=0; j<1000; j++) {
            k += a[j][i];
        }
    }

This all boils down to: keep things near to each other that get accessed at around the same time. (This is why the important optimizations occur in data structure and algorithm design rather than in the detail of the code.) This rule applies both to the layout of different objects relative to each other, and to the relative positioning of individual fields within a single structure.

If you do put an optimization in, time it on as many architectures as you can, and be suspicious of it if it slows down on any of them! Perhaps it will be slow on other architectures too (current and future). Perhaps it wasn't so clever after all? If the optimization is platform specific, you should probably put it in a platform-specific function in a platform-specific file, rather than cluttering the main source with zillions of #ifdefs.

And remember to document it.

Loosely speaking, Perl tends to optimism for speed rather than space, so you may want to code for speed first, then tweak to reclaim some space while not affecting performance.

REFERENCES ^

The section on coding style is based on Perl5's Porting/patching.pod by Daniel Grisinger. The section on naming conventions grew from some suggestions by Paolo Molaro <lupus@lettere.unipd.it>. Other snippets came from various P5Pers. The rest of it is probably my fault.

VERSION ^

CURRENT ^

   Maintainer: Dave Mitchell <davem@fdgroup.com>
   Class: Internals
   PDD Number: 7
   Version: 1
   Status: Developing
   Last Modified: 6 August 2001
   PDD Format: 1
   Language: English

HISTORY ^

Based on an earlier draft which covered only code comments.

CHANGES ^

None. First version


parrot