NAME ^

docs/pdds/pdd22_io.pod - Parrot I/O

ABSTRACT ^

Parrot's I/O subsystem.

VERSION ^

$Revision$

DEFINITIONS ^

A "stream" allows input or output operations on a source/destination such as a file, keyboard, or text console. Streams are also called "filehandles", though only some of them have anything to do with files.

DESCRIPTION ^

- Parrot I/O objects support both streams and network I/O.

- Parrot has both synchronous and asynchronous I/O operations.

- Asynchronous operations must interact safely with Parrot's other concurrency models.

IMPLEMENTATION ^

Composition ^

Currently, the Parrot I/O subsystem uses a per-interpreter stack to provide a layer-based approach to I/O. Each layer implements a subset of the ParrotIOLayerAPI vtable. To find an I/O function, the layer stack is searched downwards until a non-NULL function pointer is found for that particular slot. This implementation will be replaced with a composition model. Rather than living in a stack, the module fragments that make up the ParrotIO class will be composed and any conflicts resolved when the class is loaded. This strategy eliminates the need to search a stack on each I/O call, while still allowing a "layered" combination of functionality for different platforms.

Concurrency Model for Asynchronous I/O ^

Currently, Parrot only implements synchronous I/O operations. For the 1.0 release the asynchronous operations will be implemented separately from the synchronous ones. There may be an implementation that uses one variant to implement the other someday, but it's not an immediate priority.

Synchronous opcodes are differentiated from asynchronous opcodes by the presence of a callback argument in the asynchronous calls. Asynchronous calls that don't supply callbacks (perhaps if the user wants to manually check later if the operation succeded) are enough of a fringe case that they don't need opcodes. They can access the functionality via methods on ParrotIO objects.

The asynchronous I/O implementation will use the composition model to allow some platforms to take advantage of their built-in asynchronous operations, layered behind Parrot's asynchronous I/O interface.

Asynchronous operations use a lightweight concurrency model. At the user level, Parrot follows the callback function model of asynchronous I/O. At the interpreter level, each asynchronous operation registers a task with the interpreter's concurrency scheduler. The registered task could represent a simple Parrot asynchronous I/O operation, a platform-native asynchronous I/O call, or even synchronous code in a full Parrot thread (rare but possibly useful for prototyping new features, or for mock objects in testing).

Communication between the calling code and the asynchronous operation task is handled by a shared status object. The operation task updates the status object whenever the status changes, and the calling code can check the status object at any time. The status object contains a reference to the returned result of an asynchronous I/O call. In order to allow sharing of the status object, asynchronous ops both pass the status object to the callback PMC, and return it to the calling code.

The lightweight tasks typically used by the asynchronous I/O system capture no state other than the arguments passed to the I/O call, and share no variables with the calling code other than the status object.

[See http://en.wikipedia.org/wiki/Asynchronous_I/O, for a relatively comprehensive list of asynchronous I/O implementation options.]

I/O PMC API ^

Methods

[Over and over again throughout this section, I keep wanting an API that isn't possible with current low-level PMCs. This could mean that low-level PMCs need a good bit of work to gain the same argument passing capabilities as higher-level Parrot objects (which is true, long-term). It could mean that Parrot I/O objects would be better off defined in a higher-level syntax, with embedded C (via NCI, or a lighter-weight embedding mechanism) for those pieces that really are direct C access. Or, it could mean that I'll come back and rip this interface down to a bare minimum.]

new

  $P0 = new ParrotIO
Creates a new I/O stream object. [Note that this is usually performed via the open opcode.]

open

  $P0 = $P1.open()
  $P0 = $P1.open($S2)
  $P0 = $P1.open($S2, $S3)
Opens a stream on an existing I/O stream object, and returns a status object. With no arguments, it can be used to reopen a previously opened I/O stream. $S2 is a file path and $S3 is an optional mode for the stream (read, write, read/write, etc), using the same format as the open opcode: 'r' for read, 'w' for write, 'a' for append, and 'p' for pipe. When the mode is set to write or append, a file is created without warning if none exists. When the mode is read (without write), a nonexistent file is an error.

The asynchronous version takes a PMC callback as an additional final argument. When the open operation is complete, it invokes the callback with a single argument: a status object containing the opened stream object.

close

  $P0 = $P1.close()
  $P0 = $P1.close($P2)
Closes an I/O stream, but leaves destruction of the I/O object to the GC. The close method returns a PMC status object.

The asynchronous version takes an additional final PMC callback argument $P1. When the close operation is complete, it invokes the callback, passing it a status object. [There's not really much advantage in this over just leaving the object for the GC to clean-up, but it does give you the option of executing an action when the stream has been closed.]

print

  $P0 = $P1.print($I2)
  $P0 = $P1.print($N2)
  $P0 = $P1.print($S2)
  $P0 = $P1.print($P2)
  $P0 = $P1.print($I2, $P3)
  $P0 = $P1.print($N2, $P3)
  $P0 = $P1.print($S2, $P3)
  $P0 = $P1.print($P2, $P3)
Writes an integer, float, string, or PMC value to an I/O stream object. Returns a PMC status object.

The asynchronous version takes an additional final PMC callback argument $P2. When the print operation is complete, it invokes the callback, passing it a status object.

read

  $S0 = $P1.read($I2)
  $P0 = $P1.read($I2, $P3)
Retrieves a specified number of bytes $I2, from a stream $P1 into a string $S0. By default it reads in bytes, but the ParrotIO object can be configured to read in code points instead, by applying a utf8 or similar role to the object [the syntax for applying a role to an object has yet to be defined in PDD 15]. If there are fewer bytes remaining in the stream than specified in the read request, it returns the remaining bytes (with no error).

The asynchronous version takes an additional final PMC callback argument $P3, and only returns a status object $P0. When the read operation is complete, it invokes the callback, passing it a status object. The status object contains the return value: a string that may be in bytes or codepoints depending on the read mode of the I/O object. [The callback doesn't need to know the read mode of the original operation, as the information about the character encoding of the return value is contained in the string.]

readline

  $S0 = $P1.readline()
  $P0 = $P1.readline($P2)
Retrieves a single line from a stream $P1 into a string $S1. Calling readline flags the stream as operating in line-buffer mode (see the buffer_type method below). The readline operation respects the read mode of the I/O object the same as read does. Newlines are not removed from the end of the string.

The asynchronous version takes an additional final PMC callback argument $P2, and only returns a status object $P0. When the readline operation is complete, it invokes the callback, passing it a status object and a string of bytes.

record_separator

  $S0 = $P1.record_separator()
  $P0.record_separator($S1)
Accessor (get and set) for the I/O stream's record separator attribute. The default value is a newline (CR, LF, CRLF, etc. depending on the platform).

buffer_type

  $I0 = $P1.buffer_type()
  $S0 = $P1.buffer_type()
  $P0.buffer_type($I1)
  $P0.buffer_type($S1)
Accessor (get and set) for the I/O stream's buffer type attribute. The attribute is returned as an integer value of one of the following constants, or a string value of 'unbuffered', 'line-buffered', or 'full-buffered'.

  0    PIO_NONBUF
           Unbuffered I/O. Bytes are sent as soon as possible.
  1    PIO_LINEBUF
           Line buffered I/O. Bytes are sent when a record separator is
           encountered.
  2    PIO_FULLBUF
           Fully buffered I/O. Bytes are sent when the buffer is full.
           [Note, the constant was called "BLKBUF" because bytes are
           sent as a block, but line buffering also sends them as a
           block, so changed to "FULLBUF".]
buffer_size

  $I0 = $P1.buffer_size()
  $P0.buffer_size($I1)
Accessor (get and set) for the I/O stream's buffer size attribute. The size is specified in bytes (positive integer value), though the buffer may hold a varying number of characters when dealing with an encoding of multi-byte codepoints. The role that implements the handling of a particular character set must provide the logic that marks the buffer as "full" when it can't hold the next codepoint even if there are empty bytes in the buffer.

The buffer size can be set no matter what buffering mode is in use, but it's only relevant when in line buffering or full buffering mode (and line buffering mode will rarely reach the maximum buffer size).

It is recommended to only change the buffer size before starting IO operations, or after flushing the buffer. If the new size is larger than the existing data in the buffer, a size change is non-disruptive, but if the new size is smaller, it will truncate the buffer with a warning.

get_fd [RT #48312]

  $I0 = $P1.get_fd()
For stream objects that are simple wrappers around a Unix filehandle, get_fd retrieves the Unix integer file descriptor of the object. This method doesn't exist on stream objects that aren't Unix filehandles, so check does for the appropriate role, or can for the method before calling it.

No asynchronous version.

{{ NOTE: use a config probe (behind does or can) to determine support }}

Status Object PMC API ^

get_integer (vtable)

  $I0 = $P1
Returns an integer status for the status object, 1 for successful completion, -1 for an error, and 0 while still running. [Discuss: This is largely to preserve current expectations of -1 for an error. If we move away from that, is there a better representation?]

get_bool (vtable)

  if $P0 goto ...
Returns a boolean status for the status object, true for successful completion or while still running, false for an error.

return

  $P0 = $P1.return()
Retrieves the return value of the asynchronous operation from the status object. Returns a NULL PMC while still running, or if the operation had no return value.

error

  $P0 = $P1.error()
Retrieves the error object from the status object, if the execution of the asynchronous operation terminated with an error. The error object is derived from Exception, and can be thrown from the callback. If there was no error, or the asynchronous operation is still running, returns a null PMC.

throw

  $P0.throw()
Throw an exception from the status object if it contains an error object, otherwise do nothing.

I/O Iterator PMC API ^

[Implementation NOTE: this may either be the default Iterator object applied to a ParrotIO object, a separate Iterator object for I/O objects, or an Iterator role applied to I/O objects.]

new

    new $P0, 'Iterator', $P1
Create a new iterator object $P0 from I/O object $P1.

shift

      shift $S0, $P1
Retrieve the next line/block $S0 from the I/O iterator $P1. The amount of data retrieved in each iteration is determined by the I/O object's buffer_type setting: unbuffered, line-buffered, or fully-buffered.

get_bool (vtable)

  unless $P0 goto iter_end
Returns a boolean value for the iterator, true if there is more data to pull from the I/O object, false if the iterator has reached the end of the data. [NOTE: this means that an iterator always checks for the next line/block of data when it retrieves the current one.]

I/O Opcodes ^

The signatures for the asynchronous operations are nearly identical to the synchronous operations, but the asynchronous operations take an additional argument for a callback, and the only return value from the asynchronous operations is a status object. When the callbacks are invoked, they are passed the status object as their sole argument. Any return values from the operation are stored within the status object.

The listing below says little about whether the opcodes return error information. For now assume that they can either return a status object, or return nothing. Error handling is discussed more thoroughly below in "Error Handling".

I/O Stream Opcodes ^

Opening and closing streams

open

  $P0 = open $S1
  $P0 = open $S1, $S2
  $P0 = open $P1
  $P0 = open $P1, $S2
Opens a stream object based on a file path in $S1 and returns it. The stream object defaults to read/write mode. The optional string argument $S2 specifies the mode of the stream (read, write, append, read/write, etc.). Currently the mode of the stream is set with a string argument similar to Perl 5 syntax, but a language-agnostic mode string is preferable, using 'r' for read, 'w' for write, 'a' for append, and 'p' for pipe.

The asynchronous version takes a PMC callback as an additional final argument. When the open operation is complete, it invokes the callback with a single argument: a status object containing the opened stream object.

close

  close $P0
  close $P0, $P1
Closes a stream object. It takes a single string object argument and returns a status object.

The asynchronous version takes an additional final PMC callback argument. When the close operation is complete, it invokes the callback, passing it a status object.

Retrieving existing streams

These opcodes do not have asynchronous variants.

Writing to streams

print

  print $I0
  print $N0
  print $S0
  print $P0
  print $P0, $I1
  print $P0, $N1
  print $P0, $S1
  print $P0, $P1
  print $P0, $I1, $P2
  print $P0, $N1, $P2
  print $P0, $S1, $P2
  print $P0, $P1, $P2
Writes an integer, float, string, or PMC value to a stream. It writes to standard output by default, but optionally takes a PMC argument to select another stream to write to.

The asynchronous version takes an additional final PMC callback argument. When the print operation is complete, it invokes the callback, passing it a status object.

printerr

  printerr $I0
  printerr $N0
  printerr $S0
  printerr $P0
Writes an integer, float, string, or PMC value to standard error.

There is no asynchronous variant of printerr. [It's just a shortcut. If they want an asynchronous version, they can use print.]

Reading from streams

read

  $S0 = read $I1
  $S0 = read $P1, $I2
  $P0 = read $P1, $I2, $P3
Retrieves a specified number of bytes, $I2, from a stream, $P2, into a string, $S0. [Note this is bytes, not codepoints.] By default it reads from standard input, but it also takes an alternate stream object source as an optional argument.

The asynchronous version takes an additional final PMC callback argument, and only returns a status object. When the read operation is complete, it invokes the callback, passing it a status object and a string of bytes.

readline

  $S0 = readline $P1
  $P0 = readline $P1, $P2
Retrieves a single line from a stream into a string. Calling readline flags the stream as operating in line-buffer mode (see pioctl below).

The asynchronous version takes an additional final PMC callback argument, and only returns a status object. When the readline operation is complete, it invokes the callback, passing it a status object and a string of bytes.

peek

  $S0 = peek
  $S0 = peek $P1
['peek', 'seek', 'tell', and 'poll' are all candidates for moving from opcodes to ParrotIO object methods.]

peek retrieves the next byte from a stream into a string, but doesn't remove it from the stream. By default it reads from standard input, but it also takes a stream object argument for an alternate source.

There is no asynchronous version of peek. [Does anyone have a line of reasoning why one might be needed? The concept of "next byte" seems to be a synchronous one.]

Retrieving and setting stream properties

seek

  seek $P0, $I1, $I2
  seek $P0, $I1, $I2, $I3
  seek $P0, $I1, $I2, $P3
  seek $P0, $I1, $I2, $I3, $P4
Sets the current file position of a stream object, $P0, to an integer byte offset, $I1, from an integer starting position, $I2, (0 for the start of the file, 1 for the current position, and 2 for the end of the file). It also has a 64-bit variant that sets the byte offset by two integer arguments, $I1 and $I2, (one for the first 32 bits of the 64-bit offset, and one for the second 32 bits). [The two-register emulation for 64-bit integers may be deprecated in the future.]

The asynchronous version takes an additional final PMC callback argument. When the seek operation is complete, it invokes the callback, passing it a status object and the stream object it was called on.

tell

  $I0 = tell $P1
  ($I0, $I1) = tell $P2
Retrieves the current file position of a stream object. It also has a 64-bit variant that returns the byte offset as two integers (one for the first 32 bits of the 64-bit offset, and one for the second 32 bits). [The two-register emulation for 64-bit integers may be deprecated in the future.]

No asynchronous version.

poll

  $I0 = poll $P1, $I2, $I3, $I4
Polls a stream or socket object for particular types of events (an integer flag) at a frequency set by seconds and microseconds (the final two integer arguments). [At least, that's what the documentation in src/io/io.c says. In actual fact, the final two arguments seem to be setting the timeout, exactly the same as the corresponding argument to the system version of poll.]

See the system documentation for poll to see the constants for event types and return status.

This opcode is inherently synchronous (poll is "synchronous I/O multiplexing"), but it can retrieve status information from a stream or socket object whether the object is being used synchronously or asynchronously.

Deprecated opcodes

Filesystem Opcodes ^

[Okay, I'm seriously considering moving most of these to methods on the ParrotIO object. More than that, moving them into a role that is composed into the ParrotIO object when needed. For the ones that have the form 'opcodename parrotIOobject, arguments', I can't see that it's much less effort than 'parrotIOobject.methodname(arguments)' for either manually writing PIR or generating PIR. The slowest thing about I/O is I/O, so I can't see that we're getting much speed gain out of making them opcodes. The ones to keep as opcodes are 'unlink', 'rmdir', and 'opendir'.]

Network I/O Opcodes ^

Most of these opcodes conform to the standard UNIX interface, but the layer API allows alternate implementations for each.

[These I'm also considering moving to methods in a role for the ParrotIO object. Keep 'socket' as an opcode, or maybe just make 'socket' an option on creating a new ParrotIO object.]

Error Handling ^

Currently some of the networking opcodes (connect, recv, send, poll, bind, and listen) return an integer indicating the status of the call, -1 or a system error code if unsuccessful. Other I/O opcodes (such as accept) have various different strategies for error notification, and others have no way of marking errors at all. We want to unify all I/O opcodes so they use a consistent strategy for error notification.

Synchronous operations

Synchronous I/O operations return an integer status code indicating success or failure in addition to their ordinary return value(s). This approach has the advantage of being lightweight: returning a single additional integer is cheap.

[Discuss: should synchronous operations take the same error handling strategy as asynchronous ones?]

Asynchronous operations

Asynchronous I/O operations return a status object. The status object contains an integer status code, string status/error message, and boolean success value.

An error callback may be set on a status object, though it isn't required. This callback will be invoked if the asynchronous operation terminates in an error condition. The error callback takes one argument, which is the status object containing all information about the failed call. If no error callback is set, then the standard callback will be invoked, and the user will need to check for error conditions in the status object as the first operation of the handler code.

Exceptions

At some point in the future, I/O objects may also provide a way to throw exceptions on error conditions. This feature will be enabled by calling a method on the I/O object to set an internal flag. The exception throwing will be implemented as a method call on the status object.

Note that exception handlers for asynchronous I/O operations will likely have to be set at a global scope because execution will have left the dynamic scope of the I/O call by the time the error occurs.

IPv6 Support ^

The transition from IPv4 to IPv6 is in progress, though not likely to be complete anytime soon. Most operating systems today offer at least dual-stack IPv6 implementations, so they can use either IPv4 or IPv6, depending on what's available. Parrot also needs to support either protocol. For the most part, the network I/O opcodes should internally handle either addressing scheme, without requiring the user to specify which scheme is being used.

IETF recommends defaulting to IPv6 connections and falling back to IPv4 connections when IPv6 fails. This would give us more solid testing of Parrot's compatibility IPv6, but may be too slow. Either way, it's a good idea to make setting the default (or selecting one exclusively) an option when compiling Parrot.

The most important issues for Parrot to consider with IPv6 are:

See the relevant IETF RFCs: "Application Aspects of IPv6 Transition" (http://www.ietf.org/rfc/rfc4038.txt) and "Basic Socket Interface Extensions for IPv6" (http://www.ietf.org/rfc/rfc3493.txt).

ATTACHMENTS ^

None.

FOOTNOTES ^

None.

REFERENCES ^

  src/io/io.c
  src/ops/io.ops
  include/parrot/io.h
  runtime/parrot/library/Stream/*
  src/io/io_unix.c
  src/io/io_win32.c
  Perl 5's IO::AIO
  Perl 5's POE


parrot