Perl and XS: Concepts

Concepts

To call C from Perl, control and data pass from Perl to C.

Data representation

C Data representation

C generally uses the native formats of the processor for data. A character is one byte, and an integer is a 32-bit number. Complex data types are aggregates of simple data types. For example, int x[2]; is eight bytes in memory. Four bytes for x[0], followed by four for x[1].

struct S
{
    int	  a;
    char  b[4];
}

is four bytes for a, followed by four bytes for b.

Perl Data representation

Inside Perl, data objects are C structures. For example, a scalar looks like

typedef enum
{
    IOK = 0x01,		/* has valid integer value */
    POK = 0x02,		/* has valid string  value */
} Flags;

struct Scalar
{
    int   refcnt;	/* how many references to us 	*/
    Flags flags;	/* what we are			*/
    char *pv;		/* pointer to malloc'd string	*/
    int	  cur;		/* length of pv as a C string	*/
    int	  len;		/* allocated size of pv		*/
    int	  iv;		/* integer value		*/
};

The Scalar struct allows Perl to manage the type information for each scalar. For example, when Perl executes

my $x = 42;

it allocates a Scalar struct, sets refcnt to 1, sets iv to 42, sets the IOK flag, and clears the POK flag.

If we later write

print "$x";

the interpreter allocates space for pv, calls sprintf(pv, "%d", iv) to convert iv to a string, and sets the POK flag.

When a reference to a scalar is created, the interpreter increments refcnt, and when a reference to the scalar goes away, the interpreter decrements refcnt. When the last reference to the scalar (including $x) goes away, refcnt reduces to zero, and the interpreter frees the Scalar.

In Perl, unlike C, the programmer does not need to specify the data's type. Perl knows the type of data and converts as necessary. The programmer does not need to manage storage allocation. Perl knows the size and location of data, and allocates and frees as necessary.

Program Execution

Running a C program involves two steps. First the compiler translates the source code into machine code. Then the CPU executes the machine code. Running a Perl program divides into two similar steps. First the interpreter translates the source code into a syntax tree. Then the interpreter executes the syntax tree.

The Syntax Tree

The Perl interpreter translates Perl source code into a "syntax tree". The nodes of the tree are operations such as + and =, called "opcodes". The children of the nodes represent its operands, such as numbers to be added. After the interpreter builds the syntax tree, it executes the program by "walking" the nodes of the tree in "postfix" order. "Postfix" means walking the children of a node before the node itself.

Walking a node typically yields a value, so we also speak of "evaluating" a node. The interpreter keeps these values on a stack. To evaluate a node, the interpreter takes its operands off the stack, carries out the operation, and puts the result back onto the stack. Suppose we have the Perl statement

	$x = $y + 3;

This is parsed into a syntax tree

	   =
	  / \
	$x   +
	    / \
	  $y   3

The nodes are evaluated in the order

Step	1	2	3	4	5
Node	$x	$y	3	+	=
Stack	\$x	42 \$x	3 42 \$x	45 \$x

Here is what happens in each step.

Since the = node will assign to $x, the interpreter pushes a reference to $x, not its value.
The value of $y turns out to be 42.
3 evaluates to 3.
The + node pops the values 3 and 42, adds them, and pushes the sum onto the stack.
The = pops the value 45 and the reference to $x, and carries out the assignment. Now the stack is empty.

Subroutine calls

Perl

The Perl interpreter creates a separate syntax tree for each subroutine in the program. Each syntax tree is managed by a code reference. A code reference is the thing that you get when you write

my $coderef = sub { ... }

sub foo { ... }
my $coderef = \&foo;

in Perl. Internally, a code reference is represented by a C struct. This code reference has a pointer to the root of the syntax tree for the subroutine called the root pointer.

A subroutine call in the program source is represented in the syntax tree by a fragment that looks like this:

	    entersub
	       |
	  +----+--...---+
	  |    |        |
	arg1 arg2 ... argN

The entersub opcode transfers control to the called subroutine. Its children evaluate the arguments of the call.

To execute a subroutine call, the interpreter first walks each child node, and pushes the result onto the Perl stack. When all the arguments are on the stack, the interpreter walks the entersub node.

An entersub opcode holds a pointer to a code reference. When the interpreter walks an entersub, it follows this pointer to the code reference, and then follows the root pointer in the code reference to the syntax tree for the subroutine. Then it executes the subroutine.

The things on the Perl stack are not the C structs that represent the arguments of the subroutine, but pointers to the structs. In other words, Perl passes parameters by reference, unlike C.

If the subroutine returns any values, it pushes pointers to them onto the stack, in the same locations where its parameters were. After the subroutine returns, the caller retrieves the return values from the stack.

eXternal Subroutines

One of the other things that a code reference has is a C function pointer, a field that contains the address of the entry point of a compiled C subroutine. We'll call this the xsub pointer, and the C subroutine which it points to the xsub.

When the interpreter executes entersub, it first checks the xsub pointer in the code reference. If the xsub pointer is null, it follows the root pointer to the syntax tree for the subroutine and walks it.

If the xsub pointer is not null, the interpreter ignores the root pointer. Instead, it gets the address of the xsub from the xsub pointer, and calls the xsub, and control passes from Perl to C.

Loading, Linking, and Installation

For a C subroutine to become an xsub, the subroutine has to be loaded into memory, and the interpreter has to set the xsub pointer in a code reference to the entry point of the subroutine.

The xsub pointer is set as follows. The Perl C API includes a routine

newXS (char *name, void (*fp)())

Given the name of a Perl subroutine in name, and the address of the entry point of a C subroutine in fp, newXS installs fp as the xsub pointer in the code reference for name. Once this happens, Perl code that calls name() will invoke the C subroutine.

Name can be in any package. To install a subroutine called new() in the package Align::NW, we pass the string "Align::NW::new" for name.

There are two ways to link C subroutines in a library to an executable, static and dynamic linking. In static linking, the Perl interpreter is linked with the library when it is compiled, creating a modified Perl executable including the C subroutines. In dynamic linking, a Perl program can "load" a library while running and look up the subroutine entry point in the library's symbol table.

Dynamic linking is done by a Perl module called XSLoader. In the XS module,

package My::Module;
our $VERSION=0.01;
use XSLoader;
XSLoader::load 'My::Module', $VERSION;

When the module loads, it calls XSLoader::load. This locates the library, loads it, finds the entry points, and calls newXS.

Parameter Passing

The Perl interpreter puts things on the Perl stack, but C expects to find things on the processor stack. So the xsub has to convert between Perl and C data representations. Typically, the xsub uses facilities in the Perl C API to get parameters from the Perl stack and convert them to C data values. To return a value, the xsub creates a Perl data object and leaves a pointer to it on the Perl stack.

XS is a macro language which allows us to declare C routines, and specify how Perl data types correspond to C data types. Xsubpp reads XS code and outputs C.

Notes on these adapted articles

These pages are an adaptation of articles written in 2000 by Steven W. McDougall. My goal in modifying these articles is to simplify and update them. I hope you find these adapted versions of the articles useful. You can find the original articles at the link at the bottom of this page. The major changes in this update are:

h2xs is not used;
XSLoader is used in place of DynaLoader;
It is assumed that the reader understands the basic concepts of C and Perl programming.

This adaptation is a work in progress and many of the links on these pages may not work.

XS Mechanics by Steven W. McDougall is licensed under a Creative Commons Attribution 3.0 Unported License.

For comments, questions, and corrections, please email Ben Bullock (benkasminbullock@gmail.com). / Privacy / Disclaimer