The Common Gateway Interface

How a CGI program runs
How the server talks to the program
- Environment variables
CGI output
- HTTP headers
- Malformed headers
Sending information from the user
- Percent encodings
Post requests
- Simple forms
- File uploads
Remembering the user
Output options
- The MD5 checksum
- Compressing the output
Footnotes

How a CGI program runs

A web browser sends a request to a web server. The request goes to port number 80 on the serving computer.^[1] A request looks like this:

GET /program.cgi HTTP/1.0
Host: www.lemoda.net

Each line is followed by a carriage return and a line feed character.^[2] At the end of the request, there is a blank line.

If the browser requests a file such as index.html, the web server software searches for the file and sends it.

A web server which can run CGI programs,^[3] on receiving the instruction above, checks it, then looks for a program /program.cgi. If the web server does not find the program, the web server returns a "not found" error back over the internet to the user.^[4]

If the program is found, and the web server software has been set up to run CGI programs, the web server software starts the program, and then sends the program's output to the user.

How the server talks to the program

The web server software sends information to a CGI program like /program.cgi via environment variables. It gets information back from the program via the program's standard output. Web servers do not get information back from the CGI program via environment variables. Web servers do not use command line options. In some cases (see Post requests), web servers use the program's standard input to send information to a program.

Environment variables

The server starts the program. If the program expects input, it may read the input from environment variables. In C, the value of an environment variable is obtained using the getenv library call. (See Set and get environment variables in C).

The type of request which the user has made is given by the environment variable REQUEST_METHOD. This usually has the value GET or POST. The POST request is a request where the user sends information to the CGI program via the program's standard input. See Post requests.

CGI output

The CGI program returns a message to the user via its "standard output". In the case of a C program, that means that printf statements are enough to send the message from the CGI program via the web server.

The message consists of two parts. The first consists of HTTP headers, and the second can take any form. Between the two parts there is a blank line.

HTTP headers

The message returned by a CGI program must begin with correctly formatted HTTP (HyperText Transfer Protocol^[5]) headers. The bare minimum allowed is the Content-Type header with a MIME^[6] type for the content to follow:

Content-Type: text/plain

hello world

Here the MIME type of the message is text/plain, indicating that the output of the CGI program is "just text", with no formatting added. For example, the Figlet server at LeMoDa.net returns its output as plain text. Another common option for CGI output is text/html, which says that the output of the CGI program is HTML (HyperText Markup Language).

The headers output by the CGI program do not need to end with carriage return plus line feed. The server guarantees that it will alter any line feeds into carriage return plus line feeds before sending the output over the internet.

At the end of the HTTP headers, there is a blank line. This blank line indicates the end of the headers and the start of the text returned by the CGI program.

A CGI program may also print a Status header which indicates the status of the request in the form

Status: 200 (OK)

The web server turns this into an HTTP header

HTTP/1.1 200 OK

Here 200 means "success". A full set of possible statuses can be found on this website as HTTP status codes as C defines. Common status codes are 404, which occurs when something is not found, and 500, which occurs when a program or the server fails.

A CGI program does not need to print a status line, since the server adds a 200 status if no Status: header is found in the CGI output.

Malformed headers

A common problem when creating CGI programs is accidental output of text before the HTTP header. This error may be caused by print statements added for debugging or by forgetting to print the header first. The print statements appear to be part of the HTTP header. The web server software sends an error message to the user about "malformed HTTP headers" with a status of 500, instead of the intended message.

Sending information from the user

A request to a CGI program can also send information from an HTML form. HTML forms are described in the HTML 4.01 specification. If the HTML form uses the "GET" method, then the contents of the form are converted into a part of the URL of the request itself. For example, a form like

<FORM action='http://www.lemoda.net/games/figlet/figlet.cgi'
      method='GET'>
<INPUT name='text' type='text' value='monty'>
<INPUT type='submit'>
</FORM>

which looks like

sends a request to the server of the form

http://www.lemoda.net/games/figlet/figlet.cgi?text=monty

where the "action" is the URL of the CGI script itself, the name of the form field "text" is the word after the question mark, and the value given to that field in the form is the word after the equals sign.

If there are multiple fields in the form, they are separated with an ampersand. For example,

<FORM action='http://www.lemoda.net/games/figlet/figlet.cgi'
      method='GET'>
<INPUT name='text' type='text' value='monty'>
<INPUT name='width' type='text' value='80'>
<INPUT type='submit'>
</FORM>

which creates a form like

sends a request to the server of the form

http://www.lemoda.net/games/figlet/figlet.cgi?text=monty&width=80

This URL is sent to the web server. The web server separates out the query string from the URL and passes it to the CGI program in an environment variable QUERY_STRING. The CGI program must parse the query string to extract the parameters; the web server does not do that job.

Percent encodings

Some characters are not allowed to appear in a URL. In order to circumvent the restrictions on allowed characters, percent encoding or URL encoding substitutes disallowed characters. The method used is to substitute disallowed characters with a hexadecimal number, preceded by a percent sign.

For example, if one types @&=+?# into the above HTML form, the browser converts this string of disallowed characters into the form

http://www.lemoda.net/games/figlet/figlet.cgi?text=%40%26%3D%2B%3F%23

where each character is now represented by a percent sign and its hexadecimal ASCII value. For example, %40 represents @, and %3F represents ?.

The web server software does not decode the percent encodings before sending them to the CGI program. It is left up to the CGI program to do this.

Percent encodings may also be used to encode non-ASCII characters. In this case, the interpretation of the bytes depends on the text encoding used in the original web page.

Post requests

A "post" request is a request to the web server where the user sends some input along with the request. For example, if the user types some text into a form, or uploads an image, these are usually sent as "post" requests.

The user's input is sent to the CGI program via the program's "standard input".^[7]

Post requests involve two new variables, CONTENT_LENGTH and CONTENT_TYPE. CONTENT_LENGTH gives the number of bytes to expect on standard input. CONTENT_TYPE is the value of the Content-Type: header sent with the user's request from the web browser, and tells the CGI program what kind of content it will receive.

Simple forms

There are two main types of content which may be sent. The default behaviour of an HTML form is to send the form's contents as url-encoded. For example,

<form action='http://www.lemoda.net/games/figlet/figlet.cgi'
      method='POST'>
<input type='text' name='text' value='dandy!'>
<input type='text' name='width' value='10'>
<input type='submit'>
</form>

which looks like

sends a message of the form

text=dandy%21&width=10

with a CONTENT_LENGTH of 22, representing the twenty-two bytes of the above text plus a final carriage return character. This uses exactly the same ampersand-separated, percent-encoded format as for GET requests.

It is possible to specify this encoding by setting the enctype attribute of the HTML form element to the value "application/x-www-form-urlencoded":

<form action='http://www.lemoda.net/games/figlet/figlet.cgi'
      method='POST'
      enctype='application/x-www-form-urlencoded'>

but this is not necessary since this is already the default.

As with the GET method, the web server software does not split the parameters or decode the percent encoding, so the CGI software must do this itself.

File uploads

The encoding described in Simple forms is inefficient for transferring large binary files such as images, since each non-ASCII byte turns into three bytes if the percent encoding is used. The "multipart/form-data" MIME type is suitable for transferring binary files.

This encoding sends the values from each field of a form as individual parts of a MIME message separated by a boundary. The boundary is extracted from the CONTENT_TYPE header. For example, a form like

<form action='http://www.lemoda.net/games/figlet/figlet.cgi'
      method='POST'
      enctype='multipart/form-data'>
<input type='text' name='text' value='dandy!'>
<input type='text' name='width' value='10'>
<input type='submit'>
</form>

which looks like this

turns into a CONTENT_TYPE value of the form

multipart/form-data; boundary=----WebKitFormBoundaryHhVVGE1xV4vmz0WV

and standard input of the following kind:

------WebKitFormBoundaryHhVVGE1xV4vmz0WV
Content-Disposition: form-data; name="text"

dandy!
------WebKitFormBoundaryHhVVGE1xV4vmz0WV
Content-Disposition: form-data; name="width"

10
------WebKitFormBoundaryHhVVGE1xV4vmz0WV--

The boundary text is repeated to split the message into pieces. The exact form of the boundary text is random and depends on the web browser the user has. The above example is from Google Chrome. The CGI program has to extract the boundary string from the CONTENT_TYPE variable and then split standard input. The marker for the very last place to split is marked by the boundary plus two minus signs added at the end.

Unlike the URL encoding, characters like ! are not percent-encoded in this format. Thus this method is more suitable for a transfer of a large amount of binary data. Percent encoding triples the length of most of the binary data, but multipart/form-data leaves it in its original form.

Remembering the user

The usual behaviour of a CGI program is to run, create the web page or other data requested, and then halt. Therefore, even if the user sends some identification with a first message, the CGI program does not remember the user for the next request. The user therefore needs to inform the CGI program of his identity each time.

The most common way for a user to give an identity is by means of "cookies". Cookies are small pieces of information supplied by the server to the user, which the user then returns to the server with each request.

The server sets a cookie on the user's computer with

Set-Cookie: name=value

sent as part of the HTTP headers (see HTTP headers). The cookie is then remembered on the user's computer until he closes his browser. It is also possible to make a cookie persist beyond the point when the user closes the browser by setting a date.^[8]

Set-Cookie: name=value; expires=Mon, 27-Oct-2010 10:10:10

It is also possible to delete a cookie by setting the expiry date to before the current time.

The user sends back his identifying cookie with each subsequent request to the web server using the Cookie: field of the HTTP request. One page may have more than one cookie set.

The value of the Cookie field of the request is made available to the CGI program via the HTTP header HTTP_COOKIE. The CGI program must extract the relevant cookie from a list of semicolon-separated cookies.

Output options

The MD5 checksum

The MD5 checksum is a way of ensuring the integrity of data as it travels across the internet. The value of the MD5 checksum is calculated from the content of the page before any compression (see Compressing the output) is applied. It does not apply to the HTTP headers.

Content-MD5: f0d6b059ec2bb99747486f2602d64d21

See RFC 1864 for more on the Content-MD5 header.

Compressing the output

There are various ways to compress the output of a CGI script. Whether or not these may be sent by the CGI program is controlled by the Accept-Encoding header of the HTTP request which the browser sends. Its value is available to a CGI program via the environment variable HTTP_ACCEPT_ENCODING. Most browsers and web crawlers accept data compressed using the gzip and deflate formats.^[9]

Only the part of the message after the HTTP headers is compressed. The HTTP headers are never compressed.

The CGI program informs the client that it is sending compressed content using the Content-Encoding header. For example, if it sends gzip-compressed output, this looks like

Content-Encoding: gzip

Compression is usually applied to data in text form, such as with the mime types text/html or text/plain, but not to image files such as JPEG or PNG files, which are already compressed.

For a simple example of compressing CGI output, see Compressing CGI output with Perl.

Footnotes

A port, short for "internet socket port number", is a number which specifies an endpoint for communications on the internet. Ports originated in RFC 36. Almost all web servers use port number eighty. A web server on a different port will appear in its URL in the form http://www.example.com:8080/, where 8080 is the port number being used.
The use of carriage return plus line feed to end lines is an internet standard originating in RFC 139. This is the same line-ending convention as used in Microsoft Windows, but different from that used in Unix-like operating systems such as Linux and FreeBSD, or the Apple Macintosh operating system.
CGI was originally created for the NCSA web server. It is now defined by RFC 3875.
The "not found" error has the HTTP status code 404
HTTP, the "HyperText Transfer Protocol" which underlies the world wide web, is defined in RFC 1945 (version 1.0) and RFC 2616 (version 1.1).
MIME means "Multipurpose Internet Mail Extensions", a way of specifying file formats over the internet. MIME is specified in RFC 2045
Every Unix program has three files associated with it, its standard input, standard output, and standard error. In C, these are defined in the stdio.h header file, and they are called stdin, stdout and stderr respectively.
The date format is defined in RFC 822. Cookie dates must be in GMT (Greenwich Mean Time).
The gzip format is defined by RFC 1952 and the deflate format is defined by RFC 1950