The Common Gateway Interface
Contents
How a CGI program runs
A web browser sends a request to a web server. The request goes to port number 80 on the serving computer.[1] A request looks like this:
GET /program.cgi HTTP/1.0 Host: www.lemoda.net
Each line is followed by a carriage return and a line feed character.[2] At the end of the request, there is a blank line.
If the browser requests a file such as index.html
, the
web server software searches for the file and sends it.
A web server which can run CGI programs,[3]
on receiving the instruction above, checks it, then looks for a
program /program.cgi
. If the web server does not find the
program, the web server returns a "not found" error back over the
internet to the user.[4]
If the program is found, and the web server software has been set up to run CGI programs, the web server software starts the program, and then sends the program's output to the user.
How the server talks to the program
The web server software sends information to a CGI program
like /program.cgi
via environment variables. It gets
information back from the program via the program's standard
output. Web servers do not get information back from the CGI program
via environment variables. Web servers do not use command line
options. In some cases (see Post requests), web servers use
the program's standard input to send information to a program.
Environment variables
The server starts the program. If the program expects input, it may
read the input from environment variables. In C, the value of an
environment variable is obtained using the getenv
library
call. (See Set and get environment variables in C).
The type of request which the user has made is given by the
environment variable REQUEST_METHOD
. This usually has the
value GET
or POST
. The POST
request is a request where the user sends information to the CGI
program via the program's standard input. See Post requests.
CGI output
The CGI program returns a message to the user via its "standard
output". In the case of a C program, that means that
printf
statements are enough to send the message from
the CGI program via the web server.
The message consists of two parts. The first consists of HTTP headers, and the second can take any form. Between the two parts there is a blank line.
HTTP headers
The message returned by a CGI program must begin with correctly
formatted HTTP (HyperText Transfer Protocol[5]) headers. The bare minimum allowed is
the Content-Type
header with a MIME[6] type for
the content to follow:
Content-Type: text/plain
hello world
Here the MIME type of the
message is text/plain
, indicating that the output
of the CGI program is "just text", with no formatting added. For
example, the Figlet server at LeMoDa.net returns its output as plain
text. Another common option for CGI output is text/html
,
which says that the output of the CGI program is HTML (HyperText
Markup Language).
The headers output by the CGI program do not need to end with carriage return plus line feed. The server guarantees that it will alter any line feeds into carriage return plus line feeds before sending the output over the internet.
At the end of the HTTP headers, there is a blank line. This blank line indicates the end of the headers and the start of the text returned by the CGI program.
A CGI program may also print a Status
header which
indicates the status of the request in the form
Status: 200 (OK)
The web server turns this into an HTTP header
HTTP/1.1 200 OK
Here 200 means "success". A full set of possible statuses can be found on this website as HTTP status codes as C defines. Common status codes are 404, which occurs when something is not found, and 500, which occurs when a program or the server fails.
A CGI program does not need to print a status line, since the server
adds a 200 status if no Status:
header is found in the
CGI output.
Malformed headers
A common problem when creating CGI programs is accidental output of text before the HTTP header. This error may be caused by print statements added for debugging or by forgetting to print the header first. The print statements appear to be part of the HTTP header. The web server software sends an error message to the user about "malformed HTTP headers" with a status of 500, instead of the intended message.
Sending information from the user
A request to a CGI program can also send information from an HTML form. HTML forms are described in the HTML 4.01 specification. If the HTML form uses the "GET" method, then the contents of the form are converted into a part of the URL of the request itself. For example, a form like
<FORM action='http://www.lemoda.net/games/figlet/figlet.cgi' method='GET'> <INPUT name='text' type='text' value='monty'> <INPUT type='submit'> </FORM>
which looks like
sends a request to the server of the form
http://www.lemoda.net/games/figlet/figlet.cgi?text=monty
where the "action" is the URL of the CGI script itself, the name of the form field "text" is the word after the question mark, and the value given to that field in the form is the word after the equals sign.
If there are multiple fields in the form, they are separated with an ampersand. For example,
<FORM action='http://www.lemoda.net/games/figlet/figlet.cgi' method='GET'> <INPUT name='text' type='text' value='monty'> <INPUT name='width' type='text' value='80'> <INPUT type='submit'> </FORM>
which creates a form like
sends a request to the server of the form
http://www.lemoda.net/games/figlet/figlet.cgi?text=monty&width=80
This URL is sent to the web server. The web server separates out the
query string from the URL and passes it to the CGI program in an
environment variable QUERY_STRING
. The CGI program must
parse the query string to extract the parameters; the web server does
not do that job.
Percent encodings
Some characters are not allowed to appear in a URL. In order to circumvent the restrictions on allowed characters, percent encoding or URL encoding substitutes disallowed characters. The method used is to substitute disallowed characters with a hexadecimal number, preceded by a percent sign.
For example, if one types @&=+?#
into the above HTML
form, the browser converts this string of disallowed characters into
the form
http://www.lemoda.net/games/figlet/figlet.cgi?text=%40%26%3D%2B%3F%23
where each character is now represented by a percent sign and its
hexadecimal ASCII value. For example, %40
represents @
, and %3F
represents ?
.
The web server software does not decode the percent encodings before sending them to the CGI program. It is left up to the CGI program to do this.
Percent encodings may also be used to encode non-ASCII characters. In this case, the interpretation of the bytes depends on the text encoding used in the original web page.
Post requests
A "post" request is a request to the web server where the user sends some input along with the request. For example, if the user types some text into a form, or uploads an image, these are usually sent as "post" requests.
The user's input is sent to the CGI program via the program's "standard input".[7]
Post requests involve two new variables, CONTENT_LENGTH
and CONTENT_TYPE
. CONTENT_LENGTH
gives the
number of bytes to expect on standard input. CONTENT_TYPE
is the value of the Content-Type:
header sent with the
user's request from the web browser, and tells the CGI program what
kind of content it will receive.
Simple forms
There are two main types of content which may be sent. The default behaviour of an HTML form is to send the form's contents as url-encoded. For example,
<form action='http://www.lemoda.net/games/figlet/figlet.cgi' method='POST'> <input type='text' name='text' value='dandy!'> <input type='text' name='width' value='10'> <input type='submit'> </form>
which looks like
sends a message of the form
text=dandy%21&width=10
with a CONTENT_LENGTH
of 22, representing the twenty-two
bytes of the above text plus a final carriage return character. This
uses exactly the same ampersand-separated, percent-encoded format as
for GET requests.
It is possible to specify this encoding by setting
the enctype
attribute
of the HTML form element to the value
"application/x-www-form-urlencoded":
<form action='http://www.lemoda.net/games/figlet/figlet.cgi'
method='POST'
enctype='application/x-www-form-urlencoded'>
but this is not necessary since this is already the default.
As with the GET method, the web server software does not split the parameters or decode the percent encoding, so the CGI software must do this itself.
File uploads
The encoding described in Simple forms is inefficient for transferring large binary files such as images, since each non-ASCII byte turns into three bytes if the percent encoding is used. The "multipart/form-data" MIME type is suitable for transferring binary files.
This encoding sends the values from each field of a form as individual
parts of a MIME message separated by a boundary. The boundary is
extracted from the CONTENT_TYPE
header. For example, a
form like
<form action='http://www.lemoda.net/games/figlet/figlet.cgi' method='POST' enctype='multipart/form-data'> <input type='text' name='text' value='dandy!'> <input type='text' name='width' value='10'> <input type='submit'> </form>
which looks like this
turns into a CONTENT_TYPE
value of the form
multipart/form-data; boundary=----WebKitFormBoundaryHhVVGE1xV4vmz0WV
and standard input of the following kind:
------WebKitFormBoundaryHhVVGE1xV4vmz0WV
Content-Disposition: form-data; name="text"
dandy!
------WebKitFormBoundaryHhVVGE1xV4vmz0WV
Content-Disposition: form-data; name="width"
10
------WebKitFormBoundaryHhVVGE1xV4vmz0WV--
The boundary text is
repeated to split the message into pieces. The exact form of the
boundary text is random and depends on the web browser the user
has. The above example is from Google Chrome. The CGI program has to
extract the boundary string from the CONTENT_TYPE
variable and then split standard input. The marker for the very last place to split is
marked by the boundary plus two minus signs added at the end.
Unlike the URL encoding, characters like ! are not percent-encoded in this format. Thus this method is more suitable for a transfer of a large amount of binary data. Percent encoding triples the length of most of the binary data, but multipart/form-data leaves it in its original form.
Remembering the user
The usual behaviour of a CGI program is to run, create the web page or other data requested, and then halt. Therefore, even if the user sends some identification with a first message, the CGI program does not remember the user for the next request. The user therefore needs to inform the CGI program of his identity each time.
The most common way for a user to give an identity is by means of "cookies". Cookies are small pieces of information supplied by the server to the user, which the user then returns to the server with each request.
The server sets a cookie on the user's computer with
Set-Cookie: name=value
sent as part of the HTTP headers (see HTTP headers). The cookie is then remembered on the user's computer until he closes his browser. It is also possible to make a cookie persist beyond the point when the user closes the browser by setting a date.[8]
Set-Cookie: name=value; expires=Mon, 27-Oct-2010 10:10:10
It is also possible to delete a cookie by setting the expiry date to before the current time.
The user sends back his identifying cookie with each subsequent
request to the web server using the Cookie:
field of the
HTTP request. One page may have more than one cookie set.
The value of the Cookie
field of the request is made
available to the CGI program via the HTTP
header HTTP_COOKIE
. The CGI program must extract the
relevant cookie from a list of semicolon-separated cookies.
Output options
The MD5 checksum
The MD5 checksum is a way of ensuring the integrity of data as it travels across the internet. The value of the MD5 checksum is calculated from the content of the page before any compression (see Compressing the output) is applied. It does not apply to the HTTP headers.
Content-MD5: f0d6b059ec2bb99747486f2602d64d21
See RFC 1864 for more on the Content-MD5 header.
Compressing the output
There are various ways to compress the output of a CGI script. Whether
or not these may be sent by the CGI program is controlled by
the Accept-Encoding
header of the HTTP request which the
browser sends. Its value is available to a CGI program via the
environment variable HTTP_ACCEPT_ENCODING
. Most browsers
and web crawlers accept data compressed using the gzip
and deflate
formats.[9]
Only the part of the message after the HTTP headers is compressed. The HTTP headers are never compressed.
The CGI program informs the client that it is sending compressed
content using the Content-Encoding
header. For example,
if it sends gzip-compressed output, this looks like
Content-Encoding: gzip
Compression is usually applied to data in text form, such as with the
mime types text/html
or text/plain
, but not
to image files such as JPEG or PNG files, which are already
compressed.
For a simple example of compressing CGI output, see Compressing CGI output with Perl.
Footnotes
- A port, short for "internet socket port number", is a number which specifies an endpoint for communications on the internet. Ports originated in RFC 36. Almost all web servers use port number eighty. A web server on a different port will appear in its URL in the form http://www.example.com:8080/, where 8080 is the port number being used.
- The use of carriage return plus line feed to end lines is an internet standard originating in RFC 139. This is the same line-ending convention as used in Microsoft Windows, but different from that used in Unix-like operating systems such as Linux and FreeBSD, or the Apple Macintosh operating system.
- CGI was originally created for the NCSA web server. It is now defined by RFC 3875.
- The "not found" error has the HTTP status code 404
- HTTP, the "HyperText Transfer Protocol" which underlies the world wide web, is defined in RFC 1945 (version 1.0) and RFC 2616 (version 1.1).
- MIME means "Multipurpose Internet Mail Extensions", a way of specifying file formats over the internet. MIME is specified in RFC 2045
- Every Unix program has three files
associated with it, its standard input, standard output, and standard
error. In C, these are defined in the
stdio.h
header file, and they are calledstdin
,stdout
andstderr
respectively. - The date format is defined in RFC 822. Cookie dates must be in GMT (Greenwich Mean Time).
- The gzip format is defined by RFC 1952 and the deflate format is defined by RFC 1950