In 1999, J. David Blackstone, or, as he is know online, jdavidb, was taking CSE 4344 (Network Concepts) at UT Arlington. Those were the glory days of Sparc Solaris, and Blackstone wrote, for his college course, a C program called tinyhttpd. It is, essentially, a very short version of the immensely complex programs that seem run the world these days: web servers. Unlike the million-line behemoths (think Apache, nginx, et cetera), tinyhttpd is a HTTP 1.1 web server in 532 lines of well-commented C.
HTTP 1.1 is a ubuqitously supported protocol that is useful for a great many applications, and in this modern era of embedded (a.k.a “Internet of Things”) computing applications, small web servers have never been more important.
This program is also a small, manageable example of a legacy application - an old program written for an obsolete operating system that still gets the job done, but exposes any organization using it to not only the cost of maintaining ancient operating systems and hardware, but also to the risk of the security vulnerabilities present in tinyhttpd itself and the software it needs to run.
For the purposes of these posts, I’ll be looking at tinyhttpd from the perspective of a company that uses it internally, and wants to transition to a more modular, portable, and maintainable design, rather than one which either ships it as a product or buys it as a product from another company and wants to replace it; these situations are similar, but have additional challenges.
The first thing to do is to analyze the existing source. I’ve gone ahead and created a GitHub repository to host both the old and new source code, and I’ll link to specific commits in these posts. For instance, here is the commit with nothing but the unmodified source of the legacy app.
The first thing to do is to build the existing app. In order not to clutter the repository with object files, I created a .gitignore file from GitHub’s default C gitignores. Now all I have to do is run
What’s this, it doesn’t compile? Well, you’ll remember I mentioned it was written for an ancient version of Sparc Solaris - that’s the whole reason we’re rewriting it. Luckily, the original author anticipated this. Looking at
legacy/httpd.c (where the error is), I see this comment at the top:
I made a note of this in my analysis folder and made those changes - except that they didn’t apply. The makefile didn’t have
-lsocket, and there was only one occurrence of
pthread_create. They did make the app build, but it didn’t work!
In order to figure out what’s happening, I looked up
pthread_create on man7.org. It’s part of the POSIX threading API, and it is definitely available on Linux. Furthermore, if we look at the main() function, we can see why commenting out those lines caused a problem - it’s an infinite loop that does nothing but accept connections!
So, we need to get POSIX threads working to make this app run properly. (Note that this problem isn’t an uncommon one when looking at legacy apps; there is often not a good set of build instructions.)
In our case, luckily, this is easy: just revert the commenting and change
-lpthread in the Makefile to
-pthread, as mentioned on the manual page.
Doing this allows the app to build and run correctly, binding to port 9999. When I open localhost:9999 in my web browser, I get a page back. Success!
Now that we have a compiling and running version of the legacy tinyhttpd, it’s time to go through the source code. Luckily for us, tinyhttpd is entirely contained in a single file. Let’s start off with the top:
Here is some information which will often be included in legacy programs - some short information about the author and purpose of the program, and some (in this case out of date and inaccurate) information about building and running the program. Removing the misleading lines makes this section a lot more concise and is probably a good idea.
#includes, which aren’t very helpful in this case, we find two
SERVER_STRING definition is pretty straightforward; it’s an identifier of the software, which will be sent to clients. In our version, I would prefer to not include the
\r\n terminator in the definition itself. As to the
ISspace definition, though, I’m not immediately sure. A quick search of the source shows no definition of a function
isspace taking an integer, so it’s probably coming from one of the includes.
If this program had multiple files, I’d search through them next; but, as there are none, I’m going straight to the Internet. Turns out, it does just what you’d expect - it checks if a given integer represents whitespace or not. This definition simply allows calling it directly on
char values without writing out an explicit cast every time. I’ve made a note of this in my analysis documents.
After the head macros, we can see explicit definitions of all the functions used in the program.
Because they have no comments, these definitions are not particularly useful, so let’s go down to the bottom of the page and look at which functions are called in the program’s entry function,
Let’s break this down further. This function takes void, meaning that the program has no arguments or command line options. This probably means it’s not very customizable, something I’d like to change in the rewritten version.
After the function signature come the definitions of some local variables:
client_sock are just
ints, but they represent file handles, as we’ll see in a moment.
port is clearly a port number.
client_name is the address of the client, and
client_name_len is its length.
Below that, the program uses
signal() to ignore
SIGPIPE, the signal that programs receive when they write to a file handle which has been closed. It seems to me that this should be handled more appropriately in the rewrite.
Immediately afterward, the
server_sock variable is filled by the result of the function
startup, which is given a pointer to the port number. This seems odd to me - why does it need a reference and not just the value? - so I look at that function’s definition. It is commented with:
That makes more sense now - it allows dynamically generating a port number. That’s useful, but the functionality isn’t exposed through the command line interface, which is annoying. In our program, I’d like to expose that, and I’d also like to move away from the C convention of modifying inputs. In the rewrite, I think I’ll return a tuple. Since this is a fairly complex idea, I’ll take this time to write some notes down.
That’s enough to understand a bit more about
main. After a simple status message, the program moves on to the main loop:
This is an infinite loop which accepts a connection, as can be seen if we look up
accept(), which is where
client_sock gets its value. It returns a file handle representing the socket. It returns -1 if it fails for some reason, and the next few lines check for that eventuality. This is another suboptimal design imposed by C’s lack of algebraic data types - in Rust, this idea can be represented with an
Option or a
The next few lines try (and handle errors for) spawning a new thread that runs
accept_request. Looking at the comments here is not quite as illuminating as one might hope:
I’m not really sure what processing the request “appropriately” entails. For now, though, it’s enough to know that this is the main function for dealing with incoming requests.
The only code after this is cleanup code we won’t need in the rewrite, so we have enough info to write a short pseudocode summary of the server:
That’s a lot simpler than one might have imagined from the length of this post, and it doesn’t tell us much about the actual functionality of the server, but it gives you a good idea of the process one often has to go through to understand legacy code.
Now that we have examined the basic structure of the server’s execution, I’m going to dive into the actual functionality and logic of the server, which is encapsulated primarily in the function
accept_request, whose signature is
void accept_request(void *arg). This is a signature that is totally unrevealing, and which in Rust would require an unsafe block; this function takes a raw pointer with no type information at all. We’ll have to do quite a bit of work to understand what the function actually does.
First of all, are there any clues about what the argument might represent? Well, we can look back at how the function is called:
pthread_create(&newthread , NULL, (void *)accept_request, (void *)&client_sock)
This is a little complicated, but essentially a new thread is being spawned which will execute
accept_request(&client_sock). This is the only place this function is called, so the argument is presumably expected to be only a pointer to an integer file descriptor to a socket - but the compiler knows none of that! That’s a lot of unchecked assumptions and unsafe memory access. Rust, and more importantly the Rust standard library, has better invariant checking, which will make the re-implementation a great deal safer and thus easier to extend.
Moving on to the body of the function, we see the creation of a lot of local variables which I’ll go into as they’re used. It is important to note, though, that there is a group of buffers created with absolute lengths. These appear, at first glance, to be possible introduction points for overflow vulnerabilities - something that is mitigated by the Rust idiom of defaulting to using
Vecs instead of arrays.
One of these buffers, of length 1024, is populated using the function
get_line, which, according to the comments above its definition, reads a line into a buffer and null-terminates it, with length checking, and returns the number of bytes stored. That buffer is printed and dissected over the course of the next 90 lines or so.
Now that it’s clear how I dissect each line of code, I’m going to move a bit faster, translating the entire program into pseudocode function by function. What we currently have is this:
And we’re examining the idea of “dealing with the client”. This is all done in the
accept_connection function, whose pseudocode looks a bit like this:
This analysis is pretty revealing: essentially all this function does is determine some properties of a request and then pass it off to be handled appropriately by other functions.
This particular function should be fairly easy to translate into more efficient Rust code, especially if we look at using Rust’s more advanced type system. In particular, rather than having a large number of buffers, I’d like to use slices and ADTs. For example, I might create an enum
Then I could use a
match expression to appropriately dispatch the request, whether to the static server, CGI handler, or error response.
In the next post, I’ll take a look at the handler functions and how they handle the various conditions and actions a request can trigger - unimplemented method, resource not found, static file serving, and CGI execution. I’ll also discuss the Rust idioms that can be used to better model the intended behavior and internal structure of this server.