[ planet-factor ]

Daniel Ehrenberg: Parsing with regular expressions and group capture

Though I'd rather avoid it, string parsing is a crucial part of programming. Since we're more than 60 years into the use of modern computers, it seems like we should have a pretty good handle on how to build abstractions over parsing. Indeed, there are tons of great tools out there, like GNU Yacc, Haskell's Parsec, newer Packrat-based parsers like Factor's EBNF syntax for PEGs, and a bunch of other high level parsing libraries. These libraries are relatively easy to use once you understand the underlying structure (each one parses a different subset of context-free grammars), because they expose the programmer to a tree-like view of the string.

However, these incur too much overhead to be used for certain domains, like the parsing that goes on in an HTTP server or client. They're really overkill when, as in the case of HTTP interchanges, what you're dealing with is a regular language and processing can be done on-line. (I'll get back later to what I mean by those two things.) The main tools that exist to deal with this are Lex and Ragel.

Ragel seems like a really interesting solution for this domain. The entire description of parsing is eventually put in one regular expression, which is compiled to a DFA, where states and transitions can be annotated by actions. Fine-grained control is given to limit non-determinism. But the user must be acutely aware of how regular expressions correspond to DFAs in order to use the abstraction. So it is somewhat leaky. Also, it's difficult to get a tree-like view on things: actions are used purely for their side effect.

So, here's an idea: let's find a middle ground. Let's try to use regular expressions, with all of their efficiency advantages, but get an abstract tree-like view of the grammar and an ability to use parsing actions like high-level abstractions allow. Ideally, the user won't have to know about the implementation beyond two simple facts: regular languages can't use general recursion, and nondeterminism should be minimized.

This isn't something that I've implemented, but I have a pretty good idea for the design of such as system, and I wanted to share it with you. First, a little background.

DFAs and regular expressions


I'm taking a computer science class about this right now, so I'm gonna be totally pedantic. When I say regular expression, I mean an expression that describes a regular language. Perl regexes aren't regular expressions (and Larry Wall knows this). If you don't feel like putting on your theoretician's hat, this blog post will be mostly meaningless.

What's a regular language? First off, a language is a set of strings. We care about infinite sets of strings, since finite sets are trivial to represent. If a string is in the language, that means that the language matches the string, intuitively. A regular language is one which can be represented by a deterministic finite automaton (DFA) without extensions, also called a finite state machine (FSM) for some reason. Many useful languages are regular, and many are not.

The idea of a DFA is a finite number of states and a transition function, which takes the current state and a character of a string and returns the next state. The transition function is defined on all states and all characters in the alphabet. There is a set of final states, and if the string runs out when the machine is in a final state, then the string is accepted. The language of the DFA is the set of strings accepted by the DFA. For a given DFA, that DFA can be run in linear time with respect to the length of the input string and constant space. It can also be run "on-line", that is, without the whole string in advance, going incrementally.

A related construction is an NFA, or nondeterministic finite automaton. Imagine the previous idea, but instead of a transition function, there is a transition relation. That is, for any character and current state, there are zero or more next states to go to, and the NFA always picks the right one. This is called nondeterminism (at least that's what it means here). Amazingly, NFAs can accept only regular languages and nothing more, because NFAs can be translated into DFAs. Basically, you build a DFA which picks all possible states at once, given all possible paths through the NFA. Potentially, though, there's an exponential blowup in the number of states.

Every regular expression can be converted into an equivalent NFA, which can be converted into a DFA, which can then be converted back into a regular expression. They're all equivalent. So then what's a regular expression? There are different ways to define it. One is that you can build up a regular expression from the following elements: the epsilon regex (matching the empty string), the empty regex (matching nothing), single character regexes (matching just a single character), concatenation (one followed by another), disjunction (or) and the Kleene star (0 or more copies of something). Counterintuitively, it's possible to construct regexes which support negation, conjunction, lookahead and other interesting things.

The most important distinction from Perl regexes is that regular expressions cannot contain backreferences, because these are provably impossible to express in a DFA. It's impossible to parse something with backreferences in the same linear time and constant space that you get from regexes which are regular. In fact, parsing patterns with backreferences is NP-hard and not believed possible in polynomial time (with respect to the length of the input string). Since regular expressions which are regular give us such nice properties, I'm going to stick to them.

Regular expressions in practice in parsing today


The formal study of regular languages is a basically solved problem within the formalism itself: they are equivalent to DFAs, and satisfy a convenient set of properties summarized by the pumping lemmaand the Myhill-Nerode theorem. The problem is just, is the given string a member of the language? What languages are regular?

This was solved in the 1950s and 1960s, and the basic results are in most introductory compiler books. Those books use the solution to build lexers, like Lex. Lex basically takes a list of regular expressions, each associated with an action, finds one of them to match maximally with the current input, executes the associated action on the portion that matches, and then repeats with the rest of the string. This is useful to build lexers, but the programmer has very little context for things, so it's difficult to use for much else.

More recently, Ragel has been used as a way to parse more complicated things using regular expressions. Its strategy is to turn its compile-time input into one big regular expression, annotated with actions on certain states or transitions. The actions are fragments of C code, and they form the processing power of the machine. However, their behavior can get a little unintuitive if too much nondeterminism is used, so Ragel provides a bunch of tools to limit that. Also, Ragel lets you explicitly specify a DFA through transitions, which seems useful but low-level.

Group capture with regexes


One of the most useful features of Perl's regular expressions is group capture. By this, I mean how you can do something like s/^(1*)(0*)$/$2$1/ to swap ones and zeros in a string. This is different from backreferences (like the non-regular expression /(.*)$1/) because it's only used in subsequent code, to figure out what got matched to what part of the regex. It doesn't parse any languages which aren't regular, but it's a useful tool for processing.

Curiously, this has been ignored both by academics and DFA implementors so far. I hypothesize that it's been ignored by theorists for two reasons: (1) It's easy to confuse with backreferences, which make the language non-regular, which is totally uninteresting to theorists (2) They're not part of the formalism of regular expressions as previously expressed.

Implementors of (non-Perl-based) regular expression-based parsing mechanisms tend to avoid group capture because, in the general case, it's not fast enough and can't be done on-line. Also, as far as I can tell, it hasn't been implemented any other way than interpreting an NFA, using backtracking, and keeping track of where the parser is within the regex to determine group boundaries. This would be terrible for the domain of Lex and Ragel. By "on-line" I don't mean on the internet but rather that an algorithm that can be performed incrementally, getting little pieces (characters, say) of the input and doing computation incrementally, as the input is received, without storing the whole thing and running the algorithm all at once.

So how can we do group capture on-line? Well, in general, we can't. Consider the regular expression (1*)1 where you're trying to capture the group 1*. As the input is being processed, we don't know when we've gotten to the end of the group until the entire input is over, since if there are two more 1's, then the first one must be in the group. However, in many cases group capture can in fact be done on-line, as in (0*)(1*), where the groups captured are 0* and 1*. As the regex is processing on the string, it knows that, if there is a match, the group boundary is just before the first 1. This can be formalized as a "boundary of determinism": a point where, in the subset construction to form a DFA from an NFA gets a subset of exactly one state.

I believe this can handle most cases of group capture in practice, if the regular expression is well-written, but surely not all of them. I have an idea for how to do group capture in the few remaining circumstances, but unfortunately it takes linear space and it's not online. I'll blog about it once I have a proof of correctness.

Hierarchical parsing using group capture


Using this group capture mechanism, we can build a hierarchical parsing mechanism with actions on different things, which can be built to parse regular languages in a higher-level way. Regular expressions can't use arbitrary recursion like context-free grammars can, so the parse tree will be of fixed size, but it could still be useful. In designing this, I'm thinking specifically about making a SAX-like XML parser. It'd be awkward to write everything out as one big regular expression, but split into smaller things, each with their own little steps in processing, it could be much more elegant. My goal for syntax is something like EBNF syntax, as Chris Double's PEGs library in Factor does. Here's some future pseudocode for how it could look in parsing an XML tag, simplified. (In this code, > is used like Ragel :>>, to indicate that when the expression afterwards can be matched by the regex, it is, as soon as possible (basically).)

REG: tag
chars = "&" entity:any* > ";" [[ entity lookup-entity ]]
| any
string = "\"" str:chars > "\"" [[ str ]]
| "'" str:chars > "'" [[ str ]]
xml-name = name-start-char name-char*
attribute = name:xml-name ws "=" ws str:string [[ name str 2array ]]
tag = "<" ws closer?:("/"?) name:xml-name attrs:(attribute*) ws contained?:("/"?) ws ">" [[ ... ]]

Conclusion


Though I haven't implemented this yet, and probably shouldn't even be talking about it, I'm really excited about this idea. I even came up with a stupid little name with it: Hegel, both for High-level Ragel and because it represents the synthesis of the dialectic (as described by Hegel) of slower, higher-level parsing and fast low-level parsing into fast, high-level parsing of regular languages. I hope it works.

Sat, 10 May 2008 11:05:00

Slava Pestov: Garbage collection throughput improvements

The problem


About a month ago, I made some changes to the garbage collector. These changes included drastically decreasing the default nursery size from 16mb to 1mb. This turned out to have a negative effect on performance, with GC consuming as much as 60% of run time on some tests.

Instead of making the default nursery larger, which would just mask the problem until we hit workloads with 16 times as much allocation (or when the compiler began to generate code that was 16 times faster, or any combination of the above), I decided to investigate the root cause of the problem.

Turns out that every minor GC was taking on the order of a few milliseconds at the minimum, which is too much time. Most of the time was spent scanning the cards array for marked cards. Even if there weren't any marked cards, with a 64mb heap and 64 byte card size, that's a million array entries to check.

Faster card scanning


The first thing I did was change how the card scan loop operates. Instead of iterating over the array a byte at a time, and checking if each byte satisfies the card mark mask, we iterate over the array four bytes at a time, checking each group of four bytes against the mask. GCC should really be doing this optimization for me, because its a simple case of vectorization, but it doesn't, so I just wrote the code out by hand:
u32 *quad_ptr;
u32 quad_mask = mask | (mask << 8) | (mask << 16) | (mask << 24);

for(quad_ptr = (u32 *)first_card; quad_ptr < (u32 *)last_card; quad_ptr++)
{
if(*quad_ptr & quad_mask)
{
F_CARD *ptr = (F_CARD *)quad_ptr;

int card;
for(card = 0; card < 4; card++)
{
if(ptr[card] & mask)
{
collect_card(&ptr[card],gen,here);
ptr[card] &= ~unmask;
}
}
}
}

This improved card scan performance by a factor of 3, which makes sense, because we do four times less loop iterations, but at some point, you still have to hit memory, which dampens things out a bit.

This optimization improved performance, but not sufficiently so. I realized that card scanning hurt the most in benchmarks which didn't write into objects in the old generation at all; for example, a benchmark which operates on bignums, where the intermediate values are extremely short-lived, fills up the nursery very rapidly, and every minor GC checks all cards, which not only fills up the cache with junk, but doesn't achieve anything useful as no cards will be marked.

On the other hard, a pointer recording approach where all stores are written into a buffer is also unacceptable, because in code with a high rate of mutation, you pay a hefty price every time the buffer overflows. So I had to find a compromise.

Introducing card decks


My solution is something I haven't seen done anywhere else before, but it is pretty trivial if you think about it. It amounts to two-level card marking. We have an array of cards, where each card maps to a small amount of memory: 64 bytes in my case, but it could be anything; and an array of card decks. Each card deck corresponds to 1024 cards, and if a card is marked, then its deck must be marked too. This way, on a minor GC, we scan the deck array first, and then only check cards corresponding to marked decks.

This complicates the write barrier, because now on every store we have to mark a card and a deck, not just a card as before. I decided to counterweight this by simplifying the write barrier in another way. Formerly, it would read the card, "or" it with a mask, and store it back. This was done because the card contained an additional piece of information, and that is the offset within the card where the first object begins. Cards have two mark bits, denoting a possible pointer from this generation to either aging space or the nursery, and the other six bits were reserved for the offset of the first object. By moving object offsets (I call them "allot markers") into a separate array, I removed a read operation from the write barrier; now it simply has to write the mark mask to the card, without caring about its existing contents. This also freed up two bits in the allot marker, allowing me to switch to 256 byte cards (and 256 kilobyte card decks).

These improvements really helped on benchmarks which allocated large numbers of extremely short-lived objects, however on benchmarks which allocated longer-lived objects the improvements were still major but not so drastic. It turns out that if objects survives long enough to be copied from the nursery into aging space, then it is possible for this copying to begin to dominate running time.

Low-level optimizations


The inner loop of the GC looked like so:
void copy_handle(CELL *handle)
{
CELL pointer = *handle;

if(!immediate_p(pointer) && should_copy(pointer))
*handle = copy_object(pointer);
}

CELL collect_next(CELL scan)
{
do_slots(scan,copy_handle);

if(collecting_gen == TENURED)
do_code_slots(scan);

return scan + untagged_object_size(scan);
}

This is very elegant code; do_slots is a higher-order function which applies a function pointer to each slot of the object. However, GCC doesn't try to optimize higher-order code at all, after all it is a C compiler not an ML compiler, and it isn't Sufficiently Smart Either! So I rewrote the above function by manually inlining do_slots() and replacing the function pointer invocation with the body of copy_handle(). The next thing I realized was that should_copy() contains a large series of conditional tests which did not depend on its input parameters, only global variables whose value was invariant for the duration of the collection:
/* test if the pointer is in generation being collected, or a younger one. */
INLINE bool should_copy(CELL untagged)
{
if(in_zone(newspace,untagged))
return false;
if(collecting_gen == TENURED)
return true;
else if(HAVE_AGING_P && collecting_gen == AGING)
return !in_zone(&data_heap->generations[TENURED],untagged);
else if(HAVE_NURSERY_P && collecting_gen == NURSERY)
return in_zone(&nursery,untagged);
else
{
critical_error("Bug in should_copy",untagged);
return false;
}
}

So again, I did a bit of hand-optimization. If foo() does not depend on the loop variable or the side effects of the body, then the following are equivalent:
while(A)
{
if(foo()) B else C;
}

if(foo)
{
while(A) B;
}
else
{
while(A) C;
}

In the optimizing compiler literature, this is known as loop unswitching.

The new collect_next() is much longer, but also much faster. It looks like this:
CELL collect_next(CELL scan)
{
CELL *obj = (CELL *)scan;
CELL *end = (CELL *)(scan + binary_payload_start(scan));

obj++;

CELL newspace_start = newspace->start;
CELL newspace_end = newspace->end;

if(HAVE_NURSERY_P && collecting_gen == NURSERY)
{
CELL nursery_start = nursery.start;
CELL nursery_end = nursery.end;

for(; obj < end; obj++)
{
CELL pointer = *obj;

if(!immediate_p(pointer)
&& (pointer >= nursery_start && pointer < nursery_end))
*obj = copy_object(pointer);
}
}
else if(HAVE_AGING_P && collecting_gen == AGING)
{
F_ZONE *tenured = &data_heap->generations[TENURED];

CELL tenured_start = tenured->start;
CELL tenured_end = tenured->end;

for(; obj < end; obj++)
{
CELL pointer = *obj;

if(!immediate_p(pointer)
&& !(pointer >= newspace_start && pointer < newspace_end)
&& !(pointer >= tenured_start && pointer < tenured_end))
*obj = copy_object(pointer);
}
}
else if(collecting_gen == TENURED)
{
for(; obj < end; obj++)
{
CELL pointer = *obj;

if(!immediate_p(pointer)
&& !(pointer >= newspace_start && pointer < newspace_end))
*obj = copy_object(pointer);
}

do_code_slots(scan);
}
else
critical_error("Bug in collect_next",0);

return scan + untagged_object_size(scan);
}

That's a hell of a code explosion, but it made a real difference to the performance of the garbage collector.

Benchmarks


The first benchmark just allocates 3-element arrays in a loop:
: garbage 1 2 3 3array ;

: garbage-loop 150000000 [ garbage drop ] times ;

[ garbage-loop ] time

Here are the results with different nursery sizes; all times are in milliseconds and I ran each benchmark multiple times to ensure I was getting reliable results:
BeforeAfter
1mb nursery:89084461
2mb nursery:64554451
4mb nursery:53364582
8mb nursery:47794582

The second benchmark allocates larger arrays in a loop:
: garbage 100 f <array> ;

: garbage-loop 15000000 [ garbage drop ] times ;

[ garbage-loop ] time

The results are similar:
BeforeAfter
1mb nursery:104182158
2mb nursery:6364 2178
4mb nursery:4966 3223
8mb nursery:4249 3294

An interesting phenomenon occurs where after the GC changes, a larger nursery actually begins to hurt performance. Dan and I hypothesized that this is due to the larger nursery hurting locality, with poor algorithms masking this effect with the older GC.

The next benchmark allocates a lot of temporary arrays but stores them into larger arrays which are somewhat longer lived:
: garbage 100 f <array> ;

: garbage-loop 150 [ 10000 [ drop garbage ] map drop ] times ;

[ garbage-loop ] time

The size of the aging generation makes a difference here so I tried the benchmark with more parameters:
BeforeAfter
1mb nursery/2mb aging:86935525
2mb nursery/2mb aging:73185255
4mb nursery/2mb aging:39392628
8mb nursery/2mb aging:23141526
1mb nursery/8mb aging:34551493
4mb nursery/4mb aging:30241894
4mb nursery/4mb aging:31571577
8mb nursery/8mb aging:15331003

The above benchmarks are somewhat unrealistic, however other benchmarks showed speedups too, some more drastic than others:


BenchmarkBeforeAfter
SHA1121987282
Random1898914459
Sort1568513476
Raytracer2770012966
Mandelbrot46331847

Before the shrinking of the nursery, the runtimes of these benchmarks looked much like they do now; it was only until a month ago that GC time began to dominate benchmarks like that. However I like having a small nursery, and making it smaller forced me to optimize the garbage collector for higher throughput in constrained-memory conditions.

Looking ahead


The next big change I will be making to the garbage collector is switching to using mark/sweep for the old generation. This will reduce memory consumption by almost 50%.

The other side of this coin is compiler optimizations. If the compiler performed more advanced static analysis, it could eliminate much of the dynamic memory allocation in the above benchmarks, cutting GC time further. Of course, if we have a great compiler and a great GC, we'll just have great performance all around.

Sat, 10 May 2008 00:33:00

Daniel Ehrenberg: Interval maps in Factor

Recently, I wrote a little library in Factor to get the script of a Unicode code point. It's in the Factor git repository in the vocab unicode.script. Initially, I relatively simple representation of the data: there was a byte array, where the index was the code point and the elements were bytes corresponding to scripts. (It's possible to use a byte array because there are only seventy-some scripts to care about.) Lookup consisted of char>num-table nth num>name-table nth. But this was pretty inefficient. The largest code point (that I wanted to represent here) was something around number 195,000, meaning that the byte array took up almost 200Kb. Even if I somehow got rid of that empty space (and I don't see an obvious way how, without a bunch of overhead), there are 100,000 code points whose script I wanted to encode.

But we can do better than taking up 100Kb. The thing about this data is that scripts are in a bunch of contiguous ranges. That is, two characters that are next to each other in code point order are very likely to have the same script. The file in the Unicode Character Database encoding this information actually uses special syntax to denote a range, rather than write out each one individually. So what if we store these intervals directly rather than store each element of the intervals?

A data structure to hold intervals with O(log n) lookup and insertion has already been developed: interval trees. They're described in Chapter 14 of Introduction to Algorithms starting on page 311, but I won't describe them here. At first, I tried to implement these, but I realized that, for my purposes, they're overkill. They're really easy to get wrong: if you implement them on top of another kind of balanced binary tree, you have to make sure that balancing preserves certain invariants about annotations on the tree. Still, if you need fast insertion and deletion, they make the most sense.

A much simpler solution is to just have a sorted array of intervals, each associated with a value. The right interval, and then the corresponding value, can be found by simple binary search. I don't even need to know how to do binary search, because it's already in the Factor library! This is efficient as long as the interval map is constructed all at once, which it is in this case. By a high constant factor, this is also more space-efficient than using binary trees. The whole solution takes less than 30 lines of code.

(Note: the intervals here are closed and must be disjoint. <=> must be defined on them. They don't use the intervals in math.intervals to save space, and since they're overkill. Interval maps don't follow the assoc protocol because intervals aren't discrete, eg floats are acceptable as keys.)

First, the tuples we'll be using: an interval-map is the whole associative structure, containing a single slot for the underlying array.


TUPLE: interval-map array ;

That array consists of interval-nodes, which have a beginning, end and corresponding value.

TUPLE: interval-node from to value ;

Let's assume we already have the sorted interval maps. Given a key and an interval map, find-interval will give the index of the interval which might contain the given key.

: find-interval ( key interval-map -- i )
[ from>> <=> ] binsearch ;

interval-contains? tests if a node contains a given key.

: interval-contains? ( object interval-node -- ? )
[ from>> ] [ to>> ] bi between? ;

Finally, interval-at* searches an interval map to find a key, finding the correct interval and returning its value only if the interval contains the key.

: fixup-value ( value ? -- value/f ? )
[ drop f f ] unless* ;

: interval-at* ( key map -- value ? )
array>> [ find-interval ] 2keep swapd nth
[ nip value>> ] [ interval-contains? ] 2bi
fixup-value ;

A few convenience words, analogous to those for assocs:

: interval-at ( key map -- value ) interval-at* drop ;
: interval-key? ( key map -- ? ) interval-at* nip ;

So, to construct an interval map, there are a fewi things that have to be done. The input is an abstract specification, consisting of an assoc where the keys are either (1) 2arrays, where the first is the beginning of an interval and the second is the end (2) numbers, representing an interval of the form [a,a]. This can be converted into a form of all (1) with the following:

: all-intervals ( sequence -- intervals )
[ >r dup number? [ dup 2array ] when r> ] assoc-map
{ } assoc-like ;

Once that is done, the objects should be converted to intervals:

: >intervals ( specification -- intervals )
[ >r first2 r> interval-node boa ] { } assoc>map ;

After that, and after the intervals are sorted, it needs to be assured that all intervals are disjoint. For this, we can use the monotonic? combinator, which checks to make sure that all adjacent pairs in a sequence satisfy a predicate. (This is more useful than it sounds at first.)

: disjoint? ( node1 node2 -- ? )
[ to>> ] [ from>> ] bi* < ;

: ensure-disjoint ( intervals -- intervals )
dup [ disjoint? ] monotonic?
[ "Intervals are not disjoint" throw ] unless ;

And, to put it all together, using a tuple array for improved space efficiency:

: <interval-map> ( specification -- map )
all-intervals [ [ first second ] compare ] sort
>intervals ensure-disjoint >tuple-array
interval-map boa ;

All in all, in the case of representing the table of scripts, a table which was previously 200KB is now 20KB. Yay!

Wed, 7 May 2008 21:50:00

Slava Pestov: I/O changes, and process pipeline support

I made some improvements to the I/O system today.

Default stream variables


The stdio variable has been replaced by input-stream and output-stream, and there are four new words:
  • with-input-stream
  • with-output-stream
  • with-input-stream*
  • with-output-stream*

The first two close the stream after, the latter do not. The with-stream and with-stream* words are still around, they expect a duplex stream, unpack it, and bind both variables.

I've changed many usages of with-stream to use one of the unidirectional variants instead. This means that you can now write code like the following:
"foo.txt" utf8 [
"blah.txt" utf8 [
... read from the first file, write to the second file,
using read and write ...
] with-file-writer
] with-file-reader

Before you had to use this really ugly "design pattern":
"foo.txt" utf8 <file-reader> [
"blah.txt" utf8 <file-writer> [
<duplex-stream> [
...
] with-stream
] with-disposal
] with-disposal

Speaking of duplex streams, because they're not used by anything in the core anymore I have moved them to extra. They are still used by <process-stream> and <client>. I added a <process-reader> and <process-writer> word for those cases where you only want a pipe in one direction; they express intention better.

Pipes


The <process-stream> word has been around for a while, and this word used pipes internally, but they were not exposed in a nice, cross-platform way, until now.

The io.pipes vocabulary contains two words:
  • <pipe> creates a new pipe and wraps a pair of streams around it. The streams are packaged into a single duplex stream; any data written to the stream can be read back from the same stream (presumably, in a different thread). This is actually implemented with native pipes on Unix and Windows.
  • The run-pipeline word takes a sequence of quotations or launch descriptors, and runs them all in parallel with input wired up as if it were a Unix shell pipe. For example,
    { "cat foo.txt" "grep x" "sort" "uniq" } run-pipeline

    Corresponds to the following shell command:
    cat foo.txt | grep x | sort | uniq

    In addition, being able to place process objects and quotations in the pipeline gives you a lot of expressive power.

Appending process output to files


The io.launcher vocabulary supports the full array of input and output redirection features, and now, pipelines. There was one missing component: redirecting process output to a file opened for appending. Now this is possible. The following Factor code:
<process>
"do-stuff" >>command
"log.txt" <appender> >>stderr
run-process

Corresponds to this shell command:
do-stuff 2>> do-stuff.txt

Of course, shell script is a DSL for launching processes so it is more concise than Factor. However, Factor's io.launcher library now supports all of the features that the shell does, and its pretty easy to build a shell command parser using Chris Double's PEG library, which translates shell commands to sequences of Factor process descriptors in a pipeline.

And now, I present a concise illustration of the difference between the Unix philosophy and the Windows philosophy. Here we have two pieces of code, which do the exact same thing: create a new pipe, open both ends, return a pair of handles.

Unix:
USING: system alien.c-types kernel unix math sequences
qualified io.unix.backend io.nonblocking ;
IN: io.unix.pipes
QUALIFIED: io.pipes

M: unix io.pipes:(pipe) ( -- pair )
2 "int" <c-array>
dup pipe io-error
2 c-int-array> first2
[ [ init-handle ] bi@ ] [ io.pipes:pipe boa ] 2bi ;


Windows:
USING: alien alien.c-types arrays destructors io io.windows libc
windows.types math.bitfields windows.kernel32 windows namespaces
kernel sequences windows.errors assocs math.parser system random
combinators accessors io.pipes io.nonblocking ;
IN: io.windows.nt.pipes

: create-named-pipe ( name -- handle )
{ PIPE_ACCESS_INBOUND FILE_FLAG_OVERLAPPED } flags
PIPE_TYPE_BYTE
1
4096
4096
0
security-attributes-inherit
CreateNamedPipe
dup win32-error=0/f
dup add-completion
f <win32-file> ;

: open-other-end ( name -- handle )
GENERIC_WRITE
{ FILE_SHARE_READ FILE_SHARE_WRITE } flags
security-attributes-inherit
OPEN_EXISTING
FILE_FLAG_OVERLAPPED
f
CreateFile
dup win32-error=0/f
dup add-completion
f <win32-file> ;

: unique-pipe-name ( -- string )
[
"\\\\.\\pipe\\factor-" %
pipe counter #
"-" %
32 random-bits #
"-" %
millis #
] "" make ;

M: winnt (pipe) ( -- pipe )
[
unique-pipe-name
[ create-named-pipe dup close-later ]
[ open-other-end dup close-later ]
bi pipe boa
] with-destructors ;

The Windows API makes things difficult for no reason. Anonymous pipes do not support overlapped I/O, so you have to open a named pipe with a randomly-generated name (I'm not making this up, many other frameworks do the same thing and Microsoft even recommends this approach).

On the flipside, the nice thing about supporting both Unix and Windows is that I get to come up with true high-level abstractions that make sense, instead of being able to get away with having thin wrappers over Unix system calls as so many other language implementations do. For example, Ocaml's idea of "high-level I/O" is some basic POSIX bindings, together with incomplete emulation of Unix semantics on Windows. Look forward to writing a page of code if you want to map a file into memory or run a process and read its output into a string. And of course Java doesn't support pipes, I/O redirection for launching processes, or file system change monitoring, at all.

Mon, 5 May 2008 18:02:00

Daniel Ehrenberg: A couple GC algorithms in more detail

In previous posts on garbage collection, I've given a pretty cursory overview as to how things actually work. In this post, I hope to give a somewhat more specific explanation of two incremental (and potentially concurrent or parallel, but we'll ignore that for now) GC algorithms: Yuasa's snapshot-at-the-beginning incremental mark-sweep collector, and the MC2 algorithm. Yuasa's collector is very widely used, for example in Java 5 when an incremental collector is requested. MC2 is a more recent algorithm designed to reduce the fragmentation that mark-sweep creates, and appears to get great performance, though it isn't used much yet. In their practical implementation, both collectors are generational.

Yuasa's mark-sweep collector


The idea is pretty simple: take a mark-sweep collector and split up the work, doing a little bit on each allocation. When the heap occupancy passes a certain threshold, say 80%, switch into "mark phase", and on each allocation, mark the right amount of the heap so that everything's marked by the time the heap is full. (You can ensure this by making the amount of marking proportional to the amount of memory allocated.) Then, switch into sweep phase, and on each allocation sweep the heap by a certain amount. If a big object is allocated, sweeping continues until there's enough room. Once sweeping is done, the collector returns to a neutral state and allocation takes place without any special collection actions until the free space dips below the threshold.

Making this work


This is a neat little way to specify a GC algorithm. The implementor has three knobs at their disposal: the threshold to begin collection, the speed of marking, and the speed of sweeping. But there's a problem: the algorithm, as I described it, doesn't work. See, the graph of interconnections in the heap may change during the course of marking, and that's a problem. As I described in a previous post, if a pointer gets moved to another location, it might evade marking and get swept, causing memory corruption.

In a snapshot-at-the-beginning incremental marking GC, the technique to save this is to trap all pointer writes and execute a little bit of code: if the collector is in the marking phase, and if the old pointer value isn't marked, it needs to get marked and get pushed on the marking stack so that its children get marked. (The marking stack is the explicit stack used for depth-first traversal of the heap, to mark everything it reaches.) This piece of code is called the write barrier, and it goes on in addition to the generational write barrier, if one is necessary.

Conservativeness


One more thing: objects are allocated as marked, if an object is allocated during a GC cycle. This means that they can't be collected until the next time around. Unfortunately, this means that any generational GC will be ineffective while marking is going on: everything is effectively allocated in the oldest generation. Nevertheless, generations still provide a significant performance advantage, since most time is spent in the neural non-GC state.

This is called snapshot-at-the-beginning not because an actual snapshot is made, but because everything is saved that had something referring to it at the beginning of the marking phase. (Everything that gets a reference to it during the cycle is also saved.) Of all incremental mark-sweep GC algorithms, a snapshot-at-the-beginning collector is the most conservative, causing the most floating garbage to lie around and wait, uncollected, until the next cycle. Other algorithms have techniques to avoid this, but it often comes at other costs.

MC2


Unfortunately, no matter what strategy is used to minimize fragmentation, there is a program which will cause bad fragmentation of the heap, making it less usable and allocation more expensive. For this reason, a compaction strategy is helpful, and the MC2 algorithm (Memory-Constrained Compaction), created by Narendran Sachindran, provides one within an incremental and generational system. The details are somewhat complicated, and in this blog post I'll offer a simplified view. You can also look at the full paper.

MC


The idea is based on the Mark-Copy (MC) algorithm. The heap is divided up into a number of equally sized windows, say 40. One of these is the nursery, and the others act as tenured space. (I don't know why, but the papers about this seem to use a two-generation rather than three-generation model. I think it could easily be updated to use three generations, but I'll stick with this for now.) Each window has a logical number, with the nursery having the highest number.

Nursery collections go on as I've described in a previous post. A tenured space collection is triggered when there is only one (non-nursery) window left free. At this point, the heap is fully marked. During marking, remembered sets of pointers into each window are made. In turn, each window is copied (using Cheney's copying collector) to the open space that exists, starting in the one free window. The remembered sets can be used to update pointers that go to things that were moved. If the lowest number window is copied first, the remembered sets only need to contain pointers from higher windows to lower windows.

New modifications


MC2 adds a few things to this, to make the algorithm incremental and give low upper bounds on space overhead. The first change is that incremental marking is done. This is similar to the incremental snapshot-at-the-beginning marker described above, though the creators of MC2 opted for a version called incremental update, which is less conservative and more complicated but equally sound. The next change is in the copying technique. If a window is determined to have high occupancy (like more than 95%), it is left as it is without copying. Otherwise, windows are collected into groups whose remaining data can fit into one window. Those groups are incrementally copied into a new window.

Other changes make sure that the space overhead is bounded. The size of remembered sets is limited by switching to a card marking system in the event of an overflow. Objects with many references to them are put in semi-permanent storage in the lowest possible window number, minimizing the size of remembered set that they need.

In a benchmark included in the MC2 paper, it is demonstrated that MC2 has the same or slightly better performance compared to non-incremental generational mark-sweep or generational mark-compact, the alternatives for the domain of memory-constrained systems. Pauses more than 30ms are rare, and performance appears to be consistent over a wide range of Java programs.

Sat, 3 May 2008 00:51:00

Slava Pestov: USA Zip code database

Recently someone posted a freely available Zip code database on reddit. The database is in CSV format.

Phil Dawes contributed a CSV parser to Factor a few days ago, and I thought this would be the perfect use-case to test out the parser.

I added the library to extra/usa-cities. It begins with the usual boilerplate:

USING: io.files io.encodings.ascii sequences sequences.lib
math.parser combinators kernel memoize csv symbols inspector
words accessors math.order sorting ;
IN: usa-cities

Then, we define some singleton types for the various states of the union. While this isn't strictly necessary, it allows us to write generic words which dispatch on states; for example, I'm sure Doug's taxes library could use this:
SINGLETONS: AK AL AR AS AZ CA CO CT DC DE FL GA HI IA ID IL IN
KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK
OR PA PR RI SC SD TN TX UT VA VI VT WA WI WV WY ;

: states ( -- seq )
{
AK AL AR AS AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY
LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK
OR PA PR RI SC SD TN TX UT VA VI VT WA WI WV WY
} ; inline

ERROR: no-such-state name ;

M: no-such-state summary drop "No such state" ;

MEMO: string>state ( string -- state )
dup states [ word-name = ] with find nip
[ ] [ no-such-state ] ?if ;

Next up, we define a data type storing rows from the CSV database:
TUPLE: city
first-zip name state latitude longitude gmt-offset dst-offset ;

Now a word which reads the database, parses it as CSV, and then parses each column into a specific data type:
MEMO: cities ( -- seq )
"resource:extra/usa-cities/zipcode.csv" ascii <file-reader>
csv rest-slice [
7 firstn {
[ string>number ]
[ ]
[ string>state ]
[ string>number ]
[ string>number ]
[ string>number ]
[ string>number ]
} spread city boa
] map ;

This word is tricky; some notes on its workings:
  • We begin by opening a stream for reading from the CSV file with ASCII encoding.
  • The csv word reads CSV data from a stream.
  • The first line of the file consists of column headings and not actual data, so we ignore it by using the non-copying variant of rest, rest-slice (recall that the primary sequence type in Factor is an array, so removing the first element makes a copy; a slice is a virtual sequence presenting a view of a subsequence of an array).
  • The spread combinator takes a sequence of quotations, Q1,...,QN, and n values from the stack, X1,...XN, and outputs Q1[X1],...,QN[XN]. In this case we're taking the first 7 elements of each row of CSV data (each row has exactly 7 columns so in effect we're pushing every element of the sequence on the stack), then using spread to convert some columns from their initial text format into something more useful to us; a state singleton or a number.
  • Finally, we use city boa to construct a city tuple "by order of arguments"; this slurps the 7 stack values and stores them into a new instance of city (note that the definition of the city type has exactly 7 slots and they are defined in the same order as the columns of the file).
  • Finally, we map over the sequence of rows to perform the above steps on each row of the file. The result is a sequence of city instances.

The word is memoized so of course it will only load the database once.

We can now define words to query it:
MEMO: cities-named ( name -- cities )
cities [ name>> = ] with filter ;

MEMO: cities-named-in ( name state -- cities )
cities [
tuck [ name>> = ] [ state>> = ] 2bi* and
] with with filter ;

: find-zip-code ( code -- city )
cities [ first-zip>> <=> ] binsearch* ;

Now, let's look at some examples.

First, let's look up my Zip code:
( scratchpad ) 55406 find-zip-code .
T{ city f 55406 "Minneapolis" MN 44.938615 -93.22082 -6 1 }

And another famous Zip code:
( scratchpad ) 90210 find-zip-code .
T{ city f 90210 "Beverly Hills" CA 34.088808 -118.40612 -8 1 }

How many states have a city named "Minneapolis"?
( scratchpad ) "Minneapolis" cities-named [ state>> ] map prune .
V{ NC MN KS }

What is the possible range of Zip codes for Austin?
( scratchpad ) "Austin" TX cities-named-in [ first-zip>> ] map [ infimum . ] [ supremum . ] bi
73301
78972

There are many possible applications for this library, including form validation in web apps. It could be extended further: if the database was loaded into a quadtree sorted by latitude/longitude, you could perform queries such as finding all towns within 50 miles of a given city.

Tue, 29 Apr 2008 21:45:00

Slava Pestov: An addendum to "The new HTTP server, part 2"

If you run the web app presented in the last blog post verbatim, you will get a "500 Internal server error" with no further indication of what's going wrong. This is because the code I presented has a minor omission.

The opaque error message is intentional: if your web app crashes, you don't necessarily want to expose internal details to every user that comes along (one famous case was reddit.com, which leaked a portion of their Python codebase inside a stack trace at some point). However, if you set the development-mode global variable to a true value, the behavior of the HTTP server changes in two respects:

  • If an error occurs, the error page contains the error message as well as the full stack trace.
  • Every request begins by calling refresh-all, thus interactive testing of web app changes becomes very straightforward.

If we enable development mode, we see the real error message, "No such table: SESSIONS". This is because I didn't mention that one must initialize the database by creating the table for storing sessions first:
"counter.db" sqlite-db [ init-sessions-table ] with-db

In the next installment of this series, which hopefully won't take as long as the second one did, I will discuss form validation and the templating system.

Tue, 29 Apr 2008 17:45:00

Slava Pestov: Write once, run anywhere

Apple finally released Java 6 for Mac OS X, about a year late, and it only runs on 64-bit Intel Macs. It also requires Leopard, which isn't such a big deal because anyone can upgrade. I sold my G5 but a lot of people out there still use PowerPC Macs and I guess they're just screwed. It's a shame too because Java 6 is the first release that's starting to approach true usability for desktop applications.

Tue, 29 Apr 2008 17:25:00

Daniel Ehrenberg: Potential ideas to explore

I haven't written in a while, and it's a little hard to get started back up, so here are just a bunch of random ideas in my head that I'd like to share with you guys. Sorry if it's a little incoherent...

Possible extensions to Inverse

I've been thinking about possible ways to generalize my system for concatenative pattern matching, currently in extra/inverse. There are two ways to go about it: making a more general constraint solving system, and giving access to the old input when inverting something, as in the Harmony project. A third way is to add backtracking (in a different place than constraint solving would put it). To someone familiar with Inverse, these might seem like they're coming from nowhere, but they're actually very closely related. (To someone not familiar with it, see my previous blog post describing Inverse.)

Constraint solving

The idea of resolving constraints is to figure out as much as you can about a situation given certain facts. This is easy in some cases, but impossible in others, even if enough facts are known to, potentially, figure out what everything is. For example, Diophantine equations can be solved by a fully general constraint-solving system, but they're known to be undecidable in general.

So what can constraint solving get you in Inverse? Well, imagine an inverse to bi. It's not difficult to make one within the current framework, but some information is lost: everything must be completely determined. Think about inverting [ first ] [ second ] bi. Inverting this should get the same result as first2 (which has a hard-coded inverse right now, inverting to 2array). But it won't work.

A way for [ first ] [ second ] bi to work would be using the following steps:
  1. Initialize a logic variable X as unbound
  2. Unify X with the information, "the first element is what's second from the top of the stack (at runtime)". Now it's known that X is a sequence of length at least 1.
  3. Unify X with the information, "the second element is what's on the top of the stack (at runtime)". Now it's know that X is a sequence of length at least two.
  4. From the information we have about X, produce a canonical representation, since the inverted quotation is over: an array of the minimum possible length.

This isn't easy to do in general, but it should be possible, in theory. It'd be extremely cool if it worked out.

Formally, you can think of Inverse as already a reasonable constraint solving system, for a limited problem domain. Given [ f ], and the statement about stacks A and B that f(A) = B, and given B, find a possible value for A. The strategy used right now is mathematically sound, and I hope to write it up some day. But, a more general use of logic variables is possible: explicit logic variables in code. This could be used to make a better-integrated logic language in Factor.

The Harmony Project


The Harmony Project, led by Benjamin C. Pierce, is an attempt to solve the "view-update problem" using a new programming language and type system which is largely invertible. The view-update problem is that we want to convert different storage formats into an abstract representation, manipulate that representation and put it back without duplicating code about the representation. Everything operates on edge-labeled trees.

Within the Harmony framework, it's possible to do all your work in bijections (one-to-one onto functions, similar but not identical to the domain of Inverse right now), but there's extra power included: the function to put the abstract representation back into the original form has access to the original. This adds a huge amount of power, giving the possibility of conditionals and recursion, in limited cases. Also, it gives the power to ignore certain things about the surface structure when looking at the abstract form. (Harmony also has ideas about tree merging, and of course a new type system, but I'm not as interested in that right now.)

So far, only relatively trivial things have been made with Harmony, but the idea looks really useful, though there are two problems: (1) I don't really understand it fully (like constraints) and (2) I have no idea how it can fit together with Inverse as it is right now.

Backtracking

In Mark Tullsen's paper on first-class patterns, there was an interesting idea that Inverse could adopt. Tullsen used monads to sequence the patterns. It's the simplest to use the Maybe monad, and that corresponds to how pattern matching systems normally work. But if the List monad is used instead, then you easily get backtracking. This could be ported to Factor either by using monads or, maybe easier, by using continuations. Years ago, Chris Double implemented amb in Factor using continuations, though the code won't work anymore. The sequencing and backtracking I'm talking about is relevant in things like switch statements, rather than undo itself. I'm not sure if it'd actually be useful in practice.

Garbage collection research ideas

Because the summer's coming up, and I'll be participating in Harvey Mudd's Garbage Collection REU, I've been coming up with a few research ideas. The suggested one is to continue with the work of previous years' REUs and think about simplifiers and collecting certain persistent data structures and weak hashtables, but here are a couple more:
  • Figure out how efficient garbage collection on Non-Uniform Memory Access systems can work. The problem (if it is a problem) is that plain old garbage collection on multiprocessor NUMA systems isn't as fast as it could be, because a chunk of memory allocated for a thread may be far away from where it's used. One way to ensure locality is to give each processor (at least) its own heap, where the heap is guaranteed to be stored in the closest memory. But if data needs to be shared between processors, this can be too limiting. A piece of data can be kept on the RAM closest the processor which made the allocating call, but maybe it'd be beneficial to collect data on which processor is using which data, and dynamically move data around to different places in RAM to put it closest to where it's used. A related issue is maximizing locality when actually performing the tracing in the GC, which I have no ideas about.

  • Run a real benchmark comparing several GC algorithms. Probably the most annoying thing for programming language implementors trying to pick a good GC algorithm is that there's no comprehensive benchmark to refer to. No one really knows which algorithm is the fastest, so there are two strategies remaining: pick the one that sounds the fastest, or do trial and error among just a few. Each paper about a new algorithm reports speed improvements—over significantly older algorithms. It'd be a big project, but I think it's possible to make a good benchmark suite and test how long it takes for these algorithms to run, in terms of absolute throughput and pause length and frequency, given different allocation strategies. If it's possible, it'd be nice to know what kind of GC performs best given a particular memory use pattern.

  • Garbage collector implementation in proof-carrying code. There are a couple invariants that garbage collectors have, that must be preserved. For example, the user can't be exposed to any forwarding pointers, and a new garbage collection can't be started when forwarding pointers exist. The idea of proof-carrying code (an explicit proof, which is type-checked to be accurate, is given with the code) isn't new; it's mostly been used to prove memory consistency safety given untrusted code. But maybe it could be used to prove that a GC implementation is correct.

These ideas are really difficult, but I think they're interesting, and with four other smart people working with me, maybe in a summer we can do something really cool, like this or whatever other idea they come up with.

Ragel-style state machines in Factor

In my Automata and Computability class at Carleton, we've been studying (what else) finite automata, and it got me thinking about regular expressions and their utility in Factor. By regular expression, I mean an expression denoting a regular language: a real, academic regexp. A regular language is one that can be written as a deterministic finite automaton (finite state machine). Hopefully, I'll explain more about this in a future blog post.

Anyway, if you've heard of Ragel, it's basically what I want to do. But the form it'd take is basically the same as PEGs (Chris Double's Pacrat parser), with the one restriction that no recursion is allowed. In return for this restriction, there is no linear space overhead. Basically everything else, as far as I know, could stay the same.

I'm thinking I'll redo the XML parser with this. The SAX-like view will be done with this regular languages parser (since all that's needed is a tokenizer), and then that can be formed into a tree using PEGs (since linear space overhead is acceptable there). Linear space overhead, by the way, is unacceptable for the SAX-like view, since it should be usable for extremely large documents that couldn't easily fit in memory all at once.

(By the way, I know Ragel also allows you to explicitly make state charts, but I won't include this until I see a place where I want to use it.)

Tue, 29 Apr 2008 15:48:00

Slava Pestov: The new HTTP server, part 2

It's been a month and a half since the first part of this series. Why the long delay? I've been busy with other things. I implemented inheritance, various compiler optimizations, and many other things. In the last couple of weeks I've been working on the web framework again, tying up some loose ends and porting more existing web applications over (namely, the pastebin and planet factor).

In this entry I will talk about session management. Session management was one of the first things I implemented in the new framework when I started working on it, but recently I gave the code an overhaul.

Session management


The basic idea behind session management is that while HTTP is a stateless protocol, we can simulate state by sending a token to the client -- either in the form of a hidden element on the page, or a cookie, which the client sends back to the server with a later request. This token is associated with an object on the server and the object holds state between requests.

Another approach for session management is to store state entirely on the client; instead of sending the client a session ID identifying an object on the server, you send the session data itself to the client. Traditionally this approach has only been used for user preferences and such where security is immaterial, but it can even be used for more sensitive data by encrypting it with a private key only known to the server. The client receives an opaque blob of binary data which cannot be inspected or tampered with (unless the public key encryption algorithm being used is compromised).

Currently Factor's session manager does not support client-side sessions, but it will soon, using Doug Coleman's public-key encryption code. Server-side sessions are supported, however.

The session manager uses two main strategies to pass state to the client:
  • For GET and HEAD requests, a cookie is used. The cookie's value is a randomly-generated session ID.
  • For POST requests, the form must define a hidden field with the session ID. The value of the cookie is ignored to thwart cross-site scripting attacks.

The idea is to strike a balance between security and convenience; we don't want to add a session ID to every link and start a new session if the user navigates to the site by directly entering a URL, but on the other hand we don't want potentially destructive POST requests to be accepted unless they were sent by a form generated from within the session itself.

In Factor, a session is simply a hashtable where values can be stored. Keys are known as "session variables" and values can be read and written with the sget and sset words, there's also a schange combinator which applies a quotation is applied to an existing session variable to yield a new value. This all entirely analogous to the get/set/change words for dynamic variables.

Session namespaces are serialized and stored in a database using Doug's db.tuples O/R mapper. I originally supported pluggable "session storage" backends, with database storage and in-memory storage as the two options, however I decided to simplify the code and hardcode database storage. This has the side-effect that you'll need to set up a database to use the session management feature, however SQLite presents a lightweight option which requires no configuration, so I don't think this is a big deal at all.

I will show a small example of a 'counter' web application, much like the counter example for the Seaside framework.

We start off with a vocabulary search path:
USING: math kernel accessors http.server http.server.actions
http.server.sessions http.server.templating.fhtml locals ;
IN: webapps.counter

Now, we define a symbol used to key a session variable:
SYMBOL: count

Next, we define a pair of actions which increment the counter value, using the schange combinator. The display slot of an action contains code to be executed upon a GET request; it is expected to output a response object. In our case, the word outputs an action which applies the quotation to the current counter value; the action outputs a response which redirects back to the main page:
:: <counter-action> ( quot -- action )
<action> [
count quot schange
"" f <standard-redirect>
] >>display ;

The action to decrement the counter is entirely analogous:
: <dec-action> ( -- action )
<action> [ count [ 1- ] schange f ] >>display ;

Note that this word constructs actions, instead of invoking them. This approach is more flexible than the old "furnace" web framework, where actions were mapped directly to word execution, because it allows one to write "higher-order actions" parametrized by values more easily.

Here is the default action; it displays the counter value using a template:
: <counter-action> ( -- action )
<action> [
"resource:extra/webapps/counter/counter.fhtml" <fhtml>
] >>display ;

Finally we put everything together in a dispatcher:
: <counter-app> ( -- responder )
counter-app new-dispatcher
[ 1+ ] <counter-action> "inc" add-responder
[ 1- ] <counter-action> "dec" add-responder
<display-action> "" add-responder
<sessions> ;

We create a dispatcher, add instances of our actions to it, and wrap the whole thing in a session manager.

Now, the template:
<% USING: io math.parser http.server.sessions webapps.counter ; %>

<html>
<body>
<h1><% count sget number>string write %></h1>

<a href="inc">++</a>
<a href="dec">--</a>
</body>
</html>

Finally, once we have all the parts, we can create the counter responder and start the HTTP server:
<counter-app> "test.db" sqlite-db <db-persistence> main-responder set
8888 httpd

Note that here we wrap the counter responder in another layer of indirection, this time for database persistence; while the counter web app doesn't use persistence the session manager does, and we chose to use SQLite since it requires no configuration or external services.

Navigating over to http://localhost:8888/ should now display the counter app, and clicking the increment and decrement links should have an effect on the displayed value. Sessions persist between server restarts and time out after 20 minutes of inactivity by default. Looking at your web browser's cookie manager will show that a factorsessid cookie has been set.

As an aside, the Seaside version uses continuations to maintain state. The Factor version explicitly maintains state. Even though I ported Chris Double's modal web framework over to the new HTTP server, I'm avoiding continuations in favor of explicit state for now. I am building up a form component framework with validation, easy persistence, and user authentication without resorting to continuations, and I plan on building a state-machine model with a page flow DSL, much like jBPM, to handle more complex multi-page flows such as shopping carts. While this will result in more work for me, I believe the benefits include transparent support for load-balancing and fail-over, readable URLs, and ultimately, simpler and more reusable web application code because page flow can be decoupled from logic and expressed in a custom DSL intended for that purpose.

The Seaside version is also somewhat shorter; it is easy to express with idiomatic Seaside (transparent session management, presentation logic mixed in with web app code). I will add better abstractions to make up for some of the difference, and for larger applications there should be no difference in code size; in fact since the scope of Factor's framework is wider than Seaside (it covers persistence, authentication and validation, and soon, versioning of persistent entities) you might even need less code to accomplish the same thing.

Virtual hosting


The other topic I promised to cover last time was virtual hosting. Virtual hosting is done with dispatchers, much like nested directory structure is. You create a virtual host dispatcher with <vhost-dispatcher> and add responders for various virtual hosts using add-responder; the >>default slot can be used to set the default virtual host. The key difference between the new approach and the old HTTP server virtual hosting implementation, which relied on a global hashtable mapping virtual host names to responders, is flexibility; the virtual host dispatcher does not necessarily have to be your top-level responder.

For example, the <boilerplate> responder gives you a way of enforcing a common look and feel across a set of web apps, by adding common headers and footers to every page. While I will describe boilerplate responders and the template system in more detail in a later post, for now here is an example:
<vhost-dispatcher>
<online-store> "store.acme.com" add-responder
<support-site> "support.acme.com" add-responder
<main-site> "acme.com" add-responder
<boilerplate>
"acme-site.xml" >>template
acme-db <db-persistence>
<sessions> main-responder set

Here, all virtual hosts share the same session management, database persistence, and common theme, and the virtual host dispatch only happens after the request filters through the mentioned layers of functionality. This would not be possible with the old HTTP server without duplicating code.

Cookies


Finally I promised to talk about cookies. The session management support is great but sometimes you just want to get and set cookies directly. This can be done by reading the cookies slot of the request object, and writing the cookies slot of the response object. The slot contains a sequence of cookie objects, which are parsed and unparsed from their HTTP representation for you. A cookie object contains a series of slots, such as name, value, expiration date (as a Factor timestamp object), max-age (as a Factor duration object), path, and host. While the expiration date is deprecated as of HTTP/1.1, most sites still use it in favor of max-age because older browsers don't support max-age. Factor's HTTP server sets the date header on each response so that expiration dates can work correctly.

Here is an example of using the HTTP client (which shares the cookie code with the server) to look at Google's ridiculously long-lived cookies:
( scratchpad ) "http://www.google.com" http-get-stream drop cookies>> first describe
cookie instance
"delegate" f
"name" "pref"
"value" "ID=c0f4c074cd87502e:TM=1209466656:LM=1209466656:S=_6gGEKtuTgP..."
"path" "/"
"domain" ".google.com"
"expires" T{ timestamp f 2010 4 29 10 57 36 ~duration~ }
"max-age" f
"http-only" f

Tue, 29 Apr 2008 03:04:00

Doug Coleman: Word renaming (part 2)

Here are the word names that have changed:

  • find* -> find-from
  • find-last* -> find-last-from
  • index* -> index-from
  • last-index* -> last-index-from
  • subset -> filter
New shorthand words:
  • 1 tail -> rest
  • 1 tail-slice -> rest-slice
  • swap compose -> prepose
Changes to existing word behavior:
  • reverse stack effect of assoc-diff, diff
  • before? after? before=? after=? are now generic
  • min, max can compare more objects than before, such as timestamps
  • between? can compare more objects
  • <=> returns symbols
There are several motivations at work here. One is that words named foo* are a variant of foo, but otherwise the * is no help to what the word actually does differently. We're trying to move away from word names with stars to something more meaningful.

Code with common patterns that come up a lot, like 1 tail and swap compose, is more clearly understood if these patterns are given a single name.

Factor's subset is not equivalent to the mathematical definition of subset, so it was renamed to filter to avoid confusion. Along these same lines, diff and assoc-diff are now the more mathematically intuitive; you can think of diff like set subtraction now, seq1 seq2 diff is like seq1 - seq2.

Finally, the "UFO operator" <=> now returns symbols +lt+ +eq+ +gt+ instead of negative, zero, and positive numbers. The before? and after? words can compare anything that defines a method on this operator. Since between?, min, and max
are defined in terms of these comparison words, they also work on more objects.

Please let me know if you have any more suggestions for things words that have awkward argument orders, imprecise names, or if you can suggest alternate names for words with stars in their names.

Mon, 28 Apr 2008 10:49:00

Slava Pestov: Performance improvements

Over the last three days I spent some time improving Factor's compiler. I made the following improvements:

  • Partial dispatch is performed on integer arithmetic operations. Previously, Factor's compiler would convert generic arithmetic to machine arithmetic when it knew the exact types involved; so if you had two floats on the stack, + would become float+, and if you had two fixnums it would become fixnum+ or fixnum+fast, depending on whether interval inference determined if the overflow check was needed or not. While this worked well in many cases there were a lot of instances where the compiler could only infer you were dealing with integers, but not fixnums in particular; either because interval inference was not smart enough, or because the values really could be out of bounds for a fixnum. Now, if it knows that both inputs are integers, it compiles a call to a special word which still performs a dispatch, but with only two possibilities. This is a win, because a conditional is faster than a jump table as used in the generic +. It is an even bigger win if one of the inputs is known to be a fixnum, because then the two jump table dispatches are replaced by a single conditional.
  • Improved overflow check elimination. First, there is an enabling optimization. Suppose we have a positive integer on the stack. Then, the following two are equivalent:
    4096 mod
    4095 bitand

    The optimizer now recognizes the case where mod is applied with a power of two to a positive integer, and converts it to a bitwise and. For the rem word an even more general optimization is possible; we only have to assume the first input is an integer and the second is a power of two.

    While on a modern CPU, this is not a big in itself for mod (it is for rem, which is more expensive), it does enable the following optimization. Note that the following two expressions are equivalent:
    4095 bitand
    16777215 bitand 4095 bitand

    That is, if you mask off the first n bits, then the first m bits, and m<n, you get the same result as masking off the first m bits. This might seem trivial and useless and first, but then you realize that a truncating conversion of a positive bignum to a fixnum is simply masking off bits! So the following two are equivalent:
    4095 bitand
    >fixnum 4095 bitand

    But of course, both inputs to the latter bitand are fixnums now, and can be converted to:
    4095 bitand
    >fixnum 4095 fixnum-bitand

    It gets even better. Consider the following piece of code from project-euler.150:
    : next ( t -- new-t s )
    615949 * 797807 + 20 2^ rem dup 19 2^ - ;

    Constant folding gives us:
    : next ( t -- new-t s )
    615949 * 797807 + 1048576 rem dup 524288 - ;

    The above optimization, together with an overflow removal on the -, gives us:
    : next ( t -- new-t s )
    15949 * 797807 + >fixnum 1048575 fixnum-bitand dup 524288 fixnum-fast ;

    There is an existing optimization takes advantage of these identities:
    * >fixnum == [ >fixnum ] bi@ fixnum*fast
    + >fixnum == [ >fixnum ] bi@ fixnum+fast

    By applying the above, the compiler converts this code to:
    : next ( t -- new-t s )
    >fixnum 15949 fixnum*fast 797807 fixnum+fast 1048575 fixnum-bitand dup 524288 fixnum-fast ;

    All generic arithmetic and overflow checks have been removed, because of a single rem in the middle of the calculation!

    In project-euler.150, this word was actually used inside a loop, where it was iteratively applied to a starting generator value. With the new modular arithmetic optimization, Factor's existing interval inference code managed to infer the result of the word is always in the interval (-2^20,2^20), and since the initial value is 0, even the call to >fixnum was optimized away.
  • I improved Factor's type inference algorithm to better deal with local recursive code blocks. While type inference worked okay with loops, it didn't fare so well with binary recursive functions because it wasn't able to obtain any information about return values of the recursive calls. Everybody's favorite binary recursive benchmark, fibonacci, was one example of this situation. Solving this required a bit of thought.

    Previously, type inference for recursive functions was done in a pretty ad-hoc manner. Factor's type inference is only used for optimization so it is okay if it is conservative. It worked pretty well, however today I decided to add better support for recursion, and I also found a bug where it would infer invalid types, so I decided to redo it a little.

    Type inference of recursive functions is now an iterative process, where you begin by assuming the recursive calls take the top type as inputs and the bottom type as outputs, then you infer the type of the body under these assumptions, then apply the resulting types to the recursive calls, then infer again, and so on, until a fixed point is reached. There are smarter ways of doing this with other type systems however in Factor, the type functions of some words are pretty complex. For example, the output type of + depends on not only the types of its inputs, but the interval ranges they lie in, so I'm not sure if there's any other solution.

    With this out of the way, there is one major remaining limitation in Factor's type inference, and that is it only works within words. It does not attempt to infer types across word boundaries. In practice this means many optimizations only kick in if a lot of words are inlined, which is not practical in some cases due to code size explosion. My next project in this area is to cache type signatures for words and use this information when inferring types of callers.
  • I improved performance of inline object allocation. In the VM, the structure holding the allocation pointer is now stored directly in a global variable, instead of a pointer to it being stored in the global variable. This is one less indirection for inline allocators. Another improvement is that the heap exhaustion check is done in the allocator itself, so a call into the VM is avoided if the heap is not full. This saves a subroutine call in the common case of course, but also some saving and restoring of registers.

These changes have let to some performance improvements on the two benchmarks I was working with. My computer here is a MacBook Pro with a 2.4 GHz Intel Core Duo 2.

The project-euler.150 benchmark saw its runtime decrease from 87 seconds to 22 seconds.

The benchmark.recursive benchmark, which is a Factor implementation of the recursive benchmark from the Computer Language Shootout, saw its runtime decrease from 27 seconds to 9 seconds.

For comparison, SBCL runs the recursive benchmark in 3.5 seconds, and Java runs it in 1.6 seconds.

Using the java -Xint result of 22 seconds, I guesstimated from the results for all languages on the shootout that Factor at around the performance of the Erlang HIPE JIT, and slightly faster than the Python Psyco JIT and GForth. By anybody's standard, this is anywhere between "not very fast" and "bloody slow", but its slowly improving.

Now I'll end with a rant about the language shootout.

The only so-called "dynamic" language implementations which come close to the performance of Java and C on this benchmark are SBCL, Ikarus Scheme, and Chicken Scheme. However, all these benchmarks are actually static programs in disguise, peppered with type declarations and even unsafe low-level features. The Ikarus and Chicken versions even implement the same functions twice, once for integer inputs and using integer arithmetic primitives, and another time for float inputs and using float arithmetic primitives. Unless their goal is really to promote their languages as nothing more than C with s-expression syntax, this is disappointing and dishonest.

The Haskell benchmarks suffer from this also. Many benchmarks seem to use malloc and pointer arithmetic, and exclamation points (strictness annotations) abound. The n-body benchmark contains this gem:
planets :: Ptr Double
planets = unsafePerformIO $ mallocBytes (7 * nbodies * 8) -- sizeOf double = 8

I would expect better from the Haskell people than hardcoding the size of a double to do pointer arithmetic on manually-managed memory!

Right now I have no idea how idiomatic Haskell performs in practice; from looking at these benchmarks, it is clear to me that if one writes C in Haskell, pretty decent performance is possible, but that's not saying much. You can write C in any language.

The runtime of the Factor recursive benchmark further improves by two-fold if I manually replace generic arithmetic with unsafe machine arithmetic, however I'm not interested in writing such code. I don't want to expose low-level primitives to the user, document their use as necessary for performance critical code, and declare that my job is done.

Sat, 19 Apr 2008 04:56:00

Slava Pestov: My top 8 shell commands

I wrote a short Factor script to check my bash history and tally up the most frequently occurring commands.

    git
./factor
bf
cdf
ls
cd
fc
push

There are some aliases here:
  • cdf is aliased to cd ~/factor
  • bf is aliased to ./factor -i=boot.x86.32.image
  • push is aliased to git push origin master
  • fc is alised to ssh factorcode.org

It seems that the only thing I use my computer for is running Factor!

Sat, 19 Apr 2008 03:52:00

Doug Coleman: Word renaming

Several words have been renamed and moved around to make Factor more consistent:

  • new -> new-sequence
  • construct-empty -> new
  • construct-boa -> boa
  • diff -> assoc-diff
  • union -> assoc-union
  • intersect -> assoc-intersect
  • seq-diff -> diff
  • seq-intersect -> intersect
To make things symmetrical, a new word union operates on sequences.

Somehow, seq-diff and seq-intersect were implemented as O(n^2) algorithms. Now, they use hashtables and are O(n).

Lastly, a new vocabulary named ``sets'' contains the set theoretic words, along with a new word unique that converts a sequence to a hash table whose keys and values are the same. An efficient union and intersect are implemented in terms of this word.

Mon, 14 Apr 2008 12:25:00

Doug Coleman: Adding a new primitive

I added two primitives to the Factor VM to allow setting and unsetting of environment variables. It's not that hard to do, but you have to edit several C files in the VM and a couple .factor files in the core.  Really they should not be primitives, so eventually they will be moved into the core.

The primitives that I added are defined as follows:

IN: system
PRIMITIVE: set-os-env ( value key -- )
PRIMITIVE: unset-os-env ( key -- )

Adding a primitive to vm/

Since Factor's datatypes are not the same as C's datatypes, and because of the garbage collector, there are C functions for accessing and manipulating Factor objects. The data conversion functions are not documented yet, so here's a sampling of a few of them:
  • unbox_u16_string() - pop a Factor string off the datastack and return it as a F_CHAR*
  • from_u16_string() - convert a C string to a Factor object
  • REGISTER_C_STRING() - register a C string with Factor's garbage collector
  • UNREGISTER_C_STRING() - unregister a registered C string
  • dpush() - push an object onto the datastack
  • dpop() - pop an object off of the datastack
Registering a C string with the garbage collector is required when VM code calls code that may trigger a garbage collection (gc).  Any call to Factor from the VM might trigger a gc, and if that happened the object could be moved, thus invalidating your C pointer.  When a pointer is unregistered, it's popped from a gc stack with the corrected pointer value.

Here is the call to set-os-env:
DEFINE_PRIMITIVE(set_os_env)
{
F_CHAR *key = unbox_u16_string();
REGISTER_C_STRING(key);
F_CHAR *value = unbox_u16_string();
UNREGISTER_C_STRING(key);
if(!SetEnvironmentVariable(key, value))
general_error(ERROR_IO, tag_object(get_error_message()), F, NULL);
}
The function is defined with a macro DEFINE_PRIMITIVE that takes only the function name.  A corresponding DECLARE_PRIMITIVE goes in run.h as your function declaration.  Not all primitives use these C preprocessor macros, for instance bignums don't because it doesn't improve the performance.  Parameters to your primitive are popped or unboxed off the data stack, so a primitive's declaration expands to:
F_FASTCALL primitive_set_os_env_impl(void);

F_FASTCALL void primitive_set_os_env(CELL word, F_STACK_FRAME *callstack_top) {
save_callstack_top(callstack_top);
primitive_set_os_env_impl();
}
INLINE void primitive_set_os_env_impl(void)
F_FASTCALL is a wrapper around FASTCALL, which on x86 will pass the first two arguments in registers as an optimization.  Note that while it declares that it takes no arguments (void), most primitives will do something to the data stack.

Since unbox_u16_string() allocates memory for the Factor object, it could trigger a gc, so it's registered as a string.  You can also register values using REGISTER_ROOT for cells, REGISTER_BIGNUM for bignums, and REGISTER_UNTAGGED for arrays, words, and other Factor object pointers for which the type is known.  The key string can immediately be unregistered after calling unbox on the next stack value since the rest of the function will not cause a gc.  If the win32 call fails, there's a function general_error() that throws an exception.  In this case, it's an ERROR_IO that calls a helper function to return the Windows error message.

Now that the function is written, you have to add it to the list of primitives in primitives.c.  The important thing is that this list remains in the same order as the list in core/ which you will edit in the next section.  Also, add a prototype to the run.h file.

Adding a primitive to core/

Everything in Factor compiles down to primitives.  Because they are by definition "primitive", the compiler cannot infer the stack effect and argument types. To make a primitive's stack effect "known", edit core/inference/known-words/known-words.factor:
\ set-os-env { string string } { } <effect> set-primitive-effect
The next step is to put your word in the file core/bootstrap/primitives.factor in the same order as in vm/primitives.c.

Sometime in the future there might be a PRIMITIVE: word that will reduce the number of different places to edit to add a primitive. If it used Factor's FFI, you could add a new primitive without even having to bootstrap again.

Sun, 13 Apr 2008 14:20:00

Slava Pestov: Multi-touch gestures in the Factor UI

Recently I acquired a MacBook Pro. The new models come with a multi-touch keypad and I've found it extremely useful in Safari to be able to go back, go forward and zoom with the mouse. However, I missed the ability to do this in other applications, especially the Factor UI. The Factor UI already supports vertical and horizontal scrolling gestures, and I wanted to be able to use the other gestures as well.

Apparently, Apple's official stance is that no multitouch API will be made public until 10.6. I was slightly discouraged by this but some more Googling turned up a blog post by Elliott Harris detailing the undocumented API for receiving these events.

While normally I shy away from relying on undocumented functionality in this case the API is dead-simple and it is almost an oversight on Apple's part not to document it. And if they break it, well, it will be easy to update Factor too.

Here are the changes I had to make. First, I added some new UI gestures to extra/ui/gestures/gestures.factor; these are cross-platform and theoretically the Windows and X11 UI backends could produce them too, perhaps as a result of button presses on those "Internet" keyboards:

TUPLE: left-action ;        C: <left-action> left-action
TUPLE: right-action ; C: <right-action> right-action
TUPLE: up-action ; C: <up-action> up-action
TUPLE: down-action ; C: <down-action> down-action

TUPLE: zoom-in-action ; C: <zoom-in-action> zoom-in-action
TUPLE: zoom-out-action ; C: <zoom-out-action> zoom-out-action

Next, I edited extra/ui/cocoa/views/views.factor with some new methods for handling the new multitouch gestures, and translating them to Factor gestures:
{ "magnifyWithEvent:" "void" { "id" "SEL" "id" }
[
nip
dup -> deltaZ sgn {
{ 1 [ T{ zoom-in-action } send-action$ ] }
{ -1 [ T{ zoom-out-action } send-action$ ] }
{ 0 [ 2drop ] }
} case
]
}

{ "swipeWithEvent:" "void" { "id" "SEL" "id" }
[
nip
dup -> deltaX sgn {
{ 1 [ T{ left-action } send-action$ ] }
{ -1 [ T{ right-action } send-action$ ] }
{ 0
[
dup -> deltaY sgn {
{ 1 [ T{ up-action } send-action$ ] }
{ -1 [ T{ down-action } send-action$ ] }
{ 0 [ 2drop ] }
} case
]
}
} case
]
}

Note that I'm throwing away useful information for the sake of simplicity; the zoom gesture gives you a precise zoom amount, not just +1/-1. The swipe gestures seem to be completely discrete though. I haven't implemented rotation gestures yet because I haven't figured out what to use them for.

With the above code written, the UI now sends multi-touch gestures to gadgets however no gadgets used them yet. I fired up the gesture logger, "gesture-logger" run, and tested that the new actions actually get sent. They were, except the first time I messed up the code and got the signs the wrong way round.

With that fixed, I could proceed to add gesture handlers to the various UI tools. I edited various source files:
browser-gadget "multi-touch" f {
{ T{ left-action } com-back }
{ T{ right-action } com-forward }
} define-command-map

inspector-gadget "multi-touch" f {
{ T{ left-action } &back }
} define-command-map

workspace "multi-touch" f {
{ T{ zoom-out-action } com-listener }
{ T{ up-action } refresh-all }
} define-command-map

I think its pretty clear from the above what the default gesture assignments are:
  • In the browser tool, swipe left/right navigate the help history.
  • In the inspector, swipe left goes back.
  • In any tool, a pinch (zoom out) closes the current tool, leaving only the listener visible. A swipe up reloads any changed source files (I'm not sure if I like this yet).

15 minutes of Google, 20 minutes of hacking, and now Factor supports the fanciest feature of Apple's latest hardware.

Fri, 11 Apr 2008 23:34:00

Slava Pestov: Improvements to io.monitors; faster refresh-all

Factor's io.monitors library previously supported Mac OS X, Windows and Linux. Now it also supports BSD, but in a much more restricted fashion than the other platforms. Basically you cannot monitor directories, just individual files. This is because kqueue() only provides very limited functionality in this regard. However, having something is better than nothing, and the functionality provided on BSDs is still useful for monitoring log files and such.

On Linux, inotify doesn't directly support monitoring recursive directory hierarchies so Factor's monitors didn't support recursive monitoring, but a mailing list post by Robert Love discusses how to solve this issue in user-space. I used his solution to implement recursive monitors on Linux.

Another oddity relating to inotify is that if you add the same directory twice to the same inotify, you get the same watch ID both times, and events are only reported once. This means that the previous implementation where there was one global inotify instance shared by all monitors wasn't really as general as one would hope, because you couldn't run two programs that monitor overlapping portions of the file system. I thought of several possible fixes but in the end just changed the monitors API to accommodate this case. All monitor operations must now be wrapped in a with-monitors combinator. On Linux, it creates an inotify instance and stores it in a dynamically-scoped variable, so that subsequent calls to <monitor use this inotify. Independent inotifies in different threads don't interact at all. On Mac OS X, BSD and WIndows, with-monitors just calls the quotation without doing any special setup.

Another issue I fixed was that on Mac OS X, monitors would only work when used from the UI because no run loop was running otherwise. I made a run loop run all of the time and this allows monitors to work in the terminal listener.

Now that monitors are working better, I was able to use them to make refresh-all. This word finds all changed source files in the vocabulary roots and reloads them. It does this by comparing cached CRC32 checksums with the actual checksum of the file. Previously it would also compare modification times, but I took that code out because filesystem meta-data queries got moved out of the VM and into the native I/O code, which isn't available during bootstrap. A side-effect of this is that refresh-all became much slower, because it had to read all files. Using monitors I was able to make this faster than it has ever been. A thread waiting on a monitor is started on startup. Then, the source tree only has to be checksummed in its entirety the first time refresh-all is used in a session. Subsequently, only files for which the monitor reported changes have to be scanned. So refresh-all runs instantly if there are no changes, and so on.

Fri, 11 Apr 2008 13:19:00

Slava Pestov: The golden rule of writing comments

This is something I've wanted to say for a while, and I think many (most?) programmers don't realize it:

Never comment out code.

Comments are for natural-language descriptions of code, or pseudocode maybe.

If you comment out some code, then the parser isn't checking the syntax, the compiler isn't checking semantics, and the unit tests are not unit testing it.

So the code may as well not work. Why have non-working code in your program, especially if it's not called anywhere?

But perhaps the code is there so that you can see the what the program used to do. In that case, just fire up your favorite graphical GIT/SVN/P4/whatever frontend and check out an older revision.

Tue, 8 Apr 2008 20:36:00

Slava Pestov: Multi-methods and hooks

For a while now Factor has had 'hooks', which are generic words dispatching on a dynamically scoped variable. Hooks can be used with variables which are essentially global: the current OS, current CPU, UI backend, etc -- or variables which are truly context-specific, such as the current database connection.

I added support for hooks to the extra/multi-methods library, which is going in the core soon. While doing this I was able to significantly generalize the concept. Suppose we want to define a piece of functionality which depends on both the operating system and processor type.

We can begin with defining an ordinary generic word:

GENERIC: cache-size ( -- l1 l2 )

Notice that it is defined as taking no inputs from the stack.

I don't really know the APIs involved here, but suppose that Linux gives us a way to get this info that works across all CPUs. So we define a method specializing on the os dynamic variable (which is really treated like global):
METHOD: cache-size { { os linux } } ... read something from /proc ... ;

Now suppose on Mac OS X we use different APIs per CPU:
METHOD: cache-size { { os macosx } { cpu ppc } } ... ;

METHOD: cache-size { { os macosx } { cpu x86.32 } } ... ;

METHOD: cache-size { { os macosx } { cpu x86.64 } } ... ;

But perhaps on ARM CPUs, we just use an instruction to read the cache size without any OS-specific calls at all:
METHOD: cache-size { { cpu arm } } ... ;

Now in this case, you have an issue where if you're on Linux and ARM, the method that ends up being called depends on the order in which they were defined. If you wanted to explicitly resolve this ambiguity, you would define a new method on { { os linux } { cpu arm } }; because it is more specific than the other two, it is always called first.

The powerful thing about this new implementation of hooks is that not only can you dispatch on multiple variables, but you can add methods to any old generic which dispatches on a variable and the original designer of the generic does not have to explicitly take this into account.

For example, Factor's compiler currently has a large number of hooks dispatching on the CPU type (as an aside, Phil Dawes wrote an excellent introduction to the Factor compiler recently). If those hooks need to be further refined by OS, as is often the case with FFI-related components, the method implementation on the hook needs to perform its own dispatch; this is the "double dispatch" pattern and design patterns are something to be avoided if one wants to write quality code. When multi-methods go in the core, the compiler will simply define a series of generic words taking no inputs from the stack, and each method will specialize on the CPU, and maybe an OS too.

Another new capability is dispatching off stack values and variables in the same method. Among other things, this will be useful in eliminating a case of double dispatch in the core right now, where the <client> word for opening a client socket has to dispatch off the address type on the stack, and then call another hook which dispatches on the I/O backend stored in a variable. This can be combined into a single generic word where some methods dispatch on stack values, and others dispatch on the I/O backend variable.

The other nice thing about this is that the multi-method GENERIC: word unifies and generalizes four words in the core, GENERIC:, GENERIC# for dispatching on a value further down on the stack, HOOK: for dispatching on a variable, and MATH: which performs double dispatch on numbers only.

One of my goals for Factor 1.0 is to get the object system "right", and with the new-style slot accessors, inheritance, and singletons, we're almost there. All that remains to be done is to merge the multi-methods code. The code is still not quite ready to go in the core, though. The only feature that single dispatch has and multiple-dispatch lacks is call-next-method, which is easy to implement. A bigger hurdle to clear is performance; right now multi-methods are implemented in a naive way, where the dispatch time is O(mn) with m methods and n stack positions and variables to dispatch on. This can be improved significantly and I will find time reading the literature on optimizing method dispatch over the next few weeks.

Tue, 8 Apr 2008 20:14:00

Phil Dawes: Digging into Factor’s compiler

I wrote this post partly as an advocacy piece and partly to put down a bunch of things I’ve learnt about the factor compiler over the last few weeks. I should point out that I’m no expert in this area and so there are probably inaccuracies and omissions - hopefully Slava or one of the factor gurus will point them out. With that in mind, check this out!:

Factor has an optimising compiler which generates machine code as you type code into the REPL. If you have gdb on your system you can see this in action by firing up factor and using the tools.disassembler vocab:

( scratchpad ) : hello "hello" print ;           ! defines the word hello
( scratchpad ) USING: tools.disassembler ;
( scratchpad ) \ hello disassemble

Using host libthread_db library “/lib/tls/i686/cmov/libthread_db.so.1″.
[Thread debugging using libthread_db enabled]
[New Thread -1213614400 (LWP 25688)]
0xffffe410 in __kernel_vsyscall ()
Dump of assembler code from 0xb0f694b0 to 0xb0f694c7:
0xb0f694b0: mov    $0xb0f694b0,%ecx
0xb0f694b5: mov    0xb0f694e4,%ebx
0xb0f694bb: mov    %ebx,0×4(%esi)
0xb0f694be: add    $0×4,%esi
0xb0f694c1: jmp    0xb123e7d0
…
End of assembler dump.

This is very cool in itself, but for me the real beauty of the factor compiler is the very modular design, composed of small pieces that you can pull apart and tinker with in isolation. This makes the compiler accessible to people both as a learning tool and for those wanting to generate highly optimized code for tight loops.

The three stages of the compiler are

  1. Parsing the code and generating a ‘dataflow’ abstract syntax tree. (Also called ‘IR’ - intermediate representation)
  2. Optimizing the dataflow tree
  3. Generating machine code from the dataflow tree

I’ll dig into each of these steps in order:

Stage 1: Parsing factor code to dataflow IR

The first step parses factor code into a dataflow datastructure. You can run and inspect the results of this yourself using the dataflow word:

USE: inference
[ "hello" print ] dataflow pprint

=> T{
    #push
    T{
        node
        f
        f
        f
        V{ T{ value f “hello” 673850 f } }
        f
        f
        f
        f
        f
        f
        T{
            #call
            T{
                node
                f
                print
                V{ T{ value f “hello” 673850 f } }
                V{ }
                f
                f
                f
                f
                f
                f
                T{
                    #return
                    T{ node f f V{ } f f f f f f f f f }
                }
                f
            }
        }
        f
    }
}

Obviously inspecting this datastructure manually is pretty cumbersome, so fortunately there’s some dataflow inspection functionality in the ‘optimizer.debugger’ vocab. The dataflow>quot word renders the dataflow structure back into quotations (code blocks) that you can print and inspect. I use it here to define some words for dataflow pretty-printing:

USE: optimizer.debugger
: print-dataflow f dataflow>quot pprint nl ;
: print-annotated-dataflow t dataflow>quot pprint nl ;

So now we can turn quotations into dataflow graphs and back again:

[ "hello" print ] dataflow print-dataflow
=> [ “hello” print ]

(N.B. there are already words in the optimizer.debugger vocab for displaying optimized dataflows, but for this post I wanted to be able to print dataflows prior to optimisation)

This also works for pre-defined words using ‘word-dataflow’ :

: print-hello "hello" print ;
USE: generator
\ print-hello word-dataflow print-dataflow
=> [ “hello” print ]

In most cases the output quotation will be the same as the input quotation, however there are a couple of expansions that happen at this stage. The first is that words marked ‘inline’ are inlined directly into the dataflow:

: inlinedword "this" "is" "an" "inlined" "word" ; inline
[ inlinedword ] dataflow print-dataflow
=> [ “this” “is” “an” “inlined” “word” ]

Also any compiler-transforms (macros) are evaluated at this stage.

USE shuffle
[ 1 2 2 ndup ] dataflow print-dataflow      ! ndup is a macro
=> [ 1 2 2 drop 2 drop >r dup r> swap 2 drop >r dup r> swap ]

Stage 2: Dataflow Optimisation

Here’s where the fun starts. You can get a feel for how this stage works by looking at ‘optimizer.factor’. Here it is in its entirety:

! Copyright (C) 2006, 2008 Slava Pestov.
! See http://factorcode.org/license.txt for BSD license.
USING: kernel namespaces optimizer.backend optimizer.def-use
optimizer.known-words optimizer.math optimizer.control
optimizer.inlining inference.class ;
IN: optimizer

: optimize-1 ( node -- newnode ? )
    [
        H{ } clone class-substitutions set
        H{ } clone literal-substitutions set
        H{ } clone value-substitutions set
        dup compute-def-use
        kill-values
        dup detect-loops
        dup infer-classes
        optimizer-changed off
        optimize-nodes
        optimizer-changed get
    ] with-scope ;

: optimize ( node -- newnode )
    optimize-1 [ optimize ] when ;

‘optimize’ iteratively calls ‘optimize-1′ until nothing changes in the output graph - i.e. that it has reached a fixed point and no more optimizations can be performed. If you dig into the words used by optimize-1 (try executing them individually and inspecting the dataflow result) you’ll find that optimize-1 performs a number of inferences and optimizations:

  • It tracks the types (classes) of stack elements created within the code block
  • It inlines specific generic word implementations (methods) when it can deduce the class instance on the stack
  • It prunes unused literals and flushable words. (this is actually more useful than it sounds, since other optimisations can generate unused code)
  • It performs branch analysis, marking tail calls in loops and pruning branches that can’t be executed
  • It evaluates ‘foldable’ words at compile time if the values of arguments are known
  • It executes any custom inference code attached to words, allowing words to evaluate their results at compile time if inputs are known

Examples:

evaluating foldable words at compile time

‘+’ is a foldable word (see help for ‘+’), so the optimizer evaluates it at compile time if the values of both arguments are known. Here’s the dataflow before and after optimization:

[ 2 3 + ] dataflow print-dataflow            ! before optimization
=> [ 2 3 + ]

[ 2 3 + ] dataflow optimize print-dataflow   ! after optimization
=> [ 5 ]

type inference and inlining generic word implementations

To illustrate this we first create two tuples (classes) with constructors, and a generic word

TUPLE: classa ;
C: <classa> classa
TUPLE: classb ;
C: <classb> classb 

GENERIC: dosomething ( obj -- val )

Now we create an implementation of the generic word specialised for each class:

M: classa dosomething drop "something for class a" ;
M: classb dosomething drop "something for class b" ;

Finally, some code which calls ‘dosomething’ with an instance of ‘classa’ on the stack, before and after optimization:

[ <classa> dosomething ] dataflow print-dataflow            ! before optimization
=> [  classa drop  classa 2 <tuple -boa> dosomething ]

[ <classa> dosomething ] dataflow optimize print-dataflow    ! after optimization
=> [  classa 2 <tuple -boa> drop “something for class a” ]

It’s a little messy because of the inlined tuple creation, but you can see that prior to optimization ‘dosomething’ is a word call in the dataflow, and afterwards the optimizer has inlined the implementation of ‘dosomething’ specialized on ‘classa’. (if you look you can also see an example of pruning literals here, as the first dataflow has resulted in a superflous ‘\ classa drop’).

This is easier to see if I cheat a bit and use factor’s ‘declare’ word, which declares that elements on the top of the stack are instances of specific classes. So this quotation assumes that top stack element before it is called is of type ‘classa’:

[ { classa } declare dosomething ] dataflow optimize print-dataflow
=> [ drop “something for class a” ]

N.B. you wouldn’t normally use ‘declare’ in user code, but it could be really handy for optimizing performance sensitive tight loops where the results of an external word call are known to the programmer but not the compiler.

conditional folding

The compiler optimizes out conditional branches when it can deduce the outcome of the conditional at compile time:

[ 1 0 =  [ "do if true" ] [ "do if false" ] if ] dataflow optimize print-dataflow
=> [ “do if false” ]

1 isn’t equal to 0 so it optimizes this whole block into the contents of the false quotation.
This is a simple example, but it turns out to be really cool in performance sensitive code (e.g. tight loops) because you can use a generic library function whose behaviour depends on a conditional, specialize it with a hardcoded ‘f’ and the compiler will optimize the conditional branch right out of the resulting code. You get the elegance of the generic combinator with the speed of a hand coded loop.

Stage 3: Machine Code Generation

Code generation is implemented by the ‘generate’ word. This iterates through the nodes calling ‘generate-node’ on each.

: generate-nodes ( node -- )
    [ node@ generate-node ] iterate-nodes end-basic-block ;

: generate ( node word label -- )
    [
        init-generate-nodes
        [ generate-nodes ] with-node-iterator
    ] with-generator ;

‘generate-node’ is a generic word with specialized implementations for each type of dataflow node.

As described in this post from Slava’s excellent Factor blog there are a number of dataflow node types, the important ones being:

* #push - push literals on the data stack
* #shuffle - permute the elements of the data or call stack
* #call - call a word
* #label - an inlined recursive block (loop, etc)
* #if - conditional with two child nodes
* #dispatch - jump table with multiple nodes; jumps to the node indexed by a number on the data stack

The generate-node implementations invoke lower level words in the ‘architecture’ vocabulary, which in turn are generic words that write out small pieces of machine code specialized for each CPU architecture.

The machine code generation code is particularly cool and easy to follow because factor has an assembler DSL for each cpu architecture it supports. The assembler words match the commands and registers of the target cpu architecture and evaluate to their corresponding machine code.

You can even try this out in the REPL using ‘make’ to collect the results into an array. I’m on x86 so I load the cpu.x86.assembler vocabulary:

USE: cpu.x86.assembler

[ EAX 35 MOV ] { } make .   ! postfix assembler evaluates to machine code!

=> { 184 35 0 0 0 }

The assembler DSLs make code generation easy to follow because you can see the assembler in the generation code and then check it against the disassembled machine code using the ‘disassemble’ word we used at the start of the post. When following the code it helps to know which registers are used for what purpose. I found this information in assembler files in the factor VM source - I’m an x86 so for me the declares are in the ‘cpu-x86.32.S’ file:

#define ARG0 %eax
#define ARG1 %edx
#define XT_REG %ecx
#define STACK_REG %esp
#define DS_REG %esi
#define RETURN_REG %eax

#define CELL_SIZE 4

#define PUSH_NONVOLATILE 
	push %ebx ; 
	push %ebp

#define POP_NONVOLATILE 
	pop %ebp ; 
	pop %ebx

register CELL ds asm("esi");
register CELL rs asm("edi");

So lets check this against some generated code for a really basic word that just pops the number 42 on the stack:

: myfunc 42 ;
 myfunc disassemble
=>
Using host libthread_db library “/lib/tls/i686/cmov/libthread_db.so.1″.
[Thread debugging using libthread_db enabled]
[New Thread -1213696320 (LWP 32499)]
0xffffe410 in __kernel_vsyscall ()
Dump of assembler code from 0xb136b230 to 0xb136b247:
0xb136b230: mov    $0xb136b230,%ecx
0xb136b235: mov    $0×150,%ebx
0xb136b23a: mov    %ebx,0×4(%esi)
0xb136b23d: add    $0×4,%esi
0xb136b240: ret 

The first assembler line puts the address of the word into the XT_REG, which is %ecx. For some reason the start of each function puts it’s address into this register - not quite sure why.
The second line puts the number 42 into the ebx register. Note that factor uses the first 3 bits of a value (’cell’) to store its type (called a tag - see layouts.h). In this case it’s a fixnum which is 000. 42 shifted left 3 bits is 336, which in hex is 0×150.
The third line puts the number onto the stack, and the forth updates the stack pointer to point to the new top of the stack.

code generation optimizations

Factor has another couple of tricks up its sleeve during the code generation stages:

optimizing shuffle words

The first is that stack shuffle words (e.g. dup, swap, tuck etc..) don’t get translated into machine code. Instead the compiler has a compile time ‘phantom stack’ which records the positions of items in the stack. When it generates the machine code values are accessed from the stack out of order (the runtime stack is after all a random access piece of memory). This makes stack shuffling words and the retain stack effectively ‘free’ within a code block. A #merge node in the dataflow signifies a code boundary (usually before a subroutine call) which causes the compiler to output instructions which synchronise the physical runtime stack with its phantom stack.

word intrinsics

The second trick is word ‘intrinsics’. Word intrinsics are essentially blocks of open-coded assembler that are output in place of a subroutine call. They are associated with the word via a ‘word-property’, which is a nifty feature of factor that allows meta information to be attached to each word. For example the ‘fixnum+fast’ word has intrinsics which you can see using ‘word-prop’:

 fixnum+fast "intrinsics" word-prop pprint
=>
{
    {
        [ “x” operand “y” operand ADD ]
        H{
            { +output+ { “x” } }
            { +input+ { { f “x” } { [ small-tagged? ] “y” } } }
        }
    }
    {
        [ “x” operand “y” operand ADD ]
        H{
            { +output+ { “x” } }
            { +input+ { { f “x” } { f “y” } } }
        }
    }
}

This tells the compiler to inline the x86 ADD instruction instead of making a subroutine call to the implementation of fixnum+fast. You can add assembler intrinsics to existing words with ‘define-intrinsics’; Here’s a description from the help for the define-intrinsics word:

Defines a set of assembly intrinsics for the word. When a call to the word is being compiled, each intrinsic is tested in turn; the first applicable one will be called to generate machine code. If no suitable intrinsic is found, a simple call to the word is compiled instead.

What I particularly like about this feature is that it neatly provides the ability to specialize a highly optimized implementation for a particular hardware set, and then fall back gracefully on other architectures.

That concludes my ad-hoc tour of the factor compiler. I’ve skipped over a number of things and no doubt there are bits I haven’t discovered yet and some inaccuracies, but I hope I’ve supplied enough information to spark interest in this excellent compiler.

As I mentioned in a previous post I got interested in factor as a direct result of tinkering with jonesforth, which takes you through the entire forth bootstrap process starting with raw assembly. I’ve been delighted to find that factor retains a lot of the ‘right-down-to-the-metal’ accessibility of its low level cousin.

Tue, 8 Apr 2008 08:27:16

planet-factor is an Atom/RSS aggregator that collects the contents of Factor-related blogs. It is inspired by