Daniel Ehrenberg: Parsing with regular expressions and group captureThough I'd rather avoid it, string parsing is a crucial part of programming. Since we're more than 60 years into the use of modern computers, it seems like we should have a pretty good handle on how to build abstractions over parsing. Indeed, there are tons of great tools out there, like GNU Yacc, Haskell's Parsec, newer Packrat-based parsers like Factor's EBNF syntax for PEGs, and a bunch of other high level parsing libraries. These libraries are relatively easy to use once you understand the underlying structure (each one parses a different subset of context-free grammars), because they expose the programmer to a tree-like view of the string. DFAs and regular expressionsI'm taking a computer science class about this right now, so I'm gonna be totally pedantic. When I say regular expression, I mean an expression that describes a regular language. Perl regexes aren't regular expressions (and Larry Wall knows this). If you don't feel like putting on your theoretician's hat, this blog post will be mostly meaningless. What's a regular language? First off, a language is a set of strings. We care about infinite sets of strings, since finite sets are trivial to represent. If a string is in the language, that means that the language matches the string, intuitively. A regular language is one which can be represented by a deterministic finite automaton (DFA) without extensions, also called a finite state machine (FSM) for some reason. Many useful languages are regular, and many are not. The idea of a DFA is a finite number of states and a transition function, which takes the current state and a character of a string and returns the next state. The transition function is defined on all states and all characters in the alphabet. There is a set of final states, and if the string runs out when the machine is in a final state, then the string is accepted. The language of the DFA is the set of strings accepted by the DFA. For a given DFA, that DFA can be run in linear time with respect to the length of the input string and constant space. It can also be run "on-line", that is, without the whole string in advance, going incrementally. A related construction is an NFA, or nondeterministic finite automaton. Imagine the previous idea, but instead of a transition function, there is a transition relation. That is, for any character and current state, there are zero or more next states to go to, and the NFA always picks the right one. This is called nondeterminism (at least that's what it means here). Amazingly, NFAs can accept only regular languages and nothing more, because NFAs can be translated into DFAs. Basically, you build a DFA which picks all possible states at once, given all possible paths through the NFA. Potentially, though, there's an exponential blowup in the number of states. Every regular expression can be converted into an equivalent NFA, which can be converted into a DFA, which can then be converted back into a regular expression. They're all equivalent. So then what's a regular expression? There are different ways to define it. One is that you can build up a regular expression from the following elements: the epsilon regex (matching the empty string), the empty regex (matching nothing), single character regexes (matching just a single character), concatenation (one followed by another), disjunction (or) and the Kleene star (0 or more copies of something). Counterintuitively, it's possible to construct regexes which support negation, conjunction, lookahead and other interesting things. The most important distinction from Perl regexes is that regular expressions cannot contain backreferences, because these are provably impossible to express in a DFA. It's impossible to parse something with backreferences in the same linear time and constant space that you get from regexes which are regular. In fact, parsing patterns with backreferences is NP-hard and not believed possible in polynomial time (with respect to the length of the input string). Since regular expressions which are regular give us such nice properties, I'm going to stick to them. Regular expressions in practice in parsing todayThe formal study of regular languages is a basically solved problem within the formalism itself: they are equivalent to DFAs, and satisfy a convenient set of properties summarized by the pumping lemmaand the Myhill-Nerode theorem. The problem is just, is the given string a member of the language? What languages are regular? This was solved in the 1950s and 1960s, and the basic results are in most introductory compiler books. Those books use the solution to build lexers, like Lex. Lex basically takes a list of regular expressions, each associated with an action, finds one of them to match maximally with the current input, executes the associated action on the portion that matches, and then repeats with the rest of the string. This is useful to build lexers, but the programmer has very little context for things, so it's difficult to use for much else. More recently, Ragel has been used as a way to parse more complicated things using regular expressions. Its strategy is to turn its compile-time input into one big regular expression, annotated with actions on certain states or transitions. The actions are fragments of C code, and they form the processing power of the machine. However, their behavior can get a little unintuitive if too much nondeterminism is used, so Ragel provides a bunch of tools to limit that. Also, Ragel lets you explicitly specify a DFA through transitions, which seems useful but low-level. Group capture with regexesOne of the most useful features of Perl's regular expressions is group capture. By this, I mean how you can do something like s/^(1*)(0*)$/$2$1/ to swap ones and zeros in a string. This is different from backreferences (like the non-regular expression /(.*)$1/) because it's only used in subsequent code, to figure out what got matched to what part of the regex. It doesn't parse any languages which aren't regular, but it's a useful tool for processing.Curiously, this has been ignored both by academics and DFA implementors so far. I hypothesize that it's been ignored by theorists for two reasons: (1) It's easy to confuse with backreferences, which make the language non-regular, which is totally uninteresting to theorists (2) They're not part of the formalism of regular expressions as previously expressed. Implementors of (non-Perl-based) regular expression-based parsing mechanisms tend to avoid group capture because, in the general case, it's not fast enough and can't be done on-line. Also, as far as I can tell, it hasn't been implemented any other way than interpreting an NFA, using backtracking, and keeping track of where the parser is within the regex to determine group boundaries. This would be terrible for the domain of Lex and Ragel. By "on-line" I don't mean on the internet but rather that an algorithm that can be performed incrementally, getting little pieces (characters, say) of the input and doing computation incrementally, as the input is received, without storing the whole thing and running the algorithm all at once. So how can we do group capture on-line? Well, in general, we can't. Consider the regular expression (1*)1 where you're trying to capture the group 1*. As the input is being processed, we don't know when we've gotten to the end of the group until the entire input is over, since if there are two more 1's, then the first one must be in the group. However, in many cases group capture can in fact be done on-line, as in (0*)(1*), where the groups captured are 0* and 1*. As the regex is processing on the string, it knows that, if there is a match, the group boundary is just before the first 1. This can be formalized as a "boundary of determinism": a point where, in the subset construction to form a DFA from an NFA gets a subset of exactly one state. I believe this can handle most cases of group capture in practice, if the regular expression is well-written, but surely not all of them. I have an idea for how to do group capture in the few remaining circumstances, but unfortunately it takes linear space and it's not online. I'll blog about it once I have a proof of correctness. Hierarchical parsing using group captureUsing this group capture mechanism, we can build a hierarchical parsing mechanism with actions on different things, which can be built to parse regular languages in a higher-level way. Regular expressions can't use arbitrary recursion like context-free grammars can, so the parse tree will be of fixed size, but it could still be useful. In designing this, I'm thinking specifically about making a SAX-like XML parser. It'd be awkward to write everything out as one big regular expression, but split into smaller things, each with their own little steps in processing, it could be much more elegant. My goal for syntax is something like EBNF syntax, as Chris Double's PEGs library in Factor does. Here's some future pseudocode for how it could look in parsing an XML tag, simplified. (In this code, > is used like Ragel :>>, to indicate that when the expression afterwards can be matched by the regex, it is, as soon as possible (basically).)
ConclusionThough I haven't implemented this yet, and probably shouldn't even be talking about it, I'm really excited about this idea. I even came up with a stupid little name with it: Hegel, both for High-level Ragel and because it represents the synthesis of the dialectic (as described by Hegel) of slower, higher-level parsing and fast low-level parsing into fast, high-level parsing of regular languages. I hope it works. Sat, 10 May 2008 11:05:00 Slava Pestov: Garbage collection throughput improvementsThe problemAbout a month ago, I made some changes to the garbage collector. These changes included drastically decreasing the default nursery size from 16mb to 1mb. This turned out to have a negative effect on performance, with GC consuming as much as 60% of run time on some tests. Instead of making the default nursery larger, which would just mask the problem until we hit workloads with 16 times as much allocation (or when the compiler began to generate code that was 16 times faster, or any combination of the above), I decided to investigate the root cause of the problem. Turns out that every minor GC was taking on the order of a few milliseconds at the minimum, which is too much time. Most of the time was spent scanning the cards array for marked cards. Even if there weren't any marked cards, with a 64mb heap and 64 byte card size, that's a million array entries to check. Faster card scanningThe first thing I did was change how the card scan loop operates. Instead of iterating over the array a byte at a time, and checking if each byte satisfies the card mark mask, we iterate over the array four bytes at a time, checking each group of four bytes against the mask. GCC should really be doing this optimization for me, because its a simple case of vectorization, but it doesn't, so I just wrote the code out by hand: u32 *quad_ptr; This improved card scan performance by a factor of 3, which makes sense, because we do four times less loop iterations, but at some point, you still have to hit memory, which dampens things out a bit. This optimization improved performance, but not sufficiently so. I realized that card scanning hurt the most in benchmarks which didn't write into objects in the old generation at all; for example, a benchmark which operates on bignums, where the intermediate values are extremely short-lived, fills up the nursery very rapidly, and every minor GC checks all cards, which not only fills up the cache with junk, but doesn't achieve anything useful as no cards will be marked. On the other hard, a pointer recording approach where all stores are written into a buffer is also unacceptable, because in code with a high rate of mutation, you pay a hefty price every time the buffer overflows. So I had to find a compromise. Introducing card decksMy solution is something I haven't seen done anywhere else before, but it is pretty trivial if you think about it. It amounts to two-level card marking. We have an array of cards, where each card maps to a small amount of memory: 64 bytes in my case, but it could be anything; and an array of card decks. Each card deck corresponds to 1024 cards, and if a card is marked, then its deck must be marked too. This way, on a minor GC, we scan the deck array first, and then only check cards corresponding to marked decks. This complicates the write barrier, because now on every store we have to mark a card and a deck, not just a card as before. I decided to counterweight this by simplifying the write barrier in another way. Formerly, it would read the card, "or" it with a mask, and store it back. This was done because the card contained an additional piece of information, and that is the offset within the card where the first object begins. Cards have two mark bits, denoting a possible pointer from this generation to either aging space or the nursery, and the other six bits were reserved for the offset of the first object. By moving object offsets (I call them "allot markers") into a separate array, I removed a read operation from the write barrier; now it simply has to write the mark mask to the card, without caring about its existing contents. This also freed up two bits in the allot marker, allowing me to switch to 256 byte cards (and 256 kilobyte card decks). These improvements really helped on benchmarks which allocated large numbers of extremely short-lived objects, however on benchmarks which allocated longer-lived objects the improvements were still major but not so drastic. It turns out that if objects survives long enough to be copied from the nursery into aging space, then it is possible for this copying to begin to dominate running time. Low-level optimizationsThe inner loop of the GC looked like so: void copy_handle(CELL *handle) This is very elegant code; do_slots is a higher-order function which applies a function pointer to each slot of the object. However, GCC doesn't try to optimize higher-order code at all, after all it is a C compiler not an ML compiler, and it isn't Sufficiently Smart Either! So I rewrote the above function by manually inlining do_slots() and replacing the function pointer invocation with the body of copy_handle(). The next thing I realized was that should_copy() contains a large series of conditional tests which did not depend on its input parameters, only global variables whose value was invariant for the duration of the collection:/* test if the pointer is in generation being collected, or a younger one. */ So again, I did a bit of hand-optimization. If foo() does not depend on the loop variable or the side effects of the body, then the following are equivalent:while(A) In the optimizing compiler literature, this is known as loop unswitching. The new collect_next() is much longer, but also much faster. It looks like this:CELL collect_next(CELL scan) That's a hell of a code explosion, but it made a real difference to the performance of the garbage collector. BenchmarksThe first benchmark just allocates 3-element arrays in a loop: : garbage 1 2 3 3array ; Here are the results with different nursery sizes; all times are in milliseconds and I ran each benchmark multiple times to ensure I was getting reliable results:
The second benchmark allocates larger arrays in a loop: : garbage 100 f <array> ; The results are similar:
An interesting phenomenon occurs where after the GC changes, a larger nursery actually begins to hurt performance. Dan and I hypothesized that this is due to the larger nursery hurting locality, with poor algorithms masking this effect with the older GC. The next benchmark allocates a lot of temporary arrays but stores them into larger arrays which are somewhat longer lived: : garbage 100 f <array> ; The size of the aging generation makes a difference here so I tried the benchmark with more parameters:
The above benchmarks are somewhat unrealistic, however other benchmarks showed speedups too, some more drastic than others:
Before the shrinking of the nursery, the runtimes of these benchmarks looked much like they do now; it was only until a month ago that GC time began to dominate benchmarks like that. However I like having a small nursery, and making it smaller forced me to optimize the garbage collector for higher throughput in constrained-memory conditions. Looking aheadThe next big change I will be making to the garbage collector is switching to using mark/sweep for the old generation. This will reduce memory consumption by almost 50%. The other side of this coin is compiler optimizations. If the compiler performed more advanced static analysis, it could eliminate much of the dynamic memory allocation in the above benchmarks, cutting GC time further. Of course, if we have a great compiler and a great GC, we'll just have great performance all around. Sat, 10 May 2008 00:33:00 Daniel Ehrenberg: Interval maps in FactorRecently, I wrote a little library in Factor to get the script of a Unicode code point. It's in the Factor git repository in the vocab
That array consists of interval-nodes, which have a beginning, end and corresponding value.
Let's assume we already have the sorted interval maps. Given a key and an interval map, find-interval will give the index of the interval which might contain the given key.
interval-contains? tests if a node contains a given key.
Finally, interval-at* searches an interval map to find a key, finding the correct interval and returning its value only if the interval contains the key.
A few convenience words, analogous to those for assocs:
So, to construct an interval map, there are a fewi things that have to be done. The input is an abstract specification, consisting of an assoc where the keys are either (1) 2arrays, where the first is the beginning of an interval and the second is the end (2) numbers, representing an interval of the form [a,a]. This can be converted into a form of all (1) with the following:
Once that is done, the objects should be converted to intervals:
After that, and after the intervals are sorted, it needs to be assured that all intervals are disjoint. For this, we can use the monotonic? combinator, which checks to make sure that all adjacent pairs in a sequence satisfy a predicate. (This is more useful than it sounds at first.)
And, to put it all together, using a tuple array for improved space efficiency:
All in all, in the case of representing the table of scripts, a table which was previously 200KB is now 20KB. Yay! Wed, 7 May 2008 21:50:00 Slava Pestov: I/O changes, and process pipeline supportI made some improvements to the I/O system today. Default stream variablesThe stdio variable has been replaced by input-stream and output-stream, and there are four new words:
The first two close the stream after, the latter do not. The with-stream and with-stream* words are still around, they expect a duplex stream, unpack it, and bind both variables.I've changed many usages of with-stream to use one of the unidirectional variants instead. This means that you can now write code like the following:"foo.txt" utf8 [ Before you had to use this really ugly "design pattern": "foo.txt" utf8 <file-reader> [ Speaking of duplex streams, because they're not used by anything in the core anymore I have moved them to extra. They are still used by <process-stream> and <client>. I added a <process-reader> and <process-writer> word for those cases where you only want a pipe in one direction; they express intention better.PipesThe <process-stream> word has been around for a while, and this word used pipes internally, but they were not exposed in a nice, cross-platform way, until now.The io.pipes vocabulary contains two words:
Appending process output to filesThe io.launcher vocabulary supports the full array of input and output redirection features, and now, pipelines. There was one missing component: redirecting process output to a file opened for appending. Now this is possible. The following Factor code:<process> Corresponds to this shell command: do-stuff 2>> do-stuff.txt Of course, shell script is a DSL for launching processes so it is more concise than Factor. However, Factor's io.launcher library now supports all of the features that the shell does, and its pretty easy to build a shell command parser using Chris Double's PEG library, which translates shell commands to sequences of Factor process descriptors in a pipeline.And now, I present a concise illustration of the difference between the Unix philosophy and the Windows philosophy. Here we have two pieces of code, which do the exact same thing: create a new pipe, open both ends, return a pair of handles. Unix: USING: system alien.c-types kernel unix math sequences Windows: USING: alien alien.c-types arrays destructors io io.windows libc The Windows API makes things difficult for no reason. Anonymous pipes do not support overlapped I/O, so you have to open a named pipe with a randomly-generated name (I'm not making this up, many other frameworks do the same thing and Microsoft even recommends this approach). On the flipside, the nice thing about supporting both Unix and Windows is that I get to come up with true high-level abstractions that make sense, instead of being able to get away with having thin wrappers over Unix system calls as so many other language implementations do. For example, Ocaml's idea of "high-level I/O" is some basic POSIX bindings, together with incomplete emulation of Unix semantics on Windows. Look forward to writing a page of code if you want to map a file into memory or run a process and read its output into a string. And of course Java doesn't support pipes, I/O redirection for launching processes, or file system change monitoring, at all. Mon, 5 May 2008 18:02:00 Daniel Ehrenberg: A couple GC algorithms in more detailIn previous posts on garbage collection, I've given a pretty cursory overview as to how things actually work. In this post, I hope to give a somewhat more specific explanation of two incremental (and potentially concurrent or parallel, but we'll ignore that for now) GC algorithms: Yuasa's snapshot-at-the-beginning incremental mark-sweep collector, and the MC2 algorithm. Yuasa's collector is very widely used, for example in Java 5 when an incremental collector is requested. MC2 is a more recent algorithm designed to reduce the fragmentation that mark-sweep creates, and appears to get great performance, though it isn't used much yet. In their practical implementation, both collectors are generational. Yuasa's mark-sweep collectorThe idea is pretty simple: take a mark-sweep collector and split up the work, doing a little bit on each allocation. When the heap occupancy passes a certain threshold, say 80%, switch into "mark phase", and on each allocation, mark the right amount of the heap so that everything's marked by the time the heap is full. (You can ensure this by making the amount of marking proportional to the amount of memory allocated.) Then, switch into sweep phase, and on each allocation sweep the heap by a certain amount. If a big object is allocated, sweeping continues until there's enough room. Once sweeping is done, the collector returns to a neutral state and allocation takes place without any special collection actions until the free space dips below the threshold. Making this workThis is a neat little way to specify a GC algorithm. The implementor has three knobs at their disposal: the threshold to begin collection, the speed of marking, and the speed of sweeping. But there's a problem: the algorithm, as I described it, doesn't work. See, the graph of interconnections in the heap may change during the course of marking, and that's a problem. As I described in a previous post, if a pointer gets moved to another location, it might evade marking and get swept, causing memory corruption. In a snapshot-at-the-beginning incremental marking GC, the technique to save this is to trap all pointer writes and execute a little bit of code: if the collector is in the marking phase, and if the old pointer value isn't marked, it needs to get marked and get pushed on the marking stack so that its children get marked. (The marking stack is the explicit stack used for depth-first traversal of the heap, to mark everything it reaches.) This piece of code is called the write barrier, and it goes on in addition to the generational write barrier, if one is necessary. ConservativenessOne more thing: objects are allocated as marked, if an object is allocated during a GC cycle. This means that they can't be collected until the next time around. Unfortunately, this means that any generational GC will be ineffective while marking is going on: everything is effectively allocated in the oldest generation. Nevertheless, generations still provide a significant performance advantage, since most time is spent in the neural non-GC state. This is called snapshot-at-the-beginning not because an actual snapshot is made, but because everything is saved that had something referring to it at the beginning of the marking phase. (Everything that gets a reference to it during the cycle is also saved.) Of all incremental mark-sweep GC algorithms, a snapshot-at-the-beginning collector is the most conservative, causing the most floating garbage to lie around and wait, uncollected, until the next cycle. Other algorithms have techniques to avoid this, but it often comes at other costs. MC2Unfortunately, no matter what strategy is used to minimize fragmentation, there is a program which will cause bad fragmentation of the heap, making it less usable and allocation more expensive. For this reason, a compaction strategy is helpful, and the MC2 algorithm (Memory-Constrained Compaction), created by Narendran Sachindran, provides one within an incremental and generational system. The details are somewhat complicated, and in this blog post I'll offer a simplified view. You can also look at the full paper. MCThe idea is based on the Mark-Copy (MC) algorithm. The heap is divided up into a number of equally sized windows, say 40. One of these is the nursery, and the others act as tenured space. (I don't know why, but the papers about this seem to use a two-generation rather than three-generation model. I think it could easily be updated to use three generations, but I'll stick with this for now.) Each window has a logical number, with the nursery having the highest number. Nursery collections go on as I've described in a previous post. A tenured space collection is triggered when there is only one (non-nursery) window left free. At this point, the heap is fully marked. During marking, remembered sets of pointers into each window are made. In turn, each window is copied (using Cheney's copying collector) to the open space that exists, starting in the one free window. The remembered sets can be used to update pointers that go to things that were moved. If the lowest number window is copied first, the remembered sets only need to contain pointers from higher windows to lower windows. New modificationsMC2 adds a few things to this, to make the algorithm incremental and give low upper bounds on space overhead. The first change is that incremental marking is done. This is similar to the incremental snapshot-at-the-beginning marker described above, though the creators of MC2 opted for a version called incremental update, which is less conservative and more complicated but equally sound. The next change is in the copying technique. If a window is determined to have high occupancy (like more than 95%), it is left as it is without copying. Otherwise, windows are collected into groups whose remaining data can fit into one window. Those groups are incrementally copied into a new window. Other changes make sure that the space overhead is bounded. The size of remembered sets is limited by switching to a card marking system in the event of an overflow. Objects with many references to them are put in semi-permanent storage in the lowest possible window number, minimizing the size of remembered set that they need. In a benchmark included in the MC2 paper, it is demonstrated that MC2 has the same or slightly better performance compared to non-incremental generational mark-sweep or generational mark-compact, the alternatives for the domain of memory-constrained systems. Pauses more than 30ms are rare, and performance appears to be consistent over a wide range of Java programs. Sat, 3 May 2008 00:51:00 Slava Pestov: USA Zip code databaseRecently someone posted a freely available Zip code database on reddit. The database is in CSV format. USING: io.files io.encodings.ascii sequences sequences.lib Then, we define some singleton types for the various states of the union. While this isn't strictly necessary, it allows us to write generic words which dispatch on states; for example, I'm sure Doug's taxes library could use this:SINGLETONS: AK AL AR AS AZ CA CO CT DC DE FL GA HI IA ID IL IN Next up, we define a data type storing rows from the CSV database: TUPLE: city Now a word which reads the database, parses it as CSV, and then parses each column into a specific data type: MEMO: cities ( -- seq ) This word is tricky; some notes on its workings:
The word is memoized so of course it will only load the database once. We can now define words to query it: MEMO: cities-named ( name -- cities ) Now, let's look at some examples. First, let's look up my Zip code: ( scratchpad ) 55406 find-zip-code . And another famous Zip code: ( scratchpad ) 90210 find-zip-code . How many states have a city named "Minneapolis"? ( scratchpad ) "Minneapolis" cities-named [ state>> ] map prune . What is the possible range of Zip codes for Austin? ( scratchpad ) "Austin" TX cities-named-in [ first-zip>> ] map [ infimum . ] [ supremum . ] bi There are many possible applications for this library, including form validation in web apps. It could be extended further: if the database was loaded into a quadtree sorted by latitude/longitude, you could perform queries such as finding all towns within 50 miles of a given city. Tue, 29 Apr 2008 21:45:00 Slava Pestov: An addendum to "The new HTTP server, part 2"If you run the web app presented in the last blog post verbatim, you will get a "500 Internal server error" with no further indication of what's going wrong. This is because the code I presented has a minor omission.
If we enable development mode, we see the real error message, "No such table: SESSIONS". This is because I didn't mention that one must initialize the database by creating the table for storing sessions first: "counter.db" sqlite-db [ init-sessions-table ] with-db In the next installment of this series, which hopefully won't take as long as the second one did, I will discuss form validation and the templating system. Tue, 29 Apr 2008 17:45:00 Slava Pestov: Write once, run anywhereApple finally released Java 6 for Mac OS X, about a year late, and it only runs on 64-bit Intel Macs. It also requires Leopard, which isn't such a big deal because anyone can upgrade. I sold my G5 but a lot of people out there still use PowerPC Macs and I guess they're just screwed. It's a shame too because Java 6 is the first release that's starting to approach true usability for desktop applications. Tue, 29 Apr 2008 17:25:00 Daniel Ehrenberg: Potential ideas to exploreI haven't written in a while, and it's a little hard to get started back up, so here are just a bunch of random ideas in my head that I'd like to share with you guys. Sorry if it's a little incoherent... Possible extensions to InverseI've been thinking about possible ways to generalize my system for concatenative pattern matching, currently inextra/inverse. There are two ways to go about it: making a more general constraint solving system, and giving access to the old input when inverting something, as in the Harmony project. A third way is to add backtracking (in a different place than constraint solving would put it). To someone familiar with Inverse, these might seem like they're coming from nowhere, but they're actually very closely related. (To someone not familiar with it, see my previous blog post describing Inverse.)Constraint solvingThe idea of resolving constraints is to figure out as much as you can about a situation given certain facts. This is easy in some cases, but impossible in others, even if enough facts are known to, potentially, figure out what everything is. For example, Diophantine equations can be solved by a fully general constraint-solving system, but they're known to be undecidable in general.So what can constraint solving get you in Inverse? Well, imagine an inverse to bi. It's not difficult to make one within the current framework, but some information is lost: everything must be completely determined. Think about inverting [ first ] [ second ] bi. Inverting this should get the same result as first2 (which has a hard-coded inverse right now, inverting to 2array). But it won't work.A way for [ first ] [ second ] bi to work would be using the following steps:
This isn't easy to do in general, but it should be possible, in theory. It'd be extremely cool if it worked out. Formally, you can think of Inverse as already a reasonable constraint solving system, for a limited problem domain. Given [ f ], and the statement about stacks A and B that f(A) = B, and given B, find a possible value for A. The strategy used right now is mathematically sound, and I hope to write it up some day. But, a more general use of logic variables is possible: explicit logic variables in code. This could be used to make a better-integrated logic language in Factor. The Harmony ProjectThe Harmony Project, led by Benjamin C. Pierce, is an attempt to solve the "view-update problem" using a new programming language and type system which is largely invertible. The view-update problem is that we want to convert different storage formats into an abstract representation, manipulate that representation and put it back without duplicating code about the representation. Everything operates on edge-labeled trees. Within the Harmony framework, it's possible to do all your work in bijections (one-to-one onto functions, similar but not identical to the domain of Inverse right now), but there's extra power included: the function to put the abstract representation back into the original form has access to the original. This adds a huge amount of power, giving the possibility of conditionals and recursion, in limited cases. Also, it gives the power to ignore certain things about the surface structure when looking at the abstract form. (Harmony also has ideas about tree merging, and of course a new type system, but I'm not as interested in that right now.) So far, only relatively trivial things have been made with Harmony, but the idea looks really useful, though there are two problems: (1) I don't really understand it fully (like constraints) and (2) I have no idea how it can fit together with Inverse as it is right now. BacktrackingIn Mark Tullsen's paper on first-class patterns, there was an interesting idea that Inverse could adopt. Tullsen used monads to sequence the patterns. It's the simplest to use the Maybe monad, and that corresponds to how pattern matching systems normally work. But if the List monad is used instead, then you easily get backtracking. This could be ported to Factor either by using monads or, maybe easier, by using continuations. Years ago, Chris Double implemented amb in Factor using continuations, though the code won't work anymore. The sequencing and backtracking I'm talking about is relevant in things likeswitch statements, rather than undo itself. I'm not sure if it'd actually be useful in practice.Garbage collection research ideasBecause the summer's coming up, and I'll be participating in Harvey Mudd's Garbage Collection REU, I've been coming up with a few research ideas. The suggested one is to continue with the work of previous years' REUs and think about simplifiers and collecting certain persistent data structures and weak hashtables, but here are a couple more:
These ideas are really difficult, but I think they're interesting, and with four other smart people working with me, maybe in a summer we can do something really cool, like this or whatever other idea they come up with. Ragel-style state machines in FactorIn my Automata and Computability class at Carleton, we've been studying (what else) finite automata, and it got me thinking about regular expressions and their utility in Factor. By regular expression, I mean an expression denoting a regular language: a real, academic regexp. A regular language is one that can be written as a deterministic finite automaton (finite state machine). Hopefully, I'll explain more about this in a future blog post.Anyway, if you've heard of Ragel, it's basically what I want to do. But the form it'd take is basically the same as PEGs (Chris Double's Pacrat parser), with the one restriction that no recursion is allowed. In return for this restriction, there is no linear space overhead. Basically everything else, as far as I know, could stay the same. I'm thinking I'll redo the XML parser with this. The SAX-like view will be done with this regular languages parser (since all that's needed is a tokenizer), and then that can be formed into a tree using PEGs (since linear space overhead is acceptable there). Linear space overhead, by the way, is unacceptable for the SAX-like view, since it should be usable for extremely large documents that couldn't easily fit in memory all at once. (By the way, I know Ragel also allows you to explicitly make state charts, but I won't include this until I see a place where I want to use it.) Tue, 29 Apr 2008 15:48:00 Slava Pestov: The new HTTP server, part 2It's been a month and a half since the first part of this series. Why the long delay? I've been busy with other things. I implemented inheritance, various compiler optimizations, and many other things. In the last couple of weeks I've been working on the web framework again, tying up some loose ends and porting more existing web applications over (namely, the pastebin and planet factor). Session managementThe basic idea behind session management is that while HTTP is a stateless protocol, we can simulate state by sending a token to the client -- either in the form of a hidden element on the page, or a cookie, which the client sends back to the server with a later request. This token is associated with an object on the server and the object holds state between requests. Another approach for session management is to store state entirely on the client; instead of sending the client a session ID identifying an object on the server, you send the session data itself to the client. Traditionally this approach has only been used for user preferences and such where security is immaterial, but it can even be used for more sensitive data by encrypting it with a private key only known to the server. The client receives an opaque blob of binary data which cannot be inspected or tampered with (unless the public key encryption algorithm being used is compromised). Currently Factor's session manager does not support client-side sessions, but it will soon, using Doug Coleman's public-key encryption code. Server-side sessions are supported, however. The session manager uses two main strategies to pass state to the client:
The idea is to strike a balance between security and convenience; we don't want to add a session ID to every link and start a new session if the user navigates to the site by directly entering a URL, but on the other hand we don't want potentially destructive POST requests to be accepted unless they were sent by a form generated from within the session itself. In Factor, a session is simply a hashtable where values can be stored. Keys are known as "session variables" and values can be read and written with the sget and sset words, there's also a schange combinator which applies a quotation is applied to an existing session variable to yield a new value. This all entirely analogous to the get/set/change words for dynamic variables.Session namespaces are serialized and stored in a database using Doug's db.tuples O/R mapper. I originally supported pluggable "session storage" backends, with database storage and in-memory storage as the two options, however I decided to simplify the code and hardcode database storage. This has the side-effect that you'll need to set up a database to use the session management feature, however SQLite presents a lightweight option which requires no configuration, so I don't think this is a big deal at all.I will show a small example of a 'counter' web application, much like the counter example for the Seaside framework. We start off with a vocabulary search path: USING: math kernel accessors http.server http.server.actions Now, we define a symbol used to key a session variable: SYMBOL: count Next, we define a pair of actions which increment the counter value, using the schange combinator. The display slot of an action contains code to be executed upon a GET request; it is expected to output a response object. In our case, the word outputs an action which applies the quotation to the current counter value; the action outputs a response which redirects back to the main page::: <counter-action> ( quot -- action ) The action to decrement the counter is entirely analogous: : <dec-action> ( -- action ) Note that this word constructs actions, instead of invoking them. This approach is more flexible than the old "furnace" web framework, where actions were mapped directly to word execution, because it allows one to write "higher-order actions" parametrized by values more easily. Here is the default action; it displays the counter value using a template: : <counter-action> ( -- action ) Finally we put everything together in a dispatcher: : <counter-app> ( -- responder ) We create a dispatcher, add instances of our actions to it, and wrap the whole thing in a session manager. Now, the template: <% USING: io math.parser http.server.sessions webapps.counter ; %> Finally, once we have all the parts, we can create the counter responder and start the HTTP server: <counter-app> "test.db" sqlite-db <db-persistence> main-responder set Note that here we wrap the counter responder in another layer of indirection, this time for database persistence; while the counter web app doesn't use persistence the session manager does, and we chose to use SQLite since it requires no configuration or external services. Navigating over to http://localhost:8888/ should now display the counter app, and clicking the increment and decrement links should have an effect on the displayed value. Sessions persist between server restarts and time out after 20 minutes of inactivity by default. Looking at your web browser's cookie manager will show that a factorsessid cookie has been set.As an aside, the Seaside version uses continuations to maintain state. The Factor version explicitly maintains state. Even though I ported Chris Double's modal web framework over to the new HTTP server, I'm avoiding continuations in favor of explicit state for now. I am building up a form component framework with validation, easy persistence, and user authentication without resorting to continuations, and I plan on building a state-machine model with a page flow DSL, much like jBPM, to handle more complex multi-page flows such as shopping carts. While this will result in more work for me, I believe the benefits include transparent support for load-balancing and fail-over, readable URLs, and ultimately, simpler and more reusable web application code because page flow can be decoupled from logic and expressed in a custom DSL intended for that purpose. The Seaside version is also somewhat shorter; it is easy to express with idiomatic Seaside (transparent session management, presentation logic mixed in with web app code). I will add better abstractions to make up for some of the difference, and for larger applications there should be no difference in code size; in fact since the scope of Factor's framework is wider than Seaside (it covers persistence, authentication and validation, and soon, versioning of persistent entities) you might even need less code to accomplish the same thing. Virtual hostingThe other topic I promised to cover last time was virtual hosting. Virtual hosting is done with dispatchers, much like nested directory structure is. You create a virtual host dispatcher with <vhost-dispatcher> and add responders for various virtual hosts using add-responder; the >>default slot can be used to set the default virtual host. The key difference between the new approach and the old HTTP server virtual hosting implementation, which relied on a global hashtable mapping virtual host names to responders, is flexibility; the virtual host dispatcher does not necessarily have to be your top-level responder.For example, the <boilerplate> responder gives you a way of enforcing a common look and feel across a set of web apps, by adding common headers and footers to every page. While I will describe boilerplate responders and the template system in more detail in a later post, for now here is an example:<vhost-dispatcher> Here, all virtual hosts share the same session management, database persistence, and common theme, and the virtual host dispatch only happens after the request filters through the mentioned layers of functionality. This would not be possible with the old HTTP server without duplicating code. CookiesFinally I promised to talk about cookies. The session management support is great but sometimes you just want to get and set cookies directly. This can be done by reading the cookies slot of the request object, and writing the cookies slot of the response object. The slot contains a sequence of cookie objects, which are parsed and unparsed from their HTTP representation for you. A cookie object contains a series of slots, such as name, value, expiration date (as a Factor timestamp object), max-age (as a Factor duration object), path, and host. While the expiration date is deprecated as of HTTP/1.1, most sites still use it in favor of max-age because older browsers don't support max-age. Factor's HTTP server sets the date header on each response so that expiration dates can work correctly.Here is an example of using the HTTP client (which shares the cookie code with the server) to look at Google's ridiculously long-lived cookies: ( scratchpad ) "http://www.google.com" http-get-stream drop cookies>> first describe Tue, 29 Apr 2008 03:04:00 Doug Coleman: Word renaming (part 2)Here are the word names that have changed:
foo* are a variant of foo, but otherwise the * is no help to what the word actually does differently. We're trying to move away from word names with stars to something more meaningful.Code with common patterns that come up a lot, like 1 tail and swap compose, is more clearly understood if these patterns are given a single name.Factor's subset is not equivalent to the mathematical definition of subset, so it was renamed to filter to avoid confusion. Along these same lines, diff and assoc-diff are now the more mathematically intuitive; you can think of diff like set subtraction now, seq1 seq2 diff is like seq1 - seq2.Finally, the "UFO operator" <=> now returns symbols +lt+ +eq+ +gt+ instead of negative, zero, and positive numbers. The before? and after? words can compare anything that defines a method on this operator. Since between?, min, and maxare defined in terms of these comparison words, they also work on more objects. Please let me know if you have any more suggestions for things words that have awkward argument orders, imprecise names, or if you can suggest alternate names for words with stars in their names. Mon, 28 Apr 2008 10:49:00 Slava Pestov: Performance improvementsOver the last three days I spent some time improving Factor's compiler. I made the following improvements:
These changes have let to some performance improvements on the two benchmarks I was working with. My computer here is a MacBook Pro with a 2.4 GHz Intel Core Duo 2. The project-euler.150 benchmark saw its runtime decrease from 87 seconds to 22 seconds.The benchmark.recursive benchmark, which is a Factor implementation of the recursive benchmark from the Computer Language Shootout, saw its runtime decrease from 27 seconds to 9 seconds.For comparison, SBCL runs the recursive benchmark in 3.5 seconds, and Java runs it in 1.6 seconds. Using the java -Xint result of 22 seconds, I guesstimated from the results for all languages on the shootout that Factor at around the performance of the Erlang HIPE JIT, and slightly faster than the Python Psyco JIT and GForth. By anybody's standard, this is anywhere between "not very fast" and "bloody slow", but its slowly improving.Now I'll end with a rant about the language shootout. The only so-called "dynamic" language implementations which come close to the performance of Java and C on this benchmark are SBCL, Ikarus Scheme, and Chicken Scheme. However, all these benchmarks are actually static programs in disguise, peppered with type declarations and even unsafe low-level features. The Ikarus and Chicken versions even implement the same functions twice, once for integer inputs and using integer arithmetic primitives, and another time for float inputs and using float arithmetic primitives. Unless their goal is really to promote their languages as nothing more than C with s-expression syntax, this is disappointing and dishonest. The Haskell benchmarks suffer from this also. Many benchmarks seem to use malloc and pointer arithmetic, and exclamation points (strictness annotations) abound. The n-body benchmark contains this gem: planets :: Ptr Double I would expect better from the Haskell people than hardcoding the size of a double to do pointer arithmetic on manually-managed memory! Right now I have no idea how idiomatic Haskell performs in practice; from looking at these benchmarks, it is clear to me that if one writes C in Haskell, pretty decent performance is possible, but that's not saying much. You can write C in any language. The runtime of the Factor recursive benchmark further improves by two-fold if I manually replace generic arithmetic with unsafe machine arithmetic, however I'm not interested in writing such code. I don't want to expose low-level primitives to the user, document their use as necessary for performance critical code, and declare that my job is done. Sat, 19 Apr 2008 04:56:00 Slava Pestov: My top 8 shell commandsI wrote a short Factor script to check my bash history and tally up the most frequently occurring commands. git There are some aliases here:
It seems that the only thing I use my computer for is running Factor! Sat, 19 Apr 2008 03:52:00 Doug Coleman: Word renamingSeveral words have been renamed and moved around to make Factor more consistent:
union operates on sequences.Somehow, seq-diff and seq-intersect were implemented as O(n^2) algorithms. Now, they use hashtables and are O(n).Lastly, a new vocabulary named ``sets'' contains the set theoretic words, along with a new word unique that converts a sequence to a hash table whose keys and values are the same. An efficient union and intersect are implemented in terms of this word.Mon, 14 Apr 2008 12:25:00 Doug Coleman: Adding a new primitiveI added two primitives to the Factor VM to allow setting and unsetting of environment variables. It's not that hard to do, but you have to edit several C files in the VM and a couple .factor files in the core. Really they should not be primitives, so eventually they will be moved into the core. IN: system Adding a primitive to vm/Since Factor's datatypes are not the same as C's datatypes, and because of the garbage collector, there are C functions for accessing and manipulating Factor objects. The data conversion functions are not documented yet, so here's a sampling of a few of them:
Here is the call to DEFINE_PRIMITIVE(set_os_env)The function is defined with a macro DEFINE_PRIMITIVE that takes only the function name. A corresponding DECLARE_PRIMITIVE goes in run.h as your function declaration. Not all primitives use these C preprocessor macros, for instance bignums don't because it doesn't improve the performance. Parameters to your primitive are popped or unboxed off the data stack, so a primitive's declaration expands to:F_FASTCALL primitive_set_os_env_impl(void);F_FASTCALL is a wrapper around FASTCALL, which on x86 will pass the first two arguments in registers as an optimization. Note that while it declares that it takes no arguments (void), most primitives will do something to the data stack. Since unbox_u16_string() allocates memory for the Factor object, it could trigger a gc, so it's registered as a string. You can also register values using REGISTER_ROOT for cells, REGISTER_BIGNUM for bignums, and REGISTER_UNTAGGED for arrays, words, and other Factor object pointers for which the type is known. The key string can immediately be unregistered after calling unbox on the next stack value since the rest of the function will not cause a gc. If the win32 call fails, there's a function general_error() that throws an exception. In this case, it's an ERROR_IO that calls a helper function to return the Windows error message.Adding a primitive to core/Everything in Factor compiles down to primitives. Because they are by definition "primitive", the compiler cannot infer the stack effect and argument types. To make a primitive's stack effect "known", edit core/inference/known-words/known-words.factor:\ set-os-env { string string } { } <effect> set-primitive-effectThe next step is to put your word in the file core/bootstrap/primitives.factor in the same order as in vm/primitives.c.Sometime in the future there might be a PRIMITIVE: word that will reduce the number of different places to edit to add a primitive. If it used Factor's FFI, you could add a new primitive without even having to bootstrap again.Sun, 13 Apr 2008 14:20:00 Slava Pestov: Multi-touch gestures in the Factor UIRecently I acquired a MacBook Pro. The new models come with a multi-touch keypad and I've found it extremely useful in Safari to be able to go back, go forward and zoom with the mouse. However, I missed the ability to do this in other applications, especially the Factor UI. The Factor UI already supports vertical and horizontal scrolling gestures, and I wanted to be able to use the other gestures as well. TUPLE: left-action ; C: <left-action> left-action Next, I edited extra/ui/cocoa/views/views.factor with some new methods for handling the new multitouch gestures, and translating them to Factor gestures:{ "magnifyWithEvent:" "void" { "id" "SEL" "id" }Note that I'm throwing away useful information for the sake of simplicity; the zoom gesture gives you a precise zoom amount, not just +1/-1. The swipe gestures seem to be completely discrete though. I haven't implemented rotation gestures yet because I haven't figured out what to use them for. With the above code written, the UI now sends multi-touch gestures to gadgets however no gadgets used them yet. I fired up the gesture logger, "gesture-logger" run, and tested that the new actions actually get sent. They were, except the first time I messed up the code and got the signs the wrong way round. With that fixed, I could proceed to add gesture handlers to the various UI tools. I edited various source files: browser-gadget "multi-touch" f {I think its pretty clear from the above what the default gesture assignments are:
15 minutes of Google, 20 minutes of hacking, and now Factor supports the fanciest feature of Apple's latest hardware. Fri, 11 Apr 2008 23:34:00 Slava Pestov: Improvements to io.monitors; faster refresh-allFactor's Fri, 11 Apr 2008 13:19:00 Slava Pestov: The golden rule of writing commentsThis is something I've wanted to say for a while, and I think many (most?) programmers don't realize it: Tue, 8 Apr 2008 20:36:00 Slava Pestov: Multi-methods and hooksFor a while now Factor has had 'hooks', which are generic words dispatching on a dynamically scoped variable. Hooks can be used with variables which are essentially global: the current OS, current CPU, UI backend, etc -- or variables which are truly context-specific, such as the current database connection. GENERIC: cache-size ( -- l1 l2 ) Notice that it is defined as taking no inputs from the stack. I don't really know the APIs involved here, but suppose that Linux gives us a way to get this info that works across all CPUs. So we define a method specializing on the os dynamic variable (which is really treated like global):METHOD: cache-size { { os linux } } ... read something from /proc ... ;Now suppose on Mac OS X we use different APIs per CPU: METHOD: cache-size { { os macosx } { cpu ppc } } ... ;But perhaps on ARM CPUs, we just use an instruction to read the cache size without any OS-specific calls at all: METHOD: cache-size { { cpu arm } } ... ;Now in this case, you have an issue where if you're on Linux and ARM, the method that ends up being called depends on the order in which they were defined. If you wanted to explicitly resolve this ambiguity, you would define a new method on { { os linux } { cpu arm } }; because it is more specific than the other two, it is always called first.The powerful thing about this new implementation of hooks is that not only can you dispatch on multiple variables, but you can add methods to any old generic which dispatches on a variable and the original designer of the generic does not have to explicitly take this into account. For example, Factor's compiler currently has a large number of hooks dispatching on the CPU type (as an aside, Phil Dawes wrote an excellent introduction to the Factor compiler recently). If those hooks need to be further refined by OS, as is often the case with FFI-related components, the method implementation on the hook needs to perform its own dispatch; this is the "double dispatch" pattern and design patterns are something to be avoided if one wants to write quality code. When multi-methods go in the core, the compiler will simply define a series of generic words taking no inputs from the stack, and each method will specialize on the CPU, and maybe an OS too. Another new capability is dispatching off stack values and variables in the same method. Among other things, this will be useful in eliminating a case of double dispatch in the core right now, where the <client> word for opening a client socket has to dispatch off the address type on the stack, and then call another hook which dispatches on the I/O backend stored in a variable. This can be combined into a single generic word where some methods dispatch on stack values, and others dispatch on the I/O backend variable.The other nice thing about this is that the multi-method GENERIC: word unifies and generalizes four words in the core, GENERIC:, GENERIC# for dispatching on a value further down on the stack, HOOK: for dispatching on a variable, and MATH: which performs double dispatch on numbers only.One of my goals for Factor 1.0 is to get the object system "right", and with the new-style slot accessors, inheritance, and singletons, we're almost there. All that remains to be done is to merge the multi-methods code. The code is still not quite ready to go in the core, though. The only feature that single dispatch has and multiple-dispatch lacks is call-next-method, which is easy to implement. A bigger hurdle to clear is performance; right now multi-methods are implemented in a naive way, where the dispatch time is O(mn) with m methods and n stack positions and variables to dispatch on. This can be improved significantly and I will find time reading the literature on optimizing method dispatch over the next few weeks.Tue, 8 Apr 2008 20:14:00 Phil Dawes: Digging into Factor’s compilerI wrote this post partly as an advocacy piece and partly to put down a bunch of things I’ve learnt about the factor compiler over the last few weeks. I should point out that I’m no expert in this area and so there are probably inaccuracies and omissions - hopefully Slava or one of the factor gurus will point them out. With that in mind, check this out!: Factor has an optimising compiler which generates machine code as you type code into the REPL. If you have gdb on your system you can see this in action by firing up factor and using the tools.disassembler vocab: ( scratchpad ) : hello "hello" print ; ! defines the word hello ( scratchpad ) USING: tools.disassembler ; ( scratchpad ) \ hello disassemble Using host libthread_db library “/lib/tls/i686/cmov/libthread_db.so.1″. [Thread debugging using libthread_db enabled] [New Thread -1213614400 (LWP 25688)] 0xffffe410 in __kernel_vsyscall () Dump of assembler code from 0xb0f694b0 to 0xb0f694c7: 0xb0f694b0: mov $0xb0f694b0,%ecx 0xb0f694b5: mov 0xb0f694e4,%ebx 0xb0f694bb: mov %ebx,0×4(%esi) 0xb0f694be: add $0×4,%esi 0xb0f694c1: jmp 0xb123e7d0 … End of assembler dump. This is very cool in itself, but for me the real beauty of the factor compiler is the very modular design, composed of small pieces that you can pull apart and tinker with in isolation. This makes the compiler accessible to people both as a learning tool and for those wanting to generate highly optimized code for tight loops. The three stages of the compiler are
I’ll dig into each of these steps in order: Stage 1: Parsing factor code to dataflow IRThe first step parses factor code into a dataflow datastructure. You can run and inspect the results of this yourself using the dataflow word:
USE: inference
[ "hello" print ] dataflow pprint
=> T{
#push
T{
node
f
f
f
V{ T{ value f “hello” 673850 f } }
f
f
f
f
f
f
T{
#call
T{
node
f
print
V{ T{ value f “hello” 673850 f } }
V{ }
f
f
f
f
f
f
T{
#return
T{ node f f V{ } f f f f f f f f f }
}
f
}
}
f
}
}
Obviously inspecting this datastructure manually is pretty cumbersome, so fortunately there’s some dataflow inspection functionality in the ‘optimizer.debugger’ vocab. The dataflow>quot word renders the dataflow structure back into quotations (code blocks) that you can print and inspect. I use it here to define some words for dataflow pretty-printing: USE: optimizer.debugger : print-dataflow f dataflow>quot pprint nl ; : print-annotated-dataflow t dataflow>quot pprint nl ; So now we can turn quotations into dataflow graphs and back again: [ "hello" print ] dataflow print-dataflow => [ “hello” print ] (N.B. there are already words in the optimizer.debugger vocab for displaying optimized dataflows, but for this post I wanted to be able to print dataflows prior to optimisation) This also works for pre-defined words using ‘word-dataflow’ : : print-hello "hello" print ; USE: generator \ print-hello word-dataflow print-dataflow => [ “hello” print ] In most cases the output quotation will be the same as the input quotation, however there are a couple of expansions that happen at this stage. The first is that words marked ‘inline’ are inlined directly into the dataflow: : inlinedword "this" "is" "an" "inlined" "word" ; inline [ inlinedword ] dataflow print-dataflow => [ “this” “is” “an” “inlined” “word” ] Also any compiler-transforms (macros) are evaluated at this stage. USE shuffle [ 1 2 2 ndup ] dataflow print-dataflow ! ndup is a macro => [ 1 2 2 drop 2 drop >r dup r> swap 2 drop >r dup r> swap ] Stage 2: Dataflow OptimisationHere’s where the fun starts. You can get a feel for how this stage works by looking at ‘optimizer.factor’. Here it is in its entirety:
! Copyright (C) 2006, 2008 Slava Pestov.
! See http://factorcode.org/license.txt for BSD license.
USING: kernel namespaces optimizer.backend optimizer.def-use
optimizer.known-words optimizer.math optimizer.control
optimizer.inlining inference.class ;
IN: optimizer
: optimize-1 ( node -- newnode ? )
[
H{ } clone class-substitutions set
H{ } clone literal-substitutions set
H{ } clone value-substitutions set
dup compute-def-use
kill-values
dup detect-loops
dup infer-classes
optimizer-changed off
optimize-nodes
optimizer-changed get
] with-scope ;
: optimize ( node -- newnode )
optimize-1 [ optimize ] when ;
‘optimize’ iteratively calls ‘optimize-1′ until nothing changes in the output graph - i.e. that it has reached a fixed point and no more optimizations can be performed. If you dig into the words used by optimize-1 (try executing them individually and inspecting the dataflow result) you’ll find that optimize-1 performs a number of inferences and optimizations:
Examples: evaluating foldable words at compile time‘+’ is a foldable word (see help for ‘+’), so the optimizer evaluates it at compile time if the values of both arguments are known. Here’s the dataflow before and after optimization: [ 2 3 + ] dataflow print-dataflow ! before optimization => [ 2 3 + ] [ 2 3 + ] dataflow optimize print-dataflow ! after optimization => [ 5 ] type inference and inlining generic word implementationsTo illustrate this we first create two tuples (classes) with constructors, and a generic word TUPLE: classa ; C: <classa> classa TUPLE: classb ; C: <classb> classb GENERIC: dosomething ( obj -- val ) Now we create an implementation of the generic word specialised for each class: M: classa dosomething drop "something for class a" ; M: classb dosomething drop "something for class b" ; Finally, some code which calls ‘dosomething’ with an instance of ‘classa’ on the stack, before and after optimization: [ <classa> dosomething ] dataflow print-dataflow ! before optimization => [ classa drop classa 2 <tuple -boa> dosomething ] [ <classa> dosomething ] dataflow optimize print-dataflow ! after optimization => [ classa 2 <tuple -boa> drop “something for class a” ] It’s a little messy because of the inlined tuple creation, but you can see that prior to optimization ‘dosomething’ is a word call in the dataflow, and afterwards the optimizer has inlined the implementation of ‘dosomething’ specialized on ‘classa’. (if you look you can also see an example of pruning literals here, as the first dataflow has resulted in a superflous ‘\ classa drop’). This is easier to see if I cheat a bit and use factor’s ‘declare’ word, which declares that elements on the top of the stack are instances of specific classes. So this quotation assumes that top stack element before it is called is of type ‘classa’:
[ { classa } declare dosomething ] dataflow optimize print-dataflow
=> [ drop “something for class a” ]
N.B. you wouldn’t normally use ‘declare’ in user code, but it could be really handy for optimizing performance sensitive tight loops where the results of an external word call are known to the programmer but not the compiler. conditional foldingThe compiler optimizes out conditional branches when it can deduce the outcome of the conditional at compile time: [ 1 0 = [ "do if true" ] [ "do if false" ] if ] dataflow optimize print-dataflow => [ “do if false” ] 1 isn’t equal to 0 so it optimizes this whole block into the contents of the false quotation. Stage 3: Machine Code GenerationCode generation is implemented by the ‘generate’ word. This iterates through the nodes calling ‘generate-node’ on each.
: generate-nodes ( node -- )
[ node@ generate-node ] iterate-nodes end-basic-block ;
: generate ( node word label -- )
[
init-generate-nodes
[ generate-nodes ] with-node-iterator
] with-generator ;
‘generate-node’ is a generic word with specialized implementations for each type of dataflow node. As described in this post from Slava’s excellent Factor blog there are a number of dataflow node types, the important ones being:
The generate-node implementations invoke lower level words in the ‘architecture’ vocabulary, which in turn are generic words that write out small pieces of machine code specialized for each CPU architecture. The machine code generation code is particularly cool and easy to follow because factor has an assembler DSL for each cpu architecture it supports. The assembler words match the commands and registers of the target cpu architecture and evaluate to their corresponding machine code. You can even try this out in the REPL using ‘make’ to collect the results into an array. I’m on x86 so I load the cpu.x86.assembler vocabulary:
USE: cpu.x86.assembler
[ EAX 35 MOV ] { } make . ! postfix assembler evaluates to machine code!
=> { 184 35 0 0 0 }
The assembler DSLs make code generation easy to follow because you can see the assembler in the generation code and then check it against the disassembled machine code using the ‘disassemble’ word we used at the start of the post. When following the code it helps to know which registers are used for what purpose. I found this information in assembler files in the factor VM source - I’m an x86 so for me the declares are in the ‘cpu-x86.32.S’ file:
#define ARG0 %eax
#define ARG1 %edx
#define XT_REG %ecx
#define STACK_REG %esp
#define DS_REG %esi
#define RETURN_REG %eax
#define CELL_SIZE 4
#define PUSH_NONVOLATILE
push %ebx ;
push %ebp
#define POP_NONVOLATILE
pop %ebp ;
pop %ebx
register CELL ds asm("esi");
register CELL rs asm("edi");
So lets check this against some generated code for a really basic word that just pops the number 42 on the stack: : myfunc 42 ; myfunc disassemble => Using host libthread_db library “/lib/tls/i686/cmov/libthread_db.so.1″. [Thread debugging using libthread_db enabled] [New Thread -1213696320 (LWP 32499)] 0xffffe410 in __kernel_vsyscall () Dump of assembler code from 0xb136b230 to 0xb136b247: 0xb136b230: mov $0xb136b230,%ecx 0xb136b235: mov $0×150,%ebx 0xb136b23a: mov %ebx,0×4(%esi) 0xb136b23d: add $0×4,%esi 0xb136b240: ret The first assembler line puts the address of the word into the XT_REG, which is %ecx. For some reason the start of each function puts it’s address into this register - not quite sure why. code generation optimizationsFactor has another couple of tricks up its sleeve during the code generation stages: optimizing shuffle wordsThe first is that stack shuffle words (e.g. dup, swap, tuck etc..) don’t get translated into machine code. Instead the compiler has a compile time ‘phantom stack’ which records the positions of items in the stack. When it generates the machine code values are accessed from the stack out of order (the runtime stack is after all a random access piece of memory). This makes stack shuffling words and the retain stack effectively ‘free’ within a code block. A #merge node in the dataflow signifies a code boundary (usually before a subroutine call) which causes the compiler to output instructions which synchronise the physical runtime stack with its phantom stack. word intrinsicsThe second trick is word ‘intrinsics’. Word intrinsics are essentially blocks of open-coded assembler that are output in place of a subroutine call. They are associated with the word via a ‘word-property’, which is a nifty feature of factor that allows meta information to be attached to each word. For example the ‘fixnum+fast’ word has intrinsics which you can see using ‘word-prop’:
fixnum+fast "intrinsics" word-prop pprint
=>
{
{
[ “x” operand “y” operand ADD ]
H{
{ +output+ { “x” } }
{ +input+ { { f “x” } { [ small-tagged? ] “y” } } }
}
}
{
[ “x” operand “y” operand ADD ]
H{
{ +output+ { “x” } }
{ +input+ { { f “x” } { f “y” } } }
}
}
}
This tells the compiler to inline the x86 ADD instruction instead of making a subroutine call to the implementation of fixnum+fast. You can add assembler intrinsics to existing words with ‘define-intrinsics’; Here’s a description from the help for the define-intrinsics word:
What I particularly like about this feature is that it neatly provides the ability to specialize a highly optimized implementation for a particular hardware set, and then fall back gracefully on other architectures. – That concludes my ad-hoc tour of the factor compiler. I’ve skipped over a number of things and no doubt there are bits I haven’t discovered yet and some inaccuracies, but I hope I’ve supplied enough information to spark interest in this excellent compiler. As I mentioned in a previous post I got interested in factor as a direct result of tinkering with jonesforth, which takes you through the entire forth bootstrap process starting with raw assembly. I’ve been delighted to find that factor retains a lot of the ‘right-down-to-the-metal’ accessibility of its low level cousin. Tue, 8 Apr 2008 08:27:16 |
planet-factor is an Atom/RSS aggregator that collects the contents of Factor-related blogs. It is inspired by |