[ planet-factor ]

John Benediktsson: Jaro-Winkler

Jaro-Winkler distance is a measure of string similarity and edit distance between two sequences:

The higher the Jaro–Winkler distance for two strings is, the less similar the strings are. The score is normalized such that 0 means an exact match and 1 means there is no similarity. The original paper actually defined the metric in terms of similarity, so the distance is defined as the inversion of that value (distance = 1 − similarity).

There are actually two different concepts – and RosettaCode tasks – implied by this algorithm:

Jaro similarity and Jaro distance
Jaro-Winkler similarity and Jaro-Winkler distance.

Let’s build an implementation of these in Factor!

Jaro Similarity

The base that all of these are built upon is the Jaro similarity. It is calculated as a score by measuring the number of matches (m) between the strings, counting the number of transpositions divided by 2 (t), and then returning a weighted score using the formula using the lengths of each sequence (|s1| and |s2|):

In particular, it considers a matching character to be one that is found in the other string within a match distance away, calculated by the formula:

There are multiple ways to go about this, with varying performance, but I decided one longer function was simpler to understand than breaking out the steps into their own words. We use a bit-array to efficiently track which characters have been matched already as we iterate:

:: jaro-similarity ( s1 s2 -- n )
    s1 s2 [ length ] bi@       :> ( len1 len2 )
    len1 len2 max 2/ 1 [-]     :> delta
    len2 <bit-array>           :> flags

    s1 [| ch i |
        i delta [-]            :> from
        i delta + 1 + len2 min :> to

        from to [| j |
            j flags nth [ f ] [
                ch j s2 nth = dup j flags set-nth
            ] if
        ] find-integer-from
    ] filter-index

    [ 0 ] [
        [ length ] keep s2 flags [ nip ] 2filter [ = not ] 2count
        :> ( #matches #transpositions )

        #matches len1 /f #matches len2 /f +
        #matches #transpositions 2/ - #matches /f + 3 /
    ] if-empty ;

The Jaro distance is then just a subtraction:

: jaro-distance ( s1 s2 -- n )
    jaro-similarity 1.0 swap - ;

I’m curious if anyone else has a simpler implementation – please share!

Jaro-Winkler Similarity

The Jaro-Winkler similarity builds upon this by factoring in the length of the common prefix (l) times a constant scaling factor (p) that is usually set to 0.1 in most implementations I’ve seen:

We can implement this by calcuting the Jaro similarity and then computing the common prefix and then generating the result:

:: jaro-winkler-similarity ( s1 s2 -- n )
    s1 s2 jaro-similarity :> jaro
    s1 s2 min-length 4 min :> len
    s1 s2 [ len head-slice ] bi@ [ = ] 2count :> #common
    1 jaro - #common 0.1 * * jaro + ;

The Jaro-Winkler distance is again just a subtraction:

: jaro-winkler-distance ( a b -- n )
    jaro-winkler-similarity 1.0 swap - ;

Try it out

The Wikipedia article compares the similarity of FARMVILLE and FAREMVIEL:

IN: scratchpad "FARMVILLE" "FAREMVIEL" jaro-similarity .
0.8842592592592592

We can also see that the algorithm considers the transposition of two close characters to be less of a penalty than the transposition of two characters farther away from each other. It also penalizes additions and substitutions of characters that cannot be expressed as transpositions.

IN: scratchpad "My string" "My tsring" jaro-winkler-similarity .
0.9740740740740741

IN: scratchpad "My string" "My ntrisg" jaro-winkler-similarity .
0.8962962962962963

We can compare the rough performance of Julia using the same algorithm:

julia> using Random

julia> s = randstring(10_000)

julia> t = randstring(10_000)

julia> @time jarowinklerdistance(s, t)
  1.492011 seconds (108.32 M allocations: 2.178 GiB, 1.87% gc time)
0.19016926812348256

Note: I’m not a Julia developer, I just play one on TV. I adapted this implementation in Julia, which originally took over 4.5 seconds. A better developer could probably improve it quite a bit. In fact, it was pointed out that we are indexing UTF-8 String in a loop, and should instead collect the Char into a Vector first. That does indeed make it super fast.

To the implementation in Factor that we built above, which runs quite a bit faster:

IN: scratchpad USE: random.data

IN: scratchpad 10,000 random-string
               10,000 random-string
               gc [ jaro-winkler-distance ] time .
Running time: 0.259643166 seconds

0.1952856823031448

Thats not bad for a first version that uses safe indexing with unnecessary bounds-checking, generic iteration on integers when usually the indices are fixnum (something I hope to fix someday automatically), and should probably order the input sequences by length for consistency.

If we fix those problems, it gets even faster:

IN: scratchpad USE: random.data

IN: scratchpad 10,000 random-string
               10,000 random-string
               gc [ jaro-winkler-distance ] time .
Running time: 0.068086625 seconds

0.19297898770334765

This is available in the development version in the math.similarity and the math.distances vocabularies.

Sat, 21 Jun 2025 15:00:00

John Benediktsson: Best Shuffle

The “Best shuffle” is a Rosetta Code task that was not yet implemented in Factor:

Task

Shuffle the characters of a string in such a way that as many of the character values are in a different position as possible.

A shuffle that produces a randomized result among the best choices is to be preferred. A deterministic approach that produces the same sequence every time is acceptable as an alternative.

Display the result as follows:
original string, shuffled string, (score)
The score gives the number of positions whose character value did not change.

There are multiple ways to approach this problem, but the way that most solutions seem to take is to shuffle two sets of indices, and then iterate through them swapping the characters in the result if they are different.

I wanted to contribute a solution in Factor, using local variables and short-circuit combinators:

:: best-shuffle ( str -- str' )
    str clone :> new-str
    str length :> n
    n <iota> >array randomize :> range1
    n <iota> >array randomize :> range2

    range1 [| i |
        range2 [| j |
            {
                [ i j = ]
                [ i new-str nth j new-str nth = ]
                [ i str nth j new-str nth = ]
                [ i new-str nth j str nth = ]
            } 0|| [
                 i j new-str exchange
            ] unless
        ] each
    ] each

    new-str ;

And we can write some code to display the result as requested:

: best-shuffle. ( str -- )
    dup best-shuffle 2dup [ = ] 2count "%s, %s, (%d)\n" printf ;

And then print some test cases:

IN: scratchpad {
                   "abracadabra"
                   "seesaw"
                   "elk"
                   "grrrrrr"
                   "up"
                   "a"
               } [ best-shuffle. ] each
abracadabra, raabaracdab, (0)
seesaw, easwse, (0)
elk, lke, (0)
grrrrrr, rrrrgrr, (5)
up, pu, (0)
a, a, (1)

This is reminiscent to the recent work I had done on derangements and generating a random derangement. While this approach does not generate a perfect derangement of the indices – and happens to be accidentally quadratic – it is somewhat similar with the additional step that we look to make sure not only are the indices different, but that the contents are different as well before swapping.

Wed, 18 Jun 2025 15:00:00

John Benediktsson: Dotenv

Dotenv is an informal file specification, a collection of implementations in different languages, and an organization providing cloud-hosting services. They describe the .env file format and some extensions:

The .env file format is central to good DSX and has been since it was introduced by Heroku in 2012 and popularized by the dotenv node module (and other libraries) in 2013.

The .env file format starts where the developer starts - in development. It is added to each project but NOT committed to source control. This gives the developer a single secure place to store sensitive application secrets.

Can you believe that prior to introducing the .env file, almost all developers stored their secrets as hardcoded strings in source control. That was only 10 years ago!

Besides official and many unofficial .env parsers available in a lot of languages, the Dotenv organization provides support for dotenv-vault cloud services in Node.js, Python, Ruby, Go, PHP, and Rust.

Today, I wanted to show how you might implement a .env parser in Factor.

File Format

The .env files are relatively simple formats with key-value pairs that are separated by an equal sign. These values can be un-quoted, single-quoted, double-quoted, or backtick-quoted strings:

SIMPLE=xyz123
INTERPOLATED="Multiple\nLines"
NON_INTERPOLATED='raw text without variable interpolation'
MULTILINE = `long text here,
e.g. a private SSH key`

Parsing

There are a lot of ways to build a parser – everything from manually spinning through bytes using a hand-coded state machine, higher-level parsing grammars like PEG, or explicit parsing syntax forms like EBNF.

We are going to implement a .env parser using standard PEG parsers, beginning with some parsers that look for whitespace, comment lines, and newlines:

: ws ( -- parser )
    [ " \t" member? ] satisfy repeat0 ;

: comment ( -- parser )
    "#" token [ CHAR: \n = not ] satisfy repeat0 2seq hide ;

: newline ( -- parser )
    "\n" token "\r\n" token 2choice ;

Keys

The .env keys are specified simply:

For the sake of portability (and sanity), environment variable names (keys) must consist solely of letters, digits, and the underscore ( _ ) and must not begin with a digit. In regex-speak, the names must match the following pattern:
[a-zA-Z_]+[a-zA-Z0-9_]*

We can build a key parser by looking for those characters:

: key-parser ( -- parser )
    CHAR: A CHAR: Z range
    CHAR: a CHAR: z range
    [ CHAR: _ = ] satisfy 3choice

    CHAR: A CHAR: Z range
    CHAR: a CHAR: z range
    CHAR: 0 CHAR: 9 range
    [ CHAR: _ = ] satisfy 4choice repeat0

    2seq [ first2 swap prefix "" like ] action ;

Values

The .env values can be un-quoted, single-quoted, double-quoted, or backtick-quoted strings. Only double-quoted strings support escape characters, but single-quoted and backtick-quoted strings support escaping either single-quotes or backtick characters.

: single-quote ( -- parser )
    "\\" token hide [ "\\'" member? ] satisfy 2seq [ first ] action
    [ CHAR: ' = not ] satisfy 2choice repeat0 "'" dup surrounded-by ;

: backtick ( -- parser )
    "\\" token hide [ "\\`" member? ] satisfy 2seq [ first ] action
    [ CHAR: ` = not ] satisfy 2choice repeat0 "`" dup surrounded-by ;

: double-quote ( -- parser )
    "\\" token hide [ "\"\\befnrt" member? ] satisfy 2seq [ first escape ] action
    [ CHAR: " = not ] satisfy 2choice repeat0 "\"" dup surrounded-by ;

: literal ( -- parser )
    [ " \t\r\n" member? not ] satisfy repeat0 ;

Before we implement our value parser, we should note that some values can be interpolated:

Interpolation (also known as variable expansion) is supported in environment files. Interpolation is applied for unquoted and double-quoted values. Both braced (${VAR}) and unbraced ($VAR) expressions are supported.

Direct interpolation

${VAR} -> value of VAR

Default value

${VAR:-default} -> value of VAR if set and non-empty, otherwise default

${VAR-default} -> value of VAR if set, otherwise default

And some values can have command substitution:

Add the output of a command to one of your variables in your .env file. Command substitution is applied for unquoted and double-quoted values.

Direct substitution

$(whoami) -> value of $ whoami

We can implement an interpolate parser that acts on strings and replaces observed variables with their interpolated or command-substituted values. This uses a regular expressions and re-replace-with to substitute values appropriately.

: interpolate-value ( string -- string' )
    R/ \$\([^)]+\)|\$\{[^\}:-]+(:?-[^\}]*)?\}|\$[^(^{].+/ [
        "$(" ?head [
            ")" ?tail drop process-contents [ blank? ] trim
        ] [
            "${" ?head [ "}" ?tail drop ] [ "$" ?head drop ] if
            ":-" split1 [
                [ os-env [ empty? not ] keep ] dip ?
            ] [
                "-" split1 [ [ os-env ] dip or ] [ os-env ] if*
            ] if*
        ] if
    ] re-replace-with ;

: interpolate ( parser -- parser )
    [ "" like interpolate-value ] action ;

We can use that to build a value parser, remembering that only un-quoted and double-quoted values are interpolated, and making sure to convert the result to a string:

: value-parser ( -- parser )
    [
        single-quote ,
        double-quote interpolate ,
        backtick ,
        literal interpolate ,
    ] choice* [ "" like ] action ;

Key-Values

Combining those, we can make a key-value parser, that ignores whitespace around the = token and uses set-os-env to update the environment variables:

: key-value-parser ( -- parser )
    [
        key-parser ,
        ws hide ,
        "=" token hide ,
        ws hide ,
        value-parser ,
    ] seq* [ first2 swap set-os-env ignore ] action ;

And finally, we can build a parsing word that looks for these key-value pairs while ignoring optional comments and whitespace:

PEG: parse-dotenv ( string -- ast )
    ws hide key-value-parser optional
    ws hide comment optional hide 4seq
    newline list-of hide ;

Loading Files

We can load a file by reading the file-contents and then parsing it into environment variables:

: load-dotenv-file ( path -- )
    utf8 file-contents parse-dotenv drop ;

These .env files are usually located somewhere above the current directory, typically at a project root. For now, we make a word that traverses from the current directory up to the root, looking for the first .env file that exists:

: find-dotenv-file ( -- path/f )
    f current-directory get absolute-path [
        nip
        [ ".env" append-path dup file-exists? [ drop f ] unless ]
        [ ?parent-directory ] bi over [ f ] [ dup ] if
    ] loop drop ;

And now, finally, we can find and then load the relevant .env file, if there is one:

: load-dotenv ( -- )
    find-dotenv-file [ load-dotenv-file ] when* ;

Try it out

We can make a simple .env file:

$ cat .env
HOST="${HOST:-localhost}"
PORT="${PORT:-80}"
URL="https://${HOST}:${PORT}/index.html"

And then try it out, overriding the PORT environment variable:

$ PORT=8080 ./factor
IN: scratchpad USE: dotenv
IN: scratchpad load-dotenv
IN: scratchpad "URL" os-env .
"https://localhost:8080/index.html"

Some additional features that we might want to follow up on:

investigate the POSIX-compliant dotenv syntax specification and included test cases
optionally use a startup hook to automatically load .env files
support for dotenv-vaults and encrypted deploys
command-line support similar to dotenvx

This is available in the latest development version. Check it out!

Tue, 17 Jun 2025 15:00:00

John Benediktsson: Color Prettyprint

Factor has a neat feature in the prettyprint vocabulary that allows printing objects, typically as valid source literal expressions. There are small caveats to that regarding circularity, depth limits, and other prettyprint control variables, but it’s roughly true that you can pprint most everything and have it be useful.

At some point in the past few years, I noticed that Xcode and Swift Playground have support for color literals that are rendered in the source code. You can see that in this short video describing how it works:

Inspired by that – and a past effort at color tab completion – I thought it would be fun to show how you might extend our color support to allow colors to be prettyprinted with a little gadget in the UI that renders their colors.

First, we need to define a section object that holds a color and renders it using a colored border gadget.

TUPLE: color-section < section color ;

: <color-section> ( color -- color-section )
    1 color-section new-section swap >>color ;

M: color-section short-section
     " " <label> { 5 0 } <border>
        swap color>> <solid> >>interior
        COLOR: black <solid> >>boundary
    output-stream get write-gadget ;

Next, we extend pprint* with a custom implementation for any color type as well as our named colors that adds a color section to the output block:

M: color pprint*
    <block
        [ call-next-method ]
        [ <color-section> add-section ] bi
    block> ;

M: parsed-color pprint*
    <block
        [ \ COLOR: pprint-word string>> text ]
        [ <color-section> add-section ] bi
    block> ;

And, now that we have that, we can push some different colors to the stack and see how they are all displayed:

Pretty cool.

I did not commit this yet – partly because I’m not sure we want this as-is and also partly because it needs to only display the gadget if the UI is running. We also might want to consider the UI theme and choose a nice contrasting color for the border element.

Sun, 15 Jun 2025 15:00:00

John Benediktsson: Tracking Dict

Peter Bengtsson wrote about building a Python dict that can report which keys you did not use:

This can come in handy if you’re working with large Python objects and you want to be certain that you’re either unit testing everything you retrieve or certain that all the data you draw from a database is actually used in a report.

For example, you might have a SELECT fieldX, fieldY, fieldZ FROM ... SQL query, but in the report you only use fieldX, fieldY in your CSV export.
class TrackingDict(dict):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._accessed_keys = set()

    def __getitem__(self, key):
        self._accessed_keys.add(key)
        return super().__getitem__(key)

    @property
    def accessed_keys(self):
        return self._accessed_keys

    @property
    def never_accessed_keys(self):
        return set(self.keys()) - self._accessed_keys

We can build a version of this in Factor intended to show off a few language features. The original version in Python used inheritance versus composition to implement the data structure. Instead, we build a data structure that will wrap an existing assoc and delegate to it.

First, a simple tuple definition that will have the underlying assoc and a set of accessed keys:

TUPLE: tracking-assoc underlying accessed-keys ;

: <tracking-assoc> ( underlying -- tracking-assoc )
    HS{ } clone tracking-assoc boa ;

INSTANCE: tracking-assoc assoc

We then implement the assoc protocol by using delegation to the underlying assoc, with an override for tracking accessed keys:

CONSULT: assoc-protocol tracking-assoc underlying>> ;

M: tracking-assoc at*
    [ underlying>> at* ] [ accessed-keys>> adjoin ] 2bi ;

And for fun – since we could have built a normal word to do this – we define a protocol slot that we then implement to compute the never accessed keys:

SLOT: never-accessed-keys

M: tracking-assoc never-accessed-keys>>
    [ underlying>> keys ] [ accessed-keys>> ] bi diff ;

And we show it works using a simple example from the original blog post:

H{
    { "name" "John Doe" }
    { "age" 30 }
    { "email" "jd@example.com" }
} <tracking-assoc>

"name" over at "John Doe" assert=

[ accessed-keys>> "Accessed keys: %u\n" printf ]
[ never-accessed-keys>> "Never accessed keys: %u\n" printf ] bi

Which prints this:

Accessed keys: HS{ "name" }
Never accessed keys: HS{ "email" "age" }

Fun!

Sat, 14 Jun 2025 15:00:00

John Benediktsson: bab=aaa, bbb=bb

Admittedly, I struggle sometimes when I read the word “monoid”. It seems to always remind me of that saying “A Monad is just a Monoid in the Category of Endofunctors” which is both a tongue-twister, requires repeated effort to understand, and is sometimes used in casual conversation when jealously describing the features and capabilities of the Haskell programming language.

In any event, the topic of monoids came up recently on the Factor Discord server. Slava Pestov, the original creator of the Factor programming language, was describing recent work he was doing on some fun mathematical problems:

I’m searching for examples of finitely-presented monoids that cannot be presented by finite complete rewriting systems:

⟨a, b | aba=aa, baa=aab⟩ – my first result in this space.

⟨a, b | bab=aaa, bbb=bb⟩ – explore the equivalence class of a⁸ in this remarkable monoid.

He clarified in the discussion that “the Knuth-Bendix algorithm can solve many cases but not these two, which is how I found them in the first place”.

The second link above – made extra fun because it uses a=🍎 and b=🍌 to make a more emojiful experience – describes this specific problem in more detail and presents it as a simple game to play. You can see the available rules, the current state, and the next possible states achieved by applying either of the rules, which are bi-directional.

Your pie recipe calls for 10 apples, but you only have 8 apples.

Can you turn your 8 apples into 10 apples with these two magic spells?

🍌🍎🍌 ↔️ 🍎🍎🍎

🍌🍌🍌 ↔️ 🍌🍌

Current state:

🍎🍎🍎🍎🍎🍎🍎🍎

Tap to cast a spell:

🍌🍎🍌🍎🍎🍎🍎🍎

🍎🍌🍎🍌🍎🍎🍎🍎

🍎🍎🍌🍎🍌🍎🍎🍎

🍎🍎🍎🍌🍎🍌🍎🍎

🍎🍎🍎🍎🍌🍎🍌🍎

🍎🍎🍎🍎🍎🍌🍎🍌

When exploring things like this, many questions come to mind. For example:

Is this even solvable?
What is the shortest solution?
How many short solutions exist?
How many of the possible states lead to the solution?
How large is the set of all possible states?

After the link was shared, I must have clicked through about 5000 different state transitions hoping to randomly stumble upon the solution. And, eventually, recognized that it might be a good idea – or even possibly poetic – to do that exploration using Factor.

Warning: Spoilers ahead!

Let’s start by writing the rules:

CONSTANT: rules {
    { "bab" "aaa" }
    { "bbb" "bb" }
}

For convenience, we will make a sequence containing all the rules – since these are bi-directional and can be applied in either direction – using a literal expression.

CONSTANT: all-rules $[
    rules dup [ swap ] assoc-map append
]

We can make a word that takes a from node and applies a quotation to the result of rules a -> b. Notice that we’re able to use our previous work on finding subsequences:

:: each-move ( from a b quot: ( next -- ) -- )
    from dup a subseq-indices [
        cut a length tail b glue quot call
    ] with each ; inline

And then a word that returns all the next states:

: all-moves ( from -- moves )
    [ all-rules [ first2 [ , ] each-move ] with each ] { } make ;

It’s often good practice to try each step out during development, so let’s do that and show the first six possible next states match the ones from the game:

IN: scratchpad "aaaaaaaa" all-moves .
{
    "babaaaaa"
    "ababaaaa"
    "aababaaa"
    "aaababaa"
    "aaaababa"
    "aaaaabab"
}

The next state is nice to have, but we’re generally going to be accumulating paths which are a series of states achieved by traversing the graph of all possible states:

:: all-paths% ( path -- )
    path last all-rules [
        first2 [ path swap suffix , ] each-move
    ] with each ;

: all-paths ( paths -- paths' )
    [ [ all-paths% ] each ] { } make members ;

So, these are the first two steps in traversing the graph. You can see that some of the possible second moves end up circling back to the starting position, which makes sense since the rules are bi-directional and if applied can be un-applied on the next step.

IN: scratchpad { { "aaaaaaaa" } } all-paths dup . nl all-paths .
{
    { "aaaaaaaa" "babaaaaa" }
    { "aaaaaaaa" "ababaaaa" }
    { "aaaaaaaa" "aababaaa" }
    { "aaaaaaaa" "aaababaa" }
    { "aaaaaaaa" "aaaababa" }
    { "aaaaaaaa" "aaaaabab" }
}

{
    { "aaaaaaaa" "babaaaaa" "aaaaaaaa" }
    { "aaaaaaaa" "babaaaaa" "babbabaa" }
    { "aaaaaaaa" "babaaaaa" "babababa" }
    { "aaaaaaaa" "babaaaaa" "babaabab" }
    { "aaaaaaaa" "ababaaaa" "aaaaaaaa" }
    { "aaaaaaaa" "ababaaaa" "ababbaba" }
    { "aaaaaaaa" "ababaaaa" "abababab" }
    { "aaaaaaaa" "aababaaa" "aaaaaaaa" }
    { "aaaaaaaa" "aababaaa" "aababbab" }
    { "aaaaaaaa" "aaababaa" "aaaaaaaa" }
    { "aaaaaaaa" "aaababaa" "babbabaa" }
    { "aaaaaaaa" "aaaababa" "aaaaaaaa" }
    { "aaaaaaaa" "aaaababa" "babababa" }
    { "aaaaaaaa" "aaaababa" "ababbaba" }
    { "aaaaaaaa" "aaaaabab" "aaaaaaaa" }
    { "aaaaaaaa" "aaaaabab" "babaabab" }
    { "aaaaaaaa" "aaaaabab" "abababab" }
    { "aaaaaaaa" "aaaaabab" "aababbab" }
}

Let’s solve for the shortest paths, we keep track of states we’ve previously seen to avoid cycles, and we iterate using breadth-first-search until we find any solutions:

:: shortest-paths ( from to -- moves )
    HS{ from } clone :> seen
    { { from } }     :> stack!

    f [
        drop

        ! find all next possibilities
        stack all-paths

        ! reject ones that circle back to visited nodes
        [ last seen in? ] reject

        ! reject any that are over the length of ``to``
        to length '[ last length _ > ] reject stack!

        ! add the newly visited nodes
        stack [ last seen adjoin ] each

        ! stop when we find any solutions
        stack [ last to = ] filter dup empty?
    ] loop ;

Note: we reject any states that are longer than our goal state. This provides a nice way to cull the graph and make the search performance more reasonable. You could also choose not do that, and exhaustively search into that area. However, while this is not generally a valid approach to solving these types of problems, it is specifically a valid approach to this one.

There are quite a few shortest paths:

IN: scratchpad "aaaaaaaa" "aaaaaaaaaa" shortest-paths length .
560

Each of those contain 16 nodes, which means 15 rules were applied:

IN: scratchpad "aaaaaaaa" "aaaaaaaaaa" shortest-paths first length .
16

But they only go through a seemingly small number of nodes:

IN: scratchpad "aaaaaaaa" "aaaaaaaaaa" shortest-paths concat members length .
43

How many nodes are there in total in the graph? Let’s find out!

:: full-graph ( from to -- seen )
    HS{ from } clone :> seen

    { { from } } [
        ! find all next possibilities
        all-paths

        ! reject any that are over the length of ``to``
        to length '[ last length _ > ] reject

        ! only include ones that visit new nodes
        [ last seen ?adjoin ] filter
    ] until-empty seen ;

We can see that the shortest solutions go through about 15% of the nodes:

IN: scratchpad "aaaaaaaa" "aaaaaaaaaa" full-graph cardinality .
279

We can use our graph traversal approach and Graphviz to visualize where solutions are found, showing how some areas of the graph are quite hard to randomly get out of and then on the correct path to a solution. We draw the starting node green, the ending node blue, and the nodes that involved in the shortest path as gray:

And that’s kind of interesting, but if we cluster nodes by their depth when first discovered, some other patterns show up:

Such a fun problem!

Wed, 4 Jun 2025 15:00:00

John Benediktsson: Bitcask

Phil Eaton issued a HYIBY? – have you implemented bitcask yet? – challenge yesterday. Of course, I immediately realized that I have not and also that it would be fun to build in Factor.

Bitcask is described in the original Bitcask paper as a “log-structured hash-table for fast key/value data”, and was part of the software developed by Basho Technologies as part of the Riak distributed database. Besides the original paper, various developers over the years have bumped into Bitcask and implemented it in different programming languages. Arpit Bhayani, for example, has a nice blog post describing Bitcask that is worth reading for more background.

At its core, Bitcask describes an append-only storage mechanism for building a key-value database. It consists of one-or-more data files, each of which has an optional index file to allow faster recovery when initializing the database, and generally supports GET, PUT, and DELETE operations.

Data Files

Our data file contains a series of entry records. Each record consists of a key length, value length, key bytes, and value bytes. A simple word provides a way to write these bytes to a file:

: write-entry-bytes ( key value -- )
    [ dup length 4 >be write ] bi@ [ write ] bi@ ;

Then, using the serialize vocabulary we can store Factor objects quite simply:

: write-entry ( key value -- )
    [ object>bytes ] bi@ write-entry-bytes ;

We need the ability to store tombstone records which indicate that a key has been deleted from the database. In this case, we choose to store a zero-sized value to indicate that:

: write-tombstone ( key -- )
    object>bytes f write-entry-bytes ;

Assuming that a data file has had it’s seek position moved to the beginning of an entry record, we can read the value that it contains, or return a boolean indicating that it is not found because it was stored as a tombstone:

: read-entry ( -- value/f ? )
    4 read be> 4 read be> [
        drop f f
    ] [
        [ seek-relative seek-input ]
        [ read bytes>object t ] bi*
    ] if-zero ;

Index Files

Our index file contains hints that provide a way to recover the record offsets into the data files. These hints consist of a series of index records. Each record consists of a key length, key bytes, and file offset.

We can write our index mapping of keys to offsets:

: write-index ( index -- )
    [
        [ object>bytes dup length 4 >be write write ]
        [ 4 >be write ] bi*
    ] assoc-each ;

And then read it back into memory:

: read-index ( -- index )
    H{ } clone [
        4 read [
            be> read bytes>object 4 read be>
            swap pick set-at t
        ] [ f ] if*
    ] loop ;

We want to make the index files optional, continuing to recover the index by first seeking to the last entry that we have in the index, and then continuing to iterate across the records in the data file to recover the full index, making sure to delete any items that are subsequently observed to contain tombstone entries:

: recover-index ( index -- index' )
    dup values [ maximum seek-absolute seek-input ] unless-empty
    [
        tell-input 4 read [
            be> 4 read be> [ read bytes>object ] dip
            [ pick delete-at drop ] [
                [ pick set-at ]
                [ seek-relative seek-input ] bi*
            ] if-zero t
        ] [ drop f ] if*
    ] loop ;

Bitcask Implementation

The associative mapping protocol describes the words that an assoc should support. This type of object provides a mapping of key to value, with ways to add, update, and delete these mappings.

We want our bitcask type to use a single data file, reading and recovering from an index file, and then providing ways to modify – by appending to the data file – the database.

TUPLE: bitcask path index ;

:: <bitcask> ( path -- bitcask )
    path dup touch-file
    path ".idx" append dup touch-file
    binary [ read-index ] with-file-reader
    path binary [ recover-index ] with-file-reader
    bitcask boa ;

INSTANCE: bitcask assoc

The application should control when and how these index files are persisted:

: save-index ( bitcask -- )
    dup path>> ".idx" append binary
    [ index>> write-index ] with-file-writer ;

The first operation we support will be set-at, updating the index after writing the entry.

M:: bitcask set-at ( value key bitcask -- )
    bitcask path>> binary [
        tell-output
        key value write-entry
        key bitcask index>> set-at
    ] with-file-appender ;

Next, we support at*, to lookup a value by seeking in the data file and reading the entry:

M:: bitcask at* ( key bitcask -- value/f ? )
    key bitcask index>> at* [
        bitcask path>> binary [
            seek-absolute seek-input read-entry
        ] with-file-reader
    ] [ drop f f ] if ;

And finally, delete-at removes a key from the index after writing a tombstone:

M:: bitcask delete-at ( key bitcask -- )
    key bitcask index>> key? [
        bitcask path>> binary [
            key write-tombstone
            key bitcask index>> delete-at
        ] with-file-appender
    ] when ;

The assoc-size of our database is the size of the index:

M: bitcask assoc-size
    index>> assoc-size ;

It is helpful to implement >alist to provide a conversion to an assocation list, although if the database gets quite large, this might be of less practical value:

M:: bitcask >alist ( bitcask -- alist )
    bitcask path>> binary [
        bitcask index>> [
            seek-absolute seek-input read-entry t assert=
        ] { } assoc-map-as
    ] with-file-reader ;

And a way to clear-assoc by writing tombstones and clearing the index:

M:: bitcask clear-assoc ( bitcask -- )
    bitcask path>> binary [
        bitcask index>>
        dup keys [ write-tombstone ] each
        clear-assoc
    ] with-file-appender ;

There are some elements desirable in a production database that are not implemented, for example:

reducing the amount of opening and closing files to increase performance
controlling when file writes are flushed to disk
storing other metadata such as timestamps for each entry
rolling over data log files as they reach a maximum size
providing a way to vacuum the database files to remove tombstone entries
compressing the database entries or data log files if size is a consideration
protocol for accessing it over the network by other parts of the application

This is now available in the development version in the bitcask vocabulary!

Tue, 3 Jun 2025 15:00:00

John Benediktsson: Game of Life

One of my first and most memorable graphical programs was implementing John Conway’s Game of Life. At the time, that implementation was as a Java applet. I’ve revisited it periodically in different programming languages including several years ago when I started to implement the Game of Life in Factor – something I’ve always wanted to write about.

The Game of Life is a two-dimensional grid of square cells with fairly simple logic. Each cell can be either live or dead. Each cell interacts with its eight neighboring cells with the following rules determining the next state of the game board:

Any live cell with fewer than two live neighbours dies, as if by underpopulation.
Any live cell with two or three live neighbours lives on to the next generation.
Any live cell with more than three live neighbours dies, as if by overpopulation.
Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction.

You can run this in any release since Factor 0.98:

IN: scratchpad "game-of-life" run

And it will look something like this:

Let’s go ahead and build it!

Game Logic

We will model our two-dimensional game board as an array of arrays. And in particular, since each cell has only two states, we will use bit-arrays to reduce the memory requirements by efficiently storing the state, one bit for each cell.

: <grid> ( rows cols -- grid )
    '[ _ <bit-array> ] replicate ;

: grid-dim ( grid -- rows cols )
    [ length ] [ first length ] bi ;

Making a random grid, which is useful in testing:

: random-grid ( rows cols -- grid )
    '[ _ { t f } ?{ } randoms-as ] replicate ;

And a word we can use for debugging, to print a grid out:

: grid. ( grid -- )
    [ [ CHAR: # CHAR: . ? ] "" map-as print ] each ;

Some implementations choose to make the game boards infinite, but we are instead going to build a wraparound game board. This allows, for example, a glider shape to fly off the bottom right and then re-appear on the top left of the board, which is a lot more fun to watch.

A useful word calculates adjacent indices for a cell – that wrap at a max value of rows or columns:

:: adjacent-indices ( n max -- n-1 n n+1 )
    n [ max ] when-zero 1 -
    n
    n 1 + dup max = [ drop 0 ] when ;

Test it out, showing how it might work in a hypothetical 10 x 10 grid:

! in the middle
IN: scratchpad 3 10 adjacent-indices 3array .
{ 2 3 4 }

! at the start, wrapped around
IN: scratchpad 0 10 adjacent-indices 3array .
{ 9 0 1 }

! at the end, wrapped around
IN: scratchpad 9 10 adjacent-indices 3array .
{ 8 9 0 }

The main game logic requires counting neighbors for each cell. Since each cell can have 8 neighbors, we can store this count in a half-byte – a nibble – which can hold the values [0..15]. In the batteries-included standard library, we have a nibble-arrays vocabulary that makes this easy.

The simplest implementation would just iterate across the game board, and for each cell that is live, increment the count for the neighboring indices around it:

:: count-neighbors ( grid -- counts )
    grid grid-dim :> ( rows cols )
    rows [ cols <nibble-array> ] replicate :> neighbors
    grid [| row j |
        j rows adjacent-indices
        [ neighbors nth ] tri@ :> ( above same below )

        row [| cell i |
            cell [
                i cols adjacent-indices
                [ [ above [ 1 + ] change-nth ] tri@ ]
                [ nip [ same [ 1 + ] change-nth ] bi@ ]
                [ [ below [ 1 + ] change-nth ] tri@ ]
                3tri
            ] when
        ] each-index
    ] each-index neighbors ;

Then the last piece of game logic we need is to adjust the grid cells according to the rules – making some transition from live to dead, and others from dead to live based on their state and the neighboring counts.

:: next-step ( grid -- )
    grid count-neighbors :> neighbors
    grid [| row j |
        j neighbors nth :> neighbor-row
        row [| cell i |
            i neighbor-row nth
            cell [
                2 3 between? i row set-nth
            ] [
                3 = [ t i row set-nth ] when
            ] if
        ] each-index
    ] each-index ;

Before we move on to creating a graphical user interface for the game, let’s try it out in the Factor listener:

! Create a random 10x10 grid
IN: scratchpad 10 10 random-grid

! Print it out
IN: scratchpad dup grid.
#..#..#.##
##....####
..###.####
.##...#..#
.##....###
..###..#.#
...###.#..
.###....##
#...###.##
.##..#.#..

! Compute the neighbors for each cell
IN: scratchpad dup count-neighbors .
{
    N{ 5 5 4 1 2 3 4 6 5 5 }
    N{ 5 3 4 4 3 4 4 7 7 7 }
    N{ 6 5 4 3 1 4 4 6 6 5 }
    N{ 5 4 5 5 2 3 3 6 7 4 }
    N{ 5 4 5 5 2 2 3 3 5 3 }
    N{ 3 3 4 5 4 3 4 3 6 2 }
    N{ 3 3 6 6 5 2 3 2 5 3 }
    N{ 4 2 3 4 6 5 4 4 4 4 }
    N{ 4 5 5 4 3 3 3 4 4 4 }
    N{ 5 3 2 3 4 4 5 4 5 6 }
}

! Compute the next generation
IN: scratchpad dup next-step

! Print it out
IN: scratchpad dup grid.
.....#....
.#..#.....
...#......
.....##...
......##.#
##...#.#.#
##...###.#
.##.......
....###...
.###......

It works!

Game Interface

In Factor, one of the ways we can build user interfaces is using gadgets and OpenGL rendering instructions. We start by modeling our game as a gadget with a grid object, a size that specifies the rendered pixels-per-cell, and a timer to control the speed of repainting new generations.

TUPLE: grid-gadget < gadget grid size timer ;

Our default gadget will have cells that are 20 pixels square, and repaint 10 times per second:

: <grid-gadget> ( grid -- gadget )
    grid-gadget new
        swap >>grid
        20 >>size
        dup '[ _ dup grid>> next-step relayout-1 ]
        f 1/10 seconds <timer> >>timer ;

Gadgets are grafted onto the render hierarchy, and then later ungrafted when they are removed. We handle that state change by stopping the timer before delegating to the parent to cleanup further:

M: grid-gadget ungraft*
    [ timer>> stop-timer ] [ call-next-method ] bi ;

The default dimension for our gadget is the grid dimension times the pixel size:

M: grid-gadget pref-dim*
    [ grid>> grid-dim swap ] [ size>> '[ _ * ] bi@ 2array ] bi ;

If the grid size changes – for example, by using the mouse scroll wheel to zoom in or out – we can create and store a new grid, keeping the cells that are visible in the same state they were in:

:: update-grid ( gadget -- )
    gadget dim>> first2 :> ( w h )
    gadget size>> :> size
    h w [ size /i ] bi@ :> ( new-rows new-cols )
    gadget grid>> :> grid
    grid grid-dim :> ( rows cols )
    rows new-rows = not cols new-cols = not or [
        new-rows new-cols <grid> :> new-grid
        rows new-rows min [| j |
            cols new-cols min [| i |
                i j grid nth nth
                i j new-grid nth set-nth
            ] each-integer
        ] each-integer
        new-grid gadget grid<<
    ] when ;

We can draw the cells that are live as black squares:

:: draw-cells ( gadget -- )
    COLOR: black gl-color
    gadget size>> :> size
    gadget grid>> [| row j |
        row [| cell i |
            cell [
                i j [ size * ] bi@ 2array { size size } gl-fill-rect
            ] when
        ] each-index
    ] each-index ;

And then draw the gray lines that define the grid of cells:

:: draw-lines ( gadget -- )
    gadget size>> :> size
    gadget grid>> grid-dim :> ( rows cols )
    COLOR: gray gl-color
    cols rows [ size * ] bi@ :> ( w h )
    rows 1 + [| j |
        j size * :> y
        { 0 y } { w y } gl-line
    ] each-integer
    cols 1 + [| i |
        i size * :> x
        { x 0 } { x h } gl-line
    ] each-integer ;

Putting this together, we draw our gadget by updating the grid, drawing the cells, and drawing the lines:

M: grid-gadget draw-gadget*
    [ update-grid ] [ draw-cells ] [ draw-lines ] tri ;

And, with the “visual REPL”, you can directly render the grid gadget, to see it work:

We now need to build the interactive parts. Let’s first start by handling a click, to toggle the state of a cell, and storing which state it was toggled to in the last-click variable:

SYMBOL: last-click

:: on-click ( gadget -- )
    gadget grid>> :> grid
    gadget size>> :> size
    grid grid-dim :> ( rows cols )
    gadget hand-rel first2 [ size /i ] bi@ :> ( i j )
    i 0 cols 1 - between?
    j 0 rows 1 - between? and [
        i j grid nth
        [ not dup last-click set ] change-nth
    ] when gadget relayout-1 ;

That allows us to build a drag feature, where as we drag, we continue to either set cells to live or dead according to what the first click was doing:

:: on-drag ( gadget -- )
    gadget grid>> :> grid
    gadget size>> :> size
    grid grid-dim :> ( rows cols )
    gadget hand-rel first2 [ size /i ] bi@ :> ( i j )
    i 0 cols 1 - between?
    j 0 rows 1 - between? and [
        last-click get i j
        grid nth set-nth
        gadget relayout-1
    ] when ;

We implement a scrolling feature to adjust the size of the rendered cells, clamping the value when it gets too small or too large:

: on-scroll ( gadget -- )
    [
        scroll-direction get second {
            { [ dup 0 > ] [ -2 ] }
            { [ dup 0 < ] [ 2 ] }
            [ 0 ]
        } cond nip + 4 30 clamp
    ] change-size relayout-1 ;

And we store these as "gestures" that are supported by the gadget:

grid-gadget "gestures" [
    {
        { T{ button-down { # 1 } } [ on-click ] }
        { T{ drag { # 1 } } [ on-drag ] }
        { mouse-scroll [ on-scroll ] }
    } assoc-union
] change-word-prop

The last bit we need is to make the toolbar, which has a few commands we can run:

:: com-play ( gadget -- )
    gadget timer>> restart-timer ;

:: com-stop ( gadget -- )
    gadget timer>> stop-timer ;

:: com-clear ( gadget -- )
    gadget dup grid>> [ clear-bits ] each relayout-1 ;

:: com-random ( gadget -- )
    gadget dup grid>> [ [ drop { t f } random ] map! drop ] each relayout-1 ;

:: com-glider ( gadget -- )
    gadget dup grid>> :> grid
    { { 2 1 } { 3 2 } { 1 3 } { 2 3 } { 3 3 } }
    [ grid nth t -rot set-nth ] assoc-each relayout-1 ;

:: com-step ( gadget -- )
    gadget dup grid>> next-step relayout-1 ;

And then store these as the "toolbar" command map:

grid-gadget "toolbar" f {
    { T{ key-down { sym "1" } } com-play }
    { T{ key-down { sym "2" } } com-stop }
    { T{ key-down { sym "3" } } com-clear }
    { T{ key-down { sym "4" } } com-random }
    { T{ key-down { sym "5" } } com-glider }
    { T{ key-down { sym "6" } } com-step }
} define-command-map

And finally, we can wrap the grid gadget with something that makes a toolbar, and creates a main window when launched:

TUPLE: life-gadget < track ;

: <life-gadget> ( -- gadget )
    vertical life-gadget new-track
    20 20 make-grid <grid-gadget>
    [ <toolbar> format-toolbar f track-add ]
    [ 1 track-add ] bi ;

M: life-gadget focusable-child* children>> second ;

MAIN-WINDOW: life-window
    { { title "Game of Life" } }
    <life-gadget> >>gadgets ;

As with anything, there are probably things we could continue to improve in our UI framework, but one of the biggest missing pieces are examples of working code, which is largely what motivated writing about this today.

Check it out!

And maybe think about how you might adjust it to be an infinite game board, or to increase performance when computing the next generation, to improve the OpenGL rendering logic, persist the game board between launches, or do things like communicate age of each cell by the color that it is rendered with.

Tue, 27 May 2025 15:00:00

[ planet-factor ]

Jaro Similarity

Jaro-Winkler Similarity

Try it out

File Format

Parsing

Keys

Values

Key-Values

Loading Files

Try it out

Data Files

Index Files

Bitcask Implementation

Game Logic

Game Interface

Blogroll