Jaro-Winkler distance is a measure of string similarity and edit distance between two sequences:
The higher the Jaro–Winkler distance for two strings is, the less similar the strings are. The score is normalized such that 0 means an exact match and 1 means there is no similarity. The original paper actually defined the metric in terms of similarity, so the distance is defined as the inversion of that value (distance = 1 − similarity).
There are actually two different concepts – and RosettaCode tasks – implied by this algorithm:
Let’s build an implementation of these in Factor!
The base that all of these are built upon is the Jaro similarity. It is
calculated as a score by measuring the number of matches (m
) between the
strings, counting the number of transpositions divided by 2 (t
), and then returning a
weighted score using the formula using the lengths of each sequence (|s1|
and |s2|
):
In particular, it considers a matching character to be one that is found in the other string within a match distance away, calculated by the formula:
There are multiple ways to go about this, with varying performance, but I decided one longer function was simpler to understand than breaking out the steps into their own words. We use a bit-array to efficiently track which characters have been matched already as we iterate:
:: jaro-similarity ( s1 s2 -- n )
s1 s2 [ length ] bi@ :> ( len1 len2 )
len1 len2 max 2/ 1 [-] :> delta
len2 <bit-array> :> flags
s1 [| ch i |
i delta [-] :> from
i delta + 1 + len2 min :> to
from to [| j |
j flags nth [ f ] [
ch j s2 nth = dup j flags set-nth
] if
] find-integer-from
] filter-index
[ 0 ] [
[ length ] keep s2 flags [ nip ] 2filter [ = not ] 2count
:> ( #matches #transpositions )
#matches len1 /f #matches len2 /f +
#matches #transpositions 2/ - #matches /f + 3 /
] if-empty ;
The Jaro distance is then just a subtraction:
: jaro-distance ( s1 s2 -- n )
jaro-similarity 1.0 swap - ;
I’m curious if anyone else has a simpler implementation – please share!
The Jaro-Winkler similarity builds upon this by factoring in the length of
the common prefix (l
) times a constant scaling factor (p
) that is
usually set to 0.1
in most implementations I’ve seen:
We can implement this by calcuting the Jaro similarity and then computing the common prefix and then generating the result:
:: jaro-winkler-similarity ( s1 s2 -- n )
s1 s2 jaro-similarity :> jaro
s1 s2 min-length 4 min :> len
s1 s2 [ len head-slice ] bi@ [ = ] 2count :> #common
1 jaro - #common 0.1 * * jaro + ;
The Jaro-Winkler distance is again just a subtraction:
: jaro-winkler-distance ( a b -- n )
jaro-winkler-similarity 1.0 swap - ;
The Wikipedia article
compares the similarity of FARMVILLE
and FAREMVIEL
:
IN: scratchpad "FARMVILLE" "FAREMVIEL" jaro-similarity .
0.8842592592592592
We can also see that the algorithm considers the transposition of two close characters to be less of a penalty than the transposition of two characters farther away from each other. It also penalizes additions and substitutions of characters that cannot be expressed as transpositions.
IN: scratchpad "My string" "My tsring" jaro-winkler-similarity .
0.9740740740740741
IN: scratchpad "My string" "My ntrisg" jaro-winkler-similarity .
0.8962962962962963
We can compare the rough performance of Julia using the same algorithm:
julia> using Random
julia> s = randstring(10_000)
julia> t = randstring(10_000)
julia> @time jarowinklerdistance(s, t)
1.492011 seconds (108.32 M allocations: 2.178 GiB, 1.87% gc time)
0.19016926812348256
Note: I’m not a Julia developer, I just play one on
TV. I adapted this
implementation in
Julia, which
originally took over 4.5 seconds. A better developer could probably improve it
quite a bit. In fact, it was pointed out that we are indexing UTF-8 String
in a loop, and should instead collect the Char
into a Vector
first.
That does indeed make it super fast.
To the implementation in Factor that we built above, which runs quite a bit faster:
IN: scratchpad USE: random.data
IN: scratchpad 10,000 random-string
10,000 random-string
gc [ jaro-winkler-distance ] time .
Running time: 0.259643166 seconds
0.1952856823031448
Thats not bad for a first version that uses safe indexing with unnecessary bounds-checking, generic iteration on integers when usually the indices are fixnum (something I hope to fix someday automatically), and should probably order the input sequences by length for consistency.
If we fix those problems, it gets even faster:
IN: scratchpad USE: random.data
IN: scratchpad 10,000 random-string
10,000 random-string
gc [ jaro-winkler-distance ] time .
Running time: 0.068086625 seconds
0.19297898770334765
This is available in the development version in the math.similarity and the math.distances vocabularies.
The “Best shuffle” is a Rosetta Code task that was not yet implemented in Factor:
Task
Shuffle the characters of a string in such a way that as many of the character values are in a different position as possible.
A shuffle that produces a randomized result among the best choices is to be preferred. A deterministic approach that produces the same sequence every time is acceptable as an alternative.
Display the result as follows:
original string, shuffled string, (score)
The score gives the number of positions whose character value did not change.
There are multiple ways to approach this problem, but the way that most solutions seem to take is to shuffle two sets of indices, and then iterate through them swapping the characters in the result if they are different.
I wanted to contribute a solution in Factor, using local variables and short-circuit combinators:
:: best-shuffle ( str -- str' )
str clone :> new-str
str length :> n
n <iota> >array randomize :> range1
n <iota> >array randomize :> range2
range1 [| i |
range2 [| j |
{
[ i j = ]
[ i new-str nth j new-str nth = ]
[ i str nth j new-str nth = ]
[ i new-str nth j str nth = ]
} 0|| [
i j new-str exchange
] unless
] each
] each
new-str ;
And we can write some code to display the result as requested:
: best-shuffle. ( str -- )
dup best-shuffle 2dup [ = ] 2count "%s, %s, (%d)\n" printf ;
And then print some test cases:
IN: scratchpad {
"abracadabra"
"seesaw"
"elk"
"grrrrrr"
"up"
"a"
} [ best-shuffle. ] each
abracadabra, raabaracdab, (0)
seesaw, easwse, (0)
elk, lke, (0)
grrrrrr, rrrrgrr, (5)
up, pu, (0)
a, a, (1)
This is reminiscent to the recent work I had done on derangements and generating a random derangement. While this approach does not generate a perfect derangement of the indices – and happens to be accidentally quadratic – it is somewhat similar with the additional step that we look to make sure not only are the indices different, but that the contents are different as well before swapping.
Dotenv is an informal file specification, a collection of implementations in different languages, and an organization providing cloud-hosting services. They describe the .env file format and some extensions:
The .env file format is central to good DSX and has been since it was introduced by Heroku in 2012 and popularized by the dotenv node module (and other libraries) in 2013.
The .env file format starts where the developer starts - in development. It is added to each project but NOT committed to source control. This gives the developer a single secure place to store sensitive application secrets.
Can you believe that prior to introducing the .env file, almost all developers stored their secrets as hardcoded strings in source control. That was only 10 years ago!
Besides official and many unofficial .env
parsers available in a lot of
languages, the Dotenv
organization provides support for dotenv-vault
cloud services in Node.js, Python,
Ruby, Go, PHP, and Rust.
Today, I wanted to show how you might implement a .env
parser in
Factor.
The .env
files are relatively simple formats with key-value pairs that
are separated by an equal sign. These values can be un-quoted,
single-quoted, double-quoted, or backtick-quoted strings:
SIMPLE=xyz123
INTERPOLATED="Multiple\nLines"
NON_INTERPOLATED='raw text without variable interpolation'
MULTILINE = `long text here,
e.g. a private SSH key`
There are a lot of ways to build a parser – everything from manually spinning through bytes using a hand-coded state machine, higher-level parsing grammars like PEG, or explicit parsing syntax forms like EBNF.
We are going to implement a .env
parser using standard PEG parsers,
beginning with some parsers that look for whitespace, comment lines, and
newlines:
: ws ( -- parser )
[ " \t" member? ] satisfy repeat0 ;
: comment ( -- parser )
"#" token [ CHAR: \n = not ] satisfy repeat0 2seq hide ;
: newline ( -- parser )
"\n" token "\r\n" token 2choice ;
The .env keys are specified simply:
For the sake of portability (and sanity), environment variable names (keys) must consist solely of letters, digits, and the underscore ( _ ) and must not begin with a digit. In regex-speak, the names must match the following pattern:
[a-zA-Z_]+[a-zA-Z0-9_]*
We can build a key parser by looking for those characters:
: key-parser ( -- parser )
CHAR: A CHAR: Z range
CHAR: a CHAR: z range
[ CHAR: _ = ] satisfy 3choice
CHAR: A CHAR: Z range
CHAR: a CHAR: z range
CHAR: 0 CHAR: 9 range
[ CHAR: _ = ] satisfy 4choice repeat0
2seq [ first2 swap prefix "" like ] action ;
The .env values can be un-quoted, single-quoted, double-quoted, or backtick-quoted strings. Only double-quoted strings support escape characters, but single-quoted and backtick-quoted strings support escaping either single-quotes or backtick characters.
: single-quote ( -- parser )
"\\" token hide [ "\\'" member? ] satisfy 2seq [ first ] action
[ CHAR: ' = not ] satisfy 2choice repeat0 "'" dup surrounded-by ;
: backtick ( -- parser )
"\\" token hide [ "\\`" member? ] satisfy 2seq [ first ] action
[ CHAR: ` = not ] satisfy 2choice repeat0 "`" dup surrounded-by ;
: double-quote ( -- parser )
"\\" token hide [ "\"\\befnrt" member? ] satisfy 2seq [ first escape ] action
[ CHAR: " = not ] satisfy 2choice repeat0 "\"" dup surrounded-by ;
: literal ( -- parser )
[ " \t\r\n" member? not ] satisfy repeat0 ;
Before we implement our value parser, we should note that some values can be interpolated:
Interpolation (also known as variable expansion) is supported in environment files. Interpolation is applied for unquoted and double-quoted values. Both braced (
${VAR}
) and unbraced ($VAR
) expressions are supported.Direct interpolation
${VAR}
-> value of VARDefault value
${VAR:-default}
-> value ofVAR
if set and non-empty, otherwise default${VAR-default}
-> value ofVAR
if set, otherwise default
And some values can have command substitution:
Add the output of a command to one of your variables in your .env file. Command substitution is applied for unquoted and double-quoted values.
Direct substitution
$(whoami)
-> value of$ whoami
We can implement an interpolate parser that acts on strings and replaces observed variables with their interpolated or command-substituted values. This uses a regular expressions and re-replace-with to substitute values appropriately.
: interpolate-value ( string -- string' )
R/ \$\([^)]+\)|\$\{[^\}:-]+(:?-[^\}]*)?\}|\$[^(^{].+/ [
"$(" ?head [
")" ?tail drop process-contents [ blank? ] trim
] [
"${" ?head [ "}" ?tail drop ] [ "$" ?head drop ] if
":-" split1 [
[ os-env [ empty? not ] keep ] dip ?
] [
"-" split1 [ [ os-env ] dip or ] [ os-env ] if*
] if*
] if
] re-replace-with ;
: interpolate ( parser -- parser )
[ "" like interpolate-value ] action ;
We can use that to build a value parser, remembering that only un-quoted and double-quoted values are interpolated, and making sure to convert the result to a string:
: value-parser ( -- parser )
[
single-quote ,
double-quote interpolate ,
backtick ,
literal interpolate ,
] choice* [ "" like ] action ;
Combining those, we can make a key-value parser, that ignores whitespace
around the =
token and uses
set-os-env
to update the environment
variables:
: key-value-parser ( -- parser )
[
key-parser ,
ws hide ,
"=" token hide ,
ws hide ,
value-parser ,
] seq* [ first2 swap set-os-env ignore ] action ;
And finally, we can build a parsing word that looks for these key-value pairs while ignoring optional comments and whitespace:
PEG: parse-dotenv ( string -- ast )
ws hide key-value-parser optional
ws hide comment optional hide 4seq
newline list-of hide ;
We can load a file by reading the file-contents and then parsing it into environment variables:
: load-dotenv-file ( path -- )
utf8 file-contents parse-dotenv drop ;
These .env
files are usually located somewhere above the current directory,
typically at a project root. For now, we make a word that traverses from the
current directory up to the root, looking for the first .env
file that
exists:
: find-dotenv-file ( -- path/f )
f current-directory get absolute-path [
nip
[ ".env" append-path dup file-exists? [ drop f ] unless ]
[ ?parent-directory ] bi over [ f ] [ dup ] if
] loop drop ;
And now, finally, we can find and then load the relevant .env
file, if
there is one:
: load-dotenv ( -- )
find-dotenv-file [ load-dotenv-file ] when* ;
We can make a simple .env
file:
$ cat .env
HOST="${HOST:-localhost}"
PORT="${PORT:-80}"
URL="https://${HOST}:${PORT}/index.html"
And then try it out, overriding the PORT
environment variable:
$ PORT=8080 ./factor
IN: scratchpad USE: dotenv
IN: scratchpad load-dotenv
IN: scratchpad "URL" os-env .
"https://localhost:8080/index.html"
Some additional features that we might want to follow up on:
.env
filesThis is available in the latest development version. Check it out!
Factor has a neat feature in the prettyprint
vocabulary
that allows printing objects, typically as valid source literal expressions.
There are small caveats to that regarding circularity, depth limits, and
other prettyprint control
variables,
but it’s roughly true that you can pprint
most everything and have it be
useful.
At some point in the past few years, I noticed that Xcode and Swift Playground have support for color literals that are rendered in the source code. You can see that in this short video describing how it works:
Inspired by that – and a past effort at color tab completion – I thought it would be fun to show how you might extend our color support to allow colors to be prettyprinted with a little gadget in the UI that renders their colors.
First, we need to define a section object that holds a color and renders it using a colored border gadget.
TUPLE: color-section < section color ;
: <color-section> ( color -- color-section )
1 color-section new-section swap >>color ;
M: color-section short-section
" " <label> { 5 0 } <border>
swap color>> <solid> >>interior
COLOR: black <solid> >>boundary
output-stream get write-gadget ;
Next, we extend pprint*
with a custom implementation for any color type
as well as our named
colors
that adds a color section to the output block:
M: color pprint*
<block
[ call-next-method ]
[ <color-section> add-section ] bi
block> ;
M: parsed-color pprint*
<block
[ \ COLOR: pprint-word string>> text ]
[ <color-section> add-section ] bi
block> ;
And, now that we have that, we can push some different colors to the stack and see how they are all displayed:
Pretty cool.
I did not commit this yet – partly because I’m not sure we want this as-is and also partly because it needs to only display the gadget if the UI is running. We also might want to consider the UI theme and choose a nice contrasting color for the border element.
Peter Bengtsson wrote about building a Python dict that can report which keys you did not use:
This can come in handy if you’re working with large Python objects and you want to be certain that you’re either unit testing everything you retrieve or certain that all the data you draw from a database is actually used in a report.
For example, you might have a
SELECT fieldX, fieldY, fieldZ FROM ...
SQL query, but in the report you only usefieldX, fieldY
in your CSV export.class TrackingDict(dict): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self._accessed_keys = set() def __getitem__(self, key): self._accessed_keys.add(key) return super().__getitem__(key) @property def accessed_keys(self): return self._accessed_keys @property def never_accessed_keys(self): return set(self.keys()) - self._accessed_keys
We can build a version of this in Factor intended to show off a few language features. The original version in Python used inheritance versus composition to implement the data structure. Instead, we build a data structure that will wrap an existing assoc and delegate to it.
First, a simple tuple definition that will have the underlying assoc and a set of accessed keys:
TUPLE: tracking-assoc underlying accessed-keys ;
: <tracking-assoc> ( underlying -- tracking-assoc )
HS{ } clone tracking-assoc boa ;
INSTANCE: tracking-assoc assoc
We then implement the assoc protocol by using delegation to the underlying assoc, with an override for tracking accessed keys:
CONSULT: assoc-protocol tracking-assoc underlying>> ;
M: tracking-assoc at*
[ underlying>> at* ] [ accessed-keys>> adjoin ] 2bi ;
And for fun – since we could have built a normal word to do this – we define a protocol slot that we then implement to compute the never accessed keys:
SLOT: never-accessed-keys
M: tracking-assoc never-accessed-keys>>
[ underlying>> keys ] [ accessed-keys>> ] bi diff ;
And we show it works using a simple example from the original blog post:
H{
{ "name" "John Doe" }
{ "age" 30 }
{ "email" "jd@example.com" }
} <tracking-assoc>
"name" over at "John Doe" assert=
[ accessed-keys>> "Accessed keys: %u\n" printf ]
[ never-accessed-keys>> "Never accessed keys: %u\n" printf ] bi
Which prints this:
Accessed keys: HS{ "name" }
Never accessed keys: HS{ "email" "age" }
Fun!
Admittedly, I struggle sometimes when I read the word “monoid”. It seems to always remind me of that saying “A Monad is just a Monoid in the Category of Endofunctors” which is both a tongue-twister, requires repeated effort to understand, and is sometimes used in casual conversation when jealously describing the features and capabilities of the Haskell programming language.
In any event, the topic of monoids came up recently on the Factor Discord server. Slava Pestov, the original creator of the Factor programming language, was describing recent work he was doing on some fun mathematical problems:
I’m searching for examples of finitely-presented monoids that cannot be presented by finite complete rewriting systems:
- ⟨a, b | aba=aa, baa=aab⟩ – my first result in this space.
- ⟨a, b | bab=aaa, bbb=bb⟩ – explore the equivalence class of a8 in this remarkable monoid.
He clarified in the discussion that “the Knuth-Bendix algorithm can solve many cases but not these two, which is how I found them in the first place”.
The second link above – made extra fun because it uses a=🍎
and b=🍌
to make a more emojiful experience – describes this specific problem in more
detail and presents it as a simple game to play. You can see the available
rules, the current state, and the next possible states achieved by applying
either of the rules, which are bi-directional.
Your pie recipe calls for 10 apples, but you only have 8 apples.
Can you turn your 8 apples into 10 apples with these two magic spells?
- 🍌🍎🍌 ↔️ 🍎🍎🍎
- 🍌🍌🍌 ↔️ 🍌🍌
Current state:
🍎🍎🍎🍎🍎🍎🍎🍎
Tap to cast a spell:
- 🍌🍎🍌🍎🍎🍎🍎🍎
- 🍎🍌🍎🍌🍎🍎🍎🍎
- 🍎🍎🍌🍎🍌🍎🍎🍎
- 🍎🍎🍎🍌🍎🍌🍎🍎
- 🍎🍎🍎🍎🍌🍎🍌🍎
- 🍎🍎🍎🍎🍎🍌🍎🍌
When exploring things like this, many questions come to mind. For example:
After the link was shared, I must have clicked through about 5000 different state transitions hoping to randomly stumble upon the solution. And, eventually, recognized that it might be a good idea – or even possibly poetic – to do that exploration using Factor.
Warning: Spoilers ahead!
Let’s start by writing the rules:
CONSTANT: rules {
{ "bab" "aaa" }
{ "bbb" "bb" }
}
For convenience, we will make a sequence containing all the rules – since these are bi-directional and can be applied in either direction – using a literal expression.
CONSTANT: all-rules $[
rules dup [ swap ] assoc-map append
]
We can make a word that takes a from
node and applies a
quotation to the
result of rules a -> b
. Notice that we’re able to use our previous work on
finding subsequences:
:: each-move ( from a b quot: ( next -- ) -- )
from dup a subseq-indices [
cut a length tail b glue quot call
] with each ; inline
And then a word that returns all the next states:
: all-moves ( from -- moves )
[ all-rules [ first2 [ , ] each-move ] with each ] { } make ;
It’s often good practice to try each step out during development, so let’s do that and show the first six possible next states match the ones from the game:
IN: scratchpad "aaaaaaaa" all-moves .
{
"babaaaaa"
"ababaaaa"
"aababaaa"
"aaababaa"
"aaaababa"
"aaaaabab"
}
The next state is nice to have, but we’re generally going to be accumulating paths which are a series of states achieved by traversing the graph of all possible states:
:: all-paths% ( path -- )
path last all-rules [
first2 [ path swap suffix , ] each-move
] with each ;
: all-paths ( paths -- paths' )
[ [ all-paths% ] each ] { } make members ;
So, these are the first two steps in traversing the graph. You can see that some of the possible second moves end up circling back to the starting position, which makes sense since the rules are bi-directional and if applied can be un-applied on the next step.
IN: scratchpad { { "aaaaaaaa" } } all-paths dup . nl all-paths .
{
{ "aaaaaaaa" "babaaaaa" }
{ "aaaaaaaa" "ababaaaa" }
{ "aaaaaaaa" "aababaaa" }
{ "aaaaaaaa" "aaababaa" }
{ "aaaaaaaa" "aaaababa" }
{ "aaaaaaaa" "aaaaabab" }
}
{
{ "aaaaaaaa" "babaaaaa" "aaaaaaaa" }
{ "aaaaaaaa" "babaaaaa" "babbabaa" }
{ "aaaaaaaa" "babaaaaa" "babababa" }
{ "aaaaaaaa" "babaaaaa" "babaabab" }
{ "aaaaaaaa" "ababaaaa" "aaaaaaaa" }
{ "aaaaaaaa" "ababaaaa" "ababbaba" }
{ "aaaaaaaa" "ababaaaa" "abababab" }
{ "aaaaaaaa" "aababaaa" "aaaaaaaa" }
{ "aaaaaaaa" "aababaaa" "aababbab" }
{ "aaaaaaaa" "aaababaa" "aaaaaaaa" }
{ "aaaaaaaa" "aaababaa" "babbabaa" }
{ "aaaaaaaa" "aaaababa" "aaaaaaaa" }
{ "aaaaaaaa" "aaaababa" "babababa" }
{ "aaaaaaaa" "aaaababa" "ababbaba" }
{ "aaaaaaaa" "aaaaabab" "aaaaaaaa" }
{ "aaaaaaaa" "aaaaabab" "babaabab" }
{ "aaaaaaaa" "aaaaabab" "abababab" }
{ "aaaaaaaa" "aaaaabab" "aababbab" }
}
Let’s solve for the shortest paths, we keep track of states we’ve previously seen to avoid cycles, and we iterate using breadth-first-search until we find any solutions:
:: shortest-paths ( from to -- moves )
HS{ from } clone :> seen
{ { from } } :> stack!
f [
drop
! find all next possibilities
stack all-paths
! reject ones that circle back to visited nodes
[ last seen in? ] reject
! reject any that are over the length of ``to``
to length '[ last length _ > ] reject stack!
! add the newly visited nodes
stack [ last seen adjoin ] each
! stop when we find any solutions
stack [ last to = ] filter dup empty?
] loop ;
Note: we reject any states that are longer than our goal state. This provides a nice way to cull the graph and make the search performance more reasonable. You could also choose not do that, and exhaustively search into that area. However, while this is not generally a valid approach to solving these types of problems, it is specifically a valid approach to this one.
There are quite a few shortest paths:
IN: scratchpad "aaaaaaaa" "aaaaaaaaaa" shortest-paths length .
560
Each of those contain 16 nodes, which means 15 rules were applied:
IN: scratchpad "aaaaaaaa" "aaaaaaaaaa" shortest-paths first length .
16
But they only go through a seemingly small number of nodes:
IN: scratchpad "aaaaaaaa" "aaaaaaaaaa" shortest-paths concat members length .
43
How many nodes are there in total in the graph? Let’s find out!
:: full-graph ( from to -- seen )
HS{ from } clone :> seen
{ { from } } [
! find all next possibilities
all-paths
! reject any that are over the length of ``to``
to length '[ last length _ > ] reject
! only include ones that visit new nodes
[ last seen ?adjoin ] filter
] until-empty seen ;
We can see that the shortest solutions go through about 15% of the nodes:
IN: scratchpad "aaaaaaaa" "aaaaaaaaaa" full-graph cardinality .
279
We can use our graph traversal approach and Graphviz to visualize where solutions are found, showing how some areas of the graph are quite hard to randomly get out of and then on the correct path to a solution. We draw the starting node green, the ending node blue, and the nodes that involved in the shortest path as gray:
And that’s kind of interesting, but if we cluster nodes by their depth when first discovered, some other patterns show up:
Such a fun problem!
Phil Eaton issued a HYIBY? – have you implemented bitcask yet? – challenge yesterday. Of course, I immediately realized that I have not and also that it would be fun to build in Factor.
Bitcask is described in the original Bitcask paper as a “log-structured hash-table for fast key/value data”, and was part of the software developed by Basho Technologies as part of the Riak distributed database. Besides the original paper, various developers over the years have bumped into Bitcask and implemented it in different programming languages. Arpit Bhayani, for example, has a nice blog post describing Bitcask that is worth reading for more background.
At its core, Bitcask describes an append-only storage mechanism for building a key-value database. It consists of one-or-more data files, each of which has an optional index file to allow faster recovery when initializing the database, and generally supports GET, PUT, and DELETE operations.
Our data file contains a series of entry records. Each record consists of a
key length
, value length
, key bytes
, and value bytes
. A simple
word provides a way to write these bytes to a file:
: write-entry-bytes ( key value -- )
[ dup length 4 >be write ] bi@ [ write ] bi@ ;
Then, using the serialize vocabulary we can store Factor objects quite simply:
: write-entry ( key value -- )
[ object>bytes ] bi@ write-entry-bytes ;
We need the ability to store tombstone records which indicate that a key has been deleted from the database. In this case, we choose to store a zero-sized value to indicate that:
: write-tombstone ( key -- )
object>bytes f write-entry-bytes ;
Assuming that a data file has had it’s seek position moved to the beginning of an entry record, we can read the value that it contains, or return a boolean indicating that it is not found because it was stored as a tombstone:
: read-entry ( -- value/f ? )
4 read be> 4 read be> [
drop f f
] [
[ seek-relative seek-input ]
[ read bytes>object t ] bi*
] if-zero ;
Our index file contains hints that provide a way to recover the record
offsets into the data files. These hints consist of a series of index
records. Each record consists of a key length
, key bytes
, and file offset
.
We can write our index
mapping of keys to offsets:
: write-index ( index -- )
[
[ object>bytes dup length 4 >be write write ]
[ 4 >be write ] bi*
] assoc-each ;
And then read it back into memory:
: read-index ( -- index )
H{ } clone [
4 read [
be> read bytes>object 4 read be>
swap pick set-at t
] [ f ] if*
] loop ;
We want to make the index files optional, continuing to recover the index by first seeking to the last entry that we have in the index, and then continuing to iterate across the records in the data file to recover the full index, making sure to delete any items that are subsequently observed to contain tombstone entries:
: recover-index ( index -- index' )
dup values [ maximum seek-absolute seek-input ] unless-empty
[
tell-input 4 read [
be> 4 read be> [ read bytes>object ] dip
[ pick delete-at drop ] [
[ pick set-at ]
[ seek-relative seek-input ] bi*
] if-zero t
] [ drop f ] if*
] loop ;
The associative mapping
protocol
describes the words that an assoc
should support. This type of object
provides a mapping of key
to value
, with ways to add, update, and
delete these mappings.
We want our bitcask
type to use a single data file, reading and
recovering from an index file, and then providing ways to modify – by
appending to the data file – the database.
TUPLE: bitcask path index ;
:: <bitcask> ( path -- bitcask )
path dup touch-file
path ".idx" append dup touch-file
binary [ read-index ] with-file-reader
path binary [ recover-index ] with-file-reader
bitcask boa ;
INSTANCE: bitcask assoc
The application should control when and how these index files are persisted:
: save-index ( bitcask -- )
dup path>> ".idx" append binary
[ index>> write-index ] with-file-writer ;
The first operation we support will be set-at
, updating the index after
writing the entry.
M:: bitcask set-at ( value key bitcask -- )
bitcask path>> binary [
tell-output
key value write-entry
key bitcask index>> set-at
] with-file-appender ;
Next, we support at*
, to lookup a value by seeking in the data file and
reading the entry:
M:: bitcask at* ( key bitcask -- value/f ? )
key bitcask index>> at* [
bitcask path>> binary [
seek-absolute seek-input read-entry
] with-file-reader
] [ drop f f ] if ;
And finally, delete-at
removes a key from the index after writing a
tombstone:
M:: bitcask delete-at ( key bitcask -- )
key bitcask index>> key? [
bitcask path>> binary [
key write-tombstone
key bitcask index>> delete-at
] with-file-appender
] when ;
The assoc-size
of our database is the size of the index:
M: bitcask assoc-size
index>> assoc-size ;
It is helpful to implement >alist
to provide a conversion to an assocation
list, although if the
database gets quite large, this might be of less practical value:
M:: bitcask >alist ( bitcask -- alist )
bitcask path>> binary [
bitcask index>> [
seek-absolute seek-input read-entry t assert=
] { } assoc-map-as
] with-file-reader ;
And a way to clear-assoc
by writing tombstones and clearing the index:
M:: bitcask clear-assoc ( bitcask -- )
bitcask path>> binary [
bitcask index>>
dup keys [ write-tombstone ] each
clear-assoc
] with-file-appender ;
There are some elements desirable in a production database that are not implemented, for example:
This is now available in the development version in the bitcask vocabulary!
One of my first and most memorable graphical programs was implementing John Conway’s Game of Life. At the time, that implementation was as a Java applet. I’ve revisited it periodically in different programming languages including several years ago when I started to implement the Game of Life in Factor – something I’ve always wanted to write about.
The Game of Life is a two-dimensional grid of square cells with fairly simple logic. Each cell can be either live or dead. Each cell interacts with its eight neighboring cells with the following rules determining the next state of the game board:
You can run this in any release since Factor 0.98:
IN: scratchpad "game-of-life" run
And it will look something like this:
Let’s go ahead and build it!
We will model our two-dimensional game board as an array of arrays. And in particular, since each cell has only two states, we will use bit-arrays to reduce the memory requirements by efficiently storing the state, one bit for each cell.
: <grid> ( rows cols -- grid )
'[ _ <bit-array> ] replicate ;
: grid-dim ( grid -- rows cols )
[ length ] [ first length ] bi ;
Making a random grid, which is useful in testing:
: random-grid ( rows cols -- grid )
'[ _ { t f } ?{ } randoms-as ] replicate ;
And a word we can use for debugging, to print a grid out:
: grid. ( grid -- )
[ [ CHAR: # CHAR: . ? ] "" map-as print ] each ;
Some implementations choose to make the game boards infinite, but we are instead going to build a wraparound game board. This allows, for example, a glider shape to fly off the bottom right and then re-appear on the top left of the board, which is a lot more fun to watch.
A useful word calculates adjacent indices for a cell – that wrap at a
max
value of rows or columns:
:: adjacent-indices ( n max -- n-1 n n+1 )
n [ max ] when-zero 1 -
n
n 1 + dup max = [ drop 0 ] when ;
Test it out, showing how it might work in a hypothetical 10 x 10
grid:
! in the middle
IN: scratchpad 3 10 adjacent-indices 3array .
{ 2 3 4 }
! at the start, wrapped around
IN: scratchpad 0 10 adjacent-indices 3array .
{ 9 0 1 }
! at the end, wrapped around
IN: scratchpad 9 10 adjacent-indices 3array .
{ 8 9 0 }
The main game logic requires counting neighbors for each cell. Since each
cell can have 8 neighbors, we can store this count in a half-byte – a
nibble – which can hold the values
[0..15]
. In the batteries-included standard
library, we
have a nibble-arrays
vocabulary
that makes this easy.
The simplest implementation would just iterate across the game board, and for each cell that is live, increment the count for the neighboring indices around it:
:: count-neighbors ( grid -- counts )
grid grid-dim :> ( rows cols )
rows [ cols <nibble-array> ] replicate :> neighbors
grid [| row j |
j rows adjacent-indices
[ neighbors nth ] tri@ :> ( above same below )
row [| cell i |
cell [
i cols adjacent-indices
[ [ above [ 1 + ] change-nth ] tri@ ]
[ nip [ same [ 1 + ] change-nth ] bi@ ]
[ [ below [ 1 + ] change-nth ] tri@ ]
3tri
] when
] each-index
] each-index neighbors ;
Then the last piece of game logic we need is to adjust the grid cells according to the rules – making some transition from live to dead, and others from dead to live based on their state and the neighboring counts.
:: next-step ( grid -- )
grid count-neighbors :> neighbors
grid [| row j |
j neighbors nth :> neighbor-row
row [| cell i |
i neighbor-row nth
cell [
2 3 between? i row set-nth
] [
3 = [ t i row set-nth ] when
] if
] each-index
] each-index ;
Before we move on to creating a graphical user interface for the game, let’s try it out in the Factor listener:
! Create a random 10x10 grid
IN: scratchpad 10 10 random-grid
! Print it out
IN: scratchpad dup grid.
#..#..#.##
##....####
..###.####
.##...#..#
.##....###
..###..#.#
...###.#..
.###....##
#...###.##
.##..#.#..
! Compute the neighbors for each cell
IN: scratchpad dup count-neighbors .
{
N{ 5 5 4 1 2 3 4 6 5 5 }
N{ 5 3 4 4 3 4 4 7 7 7 }
N{ 6 5 4 3 1 4 4 6 6 5 }
N{ 5 4 5 5 2 3 3 6 7 4 }
N{ 5 4 5 5 2 2 3 3 5 3 }
N{ 3 3 4 5 4 3 4 3 6 2 }
N{ 3 3 6 6 5 2 3 2 5 3 }
N{ 4 2 3 4 6 5 4 4 4 4 }
N{ 4 5 5 4 3 3 3 4 4 4 }
N{ 5 3 2 3 4 4 5 4 5 6 }
}
! Compute the next generation
IN: scratchpad dup next-step
! Print it out
IN: scratchpad dup grid.
.....#....
.#..#.....
...#......
.....##...
......##.#
##...#.#.#
##...###.#
.##.......
....###...
.###......
It works!
In Factor, one of the ways we can build user
interfaces is
using gadgets and OpenGL rendering
instructions.
We start by modeling our game as a
gadget with
a grid
object, a size
that specifies the rendered pixels-per-cell, and
a timer
to control the speed of repainting new generations.
TUPLE: grid-gadget < gadget grid size timer ;
Our default gadget will have cells that are 20 pixels square, and repaint 10 times per second:
: <grid-gadget> ( grid -- gadget )
grid-gadget new
swap >>grid
20 >>size
dup '[ _ dup grid>> next-step relayout-1 ]
f 1/10 seconds <timer> >>timer ;
Gadgets are grafted onto the render hierarchy, and then later ungrafted when they are removed. We handle that state change by stopping the timer before delegating to the parent to cleanup further:
M: grid-gadget ungraft*
[ timer>> stop-timer ] [ call-next-method ] bi ;
The default dimension for our gadget is the grid dimension times the pixel size:
M: grid-gadget pref-dim*
[ grid>> grid-dim swap ] [ size>> '[ _ * ] bi@ 2array ] bi ;
If the grid size
changes – for example, by using the mouse scroll wheel to
zoom in or out – we can create and store a new grid, keeping the cells that
are visible in the same state they were in:
:: update-grid ( gadget -- )
gadget dim>> first2 :> ( w h )
gadget size>> :> size
h w [ size /i ] bi@ :> ( new-rows new-cols )
gadget grid>> :> grid
grid grid-dim :> ( rows cols )
rows new-rows = not cols new-cols = not or [
new-rows new-cols <grid> :> new-grid
rows new-rows min [| j |
cols new-cols min [| i |
i j grid nth nth
i j new-grid nth set-nth
] each-integer
] each-integer
new-grid gadget grid<<
] when ;
We can draw the cells that are live as black
squares:
:: draw-cells ( gadget -- )
COLOR: black gl-color
gadget size>> :> size
gadget grid>> [| row j |
row [| cell i |
cell [
i j [ size * ] bi@ 2array { size size } gl-fill-rect
] when
] each-index
] each-index ;
And then draw the gray
lines that define the grid of cells:
:: draw-lines ( gadget -- )
gadget size>> :> size
gadget grid>> grid-dim :> ( rows cols )
COLOR: gray gl-color
cols rows [ size * ] bi@ :> ( w h )
rows 1 + [| j |
j size * :> y
{ 0 y } { w y } gl-line
] each-integer
cols 1 + [| i |
i size * :> x
{ x 0 } { x h } gl-line
] each-integer ;
Putting this together, we draw our gadget by updating the grid, drawing the cells, and drawing the lines:
M: grid-gadget draw-gadget*
[ update-grid ] [ draw-cells ] [ draw-lines ] tri ;
And, with the “visual REPL”, you can directly render the grid gadget, to see it work:
We now need to build the interactive parts. Let’s first start by handling a
click, to toggle the state of a cell, and storing which state it was toggled
to in the last-click
variable:
SYMBOL: last-click
:: on-click ( gadget -- )
gadget grid>> :> grid
gadget size>> :> size
grid grid-dim :> ( rows cols )
gadget hand-rel first2 [ size /i ] bi@ :> ( i j )
i 0 cols 1 - between?
j 0 rows 1 - between? and [
i j grid nth
[ not dup last-click set ] change-nth
] when gadget relayout-1 ;
That allows us to build a drag feature, where as we drag, we continue to either set cells to live or dead according to what the first click was doing:
:: on-drag ( gadget -- )
gadget grid>> :> grid
gadget size>> :> size
grid grid-dim :> ( rows cols )
gadget hand-rel first2 [ size /i ] bi@ :> ( i j )
i 0 cols 1 - between?
j 0 rows 1 - between? and [
last-click get i j
grid nth set-nth
gadget relayout-1
] when ;
We implement a scrolling feature to adjust the size
of the rendered cells,
clamping the value when it gets too small or too large:
: on-scroll ( gadget -- )
[
scroll-direction get second {
{ [ dup 0 > ] [ -2 ] }
{ [ dup 0 < ] [ 2 ] }
[ 0 ]
} cond nip + 4 30 clamp
] change-size relayout-1 ;
And we store these as "gestures"
that are supported by the gadget:
grid-gadget "gestures" [
{
{ T{ button-down { # 1 } } [ on-click ] }
{ T{ drag { # 1 } } [ on-drag ] }
{ mouse-scroll [ on-scroll ] }
} assoc-union
] change-word-prop
The last bit we need is to make the toolbar, which has a few commands we can run:
:: com-play ( gadget -- )
gadget timer>> restart-timer ;
:: com-stop ( gadget -- )
gadget timer>> stop-timer ;
:: com-clear ( gadget -- )
gadget dup grid>> [ clear-bits ] each relayout-1 ;
:: com-random ( gadget -- )
gadget dup grid>> [ [ drop { t f } random ] map! drop ] each relayout-1 ;
:: com-glider ( gadget -- )
gadget dup grid>> :> grid
{ { 2 1 } { 3 2 } { 1 3 } { 2 3 } { 3 3 } }
[ grid nth t -rot set-nth ] assoc-each relayout-1 ;
:: com-step ( gadget -- )
gadget dup grid>> next-step relayout-1 ;
And then store these as the "toolbar"
command map:
grid-gadget "toolbar" f {
{ T{ key-down { sym "1" } } com-play }
{ T{ key-down { sym "2" } } com-stop }
{ T{ key-down { sym "3" } } com-clear }
{ T{ key-down { sym "4" } } com-random }
{ T{ key-down { sym "5" } } com-glider }
{ T{ key-down { sym "6" } } com-step }
} define-command-map
And finally, we can wrap the grid gadget with something that makes a toolbar, and creates a main window when launched:
TUPLE: life-gadget < track ;
: <life-gadget> ( -- gadget )
vertical life-gadget new-track
20 20 make-grid <grid-gadget>
[ <toolbar> format-toolbar f track-add ]
[ 1 track-add ] bi ;
M: life-gadget focusable-child* children>> second ;
MAIN-WINDOW: life-window
{ { title "Game of Life" } }
<life-gadget> >>gadgets ;
As with anything, there are probably things we could continue to improve in our UI framework, but one of the biggest missing pieces are examples of working code, which is largely what motivated writing about this today.
Check it out!
And maybe think about how you might adjust it to be an infinite game board, or to increase performance when computing the next generation, to improve the OpenGL rendering logic, persist the game board between launches, or do things like communicate age of each cell by the color that it is rendered with.
planet-factor is an Atom/RSS aggregator that collects the contents of Factor-related blogs. It is inspired by Planet Lisp.