[ planet-factor ]

John Benediktsson: Cuckoo Filters

A Cuckoo filter is a Bloom filter replacement that allows for space-efficient probabilistic membership checks. Cuckoo filters provide the ability to add and remove items dynamically without significantly degrading space and performance. False positive rates are typically low.

This data structure is explained by Bin Fan, Dave Andersen, Michael Kaminsky, and Michael Mitzenmacher in two papers: Cuckoo Filter: Better Than Bloom and Cuckoo Filter: Practically Better Than Bloom. There is also an implementation in C++ that can be referred to.

The Cuckoo filter is basically a dense hash table that can support high load factors (up to 95%) without degraded performance. Instead of storing objects, we will store a hashed fingerprint.

Buckets

First, we need to create a number of buckets. Each bucket will hold 4 fingerprints. Load factors over 96% will cause us to grow our capacity to the next-power-of-2.

! The number of fingerprints to store in each bucket
CONSTANT: bucket-size 4

! The maximum load factor we allow before growing the capacity
CONSTANT: max-load-factor 0.96

: #buckets ( capacity -- #buckets )
[ bucket-size /i next-power-of-2 ] keep
over / bucket-size / max-load-factor > [ 2 * ] when ;

Making our buckets is then just an array of arrays:

: <cuckoo-buckets> ( capacity -- buckets )
#buckets [ bucket-size f <array> ] replicate ;

Given a fingerprint, we can check if it is in a bucket by calling member?:

: bucket-lookup ( fingerprint bucket -- ? )
member? ;

To insert a fingerprint into the bucket, we find the first empty slot and replace it with the fingerprint. We return a boolean value indicating if we were able to insert it or not:

: bucket-insert ( fingerprint bucket -- ? )
dup [ not ] find drop [ swap set-nth t ] [ 2drop f ] if* ;

To delete a fingerprint, we finding its index (if present) and set it to false.

: bucket-delete ( fingerprint bucket -- ? )
[ f ] 2dip [ index ] keep over [ set-nth t ] [ 3drop f ] if ;

If the bucket is full, we need to be able to swap a fingerprint into the bucket, replacing/removing an existing one:

: bucket-swap ( fingerprint bucket -- fingerprint' )
[ length random ] keep [ swap ] change-nth ;

Hashing

Our hashing strategy will be to generate the SHA-1 hash value for a given byte-array, splitting it into two 32-bit values (a 32-bit fingerprint, and a 32-bit index value). We will also generate an alternate index value as well using a constant from the MurmurHash to mix with the primary index:

: hash-index ( hash -- fingerprint index )
4 over <displaced-alien> [ uint deref ] bi@ ;

: alt-index ( fingerprint index -- alt-index )
[ 0x5bd1e995 w* ] [ bitxor ] bi* ;

: hash-indices ( bytes -- fingerprint index alt-index )
sha1 checksum-bytes hash-index 2dup alt-index ;

Insert/Lookup/Delete

Our Cuckoo filter holds our buckets:

TUPLE: cuckoo-filter buckets ;

: <cuckoo-filter> ( capacity -- cuckoo-filter )
<cuckoo-buckets> cuckoo-filter boa ;

To insert an item into the Cuckoo filter, we calculate its hash-indices and then try inserting it into the bucket specified by the first index, then the bucket specified by the second index. If those buckets are full, we go through a "kickdown" process to move fingerprints from other buckets until we find a bucket that has space, or exceed the maximum number of attempts:

! The maximum number of times we kick down items/displace from
! their buckets
CONSTANT: max-cuckoo-count 500

:: cuckoo-insert ( bytes cuckoo-filter -- ? )
bytes hash-indices :> ( fp! i1 i2 )
cuckoo-filter buckets>> :> buckets
buckets length :> n
{
[ fp i1 n mod buckets nth bucket-insert ]
[ fp i2 n mod buckets nth bucket-insert ]
} 0|| [
t
] [
cuckoo-filter checksum>> :> checksum
2 random zero? i1 i2 ? :> i!
max-cuckoo-count [
drop
fp i n mod buckets nth bucket-swap fp!
fp i alt-index i!

fp i n mod buckets nth bucket-insert
] find-integer >boolean
] if ;

To lookup an item, we calculate the hash-indices and then check the two buckets to see if the fingerprint can be found.

:: cuckoo-lookup ( bytes cuckoo-filter -- ? )
bytes hash-indices :> ( fp i1 i2 )
cuckoo-filter buckets>> :> buckets
buckets length :> n
{
[ fp i1 n mod buckets nth bucket-lookup ]
[ fp i2 n mod buckets nth bucket-lookup ]
} 0|| ;

To delete an item, we calculate the hash-indices and then try and remove it from the first index, or the second index if not found in the first bucket.

:: cuckoo-delete ( bytes cuckoo-filter -- ? )
bytes hash-indices :> ( fp i1 i2 )
cuckoo-filter buckets>> :> buckets
buckets length :> n
{
[ fp i1 n mod buckets nth bucket-delete ]
[ fp i2 n mod buckets nth bucket-delete ]
} 0|| ;

This is available in the cuckoo-filters vocabulary along with some tests, documentation, and a few extra features.

Tue, 9 Aug 2016 00:00:00

John Benediktsson: Backticks

Most languages support running arbitrary commands using something like the Linux system function. Often, this support has both quick-and-easy and full-featured-but-complex versions.

In Python, you can use os.system:

>>> os.system("ls -l")

In Ruby, you can use system as well as "backticks":

irb(main):001:0> system("ls -l")

irb(main):002:0> `ls -l`

Basically, the difference between "system" and "backticks" is:

  • "system" executes a command, returning the exit code of the process.
  • "backticks" executes a command, returning the standard output of the process.

Factor has extensive cross-platform support for launching processes, but I thought it would be fun to show how custom syntax can be created to implement "backticks", capturing and returning standard output from the process:

SYNTAX: `
"`" parse-multiline-string '[
_ utf8 [ contents ] with-process-reader
] append! ;

You can use this in a similar fashion to Ruby or Perl:

IN: scratchpad ` ls -l`
Note: This syntax currently requires a space after the leading backtick. In the future, we have plans for an improved lexer that removes this requirement.

This is available in the backticks vocabulary.

Fri, 15 Jul 2016 16:49:00

John Benediktsson: Clock Angles

Programming Praxis posted about calculating clock angles, specifically to:

Write a program that, given a time as hours and minutes (using a 12-hour clock), calculates the angle between the two hands. For instance, at 2:00 the angle is 60°.

Wikipedia has a page about clock angle problems that we can pull a few test cases from:

{ 0 } [ "12:00" clock-angle ] unit-test
{ 60 } [ "2:00" clock-angle ] unit-test
{ 180 } [ "6:00" clock-angle ] unit-test
{ 18 } [ "5:24" clock-angle ] unit-test
{ 50 } [ "2:20" clock-angle ] unit-test

The hour hand moves 360° in 12 hours and depends on the number of hours and minutes (properly handling midnight and noon to be ):

:: hour° ( hour minutes -- degrees )
hour [ 12 = 0 ] keep ? minutes 60 / + 360/12 * ;

The minute hand moves 360° in 60 minutes:

: minute° ( minutes -- degrees )
360/60 * ;

Using these words, we can calculate the clock angle from a time string:

: clock-angle ( string -- degrees )
":" split1 [ number>string ] bi@
[ hour° ] [ minute° ] bi - abs ;

Wed, 6 Jul 2016 23:37:00

John Benediktsson: left-pad

In the wake of an epic ragequit where Azer Koçulu removed all of his modules from npm (the node.js package manager), there have been so many entertaining discussions and explanations covering what happened.

Today, Programming Praxis posted the leftpad challenge, pointing out that the original solution ran in quadratic time due to it's use of character-by-character string concatenation (but not pointing out that it only works with strings).

First, the original code in Javascript:

function leftpad (str, len, ch) {
str = String(str);
var i = -1;
if (!ch && ch !== 0) ch = ' ';
len = len - str.length;
while (++i < len) {
str = ch + str;
}
return str;
}

Now, a (simpler? faster? more general?) version in Factor:

:: left-pad ( seq n elt -- newseq )
seq n seq length [-] elt <repetition> prepend ;

Using it, you can see it works:

IN: scratchpad "hello" 3 CHAR: h left-pad .
"hello"

IN: scratchpad "hello" 10 CHAR: h left-pad .
"hhhhhhello"

And it even works with other types of sequences:

IN: scratchpad { 1 2 3 } 3 0 left-pad .
{ 1 2 3 }

IN: scratchpad { 1 2 3 } 10 0 left-pad .
{ 0 0 0 0 0 0 0 1 2 3 }

I should also point out that Factor has pad-head that does this in the standard library and node.js has a pad-left module that solves the quadratic time problem (but still only works with strings).

Fri, 25 Mar 2016 16:39:00

John Benediktsson: ISBN

Most books are issued a unique International Standard Book Number (ISBN) number. Often different formats of the same book will have different ISBN numbers. On a print book, you can usually find the ISBN on a barcode on the back cover.

Most countries seem to have a national ISBN registration agency. In some countries this is a free service provided by a government agency. In other countries, this is operated by a commercial entity. In the United States, one company (R.R. Bowker LLC) has an apparent monopoly on issuing ISBN numbers which can cost $125 for one (less if you buy in bulk).

The ISBN is 13 digits long if assigned starting in 2007, and 10 digits long if assigned before 2007. Each ISBN contains a check digit which is used for basic error detection We are going to build a few words in Factor to calculate the check digits and validate ISBNs.

We need to turn an ISBN (which might include spaces or dashes) into numeric digits:

: digits ( str -- digits )
[ digit? ] filter string>digits ;

For ISBN-10, the check digit is the sum of each of 10 digits multiplied by a weight (descending from 10 to 1) modulo 11.

: isbn-10-check ( digits -- n )
0 [ 10 swap - * + ] reduce-index 11 mod ;

For ISBN-13, the check digit is the sum of each of 13 digits multiplied by a weight (alternating between 1 and 3) modulo 10.

: isbn-13-check ( digits -- n )
0 [ even? 1 3 ? * + ] reduce-index 10 mod ;

We can validate an ISBN by grabbing the digits and running either the ISBN-10 or ISBN-13 check and verifying that the result is zero.

: valid-isbn? ( str -- ? )
digits dup length {
{ 10 [ isbn-10-check ] }
{ 13 [ isbn-13-check ] }
} case 0 = ;

The code (and some tests) for this is on my GitHub.

Sat, 19 Sep 2015 21:36:00

John Benediktsson: Pig Latin

Pig Latin is a somewhat ridiculous language game which modifies words in such a funny way that is hard to figure out if you don't know how it works but easy if you do. Using Factor, we will build a converter from English to Pig Latin words.

There are two basic rules we should implement:

  1. For words that begin with consonant sounds, the initial consonant or consonant cluster is moved to the end of the word, and "ay" is added to the end.
{ "igpay" } [ "pig" pig-latin ] unit-test
{ "ananabay" } [ "banana" pig-latin ] unit-test
{ "ashtray" } [ "trash" pig-latin ] unit-test
{ "appyhay" } [ "happy" pig-latin ] unit-test
{ "uckday" } [ "duck" pig-latin ] unit-test
{ "oveglay" } [ "glove" pig-latin ] unit-test
  1. For words that begin with a vowel sounds or silent letter, add "way" to the end.
{ "eggway" } [ "egg" pig-latin ] unit-test
{ "inboxway" } [ "inbox" pig-latin ] unit-test
{ "eightway" } [ "eight" pig-latin ] unit-test

We can implement our two basic rules:

: pig-latin ( str -- str' )
dup [ "aeiou" member? ] find drop [
"way" append
] [
cut swap "ay" 3append
] if-zero ;

We could improve this by:

  • better handling of words that start with capital vowels or are all consonants
  • reverse the rules to convert Pig Latin back to English
  • variations such as adding "yay" (or "i") instead of "way"
  • different rules like adding "ag" before each vowel ("pagig lagatagin")
  • support language games in other languages

Anyway, this is available on my GitHub.

Sat, 12 Sep 2015 01:19:00

John Benediktsson: Bowling Scores

Today we are going to explore building a bowling score calculator using Factor. In particular, we will be scoring ten-pin bowling.

There are a lot of ways to "golf" this, including this short version in F#, but we will build this in several steps through transformations of the input. The test input is a string representation of the hits, misses, spares, and strikes. The output will be a number which is your total score. We will assume valid inputs and not do much error-checking.

A sample game might look like this:

12X4--3-69/-98/8-8-

Our first transformation is to convert each character to a number of pins that have been knocked down for each ball. Strikes are denoted with X, spares with /, misses with -, and normal hits with a number.

: pin ( last ch -- pin )
{
{ CHAR: X [ 10 ] }
{ CHAR: / [ 10 over - ] }
{ CHAR: - [ 0 ] }
[ CHAR: 0 - ]
} case nip ;

We use this to convert the entire string into a series of pins knocked down for each ball.

: pins ( str -- pins )
f swap [ pin dup ] { } map-as nip ;

A single frame will be either one ball, if a strike, or two balls. We are going to use cut-slice instead of cut because it will be helpful later.

: frame ( pins -- rest frame )
dup first 10 = 1 2 ? short cut-slice swap ;

A game is 9 "normal" frames and then a last frame that could have up to three balls in it.

: frames ( pins -- frames )
9 [ frame ] replicate swap suffix ;

Some frames will trigger a bonus. Strikes add the value of the next two balls. Spares add the value of the next ball. We build this by "un-slicing" the frame and calling sum on the next balls.

: bonus ( frame -- bonus )
[ seq>> ] [ to>> tail ] [ length 3 swap - ] tri head sum ;

We can score the frames by checking for frames where all ten pins are knocked down (either spares or strikes) and adding their bonus.

: scores ( frames -- scores )
[ [ sum ] keep over 10 = [ bonus + ] [ drop ] if ] map ;

We can solve the original goal by just adding all the scores:

: bowl ( str -- score )
pins frames scores sum ;

And write a bunch of unit tests to make sure it works:

{ 0 } [ "---------------------" bowl ] unit-test
{ 11 } [ "------------------X1-" bowl ] unit-test
{ 12 } [ "----------------X1-" bowl ] unit-test
{ 15 } [ "------------------5/5" bowl ] unit-test
{ 20 } [ "11111111111111111111" bowl ] unit-test
{ 20 } [ "5/5-----------------" bowl ] unit-test
{ 20 } [ "------------------5/X" bowl ] unit-test
{ 40 } [ "X5/5----------------" bowl ] unit-test
{ 80 } [ "-8-7714215X6172183-" bowl ] unit-test
{ 83 } [ "12X4--3-69/-98/8-8-" bowl ] unit-test
{ 150 } [ "5/5/5/5/5/5/5/5/5/5/5" bowl ] unit-test
{ 144 } [ "XXX6-3/819-44X6-" bowl ] unit-test
{ 266 } [ "XXXXXXXXX81-" bowl ] unit-test
{ 271 } [ "XXXXXXXXX9/2" bowl ] unit-test
{ 279 } [ "XXXXXXXXXX33" bowl ] unit-test
{ 295 } [ "XXXXXXXXXXX5" bowl ] unit-test
{ 300 } [ "XXXXXXXXXXXX" bowl ] unit-test
{ 100 } [ "-/-/-/-/-/-/-/-/-/-/-" bowl ] unit-test
{ 190 } [ "9/9/9/9/9/9/9/9/9/9/9" bowl ] unit-test

This is available on my GitHub.

Sun, 30 Aug 2015 19:30:00

John Benediktsson: Haikunator

The Haikunator is a project to provide "Heroku-like memorable random names". These names usually consist of an adjective, a noun, and a random number or token. The original repository is implemented in Ruby, with ports to Go, Javascript, Python, PHP, Elixer, .NET, Java, and Dart.

We will be implementing this in Factor using the qw vocabulary that provides a simple way to make "arrays of strings" using the qw{ syntax.

First, a list of adjectives:

CONSTANT: adjectives qw{
autumn hidden bitter misty silent empty dry dark summer icy
delicate quiet white cool spring winter patient twilight
dawn crimson wispy weathered blue billowing broken cold
damp falling frosty green long late lingering bold little
morning muddy old red rough still small sparkling throbbing
shy wandering withered wild black young holy solitary
fragrant aged snowy proud floral restless divine polished
ancient purple lively nameless lucky odd tiny free dry
yellow orange gentle tight super royal broad steep flat
square round mute noisy hushy raspy soft shrill rapid sweet
curly calm jolly fancy plain shinny
}

Next, a list of nouns:

CONSTANT: nouns qw{
waterfall river breeze moon rain wind sea morning snow lake
sunset pine shadow leaf dawn glitter forest hill cloud
meadow sun glade bird brook butterfly bush dew dust field
fire flower firefly feather grass haze mountain night pond
darkness snowflake silence sound sky shape surf thunder
violet water wildflower wave water resonance sun wood dream
cherry tree fog frost voice paper frog smoke star atom band
bar base block boat term credit art fashion truth disk
math unit cell scene heart recipe union limit bread toast
bonus lab mud mode poetry tooth hall king queen lion tiger
penguin kiwi cake mouse rice coke hola salad hat
}

We will make a token out of digits:

CONSTANT: token-chars "0123456789"

Finally, a simple haikunate implementation:

: haikunate ( -- str )
adjectives random
nouns random
4 [ token-chars random ] "" replicate-as
"%s-%s-%s" sprintf ;

We can try it a few times, to see how it works:

IN: scratchpad haikunate .
"odd-water-8344"

IN: scratchpad haikunate .
"flat-tooth-9324"

IN: scratchpad haikunate .
"wandering-lion-8346"

IN: scratchpad haikunate .
"yellow-mud-9780"

IN: scratchpad haikunate .
"patient-unit-4203"

IN: scratchpad haikunate .
"floral-feather-1023"

Some versions of "haikunate" in other languages include features such as:

  • allow customization of the delimiter (dots are popular)
  • allow the token to be specified as a range of possible numbers
  • allow the token to be restricted to a maximum length
  • allow the token to be represented using hex digits
  • allow the token to be represented with custom character sets
  • etc.

This is available on my GitHub.

Thu, 27 Aug 2015 02:02:00

Blogroll


planet-factor is an Atom/RSS aggregator that collects the contents of Factor-related blogs. It is inspired by Planet Lisp.

Syndicate