[ planet-factor ]

John Benediktsson: Bit Test

Factor has a bit? generic that is used to test an integer to see if a particular bit is set. When operating on fixnum integers, this is implemented by the fixnum-bit? word:

: fixnum-bit? ( x n -- ? )
{ fixnum fixnum } declare
dup 0 >= [ neg shift even? not ] [ 2drop f ] if

Many CPU architectures have instructions for performing a Bit Test. On x86, the BT instruction is available. Below, we are going to implement a compiler intrinsic in the hopes of speeding up this operation in Factor.

Intrinsic

In cpu.architecture, add a generic %bit-test based on our CPU:

HOOK: %bit-test cpu ( dst src1 src2 temp -- )

In cpu.x86, implement %bit-test on x86 (returning a boolean using a temporary register and the CMOVB instruction to check the carry flag which holds the result of the bit test):

M:: x86 %bit-test ( dst src1 src2 temp -- )
src1 src2 BT
dst temp \ CMOVB (%boolean) ;

In compiler.cfg.instructions, add a ##bit-test instruction:

FOLDABLE-INSN: ##bit-test
def: dst/tagged-rep
use: src1/int-rep src2/int-rep
temp: temp/int-rep ;

In compiler.codegen, we link the ##bit-test instruction with the %bit-test word.

CODEGEN: ##bit-test %bit-test

In compiler.cfg.intrinsics, enable replacing fixnum-bit? with the %bit-test intrinsic:

: enable-bit-test ( -- )
{
{ fixnum-bit? [ drop [ ^^bit-test ] binary-op ] }
} enable-intrinsics ;

Disassemble

The old assembly using fixnum-bit? looked like this:

000000010ea00250: mov [rip-0x1ff256], eax
000000010ea00256: sub rsp, 0x8
000000010ea0025a: call 0x10eedf3b0 (integer>fixnum-strict)
000000010ea0025f: mov rax, [r14]
000000010ea00262: cmp rax, 0x0
000000010ea00266: jl 0x10ea002a7 (M\ fixnum bit? + 0x57)
000000010ea0026c: neg rax
000000010ea0026f: mov [r14], rax
000000010ea00272: call 0x10ebec060 (fixnum-shift)
000000010ea00277: mov rax, [r14]
000000010ea0027a: test rax, 0x10
000000010ea00281: mov rax, 0x1
000000010ea0028b: mov rbx, 0x1010eff4c
000000010ea00295: cmovnz rax, rbx
000000010ea00299: mov [r14], rax
000000010ea0029c: mov [rip-0x1ff2a2], eax
000000010ea002a2: add rsp, 0x8
000000010ea002a6: ret
000000010ea002a7: sub r14, 0x8
000000010ea002ab: mov qword [r14], 0x1
000000010ea002b2: mov [rip-0x1ff2b8], eax
000000010ea002b8: add rsp, 0x8
000000010ea002bc: ret

The new assembly looks like this with BT:

000000010e656ec0: mov [rip-0x293ec6], eax
000000010e656ec6: sub rsp, 0x8
000000010e656eca: call 0x10e467ea0 (integer>fixnum-strict)
000000010e656ecf: mov rax, [r14]
000000010e656ed2: mov rbx, [r14-0x8]
000000010e656ed6: sar rbx, 0x4
000000010e656eda: sar rax, 0x4
000000010e656ede: bt rbx, rax
000000010e656ee2: mov rbx, 0x1
000000010e656eec: mov rcx, 0x1018f169c
000000010e656ef6: cmovb rbx, rcx
000000010e656efa: sub r14, 0x8
000000010e656efe: mov [r14], rbx
000000010e656f01: mov [rip-0x293f07], eax
000000010e656f07: add rsp, 0x8
000000010e656f0b: ret

Performance

Turns out that this speeds up fixnum-bit? by a lot. Using a simple test case:

: bench-fixnum-bit ( x n -- ? )
{ fixnum fixnum } declare
[ 0 100,000,000 ] 2dip
'[ _ _ bit? [ 1 + ] when ] times ;

Our old version was decently fast:

IN: scratchpad gc [ 0b101010101 0 bench-fixnum-bit ] time
Running time: 0.821433838 seconds

But our new version is much faster!

IN: scratchpad gc [ 0b101010101 0 bench-fixnum-bit ] time
Running time: 0.239439108 seconds

This has been committed to the development version and is available now.

Fri, 19 Jun 2015 15:30:00

John Benediktsson: Compressed Sieve of Eratosthenes

In my previous post, we used "2-3-5-7" wheel factorization to produce a faster Sieve of Eratosthenes with a compression factor of 2 (by storing only the odd numbers). Samuel Tardieu reminded me that the sieve variation used by Factor uses less memory with a higher compression factor of 3.75.

I thought would be a fun implementation to show below.

Version 6

The "2-3-5" wheel with the first 30 numbers that are not divisible by 2, 3, or 5.

CONSTANT: wheel-2-3-5 $[
30 [1,b] [
{ 2 3 5 } [ divisor? ] with any? not
] B{ } filter-as
]
Note: We use 30 here because the "{ 2 3 5 } product" cycle is 30.

If you look at the wheel, you see that there are 8 candidates in every 30 numbers:

IN: scratchpad wheel-2-3-5 .
{ 1 7 11 13 17 19 23 29 }

The number 8 is interesting because it suggests that we can use a single byte to store the 8 candidates in each block of 30 numbers, using a bitmask for possible primes or f:

CONSTANT: masks $[
30 [0,b) [
wheel-2-3-5 index [ 7 swap - 2^ ] [ f ] if*
] map
]

We can index into this array to find the byte/mask for every number:

: byte-mask ( n -- byte mask/f )
30 /mod masks nth ;

Using the byte/mask we can check if a number is marked in the sieve byte-array.

:: marked? ( n sieve -- ? )
n byte-mask :> ( byte mask )
mask [ byte sieve nth mask bitand zero? not ] [ f ] if ;

And we can unmark a number in the sieve in a similar fashion:

:: unmark ( n sieve -- )
n byte-mask :> ( byte mask )
mask [ byte sieve [ mask bitnot bitand ] change-nth ] when ;

Using that, we can unmark multiples of prime numbers:

:: unmark-multiples ( i upper sieve -- )
i sq upper i 2 * <range> [ sieve unmark ] each ;

That's all we need to build our sieve, starting from a byte-array initialized with all numbers marked and then progressively unmarking multiples of primes:

:: sieve6 ( n -- sieve )
n 30 /i 1 + [ 0xff ] B{ } replicate-as :> sieve
sieve length 30 * 1 - :> upper
3 upper sqrt 2 <range> [| i |
i sieve marked? [
i upper sieve unmark-multiples
] when
] each sieve ;

As you can see, storing prime numbers up to 10 million takes only 333 KB!

IN: scratchpad 10,000,000 sieve6 length .
333334

This approach is used by the math.primes vocabulary to quickly check primality of any number less than 9 million using only 300 KB of memory.

Neat!

Wed, 17 Jun 2015 17:10:00

John Benediktsson: Genuine Sieve of Eratosthenes

Iain Gray posted on the mailing list about a paper called the Genuine Sieve of Eratosthenes by Melissa O'Neill, a Professor of Computer Science at Harvey Mudd College.

It begins with a discussion about some clever looking Haskell code that is just trial division and not the Sieve of Eratosthenes. At the end of the paper is an interesting discussion of performance improvements that I thought might be fun to implement in Factor as a followup to the three versions that I posted about recently.

Version 4

We are going to use wheel factorization as a way of reducing the search space by ignoring some multiples of small prime numbers. It is a bit similar to ignoring all even numbers in our previous "Version 3". In this case, we want to ignore all multiples of the first four prime numbers.

To calculate the "2-3-5-7" wheel, we start with 11 and filter the next 210 numbers that are not divisible by 2, 3, 5, or 7.

CONSTANT: wheel-2-3-5-7 $[
11 dup 210 + [a,b] [
{ 2 3 5 7 } [ divisor? ] with any? not
] B{ } filter-as differences
]
Note: We use 210 here because the "{ 2 3 5 7 } product" cycle is 210.

Next, we want a way to iterate across all the numbers that might be prime, calling a quotation on each number that is prime.

:: each-prime ( upto sieve quot -- )
11 upto '[ dup _ <= ] [
wheel-2-3-5-7 [
over dup 2/ sieve nth [ drop ] quot if
+
] each
] while drop ; inline

For each prime found, we will want to mark all odd multiples as not prime.

:: mark-multiples ( i upto sieve -- )
i sq upto i 2 * <range> [| j |
t j 2/ sieve set-nth
] each ; inline

We will use a bit-array that is sized in 210-bit blocks, so the number of bits needed is:

: sieve-bits ( n -- bits )
210 /i 1 + 210 * 2/ 6 + ; inline

Finally, we calculate our sieve, first for 11 and then for all the primes above 11.

:: sieve4 ( n -- primes )
n sieve-bits <bit-array> :> sieve
t 0 sieve set-nth
t 4 sieve set-nth
n sqrt sieve [ n sieve mark-multiples ] each-prime
V{ 2 3 5 7 } clone :> primes
n sieve [
dup n <= [ primes push ] [ drop ] if
] each-prime primes ;

Calculating prime numbers up to 10 million is getting even faster.

IN: scratchpad gc [ 10,000,000 sieve4 ] time
Running time: 0.08523477 seconds

Version 5

If you use the compiler.tree.debugger to inspect the optimized output, you can see some dynamic dispatch with branches for fixed-width (fixnum) and arbitrary-precision (bignum) integers.

It would be nice to perform fixnum specialization or improve type propagation in the compiler, but for now we are going to write some ugly code to force fixnum math for performance.

We make a version of each-prime that inlines the iteration:

:: each-prime2 ( upto sieve quot -- )
11 upto >fixnum '[ dup _ <= ] [
wheel-2-3-5-7 [
over dup 2/ sieve nth-unsafe [ drop ] quot if
fixnum+fast
] each
] while drop ; inline

And similarly for mark-multiples:

:: mark-multiples2 ( i upto sieve -- )
i 2 fixnum*fast :> step
i i fixnum*fast upto >fixnum '[ dup _ <= ] [
t over 2/ sieve set-nth-unsafe
step fixnum+fast
] while drop ; inline

And use them in our sieve computation:

:: sieve5 ( n -- primes )
n sieve-bits <bit-array> :> sieve
t 0 sieve set-nth
t 4 sieve set-nth
n sqrt sieve [ n sieve mark-multiples2 ] each-prime2
V{ 2 3 5 7 } clone :> primes
n sieve [
dup n <= [ primes push ] [ drop ] if
] each-prime2 primes ;

Calculating prime numbers up to 10 million is super fast!

IN: scratchpad gc [ 10,000,000 sieve5 ] time
Running time: 0.033914378 seconds

We can compute all the 11,078,937 prime numbers up to 200 million in about a second!

IN: scratchpad gc [ 200,000,000 sieve5 ] time
Running time: 0.970876855 seconds

And all the 50,847,534 prime numbers up to 1 billion in about six seconds!

IN: scratchpad gc [ 1,000,000,000 sieve5 ] time
Running time: 6.141224805 seconds

Not quite primegen or primesieve, but not bad for a laptop!

This is available in the math.primes.erato.fast vocabulary on our nightly builds!

Mon, 8 Jun 2015 22:48:00

John Benediktsson: Sieve of Eratosthenes

I noticed that the Crystal Programming Language homepage has an example of the Sieve of Eratosthenes algorithm for finding prime numbers. I thought it would be fun to show a few simple variations of that algorithm in Factor.

The algorithm works by iteratively marking as not prime the multiples of each prime, starting with the multiples of 2. In pseudocode, it looks like this:

Input: an integer n > 1

Let A be an array of Boolean values, indexed by integers 2 to n,
initially all set to true.

for i = 2, 3, 4, ..., not exceeding n:
if A[i] is true:
for j = i2, i2+i, i2+2i, i2+3i, ..., not exceeding n :
A[j] := false

Output: all i such that A[i] is true.

We will be looking at the performance and memory usage to calculate the 664,579 prime numbers between 1 and 10,000,000.

Version 1

Our first version, is a straightforward implementation from the pseudocode:

:: sieve1 ( n -- primes )
n 1 + t <array> :> sieve
f 0 sieve set-nth
f 1 sieve set-nth

2 n sqrt [a,b] [| i |
i sieve nth [
i sq n i <range> [| j |
f j sieve set-nth
] each
] when
] each

sieve [ drop ] selector [ each-index ] dip ;

We can show that it works for the first few prime numbers:

IN: scratchpad 25 sieve1 .
V{ 2 3 5 7 11 13 17 19 23 }

Calculating prime numbers up to 10 million takes less than half a second:

IN: scratchpad [ 10,000,000 sieve1 ] time
Running time: 0.408065579 seconds

The memory used is basically an array with 10 million elements in it, each element of the array is a boolean (64 bits on my laptop). Total memory used is 80 MB.

Version 2

Storing booleans can be more memory-efficient using bit-arrays, where each boolean is stored as a single bit. The logic is the same, however primes are marked by false rather than true.

:: sieve2 ( n -- primes )
n 1 + <bit-array> :> sieve
t 0 sieve set-nth
t 1 sieve set-nth

2 n sqrt [a,b] [| i |
i sieve nth [
i sq n i <range> [| j |
t j sieve set-nth
] each
] unless
] each

sieve [ drop not ] selector [ each-index ] dip ;

Calculating prime numbers up to 10 million is faster:

IN: scratchpad [ 10,000,000 sieve2 ] time
Running time: 0.241894524 seconds

The memory used is now one bit per number, so only 10 million bits or 1.25 MB.

Version 3

Since we know that even numbers (except 2) are not prime, we can only work with odd numbers and start our search from 3. For each number found, we can check only the odd multiples by marking off i2 + k*i for only even k.

:: sieve3 ( n -- sieve )
n dup odd? [ 1 + ] when 2/ <bit-array> :> sieve
t 0 sieve set-nth

3 n sqrt 2 <range> [| i |
i 2/ sieve nth [
i sq n i 2 * <range> [| j |
t j 2/ sieve set-nth
] each
] unless
] each

V{ 2 } clone :> primes

sieve [
swap [ drop ] [ 2 * 1 + primes push ] if
] each-index

primes ;
Note: the convenience of selector is not available because we want to initialize the list of primes with 2. Instead, we just do it manually.

Calculating prime numbers up to 10 million is much faster!

IN: scratchpad [ 10,000,000 sieve3 ] time
Running time: 0.110387127 seconds

As for memory usage, it is storing one bit for each odd number, so only 5 million bits or 625 KB.

Sun, 7 Jun 2015 15:21:00

John Benediktsson: SEND + MORE = MONEY?

There's a classic verbal arithmetic puzzle that was published in 1924 by Henry Dudeney, where you solve the equation:

    S E N D
+ M O R E
-----------
= M O N E Y
Note: Each letter is a unique digit and S and M can't be zero.

A few days ago, I noticed a blog post with solutions in Perl 6 that references another blog post with a solution in Haskell. I thought I'd show a solution in Factor using the backtrack vocabulary that provides something like John McCarthy's amb ambiguous operator.

First, we need a list of available digits, just the numbers 0 through 9:

CONSTANT: digits { 0 1 2 3 4 5 6 7 8 9 }

Next, we turn a sequence of digits into a number (e.g., { 1 2 3 4 } becomes 1234):

: >number ( seq -- n ) 0 [ [ 10 * ] dip + ] reduce ;

We can then implement our solver, by choosing digits while progressively restricting the possible values for future digits using the ones we've chosen so far (using local variables to store the digits):

:: send-more-money ( -- )
[
digits { 0 } diff amb-lazy :> s
digits { s } diff amb-lazy :> e
digits { s e } diff amb-lazy :> n
digits { s e n } diff amb-lazy :> d
{ s e n d } >number :> send

digits { 0 s e n d } diff amb-lazy :> m
digits { s e n d m } diff amb-lazy :> o
digits { s e n d m o } diff amb-lazy :> r
{ m o r e } >number :> more

digits { s e n d m o r } diff amb-lazy :> y
{ m o n e y } >number :> money

send more + money = [
send more money " %s\n+ %s\n= %s\n" printf
] when

] amb-all ;
Note: We search all solutions using amb-all (even though there is only one). In this case, it is effectively an iterative search, which we could implement without backtracking. If we wanted the first solution, we could use if-amb.

So, what's the answer? Let's see!

IN: scratchpad send-more-money
9567
+ 1085
= 10652

Neat! And it's fast too -- solving this puzzle in about 2.5 seconds on my laptop.

The code for this is on my GitHub.

Tue, 2 Jun 2015 00:52:00

John Benediktsson: File Monitor

Factor has a cross-platform file-system change monitor which supports detecting changes to file names, attributes and contents under a specified directory.

There is some minor platform differences between Mac OS X, Windows, and Linux which may be worth looking at if you are building on top of the io.monitors vocabulary. I was curious about what kind of events are generated for various test-cases and built a small utility to experiment with it.

Some code to monitor for changed paths recursively in a directory and print each one out:

: watch-loop ( monitor -- )
dup next-change path>> print flush watch-loop ;

: watch-directory ( path -- )
[ t [ watch-loop ] with-monitor ] with-monitors ;

I've committed this as the file-monitor tool (with support for an optional command-line argument to specify which directory to monitor as well as printing the change descriptors). You can run it very simply:

factor -run=file-monitor [path]

An example session on Linux, monitoring some simple changes to files in /tmp:


$ factor -run=file-monitor /tmp &
Monitoring /tmp

$ touch /tmp/a
{ +add-file+ } /tmp/a
{ +add-file+ } /tmp/a
{ +modify-file+ } /tmp/a

$ echo "test" > /tmp/a
{ +modify-file+ } /tmp/a

$ rm /tmp/a
{ +remove-file+ } /tmp/a

Tue, 5 May 2015 22:31:00

John Benediktsson: File Server

Python has a neat feature that lets you serve files from the current directory.

# Python 2
python2 -m SimpleHTTPServer

# Python 3
python3 -m http.server

I always thought this was a quick and useful way to share files on a local network. Given that Factor has a HTTP Server, we should be able to implement this!

We already have support for serving static content and serving CGI scripts, so we can very simply implement a script to create and launch a HTTP server for the current directory (or the one specified on the command-line), logging HTTP connections to stdout.

This is available in the file-server vocabulary, now you can:

factor -run=file-server [--cgi] [path]

Currently, this defaults to serving files on port 8080 from all available network interfaces. In the future, it would be nice to add the ability to specify port and network interfaces to bind.

Sat, 2 May 2015 00:17:00

John Benediktsson: Burrows-Wheeler Transform

The Burrows–Wheeler transform is a reversible method of rearranging text used to improve the performance of compression algorithms, such as bzip2.

We will implement transform, bwt, and inverse transform, ibwt, in both Python and Factor. First with a slow and simple algorithm, and then second with a faster version.

Version 1

We start with the pseudocode suggested in the Wikipedia article:

function BWT (string s)
append an 'EOF' character to s
create a table, rows are all possible rotations of s
sort rows alphabetically
return (last column of the table)

In Python, this might look like:

def bwt(s):
s = s + '\0'
n = len(s)
m = sorted(s[i:] + s[:i] for i in range(n))
return ''.join(x[-1] for x in m)

In Factor, using all-rotations, it might look like this:

: bwt ( seq -- seq' )
0 suffix all-rotations natural-sort [ last ] map ;

The pseudocode to perform the inverse transform:

function inverseBWT (string s)
create empty table

repeat length(s) times
// first insert creates first column
insert s as a column of table before first column
sort rows of the table alphabetically
return (row that ends with the 'EOF' character)

In Python, this might look like:

def ibwt(s):
n = len(s)
m = [''] * n
for _ in range(n):
m = sorted(s[i] + m[i] for i in range(n))
return [x for x in m if x.endswith('\0')][0][:-1]

In Factor, we could implement it like this:

: ibwt ( seq -- seq' )
[ length [ "" <array> ] keep ] keep
'[ _ [ prefix ] 2map natural-sort ] times
[ last 0 = ] find nip but-last ;

Unfortunately, this is very slow, with most of the performance loss in the invert transform.

Version 2

Another way to increase the speed of BWT inverse is to use an algorithm that returns an index into the sorted rotations along with the transform.

In Python, it looks like this:

def bwt(s):
n = len(s)
m = sorted(s[i:] + s[:i] for i in range(n))
return m.index(s), ''.join(x[-1] for x in m)

In Factor, it might look like this:

: bwt ( seq -- i seq' )
dup all-rotations natural-sort
[ index ] [ [ last ] map ] bi ;

In Python, the inverse transform looks like this:

def ibwt(k, s):
def row(k):
permutation = sorted((t, i) for i, t in enumerate(s))
for _ in s:
t, k = permutation[k]
yield t
return ''.join(row(k))

In Factor, that roughly translates to:

: ibwt ( i seq -- seq' )
[ length ] [ <enum> sort-values ] bi
'[ _ nth first2 ] replicate nip ;

An improved version 2 is available in the development version. In particular, it uses rotated virtual sequences for increased performance and returns transformations that match the type of the input sequence.

Wed, 29 Apr 2015 22:49:00

Blogroll


planet-factor is an Atom/RSS aggregator that collects the contents of Factor-related blogs. It is inspired by Planet Lisp.

Syndicate