@hackage IORefCAS0.0.1.1

Atomic compare and swap for IORefs and CASRefs.

See haddock in Data.CAS

A few notes on performance results

[2011.11.12] Initial Measurements

An initial round of tests gives the following results when executing 100K CAS's from each of four threads on a 3.33GHz Nehalem desktop:

RAW Haskell CAS:  0.143s        25% productivity 58MB alloc,  204,693 successes
'Fake' CAS:       0.9s - 1.42s  89% productivity, 132M alloc, 104,044 successes
Foreign CAS:      1.6s          23% productivity, 82M alloc,  264,821 successes

"Successes" counts the total number of CAS operations that actually succeeded.

This microbenchmark is spending a lot of its time in Gen 0 garbage collection.

Next, a million CAS's per thread:

RAW Haskell CAS:  0.65 - 1.0    20% productivity 406MB alloc (can stack overflow)
'Fake' CAS:       14.3 - 17s    270% CPU  92% productivity 1.3GB alloc  1,008,468 successes
Foreign CAS:      Stack overflow after 28.5s ... 300-390% CPU, 15% productivity, 

                  After bumping the stack size up it takes a long
                  time to finish, even after it has printed the
                  sample success bit vectors.
	      
	      With -K100M:
	      78s           6% productivity, 898M alloc,  3,324,943 successes
                                67 seconds elapsed in Gen 0 GC!!

Something odd is going on here. How could it spend so long in GC for so little allocation?? For only two threads the Foreign implementation drops to 22s, but still 17.97s elapsed in Gen 0 GC.

What about the Raw haskell CAS? It also wil stack overflow with the current version of the test. With a 1G stack it can do 5Mx4 CAS's (8,9M successful) in 6.7 seconds. 10M in 17s. And STILL not seeing the previous segfault with the Raw CAS version...

[2011.11.12] Simple Stack Overflow Fix

After making sure that all the (+1)'s in the test are strict, the stack overflow goes away and the numbers change (Raw does 5M in 3.3s instead of 6.7s). BUT there's still quite a lot of time spent in GC. Here's 1Mx4 CAS's again:

RAW Haskell CAS:  0.7s  (23% prod, 0.8s total Gen 0 GC)
'Fake' CAS:       11.8s (91% prod, 0.8s total Gen 0 GC)
Foreign CAS:      52s  (6% prod)

And then adding -A1M makes a neglible change in runtime for Raw, but reduces the # of gen0 collections from 484 to 235.

Ok, how about testing on a 3.1GHz Westmere. Wow, just ran into this:

cc1: internal compiler error: Segmentation fault
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://bugzilla.redhat.com/bugzilla> for instructions.
make: *** [all] Error 1

On a different machine it worked (default runtime flags 1Mx4 CAS):

RAW Haskell CAS:  0.7s  
'Fake' CAS:       8.1s
Foreign CAS:      46s

The lack of hyperthreading may also be helping.

: The primary source of allocation in this example is the accumulation of the [Bool] lists of success and failure. I should disable those and test again.

[2011.11.13] Testing specialized CAS.Foreign instance

All of the above results were for a cell containing an Int. That would not have triggered the specialized (Storable-based) instance in Foreign.hs. There SHOULD be special cases for all word-sized scalars, but currently there's just one for Word32. Let's test that one.

Word32 1Mx4 CAS's:
RAW Haskell CAS:  0.57s  (13.7% prod)
Foreign CAS:      0.64s  (15% prod)

Wow, the foreign one is doing as well as the Haskell one even though there's some extra silliness in the Foreign.CASRef type (causing a runtime case dispatch to unpack).

[2011.11.13] Testing atomicModify based on CAS

An atomicModify based on CAS offers a drop-in replacement that could improve performance. I implemented one which will try CAS until it fails a certain number of times ("30" for now, but needs to be tuned).

These seem to work well. They are cheaper than you would think given how long it takes to get successful CAS attempts under contention:

1Mx4 CAS attempts or atomicModifies: 0.19s -- CAS attempts 1.07M successful. 0.02s -- 1M atomicModifyIORefCAS on 1 thread 0.13s -- 1Mx2 atomicModifyIORefCAS on 2 threads 0.37s -- 1Mx4 atomicModifyIORefCAS on 4 threads

And they are cheaper than the real atomicModifyIORef (which also seems to have a stack space problem right now because of its laziness). But with a huge stack (1G) it will succeed:

1.27s  -- 1Mx4 atomicModifyIORef

But inserting extra strictness (an evaluate call) to avoid the stack-leak actually makes it slower:

2.08s  -- 1Mx4 atomicModifyIORef, stricter