@hackage inline-asm0.5.0.2

Inline some Assembly in ur Haskell!

inline-asm

Build Status

When inline C is too safe.

Did you try inline-c, but it's not enough? You need more? Nothing seems to satisfy? inline-asm to the rescue!

And, since the inline assembly is just a usual Haskell value (even if manipulated at compile-time), there's a lot of pretty cool stuff one can do, like, for instance, explicit compile-time loop unrolling.

Examples

Swapping two Ints and also incrementing one of them by two:

defineAsmFun "swap2p1"
  [asmTy| (a : Int) (b : Int) | (_ : Int) (_ : Int)]
  [asm|
  xchg {a}, {b}
  add $2, {b}
  |]

This provides a function swap2p1 :: Int -> Int -> (Int, Int) that, well, swaps two Ints. Note that the resulting function is pure, and the {a}, {b} antiquoters.

Getting the last character of a ByteString, or a default character if it's empty:

defineAsmFun "lastChar"
  [asmTy| (bs : ByteString) (def : Word) | (w : Word) |]
  [asm|
  test {bs:len}, {bs:len}
  jz is_zero
  movzbq -1({bs:ptr},{bs:len}), {w}
  RET_HASK
is_zero:
  mov {def}, {w}
  |]

This provides a function lastChar :: ByteString -> Word -> Word. Note the special {bs:ptr} and {bs:len} antiquoters, as well as RET_HASK command to return early.

SIMD-accelerated character occurrences count in a string:

defineAsmFun "countCharSSE42"
  [asmTy| (ch : Word8) (ptr : Ptr Word8) (len : Int) | (cnt : Int) |] $
  unroll "i" [12..15]
  [asm|
  push %r{i}|] <> [asm|
  vmovd {ch}, %xmm15
  vpxor %xmm0, %xmm0, %xmm0
  vpshufb %xmm0, %xmm15, %xmm15

  shr $7, {len}

  mov $16, %eax
  mov $16, %edx

  xor {cnt}, {cnt}

  {move ptr rdi}
loop: |] <> unrolls "i" [1..8] [
  [asm|
  vmovdqa {(i - 1) * 0x10}({ptr}), %xmm{i}
  |], [asm|
  vpcmpestrm $10, %xmm15, %xmm{i}
  vmovdqa %xmm0, %xmm{i}
  |], [asm|
  vmovq %xmm{i}, %r{i + 7}
  |], [asm|
  popcnt %r{i + 7}, %r{i + 7}
  |], [asm|
  add %r{i + 7}, {cnt}
  |]
  ] <>
  [asm|
  add $128, {ptr}
  dec {len}
  jnz loop|] <> unroll "i" [15,14..12] [asm|
  pop %r{i} |]

This provides a function countCharSSE42 :: Word8 -> Ptr Word8 -> Int -> Int. Note the unroll/unrolls Haskell function for compile-time code generation and loop unrolling with arithmetic expressions in the templates.

Impure computation depending on some external state, like reading the CPU's time stamp counter:

defineAsmFunM "rdtsc"
  [asmTy| | (out : Word64) |]
  [asm|
  rdtsc
  mov %rdx, {out}
  shl $32, {out}
  add %rax, {out}
  |]

This provides a function rdtsc :: PrimMonad m => m Word64 which can be used in ST or IO contexts. Note the M letter in defineAsmFunM.

Basic usage

The entry point is the defineAsmFun function from Language.Asm.Inline as well as the asm and asmTy quasiquoters from Language.Asm.Inline.QQ.

First, enable some extensions required for Template Haskell and for the unlifted FFI marshalling stuff:

{-# LANGUAGE TemplateHaskell, QuasiQuotes #-}
{-# LANGUAGE GHCForeignImportPrim, UnliftedFFITypes, UnboxedTuples #-}

then one can just do

defineAsmFun "funName"
  [asmTy| (someInt : Int) (somePtr : Ptr Word) (someStr : BS.ByteString) | (len : Int) (count : Int) |]
  [asm|
  ; your asm code follows
  |]

to define a function funName of the type Int -> Ptr Word -> BS.ByteString -> (Int, Int).

Antiquotation

Good news: it's not necessary to memorize the GHC calling convention to access the arguments and the output values slots! Instead, one can use the {someInt} syntax to refer to the argument or the return slot named someInt.

NB: despite that, be careful to not accidentally overwrite an input parameter for now by picking a wrong register for temporary computations. We might introduce some syntax to pick unused registers in a future version, but for now care must be taken.

ByteString parameters are supported, but, being composite objects, they are a bit special: an input parameter of type ByteString actually takes two registers: one for the address of the string, and one for its length.

Sometimes it might be handy to reassociate an input parameter with another register. For this, the {move argName newReg} antiquoter can be used (for instance, {move someInt rdi}). This will both update the mapping from argument names to register names as well as issue an assembly mov command.

In case you need to return early to the Haskell-land, just write RET_HASK, which gets substituted by the actual command to return to Haskell.

Explicit loop unrolling

The asm quasiquoter basically produces a string, so it can be manipulated at compile-time with arbitrary Haskell functions. In particular, a string template can be replicated at compile time.

The unroll function unrolls a single asm code block, calculating arithmetic expressions involving the unroll variable, so

unroll "i" [1..8] [asm|vmovdqa {(i - 1) * 0x10}({ptr}), %xmm{i}|]

is equivalent to

vmovdqa 0x0({ptr}), %xmm1
vmovdqa 0x10({ptr}), %xmm2
vmovdqa 0x20({ptr}), %xmm3
vmovdqa 0x30({ptr}), %xmm4
vmovdqa 0x40({ptr}), %xmm5
vmovdqa 0x50({ptr}), %xmm6
vmovdqa 0x60({ptr}), %xmm7
vmovdqa 0x70({ptr}), %xmm8

unrolls works analogously, but it takes a list of asm code blocks (instead of a single block), unrolls each of them and then concatenates the results. Equationally,

unrolls var ints codes = foldMap (unroll var ints) codes

The countCharSSE42 function above might be a pretty good example.

Impure functions

Most of the functions above are actually pure: they return the same result given the same parameters and have no side effects. Perhaps unsurprisingly, this is not always the case. Let's consider the rdtsc example again and assume we've written

defineAsmFun "rdtsc"
  [asmTy| (_ : Unit) | (out : Word64) |]
  [asm|
  rdtsc
  mov %rdx, {out}
  shl $32, {out}
  add %rax, {out}
  |]

(BTW we need a dummy Unit input here since otherwise the type of the generated imported Assembly function would be just Word64#, which is not a function but a value, and thus disallowed by GHC).

How do we use this function to measure something? We'd probably write something like

measure = do
  let r1 = rdtsc Unit
  runLongComputation
  let r2 = rdtsc Unit
  print $ r2 - r1

The problem is that the compiler is very keen on transforming this into

measure = do
  let r1 = rdtsc Unit
  runLongComputation
  let r2 = r1
  print $ r2 - r1

so every computation is executed instantly according to our measurements, and no amount of {-# NOINLINE #-} and the likes will fix it.

The only fix is to actually thread through the State# token, which is what happens when we use the monadic function defineAsmFunM:

defineAsmFunM "rdtsc"
  [asmTy| | (out : Word64) |]
  [asm|
  rdtsc
  mov %rdx, {out}
  shl $32, {out}
  add %rax, {out}
  |]

In this case, the generated Assembly function would be imported with the type forall s. State# s -> (# State# s, Word64# #), and what happens to State# s is completely opaque to the compiler, so it can no longer "optimize" this, and the following works as expected:

measure = do
  r1 <- rdtsc
  runLongComputation
  r2 <- rdtsc
  print $ r2 - r1

By the way, now that there's the state token parameter, we no longer need a dummy Unit.

Safety and notes

  • Firstly, all of this is utterly unsafe.
  • While this package provides some shortcuts, understanding the calling convention (in particular, which registers are actually used) is important. The best source of truth is, of course, GHC's source code.
  • Each function is compiled in its own .S file, so one can freely pick arbitrary naming for the labels and so on, but, on the other hand, one cannot access labels in other functions. This can be remedied somewhat easily — consider throwing up an issue if that's actually desired.
  • Finally, all of this is utterly unsafe.