@hackage hotel-california0.0.6.0

hotel-california

hotel-california is inspired by Trace and otel-cli, a pair of utilities for tracing shell scripts. We needed something like this in order to instrument local builds, so we could understand how much time people are spending waiting on builds, tests, etc.

Usage

The binary name is hotel. Currently, the only supported command is exec

$ hotel exec --help
Usage: hotel exec [-s|--span-name SPAN_NAME]
                  [-i|--set-sigint-status SPAN_STATUS]
                  [-a|--attribute KEY=VALUE]...
                  (COMMAND [ARGUMENT]... | --shell SCRIPT)

  Execute the given command with tracing enabled

Available options:
  -h,--help                Show this help text
  -s,--span-name SPAN_NAME The name of the span that the program reports. By
                           default, this is the script you pass in.
  -i,--set-sigint-status SPAN_STATUS
                           The status reported when the process is killed with
                           SIGINT.
  -a,--attribute KEY=VALUE A string attribute to add to the span.
  --shell SCRIPT           Run an arbitrary shell script instead of running an
                           executable command

Currently, the program only looks in environment variables for configuration.

  • OTEL_EXPORTER_OTLP_ENDPOINT (with a default to localhost:4317)
  • OTEL_SERVICE_NAME
  • OTEL_EXPORTER_OTLP_HEADERS
  • OTEL_RESOURCE_ATTRIBUTES
  • and probably others, see hs-opentelemetry-sdk for more information

Background/FAQ

Lol what's up with the name

Well otel-cli is a great name for the tool. But this is a Haskell implementation, so I need an h in there somewhere. hotel-cli sounds good. But wait... cli ... what else could I do with that? Ahah!

Sorry

Difference from Trace

The Trace tool has the following basic workflow:

$ TRACE_PARENT=$(trace start "build")
$ make build
$ trace finish

When you call trace start "trace-name", it generates a TraceId and SpanId and records that to a file in the temporary directory. The filename carries the trace ID and span ID. The file contains the name of the trace, the start time, and any other metadata you provide.

When you call trace finish, the tool looks for the TRACE_PARENT environment variable. It then looks in the $TMP/traces/state directory for a file that matches the TRACE_PARENT. It loads the file, creates a Span with the timestamp given in the file, and then calls span.End. The tool then makes a network request to report this data.

The Problems

Performance

The tool allows you to start groups and run commands that will make individual spans, allowing you to understand the overall trace. trace group start is similar to trace start - it creates a new Span ID, attaches it to the parent span, and writes that to the temporary directory. trace task does something a bit more idiomatic - it creates a Span, runs the command you provide, and then does Span.end. trace group finish is similar to trace finish - it loads the parent span information for the group, creates a Span, and calls span.End with the timestamps loaded from the group.

The toy example I did is here:

$ TRACE_PARENT=$(trace start "why"); \
  GROUP=$(trace group start "why-1"); \
  trace task "$GROUP" -- go build; \
  trace group finish "$GROUP"; \
  trace finish

According to Honeycomb, this spends 170ms doing go build, and then incurs another 650ms to complete the entire process - the why-1 group takes roughly 370ms extra, and then the final trace finish call adds another 300ms.

This performance hit may not be substantial for the apparent intention of the tool - instrumenting CI builds - but it is going to be a problem for my intention with the tool - instrumenting local developer workflows. 300ms is significant, but not terrible if incurred once. However, incurring that for each step we want to record? That's a problem.

The out-of-the-box solution is to use an OpenTelemetry collector that is local to the machine, and can report the spans periodically, in the background. This is an extra deployment step, so it'd be nice to avoid that, if possible.

Nesting

The API requires you to work in this manner:

$ TRACE_PARENT=$(trace start "my-trace")
$ OUTER_GROUP=$(trace group start "neat")
$ INNER_GROUP=$(trace group start "neat" "$OUTER_GROUP")
$ trace task "$INNER_GROUP" -- make build
$ trace group finish "$INNER_GROUP"
$ trace group finish "$OUTER_GROUP"
$ trace finish

So any time you do trace start, you create an identifier for a parent span. But any time you do trace group finish, you look up the relevant parent span ID and then actually create a child span.

This makes it difficult to create a span, and just "know" if you're in a root or not. You would need this in order to provide a composable interface: shell scripts calling other shell scripts which can all record spans.

Difference from otel-cli

Well, otel-cli solves most of the above problems. The main entry point is otel-cli exec, which runs a command for you, and reports a span for it. You can nest otel-cli exec calls arbitrarily, which works nicely. However, it too had some issues, with the most challenging being a bug around signals. I simply couldn't figure out the behavior around signals in Golang, and all available internet advice wasn't exactly helpful. I decided then to spike out this tool.