@hackage conduit-vfs0.1.0.3

Virtual file system for Conduit; disk, pure, and in-memory impls.

Virtual File System (VFS) Conduit for Haskell

Motivation

I was writing a Haskell program to process a lot of JSON-formatted log records: far too many to hold in memory. This is a perfect situation for conduits, so I built out my pipeline. However, I encountered a bit of a hitch: many of the files were stored in .tar.gz formats, so what I wanted was to recursively navigate into the .tar.gz file and feed its embedded files into my pipeline. What would be even cooler is if I could slurp in S3 files, or files from Dropbox, or any of that, and then process them all the same way.

Concepts

This library is an abstraction of a filesystem that operates via conduits. A filesystem is defined as a particular kind of key-value store. The keys are hierarchical strings, with different levels of the hierarchy being seperated by </>. These levels are referred to as directories. Directories may be listed and navigated, but are neither directly created nor directly destroyed. The leaf values are lazy ByteString values. These values are referred to as files. Files can be listed, removed, and written. A file's path consists of the names of its parent directories, from most distant to immediate, concatenated by </>. The directories and files of a filesystem are that filesystem's nodes. A directory that contains no child nodes (ie: an empty directory) may or may not be deleted, which may cause recursive deletes of ancestor directories if the ancestor is now empty. When a file is written, any missing directories in the file's path will be created.

Provided Virtual File Systems

There are three virtual file system implementations provided as a part of this library:

  • InMemory -- This filesystem uses an MVar to store the state of the filesystem in memory. All the data for all the files are currently stored in-memory as the raw bytes: future versions of this library may persist larger files into compressed bytestrings or temp files.
  • Pure -- This filesystem uses StateT to persist the state of the filesystem. Unlike InMemory, the state is not portable and changes in state are not persisted after the monad resolves. However, this implementation is pure, which means it can be used with pure conduits or as a way to make testing faster and idempotent.
  • Disk -- This virtual file system is the traditional disk-based filesystem, with folders persisted to the operating system's filesystem.

Other Virtual File Systems

There are other VFS implementations in the works:

  • Zip -- Treat a .zip file as a filesystem, and provide conduits that will auto-expand zip entries provided by upstream VFS conduits.
  • Tar -- Treat a .tar file as a filesystem, and provide conduits that will auto-expand tar entries provided by upstream VFS conduits.
  • S3 -- Treat an AWS S3 bucket as a filesystem.
  • Twitter -- Treat Twitter as a file system. (eg: /users/AnthroPunk/status/1042294921216978944 goes to https://twitter.com/AnthroPunk/status/1042294921216978944).
  • SQL -- Treat an HDBC implementation as a filesystem. (eg: /foos/id/12345 retrieves the tuple from the foos table where the id field is 12345)

In addition, we are looking to implement a conduit which automatically uncompresses or compresses .xz, .lzma, .z, and .gz files provided by upstream VFS conduits.