@hackage cassava-records0.1.0.3

Auto-generation of records data type.

cassava-records

A library extension for cassava (Haskell CSV parser library) that automatically creates a Record data type given an columnar input file. eg a CSV file.

What is this tool for?

Say you are working on a project that involves processing a number of comma separated or tab serparated files. Assuming, you are using cassava for loading the input files, here is a typical workflow you would follow

a. Inspect the file that contains Salaries of Employees.

b. Create a Record data type called Salaries to reflect the columns and types found in the file

c. Create required instances of the Salaries data type that may be required to load the files with cassava.

Now, imagine this file you are inspecting to contains tens or hundreds of columns. Now, as a good Haskeller you will want to automate steps (a) and (b) to the extend possible. That is precisely, what this library does.

cassava-records performs the following tasks. Given, a input file (command or tab-seperated for example), it reads the whole file, infers some basic data types for each column and automatically creates a Record of inferred data types using Template Haskell.

Quick Start

Example 1 :

Using data/salaries_simple.csv

emp_no,name,salary,status,years
1,John Doe,100.0,True,1
2,Jill Doe, 200.10,False,2
3,John Doe Sr,101.0,T,3
4,Jill Doe Sr, 10101.10,f,4.2
5,John Doe Jr,1010101.0,true,5.1
6,Jill Doe Jr, 10101.10,false,6


{-#LANGUAGE TemplateHaskell#-}
{-#LANGUAGE DeriveGeneric #-}

import Data.Cassava.Records
import Data.Csv
import qualified Data.ByteString.Lazy as BL
import Data.Vector as V
import Data.Text as DT

$(makeCsvRecord "Salaries" "data/salaries_simple.csv" "_" commaOptions)

The makeCsvRecord function take 4 arguments,

  1. "Salaries" : A String that will be used as name for the Record.
  2. data/salaries_simple.csv: path to the input file
  3. "_": string to prefix each field. Useful, if we need to build lens for this record
  4. commaOptions: defaultDecodeOptions defined in cassava library

If you load this code in GHCi, we will see

1 >:info Salaries
data Salaries
  = Salaries {
    _emp_no:: Integer,
    _name :: Text,
    _salary :: Double,
    _status :: Bool,
    _years:: Double}

Note that all column names are in lower case and "_" has been prefixed to the column names.

To be consistent, cassava-record converts all column headers to lower-case before created corresponding field names for each column header. Therefore, if column headers were all upper-case, we need to provide a field modifier while creating ToNamedRecord and FromNamedRecord instances for cassava.

Note, that if the column headers are mixed case, it become tricky. Current version of the library does not work very well with mixed case column headers.

There is a convenience method called makeInstances that can create the instances required for cassava.The instances created use the default fieldModifieroptions settings shown below.


fieldModifierOptions :: Options
fieldModifierOptions = defaultOptions { fieldLabelModifier = rmUnderscore }
  where
    rmUnderscore ('_':str) = DT.unpack . DT.toUpper . DT.pack $ str
    rmUnderscore str = str

-- the ToNamedRecord and FromNamedRecord are needed by Cassava since
-- we prefix
instance ToNamedRecord Salaries where
  toNamedRecord = genericToNamedRecord fieldModifierOptions

instance FromNamedRecord Salaries where
  parseNamedRecord = genericParseNamedRecord fieldModifierOptions

main :: IO ()
main = do
  v <- loadData "data/salaries_simple.csv":: IO (V.Vector Salaries)
  putStrLn . show $ v

In GHCi we see (formatted for clarity)

2 >loadData "data/salaries.csv"
[Salaries {_emp_no = 1, _name = "John Doe", _salary = 100.0, _status = False, _years = 1.0},
 Salaries {_emp_no = 2, _name = "Jill Doe", _salary = 200.1, _status = False, _years = 2.0},
 Salaries {_emp_no = 3, _name = "John Doe Sr", _salary = 101.0, _status = True, _years = 3.0},
 Salaries {_emp_no = 4, _name = "Jill Doe Sr", _salary = 10101.1, _status = False, _years = 4.2},
 Salaries {_emp_no = 5, _name = "John Doe Jr", _salary = 1010101.0, _status = False, _years = 5.1},
 Salaries {_emp_no = 6, _name = "Jill Doe Jr", _salary = 10101.1, _status = False, _years = 6.0}]

Note, the type inference in the above example is as follows:

  1. If a column has values from the set {true, t, false, f} (ignoring case) then the inferred type is Bool.
  2. If a column has values that are all numeric, then an Integer type is attempted, or else a Double is infered. For example for emp_no the infered type is a Integer whereas for years the type is Double.
  3. For all other cases, a Text type is inferred.

Example 2 (Missing Values)

The library also supports type inference when values are missing. For example in, data/salaries_mixed_input.csv

emp_no,name,salary,status,years
1,John Doe,100.0,True,1
2,Jill Doe, 200.10,False,2
3,John Doe Sr,101.0,T,
4,Jill Doe Sr,10101.10,,4.2
5,John Doe Jr,,true,5.1
6,, 10101.10,false,6

the status for Jill Doe Jr is missing and the salary for John Doe Sr is missing. In this case, the type as wrapped in a Maybe type.

In that case, the record instance we get will be as follows:

3 >:info Salaries
data Salaries
  = Salaries {
    _emp_no:: Integer,
    _name :: Maybe Text,
    _salary :: Maybe Double,
    _status :: Maybe Bool,
    _years:: Maybe Double}

Loading this data, would produce the following output

{-#LANGUAGE TemplateHaskell#-}
{-#LANGUAGE DeriveGeneric #-}

import Data.Cassava.Records
import Data.Csv
import qualified Data.ByteString.Lazy as BL
import Data.Vector as V
import Data.Text as DT

$(makeCsvRecord "SalariesMixed" "data/salaries_mixed_input.csv" "_" commaOptions)
$(makeInstance "SalariesMixed")
-- ^ note that we can use this function instead of manually defining
-- all instances required by Cassava

main :: IO ()
main = do
  v <- loadData "data/salaries_mixed_input.csv":: IO (V.Vector SalariesMixed)
  putStrLn . show $ v

The output will be as follows:

[SalariesMixed {_emp_no = 1, _name = Just "John Doe", _salary = Just 100.0, _status = Just False, _years = Just 1.0},
 SalariesMixed {_emp_no = 2, _name = Just "Jill Doe", _salary = Just 200.1, _status = Just False, _years = Just 2.0},
 SalariesMixed {_emp_no = 3, _name = Just "John Doe Sr", _salary = Just 101.0, _status = Just True, _years = Nothing},
 SalariesMixed {_emp_no = 4, _name = Just "Jill Doe Sr", _salary = Just 10101.1, _status = Nothing, _years = Just 4.2},
 SalariesMixed {_emp_no = 5, _name = Just "John Doe Jr", _salary = Nothing, _status = Just False, _years = Just 5.1},
 SalariesMixed {_emp_no = 6, _name = Nothing, _salary = Just 10101.1, _status = Just False, _years = Just 6.0}]

Here is a full working code that uses both the examples:

{-#LANGUAGE TemplateHaskell#-}
{-#LANGUAGE DeriveGeneric #-}
{-#LANGUAGE ScopedTypeVariables #-}
{-#LANGUAGE DuplicateRecordFields #-}
{-# LANGUAGE DeriveDataTypeable #-}

module Main where

import Data.Cassava.Records
import Data.Csv
import qualified Data.ByteString.Lazy as BL
import Data.Vector as V
import Data.Text as DT
import qualified Text.PrettyPrint.Tabulate as T
import Language.Haskell.TH
-- import Control.Lens hiding (element)

$(makeCsvRecord "Salaries" "data/salaries_simple.csv" "_" commaOptions)
-- $(makeInstance "Salaries")

$(makeCsvRecord "SalariesMixed" "data/salaries_mixed_input.csv" "_" commaOptions)
$(makeInstance "SalariesMixed")

-- the following instance is not required, if $(makeInstance Salaries) statement
-- is spliced in (currently commented in the example)
myOptions :: Options
myOptions = defaultOptions { fieldLabelModifier = rmUnderscore }
  where
    rmUnderscore ('_':str) = DT.unpack . DT.toUpper . DT.pack $ str
    rmUnderscore str = str

instance ToNamedRecord Salaries where
  toNamedRecord = genericToNamedRecord myOptions

instance FromNamedRecord Salaries where
  parseNamedRecord = genericParseNamedRecord myOptions

main :: IO ()
main = do
  v <- loadData "data/salaries_simple.csv" :: IO (V.Vector Salaries)
  v1 <- loadData "data/salaries_mixed_input.csv" :: IO (V.Vector SalariesMixed)
  putStrLn . show $ v
  putStrLn . show $ v1

Caveats (Or list of future enhancements)

  1. The columns names along with prefix("_") should be a valid Haskell field names. For example, column names cannot have spaces or other characters not supported for use as field names of Record data type.
  2. The library loads the whole file during compilation to infer types. Given the size of the file, this will increase the compile time. Alternative workflows, like stripping the file or dumping the created slice into a file is recommended. In the future, the makeCsvRecord function can take a parameter to specify the max number of rows that can be used to infer the types.
  3. The inferred types are limited. Text, Bool, Integer, Double and the MayBe variants of those. Future, support may include DateTime.
  4. Mixed case column headers not automatically supported. A more complex form of fieldOptionModifiers needs to be provided.
  5. Currently there is no option to provide custom types.