step 04: Error

Failure Modes

In the last step, we concluded our parser API design for the time being. Let’s turn to error handling.

One common approach to handling failure is to distinguish between two types of errors: those that are a direct consequence of user input, and those that just happen - “internal” errors. There’s a strong correlation between this distinction the notion of pure and side-effecting functions. User errors usually arise from pure computations (they only depend on the input), internal errors result from side effects (I/O failures, etc.).

User errors can either be handled at some point, or they should be brought to the attention of the user. It is useful to be pretty clear about which computations can yield which kind of user errors. Internal errors, on the other hand, are of no interest to the user and will mostly just end up being logged/reported for future investigation. We don’t really care what kind of internal error may arise from a computation, only whether it can happen at all.

Representation

To encode this distinction, a common pattern is to have a specific set of user error types, typically an ADT, for each module. The result of a potentially failing application will be either a successful result of the appropriate type or a user error, often represented by the Either data type. Internal errors may be propagated as exceptions - not necessarily by throwing them and letting them bubble up the stack, but by embedding them into a representation of a side-effecting computation, such as Scala’s built-in Future or cats-effect’s IO.

Let’s assume that a successful result is of type T, the user error type is E and the side-effecting computation context is F. This gives rise to the following cases for function result types:

T - A pure computation that cannot raise user errors.
Either[E, T] - A pure computation that may raise user errors.
F[T] - A side-effecting computation that may suffer from internal errors.
F[Either[E, T]] - A side-effecting computation with potential internal and user errors.

Obviously we’ll need some API for Either and F to handle threading results through these representations - and we’ll find that this will often be based on generic abstractions just like Functor and Applicative…

So much for theory, let’s put it to practice. We will focus on equipping our core CSV parser with explicit user errors - i.e. for now we’ll continue to let internal errors bubble up as exceptions, and we’ll ignore the file reading part altogether.

Either

Let’s start by defining our user error types. So far we have encountered two error modes in the core parser: Row exhaustion, and failure to parse a column to the expected type.

enum CSVParseFailure:
  case ColumnParseFailure(cause: Throwable)
  case RowExhaustionFailure

We’ll provide a type alias for the Either instantiation for this error and make it the return type of our parse operation.

type CSVResult[T] = Either[CSVParseFailure, T]

trait RowParser[T]:
  def parse(row: Row): CSVResult[(T, Row)]

We can easily convert #string. Note that Either has an Applicative instance that threads computations through its “success” (i.e. right-hand) side, so we can use #pure.

val string: RowParser[String] =
  case h :: t => (h, t).pure
  case Nil => RowExhaustionFailure.asLeft[(String, Row)]

Monad

What about #int and #date, though? We cannot keep using #map() over #string. We already get a CSVResult from the string parser, and we want to produce another CSVResult that depends on the success value of the former upon conversion.

Fortunately there’s another abstraction built on top of Applicative that supports exactly this kind of dependent computation chaining. Monad adds another function #flatMap() (and its operator alias >>=) to our tool case. It is somewhat supported by the Scala core language and library already - many types, including Either, have a #flatMap() method, and there’s for expressions that provide syntactic sugar for nested #flatMap()/#map() expressions.

To make it easier to digest, let’s split the task of providing a Monad based replacement for our previous #map() usage in two: First define a function #emap() that converts a RowParser given a function A => CSVResult[B], implemented using #flatMap(). Then define a function #guardMap() that takes an (impure!) function A => B, converts it to A => CSVResult[B] by catching potential exceptions, and feeds this to #emap().

extension[A](p: RowParser[A])
  def emap[B](f: A => CSVResult[B]): RowParser[B] =
    p.parse(_) >>= { case (res, rem) => f(res).map(_ -> rem) }
  def guardMap[B](f: A => B): RowParser[B] =
    emap { a => Either.catchNonFatal(f(a)).leftMap(ColumnParseFailure(_)) }

This gives us our new implementations for #int and #date.

val int: RowParser[Int] = string.guardMap(_.toInt)
val date: RowParser[LocalDate] = string.guardMap(LocalDate.parse)

We’ll also need to convert our Applicative instance and the inductive row parser derivation step to thread a CSVResult through monad chaining - please refer to the code for details.

Traverse

There’s one last pitfall… In our top level #parse(), we used to #map() over the Row (yes, List has Functor/Applicative/Monad instances, as well). Doing this with the new API gives us a List[CSVResult[T]]. But we just want to fail the whole computation if any row cannot be parsed - that is, we want a CSVResult[List[T]] instead.

Another abstraction to the rescue! List has a Traverse instance, which gives us functions for upending this kind of nesting, which work with any Applicative.

def parse[T](file: Path)(p: RowParser[T]): CSVResult[List[T]] =
  lines(file).map(row).traverse(parseRow(p))

Resolution

Finally, we need to resolve the two Either modes at the top level of our program, e.g. via pattern matching.

CSVParser.parse(Paths.get(csvFile))(userParser) match
  case Left(f) => println(s"ERROR: $f")
  case Right(r) => r.foreach(println)

Success,…

sbt:nanocsv> runMain de.sangamon.nanocsv.step04.main data/users.csv
1,Torsten Test,1970-01-01
2,Andrea Anders,2000-02-20
User(1,Torsten Test,1970-01-01)
User(2,Andrea Anders,2000-02-20)

…, user error,…

sbt:nanocsv> runMain de.sangamon.nanocsv.step04.main data/users.csv
1,Torsten Test,1970-01-01
2,Andrea Anders,2000-02-20x
ERROR: ColumnParseFailure(java.time.format.DateTimeParseException:
  Text '2000-02-20x' could not be parsed, unparsed text found at index 10)

…and I/O failure.

sbt:nanocsv> runMain de.sangamon.nanocsv.step04.main data/xyz.csv
[error] (run-main-1) java.io.FileNotFoundException:
  data/xyz.csv (No such file or directory)

Whew. This was quite a ride, and the benefit/cost ratio of this approach to failure handling may seem somewhat questionable. In its defense…

We now have a very explicit representation of failure modes - actually they have become part of the normal program flow now, which seems quite right. After all, you need to treat them with the same diligence as the “happy path” - keeping them kind of invisible seems somewhat absurd.
The prospect of mentally juggling all these new abstractions may feel daunting, but the good news is: There’s only so many of them (you can go a long way with Functor, Applicative, Monad[Error], Traversable and Foldable), and they’re almost universally applicable.

I hope the benefit will become even clearer in subsequent posts. The full code for this post can be found in package de.sangamon.nanocsv.step04. In the next step, we will extend failure handling to the file level.