step 06: Positions

Error Location

In the last step, we integrated some file level errors to our user error representation. Let’s go back to the core parser errors.

As things stand now, we signal parse errors to the user, but we don’t give any indication as to where in the CSV file the error occurred. It would be really helpful if we could amend parser errors with the row and column index of the occurrence.

case class ParserPos(rowIdx: Int, colIdx: Int)

enum CSVParseFailure:
  case ColumnParseFailure(cause: Throwable, pos: ParserPos)
  case RowExhaustionFailure(pos: ParserPos)

Figuring out the row index is no problem - we can just #zipWithIndex the rows. However, we’d either have to pass the current index from the top level “loop” down to the individual column parsers, or we’d have to intercept user errors on their way up and somehow inject the row index at the top level. Both approaches would significantly pollute our API.

Column indices are even trickier. We don’t know how many columns have been consumed excactly at which point, so it has to be the job of the column parsers to keep track of this.

Stating Positions

We’re already threading information through the parsers: The remaining row, as the parsers consume it. We can just amend the position information.

case class ParserState(pos: ParserPos, prevColIdx: Int, remainder: Row)

trait RowParser[T]:
  def parse(st: ParserState): CSVResult[(T, ParserState)]

Why prevColIdx? Well, remember #emap() and #guardMap() - these can fail and would want to specify their position. But the current pos has already been advanced to the next column to be consumed by the parser that produced the input to our mapping function., whereas in these cases we want to report the most recent column consumed by this parser. And we cannot just assume that the difference is one column - it may be more or less (think constant parsers).

The only parser we have (so far) that consumes anything and needs to advance the state is string. Upon success, it must advance all state values, otherwise it should encode its position in the error response.

val string: RowParser[String] =
  case ParserState(ParserPos(r, c), _, h :: t) => 
    (h, ParserState(ParserPos(r, c + 1), c, t)).pure
  case ParserState(p, _, Nil) => 
    RowExhaustionFailure(p).asLeft[(String, ParserState)]

Place Failure

Now we have to update our failure sites. In end we want to state the current (advanced) position,

  val end: RowParser[Unit] = {
    case s@ParserState(_, _, Nil) => ((), s).asRight
    case ParserState(pos, _, _ :: _) =>
      ColumnParseFailure(new IllegalStateException("trailing data"), pos).asLeft
  }

On #guardMap(), however, we want to trace back to the last consumed column.

def guardMap[B](f: A => B): RowParser[B] =
  p.parse(_) >>= {
    case (res, st@ParserState(ParserPos(r, c), pc, _)) =>
      Either.catchNonFatal(f(res))
        .leftMap(ColumnParseFailure(_, ParserPos(r, pc)))
        .map(_ -> st)
    }

Note that we cannot reuse #emap() for now.

Now the position just needs to be initialized:

private def parseRow[T](p: RowParser[T])(row: Row, rowIdx: Int): CSVResult[T] =
  p.parse(ParserState(ParserPos(rowIdx, 0), 0, row)).map(_(0))

def parseLines[T](p: RowParser[T])(lines: List[String]): CSVResult[List[T]] =
  lines.map(row).zipWithIndex.traverse(parseRow(p))

…and we get proper location reports for parse failure,…

sbt:nanocsv> runMain de.sangamon.nanocsv.step06.main data/users.csv
1,Torsten Test,1970-01-01
2,Andrea Anders,2000-02-20x
ERROR: ColumnParseFailure(java.time.format.DateTimeParseException: 
  Text '2000-02-20x' could not be parsed, unparsed text found at index 10,
  ParserPos(1,2))

…exhaustion…

sbt:nanocsv> runMain de.sangamon.nanocsv.step06.main data/users.csv
1,Torsten Test,1970-01-01
2,Andrea Anders
ERROR: RowExhaustionFailure(ParserPos(1,2))

…and trailing data.

sbt:nanocsv> runMain de.sangamon.nanocsv.step06.main data/users.csv
1,Torsten Test,1970-01-01
2,Andrea Anders,2000-02-20,x
ERROR: ColumnParseFailure(java.lang.IllegalStateException: 
  trailing data,ParserPos(1,3))

The full code for this post can be found in package de.sangamon.nanocsv.step06. In the next step we will see how this kind of state propagation blends in with the reusable abstractions we’ve been working with already.