Parsinator, a tale of a pdf parser

One day your boss asks you to read a pdf file to extract relevant information to later process it in your main software. What can you do now? How in the world are you going to read the pdf file?

The Requirements

There you are, a normal day at your office with a new challenge. One of your clients can’t connect to your main software, the only input he can provide is a text-based pdf file. This is the challenge: parse this pdf file into something that can be processed later on. But, these are the requirements:

Actual implementation

Since you’re asked to support files with any format, a for through the lines with lots of ifs and regular expressions isn’t the most adequate solution. A file with a different format will imply to code the whole thing again. There must be a better way!

On one hand, one of your concerns is how to given a pdf file turn it into actual text. But, a few lines with itextsharp will do the reading. So, no big deal after all! Now, a pdf file is a list of lists of strings, List<List<string>>. One list per page and one string per line. You could abstract this step to support not only pdf files.

On the other hand, how can you do the actual parsing? Parser combinators to the rescue! You could borrow this idea from Haskell and other functional languages. You can create small composable pieces of code to extract or discard some text at the page or line level.

Skippers

First, you can assume that your file has some content that spawns from one page to another. Also, there are some content that can be read from a given page and line. A header/detail representation. Imagine an invoice with lots of purchased items that requires a couple of pages.

Second, there are some lines you don’t care about since they don’t have any relevant information. So you can ignore them. For example, you can “skip” the first or last lines in a page, all blank lines, everything between two line numbers or two regular expressions. Then, you have the skippers.

public class SkipLineCountFromStart : ISkip
{
    private readonly int LineCount;

    public SkipLineCountFromStart(int lineCount = 1)
    {
        this.LineCount = lineCount;
    }
            
    // The first LineCount lines are removed from the 
    // input text
    public List<List<string>> Skip(List<List<string>> lines)
    {
        var skipped = lines.Select(l => l.Skip(LineCount).ToList())
                           .ToList();
        return skipped;
    }
}

Parsers

After ignoring all the unnecessary text, you can create separate small functions to extract text. For example, extract a line if it matches a regular expression, read all text between two consecutive line numbers or regexes, read a fixed string. Here, you have the parsers. Now, you can read text in a line of a page or use a default value if there isn’t any.

public class ParseFromLineNumberWithRegex : IParse
{
    private readonly string Key;
    private readonly int LineNumber;
    private readonly Regex Pattern;
        
    public ParseFromLineNumberWithRegex(string key, int lineNumber, Regex pattern)
    {
        this.Key = key;
        this.LineNumber = lineNumber;
        this.Pattern = pattern;
    }
        
    // Parse if the given line matches a regex and
    // returns the first matching group
    public IDictionary<String, String> Parse(String line, int lineNumber)
    {
        if (lineNumber == this.LineNumber)
        {
            var matches = this.Pattern.Match(line);
            if (matches.Success)
            {
                HasMatched = true;
                var value = matches.Groups[1].Value;
                return new Dictionary<string, string> { { Key, value.Trim() } };
            }
        }
        return new Dictionary<string, string>();
    }
}

Transformations

But, what about the text spawning many pages? Now, you can introduce the transformations to flatten all lines spawning many pages into a single stream of lines. So, you can use the same parsers in every one of these lines.

For example, in the case of an invoice, imagine all purchased items are in a table. It starts with a header and ends with the subtotal of all purchased items. So you can use again some skippers to extract these lines between the header and the subtotal.

public class TransformFromMultipleSkips : ITransform
{
    private readonly IList<ISkip> ToSkip;

    public TransformFromMultipleSkips(IList<ISkip> skippers)
    {
        ToSkip = skippers;
    }

    public List<string> Transform(List<List<string>> allPages)
    {
       // Chain applies the next skipper on the output of the previous one
        List<String> details = ToSkip.Chain(allPages)
                                     .SelectMany(t => t)
                                     .ToList();
        return details;
    }
}

// Table starts with "Code Description Price Total"
// and ends with "S U B T O T A L"
new TransformFromMultipleSkips(
    new SkipBeforeRegexAndAfterRegex(
        before: new Regex(@"\|\s+Code\s+.+Total\s+\|"),
        after: new Regex(@"\|\s+\|\s+S U B T O T A L\s+\|")),
    new SkipBlankLines());

All the pieces

Then, you can create a method put everything in place. You apply all skippers in every page, so you only keep relevant information. After that, you run all parsers in the appropriate pages and lines from the output of skippers.

public Dictionary<string, Dictionary<string, string>> Parse(List<List<String>> lines)
{
    List<List<String>> pages = _headerSkipers.Chain(lines);

    foreach (var page in pages.Select((Content, Number) => new { Number, Content }))
    {
        var parsers = FindPasersForPage(_headerParsers, page.Number, lines.Count);
        if (parsers.Any())
            ParseOnceInPage(parsers, page.Content);
    }

    if (_detailParsers != null && _detailParsers.Any())
    {
        List<String> details = (_transform != null)
                ? _transform.Transform(pages)
                : pages.SelectMany(t => t).ToList();

        ParseInEveryLine(_detailParsers, details);
    }

    return _output;
}

Conclusion

Finally, with this approach, you or any of your coworkers could reuse the same constructs to parse a new file. You can add new files without coding the whole thing every time you are asked to support a new file. You need to come up with the right skippers and parsers based on the structure of the new file.

PS: All these ideas and other suggestions from my coworkers gave birth to Parsinator, a library to turn structured or unstructured text into a header-detail representation. I used Parsinator to connect 4 legacy client softwares to a document API by parsing pdfs and plain text files to input xml files. In the Sample project you can see how to parse a plain-text invoice and a GPS frame. Feel free to take a look at it. All ideas and contributions are more than welcome!