Capturing data from varied documents within Ephesoft (pt. 1)

Share This Post

Ephesoft offers a great toolkit for setting up extraction logic to automate the capture of information from scanned
documents. But, sometimes the documents we’d like to feed into the system do not exhibit regular and reliable formatting,
key words, or positioning, even if they are otherwise similar in content and subject matter. In these cases, devising
suitable extraction logic can be much more difficult due to the uncertainty of what Ephesoft will find within each document.

But, Ephesoft has a lot of flexibility that can be leveraged here, and today I’ll take us through some key methods and
practices to handle these more varied document situations. In particular, this blog will look at handling some basic variability in data on which we’re trying to get a regex match. The next blog will address how to build a somewhat more complex tabular extraction model while addressing the same kind of variation in the data.

Step 1: Compile the data

When viewed from human eyes it may appear that a certain field value is common and uniform throughout our set of documents.
But, it’s always a better idea to view the data the way Ephesoft will see it. To do that, we’ll need to look within the OCR / XML files that the RecoStar engine generates after scanning each page.

The easiest way to access these XML files is to feed some sample documents into this folder:
[Ephesoft Shared Folders]/[Batch Class ID]/test-extraction

Once documents are loaded there, hit the Test button within the extraction logic that we’ve set up, and Ephesoft will process those documents and generate XML files paired with them. A simplified demonstration of these files is shown below.


XML files paired with page / image renditions

From there we’ll open up each XML file and take a look at what’s been rendered therein. The quickest way to survey the character data that was generated is to scroll to the bottom of each XML file, where we’ll find the unique tag. This node within the XML file contains a quick summary of all OCR values that were found within the document, in a linear, somewhat more human-readable format.


The HocrContent tag inside an XML file

From here the best practice is to copy out the content within from each document, and paste it all into a regex test tool. A personal favorite tool is RegExr; RegexPal is another good and free online option. Again the purpose of compiling these data all in one place is to enable easy and efficient authoring of regular expressions that will produce hits in as many of our documents as possible, within one expression.

Once all the XML is pasted into the regex test tool, we’ll see a long stack of text, which comprises all of the OCR data that Ephesoft will have access to. This is truly very close to seeing the data “as Ephesoft would see it”.

The screenshot below shows an early attempt to match the phrase Measured Depth within the OCR data, which initially produces three hits out of the four document blocks that were copied in.


Four blocks of HocrContent pasted and compiled in an online regex testing tool

Step 2: Build regex and investigate variances

From here the task is to start laying out regular expressions that will generate positive hits in the data we’ve compiled. The previous screenshot shows that we made positive matches in all blocks except for the second one. So, the next step is to investigate why there’s no match there. We already know that the text “Measured Depth” is visible to human eyes within each of these documents, but clearly that isn’t what the OCR engine rendered in the case of this second document. The quickest way to investigate this is to drill back into that second XML file using a text editor such as Sublime, and to dig through that same <HocrContent> tag to try to find what’s there instead of Measured Depth.

A quick search reveals the answer: RecoStar generated some “creative” character renderings here.

The culprit for the negative match is found

The culprit for the negative match is found

This was likely due to a low-quality initial scan, and the OCR engine simply made the best guess that it could
regarding what it found here.

To precisely locate this negative match is very instructive, and prompts us to consider several options about how to
address it:

  • We could look again at the original scan for this document and determine whether it can be re-ingested at a higher image quality. Trying this approach might result in a better OCR rendering, which would hopefully generate a positive hit. If this is the case, the same improvement in scan quality across our whole batch of documents would consequently result in a generally higher hit rate across the whole batch as well.
  • Another option would be to build some tolerance into our regex setup, such that near matches like this one might be allowed as hits. This option is reasonable in some cases, but needs to be undertaken carefully and with generous testing. Allowing too much near-match tolerance within the regex setup might accidentally create a number of false positives, thereby capturing erroneous data when we run actual document batches through Ephesoft. In other words, we might do more harm than good here. We would need to consider the overall data set we’re working with, and undertake some trial-and-error to determine what level of near-match tolerance is appropriate.
  • A final option might be to conclude that this document is an outlier in our set due to its poor image quality, and categorize it as a lost cause for automated extraction. As a recourse, our human operators could simply flag this document for manual entry or manual table extraction during the Validation phase within Ephesoft.

This post should give some key starting points on how to build out extraction logic to capture data from a varied set of
documents. In the next blog we’ll take this a step further and examine the creation of a whole table extraction model
operating on a varied set of documents.

More To Explore