Ephesoft Transact (I will refer to it as Ephesoft throughout this article) is an intelligent document processing solution that can classify a document and extract metadata from it. Document classification is the process of determining a document’s type (such as an invoice) based on the contents of the document. Ephesoft can perform this process automatically, which reduces the time it takes for an organization to process its documents.
Ephesoft also takes this a step further by providing automated extraction of metadata based on rules for each document type. The extraction rules tell Ephesoft how to find the correct information and extract it. A rule’s function is determined by the method of extraction used for that rule. There are many extraction methods in Ephesoft, and making use of the right method is crucial to getting the right result. In this article, I will focus on 5 of these methods that I find most useful and discuss when to use each:
- Key-Value Extraction
- Cross Section Extraction
- Paragraph Extraction
- Barcode Extraction
- Table Extraction
First Things First
Before getting into the extraction methods, I would like to provide some background on an important piece of most methods: Regular Expressions (regexes). Regex is a way to define a pattern for describing a certain amount of text. This pattern is then used to search a document and find matches. This pattern can be a mix of literal text like ‘Date’ or consist of tags that define a set of characters (for example, \d for digits). Ephesoft has a regex builder that helps determine the regex pattern as well as a pool of useful pre-defined regexes.
Key-Value extraction is the Swiss army knife of extraction tools. It can do simple extractions, such as for a name field where the key is “Name:” and the value is the text after the colon. It can also do complicated extractions like having the key and value be the same regex to find and extract specific bits of address information based on their expected pattern. The versatility of regex patterns allows Key-Value to be used in many different ways and circumstances.
However, Key-Value extraction isn’t perfect; it relies on information being in the expected spot. In a form, its expected information is in specified locations. This isn’t a guarantee in paragraphs or tables. In a paragraph, there is a lot of information around your expected value. Ephesoft uses set boxes to determine where from the key it should look in order to find the value. If the value isn’t a set number of characters, this can result in Ephesoft not getting enough information or picking up excess information as there is no way to tell it when to stop. In a table, it’s simple to say for the “height” column, I would like to grab the 10th value by setting “height” as the key and put the value box on the 10th value. This will run correctly up until the distance from the column name to the 10th value starts to differ. If your table has any discrepancies in spacing, there is no way to guarantee the location of the 10th value and thus, Key-Value extraction cannot get that information for you. While Key-Value can not cover these instances, Ephesoft has other extraction methods that can.
Cross Section Extraction
Cross section extraction takes Key-Value extraction to the next level by adding a second key. These keys are expected to be a column and a row that intersect on the value. This is perfect for extracting the value at the intersection of row 10 and the “depth” column. Cross section extraction fills in this niche spot for extracting specific table data and that is where it thrives. The use case, however, can be expanded past a table if you have two keys that intersect on your expected value.
Paragraph extraction allows for extracting information between point A and point B by setting a start and end pattern. This works best in a paragraph or blob of information where there are expected words that surround a key piece of information, but the character size of that information isn’t consistent. This works outside of paragraphs too, such as any instance where information should have a start and end point. One use case would be extracting the “Pay to” information on a check. The length of space to write that information varies for each bank’s checks. What we can guarantee is that information falls between the “Pay to the order of” label and the “$” for the amount. By setting those as the start and end pattern, this ensures we get all the correct information and only the correct information.
Barcode extraction works as you might expect. It extracts barcode data based on the specified barcode type and a set barcode zone. Ephesoft supports a variety of barcode formats such as CODE128, ITF, PDF417, and QR. This, by default, is intended for pages with one barcode but can be expanded with the available advanced barcode plugin to work for extracting a specific barcode from a page that has multiple barcodes.
Table extraction is the tool that always blows my mind when I see it in action. Table extraction is intended for extracting multiple rows of information for multiple columns all at once, but that doesn’t need to be a standard table. This extraction method works best when you need multiple pieces of data at once and, most importantly, when you can define a row/column of some kind. One use case of a non-standard table is a receipt. It doesn’t look like a standard table, but it can still be broken down into rows (each horizontal line) and columns (item name, quantity, value). When considered like this, we can easily extract information for each item on the receipt into an easy-to-read table. Table extraction has a lot of configuration settings within it, which allows it to work for a variety of tables beyond the standard table design. Helping identify how to use each of those configuration settings effectively will be a topic for a future blog.
The use cases I have listed above are the recommended ways of making use of these extraction methods, but the methods can be also used outside of those cases effectively. One of the things that makes Ephesoft extraction so powerful is the options offered through various methods and the flexibility of regex. Ephesoft can offer metadata extraction for most use cases, and making use of this can help any organization improve data extraction from existing files or newly scanned documents.