Extracting structured data from invoices
Xavier Holt * Andrew Chisholm *
Sypht Sypht
xavier@sypht.com andy@sypht.com
Abstract
Business documents encode a wealth of
information in a format tailored to human
consumption – i.e. aesthetically disbursed
natural language text, graphics and tables.
We address the task of extracting key fields
(e.g. the amount due on an invoice)
Figure 1: Energy bill with extracted fields.
from a wide-variety of potentially unseen
document formats. In contrast to tradi-
tional template driven extraction systems, verification of payment, supplier and pricing in-
we introduce a content-driven machine- formation. Template and RegEx driven extrac-
learning approach which is both robust tion systems address this problem in part by shift-
to noise and generalises to unseen docu- ing the burden of annotation from individual doc-
ment formats. In a comparison of our ap- uments into the curation of extraction templates
proach with alternative invoice extraction which cover a known document format. These ap-
systems, we observe an absolute accuracy proaches still necessitate ongoing human effort to
gain of 20% across compared fields, and a produce reliable extraction templates as new sup-
25%–94% reduction in extraction latency. plier formats are observed and old formats change
over time. This presents a significant challenge
– Australia bill payments provider BPAY covers
26,000 different registered billers alone1 .
1 Introduction
We introduce S YPHT – a scaleable machine-
To unlock the potential of data in documents we learning solution to document field extraction.
must first interpret, extract and structure their con- S YPHT combines OCR, heuristic filtering and a su-
tent. For bills and invoices, data extraction enables pervised ranking model conditioned on the con-
a wide variety of downstream applications. Ex- tent of document to make field-level predictions
traction of fields such as the amount due and that are robust to variations in image quality, skew,
biller information enable the automation of in- orientation and content layout. We evaluate sys-
voice payment for businesses. Moreover, extrac- tem performance on unseen document formats and
tion of information such as the daily usage or compare 3 alternative invoice extraction systems
supply charge as found on an electricity bill on a common subset of key fields. Our system
(e.g. Figure 1) enables the aggregation of usage achieves the best results with an average accuracy
statistics over time and automated supplier switch- of 92% across field types on unseen documents
ing advice. Manual annotation of document con- and the fastest median prediction latency of 3.8
tent is a time-consuming, costly and error-prone seconds. We make our system available as an API2
process (Klein et al., 2004). For many organi- – enabling low latency key-field extraction scal-
sations, processing accounts payable or expense able to hundreds of document per second.
claims requires ongoing manual transcription for 1
www.bpay.com.au
2
* Authors contributed equally to this work www.sypht.com
Xavier Holt and Andrew Chisholm. 2018. Extracting structured data from invoices. In Proceedings of Australasian
Language Technology Association Workshop, pages 53−59.
2 Background 3 Task
Information Extraction (IE) deals broadly with the We define the extraction task as follows: given a
problem of extracting structured information from document and set of fields to query, provide the
unstructured text. In the domain of invoice and value of each field as it appears in the document.
bill field extraction, document input is often bet- If there is no value for a given field present re-
ter represented as a sparse arrangement of multiple turn null. This formulation is purely extractive
text blocks rather than a single contiguous body of – we do not consider implicit or inferred field val-
text. As financial documents are often themselves ues in our experiments or annotation. For exam-
machine-generated, there is broad redundancy in ple, while it may be possible to infer the value
this spatial layout of key fields across instances in of tax paid with high confidence given the net
a corpus. Early approaches exploit this structure and gross amount totals on an invoice, without
by extracting known fields based on their relative this value being made explicit in text the correct
position to extracted lines (Tang et al., 1995) and system output is null. We do however consider
detected forms (Cesarini et al., 1998). Subsequent inference over field names. Regardless of how a
work aims to better generalise extractions patterns value is presented or labeled on a document, if it
by constructing formal descriptions of document meets our query field definition systems must ex-
structure (Coüasnon, 2006) and developing sys- tract it. For example, valid invoice number val-
tems which allow non-expert end-users to dynam- ues may be labeled as “Reference”, “Document
ically build extraction templates ad-hoc (Schuster ID” or even have no explicit label present. This
et al., 2013). Similarly, the ITESOFT system (Ru- canonicalization of field expression across docu-
siol et al., 2013) fits a term-position based extrac- ment types is the core challenge addressed by ex-
tion model from a small sample of human labeled traction systems.
samples which may be updated iteratively over To compare system extractions we first nor-
time. More recently, D’Andecy et al. (2018) build malise the surface form of extracted values by
upon this approach by incorporating an a-priori type. For example, dates expressed under a variety
model of term-positions to their iterative layout- of formats are transformed to yyyy-mm-dd and
specific extraction model, significantly boosting numeric strings or reference number types (e.g.
performance on difficult fields. A B N , invoice number) have spaces and extrane-
While these approaches deliver high-precision ous punctuation is removed. We adopt the eval-
extraction on observed document formats they uation scheme common to IE tasks such as Slot
cannot reliably or automatically generalise to un- Filling (McNamee et al., 2009) and relation ex-
seen field layouts. Palm et al. (2017) present the traction (Mintz et al., 2009). For a given field
closest work to our own with their CloudScan sys- predictions are judged true-positive if the pre-
tem for zero-shot field extraction from unseen in- dicted value matches the label; false-positive if
voice document forms. They train a recurrent neu- the predicted value does not match the label; true-
ral network (RNN) model on a corpus of over 300K negative if both system and label are null; and
invoices to recognize 8 key fields, observing an ag- false-negative if the predicted value is null and
gregate F-score of 0.84 for fields extracted from label is not null. In each instance we consider
held-out invoice layouts on their dataset. We con- the type-specific normalised form for both value
sider a similar supervised approach but address and label in comparisons. Standard metrics such
the learning problem as one of value ranking in- as F-score or accuracy may then be applied to as-
place of sequence tagging. As they note, system sess system performance.
comparison is complicated by a lack of a pub- Notably we do not consider the position of out-
licly available data for invoice extraction. Given put values emitted by a system. In practise it is
the sensitive nature of invoices and prevalence of common to find multiple valid expressions of the
personally identifiable information, well-founded same field at different points on a document – in
privacy concerns constrain open publishing in this this instance, labeling each value explicitly is both
domain. We address this limitation in part by rig- laborious for annotators and generally redundant.
orously anonymising a diverse set of invoices and This may however incorrectly assign credit to sys-
submit them for evaluation to publicly available tems for a missed predictions in rare cases, e.g.
systems — without making public the data itself. if both the net and gross totals normalise to the
54
same value (i.e. no applicable tax) a system may OCR each page is independently parsed by an
be marked correct for predicting either token for Optical Character Recognition (OCR) system in
each field. parallel which extracts textual tokens and their
corresponding in-document positions.
3.1 Fields
S YPHT provides extraction on a range of fields. Filtering for each query field we filter a subset
For the scope of this paper and the sake of compar- of tokens as candidates in prediction based on the
ison, we restrict ourselves to the following fields target field type. For example, we do not consider
relevant to invoices and bill payments: currency denominated values as candidate fills for
a date field.
Supplier ABN represents the Australian Busi-
ness Number (ABN) of the invoice or bill supplier. Prediction OCRed tokens and page images
For example, 16 627 246 039. make up the input to our prediction model. For
each field we rank the most likely value from the
Document Date the date at which the document document for that field. If the most likely predic-
was released or printed. Generally distinct from tion falls below a tuned likelihood threshold, we
the due date for bills and may be presented in a va- emit null to indicate no field value is predicted.
riety of formats, e.g. 11st December, 2018 We describe our model implementation and train-
or 11-12-2018. ing in Section 4.1.
Invoice number a reference generated by the Validation (optional) — consumers of the
supplier which uniquely identifies a document, S YPHT API may specify a confidence threshold at
e.g. INV-1447. Customer account numbers are which uncertain predictions are human validated
not considered invoice references. before finalisation. We briefly describe our predic-
Net amount the total amount of new charges for tion assisted annotation and verification work-flow
goods and services, before taxes, discounts and system in Section 4.2.
other bill adjustments, e.g. $50.00. Output a JSON formatted object containing the
GST the amount of GST charged as it relates to extracted field-value pairs, model confidence and
the net amount of goods and services, e.g. $5.00. bounding-box information for each prediction is
returned via an API call.
Gross amount the total gross cost of new
charges for goods and services, including GST or 4.1 Model and training
any adjustments, e.g. $55.00. Given an image and OCRed content as input, our
model predicts the most likely value for a given
4 S YPHT
query field. We use Spacy3 to tokenise the OCR
In this section we describe our end-to-end system output. Each token is then represented through a
for key-field extraction from business documents. wide range of features which describe the token’s
We introduce a pipeline for field extraction at a syntactic, semantic, positional and visual content
high level and describe the prediction model and and context. We utilise part-of-speech tags, word-
field annotation components in detail. shape and other lexical features in conjuction with
Although our system facilitates human-in-the- a sparse representation of the textual neighbour-
loop prediction validation, we do not utilise hood around a token to capture local textual con-
human-assisted predictions in our evaluation of text. In addition we capture a broad set of posi-
system performance in Section 5. tional features including the x and y coordinates,
in-document page offset and relative position of a
Preprocessing documents are uploaded in a va-
token in relation to other predictions in the doc-
riety of formats (e.g. PDF or image files) and nor-
ument. Our model additionally includes a range
malised to a common form of one-JPEG image per
of proprietary engineered features tailored to field
page. In development experiments we observe
and document types of interest.
faster performance without degrading prediction
Field type information is incorporated into the
accuracy by capping the rendered page resolution
model through token-level filtering. Examples of
(∼8MP) and limiting document colour channels to
3
black and white. spacy.io/models/en#en_core_web_sm
55
Figure 2: Our annotation and prediction verification tool — S YPHT VALIDATE. Tasks are presented
with fields to annotate on the left and the source document for extraction on the right. We display the
top predictions for each target field as suggestions for the user. In this example the most likely Amount
due has been selected and the position of this prediction in the source document has been highlighted for
confirmation.
field types which benefit from filtering are date, 4.2 Validation
currency and integer fields; and fields with check-
An ongoing human annotation effort is often cen-
sum rules. To handle multi-token field outputs,
tral to the training and evaluation of real-world
we utilise a combination of heuristic token merg-
machine learning systems. Well designed user-
ing (e.g. pattern based string combination for
experiences for a given annotation task can sig-
Supplier ABNs) and greedy token aggregation
nificant reduce the rate of manual-entry errors and
under a minimum sequence likelihood threshold
speed up data collection (e.g. Prodigy5 ). We de-
from token level predictions (e.g. name and ad-
signed a predication-assisted annotation and val-
dress fields).
idation tool for field extraction – S YPHT VALI -
We train our model by sampling instances at the DATE. Figure 2 shows a task during annotation.
token level. Matcher functions perform normali- Our tool is used to both supplement the train-
sation and comparison to annotated document la- ing set and optionally – where field-level confi-
bels for both for single and multi-token fields. All dence does not meet a configurable threshold; pro-
tokens which match the normalised form of the vide human-in-the-loop prediction verification in
human-agreed value for a field are used to gen- real time. Suggestions are pre-populated through
erate positive instances in a process analogous to S YPHT predictions, transforming an otherwise te-
distant supervision (Mintz et al., 2009). Other to- dious manual entry task into a relatively simple
kens in a document which match the field-type fil- decision confirmation problem. Usability features
ter are randomly sampled as negative training in- such as token-highlighting and keyboard naviga-
stances. Instances of labels and sparse features tion greatly decrease the time it takes to annotate
are then used to train a gradient boosting decision a given document.
tree model (LightGBM)4 . To handle null predic-
We utilise continuous active learning by priori-
tions, we fit a threshold on token-level confidence
tising the annotation of new documents from our
which optimises a given performance metric; i.e.
unlabeled corpus where the model is least confi-
F -score for the models considered in this work.
dent. Conversely we observe high-confidence pre-
If the maximum likelihood value for a predicted
dictions which disagree with past human annota-
token-sequence falls below the threshold for that
tions are good candidates for re-annotation; often
field, a null prediction is returned instead.
indicating the presence of annotation errors.
4 5
github.com/Microsoft/LightGBM https://prodi.gy/
56
4.3 Service architecture the generic invoice extraction model for parity
S YPHT has been developed with performance at with other comparison systems. By contrast with
scale as a primary requirement. We use a micro- other systems which provided seamless API ac-
service architecture to ensure our system is both cess, we operated the user interface manually and
robust to stochastic outages and that we can scale were unable to reliably record the distribution of
up individual pipeline components to meet de- prediction time per document. As such we only
mand. Services interact via a dedicated message note the average extraction time aggregated over
queue which increases fault-tolerance and ensure all test documents in Table 2
consistent throughput. Our system is capable of E zzy B ills 7automate data entry of invoice and
scaling to service a throughput of hundreds of re- account-payable in buisness accounting systems.
quests per second at low latency to support mobile We utilised the EzzyBills REST API.
and other near real-time prediction use-cases. We
consider latency a core metric for real-world sys- R ossum8 advertise a deep-learning driven data
tem performance and include it in our evaluation extraction API platform. We utilised their Python
of comparable systems in Section 5. API 9 in our experiments.
5 Evaluation 6 Results
In this section we describe our methodology for Table 1 presents accuracy results by field for each
creating the experimental dataset and system eval- comparison system. S YPHT delivers the highest
uation. We aim to understand how a variety of performance across measures fields with a macro
alternative extraction systems deals with various averaged accuracy exceeding our comparable re-
invoice formats. As a coarse representation of vi- sults by 23.7%, 22.8% and 20.2% (for Ezzy, AB -
sual document structure, we compute a perceptual BYY, R ossum respectively). Interestingly we ob-
hash (Niu and Jiao, 2008) from the first-page of serve low scores across the board on the net
each document in a sample of Australian invoices. amount field with every systems performing sig-
Personally identifiable information (PII) was then nificantly worse than the closely related gross
manually removed from each invoice by a human amount. This field also obtained the lowest level
reviewer. S YPHT VALIDATE was used to generate of annotator agreement and was notoriously diffi-
the labels for the task, with between two and four cult to reliably assess – for example, the inclusion
annotators per field dependent on inter-annotator or exclusion of discounts, delivery costs and other
agreement. Annotators worked closely to ensure adjustments to various sub totals on an invoice of-
consistency between their labels and the data defi- ten complicates extraction.
nitions listed in Section 3.1, with all fields having The next best system Rossum performed sur-
a sampled Cohen’s kappa greater that 0.8, and all prising well considering their coverage of the
fields except net amount having a kappa greater the European market; excluding support for
than 0.9. During the annotation procedure four Australian-specific invoice fields such as ABN.
documents were flagged as low quality and ex- Still, even after excluding ABN, net amount and
cluded from the evaluation set, resulting in a final GST which may align to different field definitions,
count of 129. In each of these cases annotators S YPHT maintains an 8 point accuracy advantage
could not reliably determine field values due to and more than 14 times lower median prediction
poor image quality. We evaluated against our de- latency.
ployed system after ensuring that all documents in Table 2 summarises the average prediction la-
the evaluation set were excluded from the model’s tency in seconds for each system alongside the
training set. times for documents at the 25th, 50th and 75th
percentile of the response time distribution. Un-
5.1 Compared systems
der the constraint of batch processing within the
ABBYY 6 We ran ABBYY FlexiCapture 12 in desktop ABBYY extraction environment we were
batch mode on a modern quad-core desktop com- unable to reliable record per-document prediction
puter. While ABBYY software provides tools for
7
creating extraction templates by hand, we utilised www.ezzybills.com/api/
8
www.rossum.ai
6 9
www.abbyy.com/en-au/flexicapture/ pypi.org/project/rossum
57
Field E zzy ABBYY R ossum Ours
Supplier ABN 76.7 80.6 - 99.2
Invoice Number 72.1 82.2 86.8 94.6
Document Date 67.4 45.0 90.7 96.1
Net Amount 53.5 51.2 55.8 80.6
GST Amount 69.8 72.1 45.0 90.7
Gross Amount 75.2 89.1 84.5 95.3
Avg. 69.1 70.0 72.6 92.8
Table 1: Prediction accuracy by field.
Avg. 25th 50th 75th We also see an exciting opportunity to provide
Rossum 67.06 47.7 54.4 91.0 self-service model development – the ability for a
E zzy 27.9 20.6 26.9 34.5 customer to use their own documents to generate
ABBYY 5.6 - - - a model tailored to their set of fields. This would
Ours 4.2 3.3 3.8 4.8 allow us to offer S YPHT for use cases where ei-
ther we cannot or would not collect the prerequi-
Table 2: Prediction latency in seconds. site data. S YPHT VALIDATE provides a straight-
forward method for bootstrapping extraction mod-
times and thus do not indicate their prediction re- els by providing rapid data annotation and efficient
sponse percentiles. S YPHT was faster than all use of annotator time through active learning.
comparison systems, and significantly faster rel-
ative to the other SaaS based API services. Even 8 Conclusion
with the lack of network overhead inherent to AB -
We present S YPHT, a SaaS API for key-field ex-
BYY ’s local extraction software, S YPHT maintains
traction from business documents. Our compar-
a 25% lower average prediction latency. In a di-
ison with alternative extraction systems demon-
rect comparison with other API based products we
strate both high accuracy and lower latency across
demonstrate stronger results still, with EzzyBills
extracted fields – enabling applications in real time
and Rossum being slower than S YPHT by a fac-
for invoices and bill payment.
tor of 6.6 and 15.9 respectively in terms of mean
prediction time per document.
Acknowledgements
7 Discussion and future work
We would like to thank members of the S YPHT
While it is not a primary component of our cur- team for their contributions to the system, annota-
rent system, we have developed and continue to tion and evaluation effort: Duane Allam, Farzan
develop a number of solutions based on neural Maghami, Paarth Arora, Raya Saeed, Saskia
network models. Models for sequence labelling, Parker, Simon Mittag and Warren Billington.
such as LSTM (Gers et al., 1999) or Transformer
(Vaswani et al., 2017) networks can be directly en-
sembled into the current system. We are also ex-
ploring the use of object classification and detec- References
tion models to make use of the visual component
of document data. Highly performant models such Francesca Cesarini, Marco Gori, Simone Marinai, and
Giovanni Soda. 1998. Informys: A flexible invoice-
as YOLO (Redmon and Farhadi, 2018), are partic- like form-reader system. IEEE Trans. Pattern Anal.
ularly interesting due to their ability to be used in Mach. Intell. 20(7):730–745.
real-time. We expect sub-5 second response times
to constitute a rough threshold for realistic deploy- Bertrand Coüasnon. 2006. Dmos, a generic doc-
ment of extraction systems in real time applica- ument recognition method: application to table
structure analysis in a general and in a spe-
tions, making S YPHT the best system in contrast cific way. International Journal of Document
to either of the other two API-based services. Analysis and Recognition (IJDAR) 8(2):111–122.
https://doi.org/10.1007/s10032-005-0148-5.
58
Vincent Poulain D’Andecy, Emmanuel Hartmann, 2013 12th International Conference on Document
and Marçal Rusiñol. 2018. Field extraction by Analysis and Recognition. IEEE Computer Society,
hybrid incremental and a-priori structural tem- Washington, DC, USA, ICDAR ’13, pages 101–105.
plates. In 13th IAPR International Workshop https://doi.org/10.1109/ICDAR.2013.28.
on Document Analysis Systems, DAS 2018, Vi-
enna, Austria, April 24-27, 2018. pages 251–256. Y. Y. Tang, C. Y. Suen, Chang De Yan, and M. Cheriet.
https://doi.org/10.1109/DAS.2018.29. 1995. Financial document processing based on staff
line and description language. IEEE Transactions
Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. on Systems, Man, and Cybernetics 25(5):738–754.
1999. Learning to forget: Continual prediction with https://doi.org/10.1109/21.376488.
lstm .
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Bertin Klein, Stevan Agne, and Andreas Dengel. 2004. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Results of a study on invoice-reading systems in ger- Kaiser, and Illia Polosukhin. 2017. Attention is all
many. In Simone Marinai and Andreas R. Dengel, you need. In Advances in Neural Information Pro-
editors, Document Analysis Systems VI. Springer cessing Systems. pages 5998–6008.
Berlin Heidelberg, Berlin, Heidelberg, pages 451–
462.
Paul McNamee, Heather Simpson, and Hoa Trang
Dang. 2009. Overview of the TAC 2009 Knowledge
Base Population Track. In Proceedings of the 2009
Text Analysis Conference.
Mike Mintz, Steven Bills, Rion Snow, and Dan
Jurafsky. 2009. Distant supervision for relation
extraction without labeled data. In Proceed-
ings of the Joint Conference of the 47th Annual
Meeting of the Association for Computational
Linguistics and the 4th International Joint Con-
ference on Natural Language Processing of the
Asian Federation of Natural Language Process-
ing. Association for Computational Linguistics,
Stroudsburg, PA, USA, volume 2, pages 1003–1011.
http://dl.acm.org/citation.cfm?id=1690219.1690287.
Xia-mu Niu and Yu-hua Jiao. 2008. An overview
of perceptual hashing. Acta Electronica Sinica
36(7):1405–1411.
Rasmus Berg Palm, Ole Winther, and Florian Laws.
2017. Cloudscan - A configuration-free invoice
analysis system using recurrent neural networks.
In 14th IAPR International Conference on Docu-
ment Analysis and Recognition, ICDAR 2017, Ky-
oto, Japan, November 9-15, 2017. pages 406–413.
https://doi.org/10.1109/ICDAR.2017.74.
Joseph Redmon and Ali Farhadi. 2018. Yolov3:
An incremental improvement. arXiv preprint
arXiv:1804.02767 .
Maral Rusiol, Tayeb Benkhelfallah, and Vin-
cent Poulain D’Andecy. 2013. Field extraction
from administrative documents by incremental
structural templates. In Proceedings of the 12th
International Conference on Document Analysis
and Recognition. IEEE Computer Society, pages
1100–1104.
Daniel Schuster, Klemens Muthmann, Daniel Esser,
Alexander Schill, Michael Berger, Christoph Wei-
dling, Kamil Aliyev, and Andreas Hofmeier. 2013.
Intellix – end-user trained information extraction
for document archiving. In Proceedings of the
59