How document understanding improves invoice, contract and resume processing

As we humans create more and more textual data, natural language processing (NLP) enables computers to better analyze them and make them searchable. Companies are using NLP to find out signals for investing, to analyze sustainability reports and to understand public speeches.  Document understanding means extracting useful information about the document, and with a model called LayoutLM, it is possible to take into account both textual content and its structure as well as the visual aspect of the document. The method makes any text-heavy work simpler, in various functions ranging from legal to recruitment.

We sat down with our AI Scientist focused on natural language processing, Luiza Sayfullina to discuss document understanding. Together with our customers, we incorporate these types of methods and technologies into AI-driven solutions that our clients use to make their work easier.

Luiza, you have a PhD in natural language processing from Aalto University, and have been building real-world NLP solutions for clients for several years now. Describe briefly what document understanding means.

First, let me define the word documents here. By document, I mean structured texts that can be accompanied by  tables and images. Typically companies and other organizations rely heavily on documentation, and seek the ways to decrease the amount of time spent on both creating new documents as well as processing existing ones using digitalization. To give you an example, many companies create hundreds of documents such as contracts or proposals every year, but much of this information gets lost, as it’s difficult to search for or nobody knows these documents have been created.

Using technologies like NLP to extract relevant information can be helpful in managing one’s own knowledge base. Once documents are scanned or uploaded digitally, we can automatically process the document to label its type, for example contract type, and extract important entities such as buyer and seller names. So far, the majority of the methods for document understanding were either text based or image based, while both modalities are important. From a visual perspective font, style, indentation, position etc. give important clues to identify titles, section names and things such as signature dates. 

Document understanding takes into account two aspects: text and visual layout to better extract the information wanted, such as client name or signature dates. The document can be any textual document such as CV, contract, or agreement.
Describe the methods in document understanding and how the approach works in practice.

Developed by Microsoft, the model is called LayoutLM, an abbreviation for Layout Language Modelling. The name captures its two important aspects: it takes inspiration from the traditional Language Modelling approach, where large neural networks are trained on huge text corpora in order to learn a statistical model of language. The trick is that we do not need labeled data and we can teach the network to predict masked words or next word for a given sequence of text. 

The paper on LayoutLM describes an approach, where the researchers perform a similar training, but they also take into account visual and layout information. The input to the neural network are words, their four coordinates on the page and corresponding word images. Four location coordinates and corresponding words are embedded together using BERT, and represent text and layout information, while corresponding word images are embedded with Faster R-CNN

The model defines layout as the combination of words in the textual form and their coordinates, while visual data is taken from word images. Coordinates might give the clue where to search the value of a certain field, e.g. for “Date: 15.11.2020” the value would be from the right to a key-word date. The model first finds the embedding representation for layout and word images separately, and only then sums them up to get the common representation used to solve the defined task at hand. 

Similarly to text Language Modelling, words are randomly masked and the model tries to predict those using positional and image embeddings. This approach is called Masked Visual-Language Model (MVLM). The advantage of this model is that after it has been trained to predict masked words, it can be fine-tuned with some labeled data for a chosen task at hand. 

Layout modelling for document understanding
Definitions in Language modelling.
What are the potential use cases of this technology?

LayoutLM is trained on the IIT-CDIP Test Collection 1.0 dataset containing 6 millions English documents. This model therefore allows us to use this network as a strong baseline to solve classification, e.g. document type classification, document clustering, or sequence labelling tasks. Simply put, LayoutLM is a universal neural model like BERT (actually it contains BERT for layout embedding), but for documents.

Therefore, the use cases for this type of document understanding are quite vast, and almost any text-heavy function from legal to sales and recruitment and administration can benefit from document understanding. In practical terms, document understanding could be applied to invoice handling and automation, making contracts and other documents searchable, helping employees navigate through various pdfs and to extract buyer/seller information to other places, such as into your CRM. 

What could be the common pitfalls where you need to be careful when applying these methods?

First of all, it is good to have a baseline method that works on the text input only, in order to see the benefit of using the LayoutLM model. A solution that uses both modalities, textual and visual, might work slower in action, and sometimes purely text-based methods are enough. For certain applications, the speed of prediction might be a critical aspect. 

Secondly, if you wish to work with a language other than English, you would need to train the model from scratch with a collection of documents written in the desired language. Alternatively, the model can be modified to work in a multilingual setting, and be trained with documents from various languages.   

Thank you, Luiza, for sharing your expertise on the fascinating topic of document understanding!
Jaakko Vainio

If you'd like to work with a stellar AI Scientist such as Luiza, get in touch with our Head of Operations Jaakko Vainio at or via LinkedIn.

Pauliina Alanen
Former Head of Brand
Silo AI
Share on Social
Subscribe to our newsletter

Join the 5000+ subscribers who read the Silo AI monthly newsletter to be among the first to hear about the latest insights, articles, podcast episodes, webinars, and more.

Pauliina Alanen

Former Head of Brand

Silo AI

What to read next

Ready to level up your AI capabilities?

Succeeding in AI requires a commitment to long-term product development. Let’s start today.