Posted in: Technical Blogs

Using AI to extract PDF in Java

We share how Generative AI, guided by prompt engineering techniques, can be employed to effectively extract information from PDFs.

Written by Paul Thorp, Head of Verticals/Insurtech.

Recently there has been much discussion with regards to AI and how it can be used to improve data processes. In this article I share  a scenario where we can utilise the OpenAI API (application programming interface) to help  extract a range of data from PDF files.

A new project called Spring AI is being built by the Spring Framework team, which I will be using for the examples below. This project is currently only available in a SNAPSHOT version, so will be subject to change as the project progresses towards the first stable release.

Why are PDFs difficult to process?

The PDF (Portable Document Format) document type is a popular file type used for sharing documents, such as invoices, across various platforms and systems. It maintains consistency in the original layout, fonts, images, and other formatting.

Anyone who has tried to parse PDF data will understand that this is not always straightforward, due to the data being unstructured. In contrast, an HTML document has a defined set of element types, i.e. paragraphs, tables, lists, etc. which can be more easily filtered and navigated. However, without such a set of defined wrappers in a PDF document, it requires a lot of custom code, and can be extremely difficult and error prone to accurately match and extract values, especially where the layout is not known in advance with formatting variances between documents.

This is where Generative AI can be of use.

How can AI help?

Imagine that you have a requirement to process invoices for a wide range of establishments and each has their own layout, such as:

sample_invoice

It cannot be guaranteed that the various data elements will be in the same location or orientation for each invoice, so we need a flexible mechanism for analysing and extracting the information.

The advances in AI, specifically LLM (large language models), are ideally suited to understand the content of such a document. Retrieval-augmented generation (RAG) is a technique for enhancing the accuracy and reliability of generative AI models with facts from external sources. Using input prompts, the AI model can find information from these facts and provide the output, in a range of output formats, such as JSON (JavaScript object notation).

However, we cannot simply pass the OpenAI API a PDF document, so how do we instruct the API with what we need?

Guiding the AI to completion

Prompt engineering is a technique employed in natural language processing (NLP) to enhance the performance of language models. It involves refining the instructions or prompts given to these models by providing more context and relevant information, ultimately improving their ability to understand and generate human-like language responses.

Creating effective prompts is imperative to guide the model:, below are some useful strategies for best results:

  • Be clear and concise – the prompt should be easy to understand and provide information for the model to generate the relevant output.
  • Provide context – include relevant background information or specify the context in which the task should be performed.
  • Use contextual embeddings – these help to establish context and relationships between words.
  • Specify the format – clearly state what the expected format should be, e.g. a complete sentence or paragraph, and guide the model on the desired

structure of the response.

  • Test and iterate – test your prompts on the model to see how it performs and refine based on the observed outcomes, using different styles, tones or

formats.

The evolution of prompts in AI has transitioned from simple strings to more complex structures and involves the use of specific roles, including:

  • ‘System’ – the high-level instructions that guide the AI behaviour and response style.
  • ‘User’ – the input from the user, such as commands, facts, questions, statements, etc.
  • ‘Assistant’ – the AI model’s response.

To retrieve the information, we need to create the appropriate prompt, with the specific instructions and content, much like you would using ChatGPT in the browser.  Our final prompt will combine 2 role messages, i.e. System and User, to send to the OpenAI API.

Building the messages

First, we design the parts that will make up the System message. In this, we will instruct the AI model in what we are trying to achieve, i.e. assisting with extracting data from an invoice, and the format in which we would like the data returned.

Building the messages

We use both strings to create the required System message for the final prompt later.

sample_invoice

Next, for the User message, we generate the full text value of the PDF document. There are numerous ways to this, including Apache PDFBox (a Java PDF library).  The PDF data is loaded and the string value presenting the document is extracted, using the PDFTextStripper class.

Generated AI Response

This produces an output such as the following:

sample_invoice

As you can see, some of the text would be relatively straightforward to match without using AI methods, while some pose challenges. For example, it would be difficult to find multiple data points on the same line, or spanning multiple lines, or there are multiple line breaks in between, especially as the formatting of the text is different in both cases. However, the AI model will be able to understand this content and pull out the parts we need, based on the System prompt message.

In our code, the User message is created with the facts from this text.

Generated AI Response

Once we have created the 2 required messages, we use these to create the main prompt and finally request that the AI client generate the response using the provided prompt.

generated_ai_response_sample_code

If we review the response returned for both invoice files, the AI model has managed to successfully extract the requested information, even though the invoice format is different for each, and the data is in the JSON format we specified.

sample_invoice

Response:

sample_invoice
sample_invoice

Response:

sample_invoice

If the response needs to include more data, we simply update the schema definition in the prompt, such as the number of nights stayed and the rooms information. Even though some of the data is presented in a very different format and located in different areas on the invoices, the AI model is still able to determine all the data we requested, e.g. the start and end dates of the stay.

sample_invoice

Tokens

It should be noted that the AI models use Tokens as their key building block and, on input, the model converts the words from the prompt into tokens.  Each token roughly equates to ~4 characters of text, or 75% of a word, so 100 tokens ~= 75 words.

In our examples, the single page documents are relatively small and using the OpenAI Tokenizer page we can see that these are approximately 350 tokens in size.

tokens

The AI model being used limits the number of tokens that can be used for any request made. For example, GPT-3.5-Turbo model allows a maximum of 4096 tokens, which is shared between the prompt and completion, so if your prompt is 4000 tokens, the response can be 96 tokens at most.

When using online AI services, it is important to remember that tokens mean £’s – or whatever your currency is. So, be careful to limit your prompt size. However, making use of other tools can help here, such as Vector Databases to store document data, which can be pre-filtered to reduce the embedding content, or hosting your own AI client, e.g. Ollama.

Conclusion

We have shown how powerful Generative AI can be to process document data, extract relevant information, return the result in a chosen format, and do this with relative ease and accuracy, compared to previous conventional methods that would require a lot of custom, and complex, code.

The new Spring AI project allows us to interact with AI APIs, although we have only explored OpenAI in this article, and, although in its infancy, this project has provided a simple mechanism to create the various messages and prompts before calling the AI API to generate the response.

How can Tier 2 help?

If you’re in search of expert assistance in software development, Tier 2 is your dedicated partner. Reach out to discuss your specific requirements and discover how our custom software solutions can elevate your digital initiatives. Come to us for a bespoke software solution leveraging industry-standard Java technologies, agile development, and user-centric design.