Scanned PDFs

Types of PDFs: native versus scanned PDFs/ image PDFs

Background

Does the type of PDF created matter?

There are two types of PDF:

  • Native PDF
  • Scanned PDF

Does the type of PDF created matter?  Yes, when converting a PDF, the nature of the PDF does matter.

Native PDFs

Native PDFs are generated from an electronic source document, for example:

  • Accounts production software
  • Word
  • Excel
  • HTML
  • Adobe InDesign
  • Computer generated report
  • Etc.

.... which have an internal structure that can be read and interpreted by software. 

These "generated" native PDF documents therefore already contain characters that have an electronic character designation. In most cases, the PDF creation software will take information from the structure of the source document - such as character information, word placement information, etc. and retain these items in the created PDF output.  This is the reason why you can word search a text-based PDF document.

Scanned PDFs

A scanned PDF comes about where a physical paper document needs to be converted into an electronic form (i.e. where it is inefficient or not viable to re-type/recreate documents manually into electronic form and then convert them into PDFs).

The solution is to scan the document using an electronic scanning device. The scanner digitally captures the image of the physical document into an electronic form, creating a “snapshot” picture of the document.  (Note: the scanner does not reconstruct the character of every word when it creates this scanned image.)  This snapshot is then turned into a PDF by using software integrated with the scanner.

The result is a scanned PDF document.

However, even though the image may be of a document that contains words, the computer recognizes those words only as “images”, which it displays without any information structure behind it.

This is the reason why if you try to text search the document, the PDF search engine will not return any results.  

OCR solution for scanned PDFs

To convert a scanned PDF into an searchable/editable format, OCR (optical character recognition) software is required to analyze the “image” of each character and match it to an electronic character-based file.  This process is not error free, and it may be difficult to determine that the character "recognized" by the OCR software is indeed the character on the scanned document.

OCR output - quality considerations

One should note, that the quality of OCR output is affected by matters such as:

  • Poor image quality of the scanned document
  • Selection and mixture of fonts used in the scanned documents, and italicized and underlined fonts, which may blur the quality and shape of individual characters
  • Etc.

OCR output - quality required for financial statements

For financial statements of course, the quality of OCR conversion is of paramount importance.  Accordingly these files need to be very carefully processed followed by manual verification and correction of the OCR output to assure accuracy of the results.

Following the above stage, the file is then read for conversion to iXBRL or XBRL.

Alternative solution - avoiding the OCR stage

Obtain the source document from which the paper document was printed

An easy solution, to avoid this OCR stage, is to obtain the source document from which the paper document was printed (and then scanned).  This is likely to be:

  • Word document
  • or native PDF file

.... created just before signature of the financial statements.  Note: that the actual signature is neither needed nor utilized for iXBRL /XBRL conversion.