Donnerstag, 12. Januar 2017

PDF File comparison for automated unit testing

Unit testing and automation of those tests is one of the key concepts to implement solid testing for your software project.

Unit tests of System Concept DMS project need to validate produced PDF documents in various scenarios, e.g.:

  • Split a (scanned) stack into page documents
  • Split a scanned stack into single semantic documents using a barcode/ QR-code mechanism
  • Combine single page documents to one document
  • ...
To abstract document orientated unit tests I implemented a base class which can process a flexible number of test document sets in a directory structure.
A document set consists of defined input documents and the corresponding output documents. The following image illustrated the test case AutoIndexBarcodeTest. The test contains 1 document set. The set consists of 1 input document (input1.pdf) and 4 output documents in the subfolder output.

Image 1: test document set, input and output documents
The concrete implementation for the barcode split unit test overrides the method
doFile(File inputFile). The implementation processes the input document and than needs to validate the created output documents against the output templates from the test definition (see image)

@Override
protected void doFile(File inputFile) throws Exception {
   
   // Process input file

   // Validate output files
   //  - count of files
   //  - location/name of the files
   //  - file content
}

The checks for count and location/name of the output documents were quite simple.

Problem - PDF file comparison using checksum


To compare the PDF documents by content a file checksum like MD5 seemed to be an appropriate solution. Easy to implement and more or less 'bug-save' which is an important thing concerning tests.

Unfortunately valid PDF output results produced different checksums from the test templates. The diff commandline tool reported 1 difference near the end of the documents:

peter@grogonzola:~$ diff -Naur result.pdf template.pdf
 /Root 1 0 R
 /Info 4 0 R
-/ID [<0e90025fbadc9f6434f3d192980836f3> <0e90025fbadc9f6434f3d192980836f3>]
+/ID [<bf69242a5582735a45e72ab0ed370876> <bf69242a5582735a45e72ab0ed370876>]
 /Size 11

The reason for the different checksum is the PDF file identifier (/ID) which is indvidually created per file and always differs even if 2 documents are produced using the exact identical piece of code.

If you want further information about it take a look at the following discussion at stackoverflow:

http://stackoverflow.com/questions/20039691/why-are-pdf-files-different-even-if-the-content-is-the-same

Mr. Lowagie explains the reason for different checksums and the need for the PDF file identifier. So far so good - but that is really awful for document validation against defined templates and test automation.


Solution 1 - Adapted checksum calculation


One possible solution is to implement an adapted checksum calculation. Look for the /ID line and do not pass the data into the checksum calculation (the same for /CreationDate).
I implemented a quick hack which does the trick but uses a BufferedReader. That is not the first choice when dealing with binary data but it works for now:


public byte[] getChecksum(InputStream stream) throws Exception {
  
  BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
  
  try {
    MessageDigest md = MessageDigest.getInstance("MD5");
   
    String line;
   
    while ((line = reader.readLine()) != null) {   
      if (!line.startsWith("/ID [<")) {
        md.update(line.getBytes());
      }     
    }
         
    return md.digest();
      
  } finally {
    // close the stream
    try { reader.close(); } catch (Exception e) {}
  }
}

The problem here is that BufferedReader is for text files and line orientated. The plan is to implement a real binary solution using a byte[] buffer array. Basically it is easy to find the "/ID [<" snippet in the byte array. The problem is to also find it when it is splitted into different buffers during the read proccess.


Solution 2 - Using external tools/ libraries for comparison


Another solution is to use external tools/ libraries to compare PDF documents. Possible tools can be:

  • http://jpdfunit.sourceforge.net - Apache PDFBox based framework for JUnit
  • diffpdf (GUI) to get an inspiration
  • http://www.qtrac.eu/comparepdf.html command line base tool
  • imagemagick (compare)

All of these tools more or less interprete/ render the documents and so do a comparison based on their structure and appearance and not on the binary file content itself.

Especially concerning Jpdfunit I see a problem when using the same library for test validation that is used within the project to create the PDF documents (System Concept DMS uses Apache PDFBox).

Conclusion


Two PDF files created by the same piece of code at different times are not binary equal. So a checksum algorithm produces differend fingerprints and a comparison fails.

Reasons:

  • Unique PDF file identifier (/ID)
  • Creation date (/CreationDate)

Possible solutions are

  1. Adapted checksum calculation which ommits /ID and /CreationDate
  2. Externals tools/ libraries to compare PDF documents by their structure/appearance and not binary content

Advantages of 1. (adapted checksum)

  • no dependencies
  • fast
  • metadata is validated, too
  • will fail if anything but the ommited parts differs

Advantages of 2. (tools)

  • may also work for document comparison from different sources



To keep tests simple and reduce dependencies on external tools I decided to use the adapted checksum solution.