Samstag, 16. Dezember 2017

Simplify text based editing of DocBook XML documentation

Writing and maintaining technical documentation is a really wide topic. Our product System Concept DMS needs a documentation.

Here is what we need:

  1. PDF output with TOC, page-numbers etc.
  2. HTML help page to publish on the web
  3. Files for the Eclipse help system (programme online help)
All this should be generated out of one single source.


I started with docbook about 1,5 years ago. As I am a programmer I have no difficulties editing XML files. But I realized that writing the documentation in plain docbook XML is to much work. I believe that problems will appear conerning image parameters. These are present in each docbook file so changes must be rolled out to all files.

So I left doccbook again and went back to LibreOffice to at least collect knowledge for later.


I reactivated the docbook stuff and the idea is to simplify the editing process by pre processing the source files.

Here is my current draft of a simplified input file:

 <section id="function_objekte_entfernen">  
   
 #height 10cm  
   
 <title>Objekte entfernen</title>  
   
   
 Mit der Aktion #b Objekte entfernen# können mit System Concept DMS aufgebrachte   
 #link Haftnotizen function_haftnotiz#, Markierungen und Schwärzungen aus   
 einem Dokument entfernt werden.  
   
   
 Sie können die Funktion per #b Rechtsklick-Objekte entfernen# direkt aus der   
 Kachelansicht oder aus einem geöffneten Dokument aufrufen.  
    
   
 Es wird der Dialog zum Entfernen von Objekten geöffnet. Markieren Sie den entsprechende Eintrag  
 in der Tabelle und Bestätigen Sie mit #b OK#. Das Objekte wird entfernt und die  
 Dokumentansicht aktualisiert.  
   
 #img img/web/function_objekte_entfernen_1.png 12cm 'Dialog zum Entfernen von Objekten'#  
   
 </section>  


As you can see some DocBook XML elements are still present. Frequently used or complex elements are simplyfied to a #tag # syntax.

Two empty lines create a paragraph-break (</para><para>)

With the #img src title# tag the pre-processing centralizes the XML-representation of images. So if a change is neccessary I just re-generate the XML-files.

I decided for the '#' because shift is not needed. Writing the documentation text must be as easy as possible.

#b: emphasis
#img: mediaobject
#icon: inlinemediaobject
#l: itemizedlist
#li: listitem + para
#-: end sequence (e.g. for #l or #li)

The above example will produce the follwoing DocBook XML section:


 <section id="function_objekte_entfernen">  
   
 <?dbfo-need height="10cm" ?>  
 <title>Objekte entfernen</title>  
   
 <para>  
 Mit der Aktion <emphasis>Objekte entfernen</emphasis> können mit System Concept DMS aufgebrachte   
 <link linkend="function_haftnotiz">Haftnotizen</link>, Markierungen und Schwärzungen aus   
 einem Dokument entfernt werden.  
   
 </para>  
 <para>  
 Sie können die Funktion per <emphasis>Rechtsklick-Objekte entfernen</emphasis> direkt aus der   
 Kachelansicht oder aus einem geöffneten Dokument aufrufen.  
    
 </para>  
 <para>
 Es wird der Dialog zum Entfernen von Objekten geöffnet. Markieren Sie den entsprechende Eintrag  
 in der Tabelle und Bestätigen Sie mit <emphasis>OK</emphasis>. Das Objekte wird entfernt und die  
 Dokumentansicht aktualisiert.  
   
 <mediaobject>  
 <imageobject condition="print">  
 <imagedata fileref="img/web/function_objekte_entfernen_1.png" format="PNG" contentdepth="12cm" />  
 </imageobject>  
 <textobject>  
 <phrase>Dialog zum Entfernen von Objekten</phrase>  
 </textobject>  
 <caption>  
 <para>Dialog zum Entfernen von Objekten</para>  
 </caption>  
 </mediaobject>  
   
   
 </para>  
 </section>  
   

This is much more away from "just writing" and almost two times longer.


The next point will be to divide the source into logical files (sections) and re-combine them for different purposes. This can be done via XML entities.


Dienstag, 30. Mai 2017

Ideas and problems with yellow pin notes on PDF documents

The idea is quite simple: pin this yellow little notes on digital PDF document in an easy-to-use way. You can do it with Adobe Reader but there are some problems:

  • It is not easy to use - e.g. you will have to use "save as" and cannot overwrite the existing document.
  • Adobe Reader is not available on every system
  • There is no way to intergrate the Reader functions  in the System Concept DMS product.

The System Concept DMS software is able to place notes in an easy to use way for about 2 years. But there was no way to remove or edit the notes yet.

The SC DMS features uses Apache PDFBox and draws a note in 3 steps:

  1. yellow box (addRect + fill)
  2. text (beginText + shotText + endText)
  3. border (addrect + stroke)

The user interface provides a two-step assistent to enter the text and choose a position for the note.

PDF document with a nice yellow note pinned


Make it removable

There was customer feedback that it would be great if notes are at least removable. This is not a simple task since a note consists of a number of drawing operations which are not connected in any way within the PDF.
I found a solution for that and use PDF comments (lines beginning with '%') to identify content streams which contain removable objects like notes.

So far so good. It turned out that content streams are put together by a certain page function of PDFBox. This resulted in an empty page if the user removed a note.
The reason was that the note META comment was still in the page but all content has been put into one single stream.

Use annotations

I tried to rewrite the note feature and make use of PDF annotations. Doing some reverse engineering I found out that Adobe Reader produces annotations.

Apache PDFBox is able to manage annotations, too:


PDPage page = doc.getPage(0);
   
List annotations = page.getAnnotations();  
 
PDAnnotationMarkup freeTextMark = new PDAnnotationMarkup();
freeTextMark.setAnnotationName("SCDMS:Note:Peter Pinnau");

freeTextMark.getCOSObject().setName(COSName.SUBTYPE,
   PDAnnotationMarkup.SUB_TYPE_FREETEXT);

freeTextMark.setCreationDate(Calendar.getInstance());
freeTextMark.setAnnotationFlags(4);
   
// Yellow color for background
PDColor yellow = new PDColor(new float[] { 1, 1, 0 }, PDDeviceRGB.INSTANCE);
freeTextMark.setColor(yellow);
  
// Position for the annotation
PDRectangle position = new PDRectangle(); 
   
position.setLowerLeftX(100);
position.setLowerLeftY(200);
position.setUpperRightX(400);
position.setUpperRightY(500);
freeTextMark.setRectangle(position);
   
// set som data
freeTextMark.setTitlePopup("Peter Pinnau");
freeTextMark.setContents("This is the text\nENTER1\nENTER2");
freeTextMark.setPrinted(true);
freeTextMark.setInvisible(false);
   
// Color blaxk, "Helv" font, 11 point
freeTextMark.getCOSObject().setString(COSName.DA, "0 0 0 rg /Helv 11 Tf");
   
// Add the annoation   
annotations.add(freeTextMark);  
  
// Save the document
doc.save(new File("..."));


The above code places a nice multi-line yellow note in the PDF. It is visible and editable in Adobe Reader. It is visible in the PDF viewer shipped with Ubuntu.
But it is NOT visible in Mozillas PDF.JS viewer. Unfortunately SCDMS uses PDF.JS to view PDF documents.

I found out that Apache PDFBox and PDF.JS do not implement a so called default appearance for annotations. Since the annotation has no apperance it is not visible.

Adobe Reader creates a default appearence and displays the annotation correctly. If the PDF is saved ones from Adobe Reader the annotations also become visible in PDF.JS.

There are two open issues concerning that:

PDFJS:
https://github.com/mozilla/pdf.js/issues/6810

PDFBox:
https://issues.apache.org/jira/browse/PDFBOX-2019


The best way to solve this problem concerning SCDMS of course will be to add a correct appearance stream when generating the annotation.
Unfortunately this goes deep into PDF stuff so I hope that PDFBOX-2019 will be solved in the nearer future.

For now I switched back to the old implementation and found another way to do the above mentioned pages operations so that the empty-page-problem could be solved in this particular case.

The content stream merging is done by (page is a page with content from a present document):

PDDocument.importPage(PDPage page)

I now use:

PDDocument.addPage(PDPage page)

and content streams are not put together anymore.







Freitag, 3. März 2017

Noise filter for QR-Code detection in scanned documents

Im am using zxing in a project to detect qr-codes within scanned documents. The goal is to achive almost 100% recognition but there were some issues to solve:


  1. zxing does not find small codes within a document page. Since the qr-code stickers are pinned on the documents manually the user has to pin the sticker in one of the 4 corners.
    The processor than cuts out corner by corner and searches for the code there.
  2. Unfortunately there were still non-recognized codes. Detection relies on printing quality of the stickers which not may be accurate in every situation.
    I did some tests and corrected non-recognized codes with gimp until they worked. I came to the conclusion that a filter is needed to eleminate false pixels as well as possible.



I spend an evening on that and finally found a specialized solution. Take a look and the sample images:

Left: original, Right: filter applied

The result is amazing, isn't it? zxing is now able to recognize the code.

How does it work?


My first idea was to use OpenCV to implement the filter but I than tried a very simple "self-made" algorithm:


  1. Input has to be already black/white pixel-data
  2. Iterate pixel (by rows and columns)
  3. Leave white pixels as they are
  4. For each black pixel calculate black pixels in the surrounding 7x7 square.
  5. Calculate the ration black pixels in 7x7-square / 49
  6. If the ration is less than 0.4 -> set pixel to white

It is important to work on a copy of the input data. The filter must not analyse pixels which have been modified by the algotithm.

Since qr-codes consist of rectangular patterns the filter does almost not destroy real data as long as the stickers are pinned likely straight.
Typical noises from bad printers or scanning failures are reduced very well.

When it goes close to the borders there is no 7x7 square available. It would be possible to leave that areas. I decided to shrink the square according to the position and process data in the same way.

Of course the 7x7 is adjusted to the qr-code size and the scanning resolution.

The following illustration shows the 7x7 square around the current pixel. The result of the black pixel count in that case equals 5 (current pixel no included). The current pixel will be erased and set to white.

Illustration 7x7 square


Make it simplier


A friend of mine pointed out that calculation of the ratio is not necessary. The pixel size of the square is always 7x7 = 49. So the threshold of 0.4 can be pre calculated as 0.4*49 = 20.

Exception: The border areas of the image. The square is shrinked but it is no problem to use the precalculated threshold. The algorithm is than a little more "aggressive" at the image borders (first 3 pixels).


Close areas


Next step is to use the algorithm to close areas. If the threshold as greater than 40 pixels are set to black.

The following image shows the progress. Please enlarge the picture and compare the middle and right sample.  you will see that some white pixels in the data blocks have been closed.

Left: orginal, middle: cleared, right: closed




Donnerstag, 16. Februar 2017

# c0ders wanted #

If I believe in the statistics there seem to be some followers on this blog. Now it's up to you. Inspired by a friend of mine I have created a little programming riddle.

I pinned it at the MENSA of the tech university near our office. The purpose is to find a student who will be able to support us.
Unfortunately the semester holidays just started. Perfect timing ;)

Feel free to solve the task and let me know (peter AT pinnau.biz).



Donnerstag, 12. Januar 2017

PDF File comparison for automated unit testing

Unit testing and automation of those tests is one of the key concepts to implement solid testing for your software project.

Unit tests of System Concept DMS project need to validate produced PDF documents in various scenarios, e.g.:

  • Split a (scanned) stack into page documents
  • Split a scanned stack into single semantic documents using a barcode/ QR-code mechanism
  • Combine single page documents to one document
  • ...
To abstract document orientated unit tests I implemented a base class which can process a flexible number of test document sets in a directory structure.
A document set consists of defined input documents and the corresponding output documents. The following image illustrated the test case AutoIndexBarcodeTest. The test contains 1 document set. The set consists of 1 input document (input1.pdf) and 4 output documents in the subfolder output.

Image 1: test document set, input and output documents
The concrete implementation for the barcode split unit test overrides the method
doFile(File inputFile). The implementation processes the input document and than needs to validate the created output documents against the output templates from the test definition (see image)

@Override
protected void doFile(File inputFile) throws Exception {
   
   // Process input file

   // Validate output files
   //  - count of files
   //  - location/name of the files
   //  - file content
}

The checks for count and location/name of the output documents were quite simple.

Problem - PDF file comparison using checksum


To compare the PDF documents by content a file checksum like MD5 seemed to be an appropriate solution. Easy to implement and more or less 'bug-save' which is an important thing concerning tests.

Unfortunately valid PDF output results produced different checksums from the test templates. The diff commandline tool reported 1 difference near the end of the documents:

peter@grogonzola:~$ diff -Naur result.pdf template.pdf
 /Root 1 0 R
 /Info 4 0 R
-/ID [<0e90025fbadc9f6434f3d192980836f3> <0e90025fbadc9f6434f3d192980836f3>]
+/ID [<bf69242a5582735a45e72ab0ed370876> <bf69242a5582735a45e72ab0ed370876>]
 /Size 11

The reason for the different checksum is the PDF file identifier (/ID) which is indvidually created per file and always differs even if 2 documents are produced using the exact identical piece of code.

If you want further information about it take a look at the following discussion at stackoverflow:

http://stackoverflow.com/questions/20039691/why-are-pdf-files-different-even-if-the-content-is-the-same

Mr. Lowagie explains the reason for different checksums and the need for the PDF file identifier. So far so good - but that is really awful for document validation against defined templates and test automation.


Solution 1 - Adapted checksum calculation


One possible solution is to implement an adapted checksum calculation. Look for the /ID line and do not pass the data into the checksum calculation (the same for /CreationDate).
I implemented a quick hack which does the trick but uses a BufferedReader. That is not the first choice when dealing with binary data but it works for now:


public byte[] getChecksum(InputStream stream) throws Exception {
  
  BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
  
  try {
    MessageDigest md = MessageDigest.getInstance("MD5");
   
    String line;
   
    while ((line = reader.readLine()) != null) {   
      if (!line.startsWith("/ID [<")) {
        md.update(line.getBytes());
      }     
    }
         
    return md.digest();
      
  } finally {
    // close the stream
    try { reader.close(); } catch (Exception e) {}
  }
}

The problem here is that BufferedReader is for text files and line orientated. The plan is to implement a real binary solution using a byte[] buffer array. Basically it is easy to find the "/ID [<" snippet in the byte array. The problem is to also find it when it is splitted into different buffers during the read proccess.


Solution 2 - Using external tools/ libraries for comparison


Another solution is to use external tools/ libraries to compare PDF documents. Possible tools can be:

  • http://jpdfunit.sourceforge.net - Apache PDFBox based framework for JUnit
  • diffpdf (GUI) to get an inspiration
  • http://www.qtrac.eu/comparepdf.html command line base tool
  • imagemagick (compare)

All of these tools more or less interprete/ render the documents and so do a comparison based on their structure and appearance and not on the binary file content itself.

Especially concerning Jpdfunit I see a problem when using the same library for test validation that is used within the project to create the PDF documents (System Concept DMS uses Apache PDFBox).

Conclusion


Two PDF files created by the same piece of code at different times are not binary equal. So a checksum algorithm produces differend fingerprints and a comparison fails.

Reasons:

  • Unique PDF file identifier (/ID)
  • Creation date (/CreationDate)

Possible solutions are

  1. Adapted checksum calculation which ommits /ID and /CreationDate
  2. Externals tools/ libraries to compare PDF documents by their structure/appearance and not binary content

Advantages of 1. (adapted checksum)

  • no dependencies
  • fast
  • metadata is validated, too
  • will fail if anything but the ommited parts differs

Advantages of 2. (tools)

  • may also work for document comparison from different sources



To keep tests simple and reduce dependencies on external tools I decided to use the adapted checksum solution.