mercredi 17 février 2010

Migrate a Text Processing Document to DocBook XML

We have been struggling a lot at NeoDoc with that topic, so this is a quick howto that should make your life easier when transforming Word or OpenOffice.org text documents to DocBook XML.

We use as a base the DocBook export available in OpenOffice.org. It appears the result can be catastrophic without a little document cleaning. So this is the process we crafted:
  1. Open the original document in OOo
  2. If it's very big, copy the beginning of the document, just to keep a few dozen significant pages (in terms of images, tables, sections structure, etc.)
  3. Copy the document into a new, blank OOo text document, and save under a different name.
  4. Now comes the boring part, you will have to try and fix the doc structure and styles:
    • Make sure the titles are correctly styled (title1, title2, etc.)
    • Make sure those styles correspond to the correct level (in style configuration window) and have no numbering associated
    • Make sure the chapters numbering (tools menu) configuration actually corresponds to the title styles used.

  5. Once this is done make sure the default styles are applied to all the document by selecting all content (Ctrl+A) and right click -> Default Formatting. Remember the steps you have been taking on this sample document to fix it.
  6. Save to DocBook and check the document contains the content and structure you expect. If not go back to step 4.
  7. Once you are satisfied with the sample doc, apply all steps (from step 3) to the real document, and save it to DocBook.
  8. Remove all "anchor" elements: they proved to make fop fail.
  9. Check the value of the "cols" attribute of all tables. It must be equal to the maximum number of cells in a row. OOo writes wrong decimal values.
After that you should get a DocBook 4 document, hopeful conform and processable. Some additional steps might be useful:
  1. Process the document through the db4-upgrade.xsl stylesheet to get a DocBook 5 document.
  2. Process the resulting DocBook 5 document.through an XSL that automatically make modules (xi:include) out of it. We will provide one in a future article.

Limitations: Images files are not processed by OOo, though they can be recovered by unzipping the .odt. There should be a method to automatically reference the image file in the XML, it might be studied in a future article.
Feedback: Please comment with your success/failures and your tricks to fix the output.