Skip to content
This repository was archived by the owner on Feb 18, 2021. It is now read-only.

Workflow

axfelix edited this page Feb 4, 2014 · 6 revisions

Basic Workflow of PKP XML Stack

1. Convert document to .docx (Word 2007 XML) if it is not already in .docx form.

Handled by: https://github.com/pkp/xmlps/blob/master/module/DocxConversion/src/DocxConversion/Model/

Because our primary parsing engine is developed against .docx format, it is necessary to convert non-docx input into .docx before running parser. To do this, https://github.com/dagwieers/unoconv is used -- a CLI tool which calls LibreOffice in the background. A list of LibreOffice-supported input formats is at http://en.wikipedia.org/wiki/LibreOffice#Supported_file_formats; in general, you should get good results with DOC/RTF/ODT input, but this is dependent on LibreOffice.

Known limitations

LibreOffice converts images embedded in DOC files to WMF (Windows Metafile) for some godawful reason when converting the document to .docx. There is code in place to convert these WMF files to the more useful JPG/PNG, but libwmf has been broken on Ubuntu for a while (https://bugs.launchpad.net/ubuntu/+source/gimp/+bug/1001570) so this may not be working currently, thus there may be a problem with images when using DOC input. We're waiting for a fix upstream.

2. Run meTypeset to convert .docx to NLM XML

Handled by: https://github.com/pkp/xmlps/blob/master/module/NlmxmlConversion/src/NlmxmlConversion/Model/

This step contains most of the heavy lifting to go from a Word/compatible format to NLM XML (currently targeting the newest JATS NLM spec). Currently it's functionally a one-line call to our external meTypeset library (https://github.com/MartinPaulEve/meTypeset), which does the work of extracting the images from the .docx and transforming the underlying XML using a combination of unsupervised classifier routines and buckets of XSL. Document metadata can be supplied as optional input; otherwise, meTypeset will attempt to parse out article metadata (authors, titles, etc.) from the front matter.

Known limitations

Lots :) meTypeset is currently under very active development. By the time you're reading this, there should be a stable fork which produces reasonably good output on most elements of a document (headers, lists, body text, inline references, footnotes, etc). However, there are many cases where you may encounter some small issues; detection of front matter is currently very incomplete, our classifier code currently doesn't know what to do with any elements that belong outside of the main body text (though detection is improving), and if the article bibliography isn't detected correctly, many of the steps later in this workflow will fail.

3. Citation Parsing

Handled by: https://github.com/pkp/xmlps/tree/master/module/ReferencesConversion/src/ReferencesConversion/Model

This step passes the Bibliography identified by meTypeset to ParsCit (http://aye.comp.nus.edu.sg/parsCit/) in order to break down the component parts of a given citation into Author, Title, Year, etc. so that they can be reformatted into any desired citation style. ParsCit was selected over other similar citation parsing libraries because it meets the requirements of being under active development (https://github.com/knmnyn/parscit) and reasonably easy to install locally (it's mostly Perl and C++). ParsCit output is then walked through MODS and BibTeX formats:

http://www.loc.gov/standards/mods/ http://www.bibtex.org/ https://github.com/pkp/xmlps/blob/master/module/ReferencesConversion/assets/parsCit.xsl https://github.com/pkp/xmlps/tree/master/module/BibtexConversion/src/BibtexConversion/Model https://github.com/pkp/xmlps/tree/master/module/BibtexreferencesConversion/src/BibtexreferencesConversion/Mode https://github.com/pkp/xmlps/blob/master/module/BibtexreferencesConversion/assets/biblatex2xml.xsl

... and, finally, transformed back to NLM XML citation markup and pasted back into the original document.

The BibTeX file containing all of the references extracted from the article, separated into their component parts, can be requested via our API if a user wishes to use the webservice solely for reference extraction rather than utilizing the entire conversion pipeline.

The NLM XML document is not transformed any more from this point in our workflow onward, and is returned as one of the output products of our service.

Known limitations

As mentioned, if meTypeset failed to tag the article bibliography, this step will not work. It goes without saying that ParsCit is also not perfect (n.b. "under active development") but it performs well on most author-formatted citations, even if they do not conform to a particular style. We have considered supplementing ParsCit with an additional citation parsing service but have not done so because none other is as performant and able to be installed locally.

4. HTML Conversion

Handled by: https://github.com/pkp/xmlps/tree/master/module/HtmlConversion/src/HtmlConversion/Model https://github.com/pkp/xmlps/tree/master/module/CitationstyleConversion/src/CitationstyleConversion/Model with javascript/CSS assets in https://github.com/pkp/xmlps/tree/master/module/HtmlConversion/assets

NLM XML is converted directly to our own HTML layout using XSL. Current included layout is bootstrap-derived and features some basic jQuery niceties to pop out full-size images by clicking on thumbnails; we'll be looking into developing alternate layouts and stylesheets in the future. Inline citations and article bibliography are reformatted from the now-marked-up XML to any citation format of the user's choosing using Pandoc (johnmacfarlane.net/pandoc/), assuming they have been parsed correctly earlier in the process; our list of available citation formats is exhaustive, pegged to the http://citationstyles.org/ repo, and can be supplied as an API parameter or selected via a find-as-you-type dialog on the http://pkp-udev.lib.sfu.ca/ mainpage.

Known Limitations:

Layouts could use more work :) Otherwise, nothing wrong with this part of the process. XSL is currently somewhat undeveloped based on the subset of NLM tags meTypeset outputs and will be expanded gradually.

5. PDF Conversion

Handled by: https://github.com/pkp/xmlps/tree/master/module/PdfConversion/src/PdfConversion/Model https://github.com/pkp/xmlps/tree/master/module/XmpConversion/src/XmpConversion/Model

Our end-product HTML is converted to PDF by way of what is effectively a headless Webkit "printer" -- https://github.com/antialize/wkhtmltopdf. This approach includes certain niceties which would not necessarily be available to a desktop user running a print-to-file command, i.e., text is never cut off between pages and the PDF's internal bookmarks are pre-populated with the document's table of contents, based on tagged headers. This step is currently configured to provide a relatively clean, direct conversion from the layout HTML, but it would be possible to insert other desired watermarks, stylesheets, or ToC pages with minimal effort. Article metadata is embedded in the PDF as XMP (http://en.wikipedia.org/wiki/Extensible_Metadata_Platform) using ExifTool (http://www.sno.phy.queensu.ca/~phil/exiftool/) so that it can be scraped by mining tools or displayed by compatible PDF viewers.

Known limitations

No issues specific to this step, but worth noting that the article PDF will only be as good as the parsed XML/HTML (and the XMP only as good as the extracted or supplied metadata).

The XMP-enhanced PDF is the "final" output from our service, and is returned along with BibTeX, XML, and HTML in a zipfile when making a standard request.

Adding modules

More documentation will be forthcoming here -- if you wish to add another conversion step to the module directory, changes must also be made to the following four files (should be reasonably self-explanatory):

https://github.com/pkp/xmlps/blob/master/start_queues.sh https://github.com/pkp/xmlps/blob/master/config/autoload/global.php https://github.com/pkp/xmlps/blob/master/module/Manager/src/Manager/Entity/Job.php https://github.com/pkp/xmlps/blob/master/module/Manager/src/Manager/Model/Queue/Manager.php