Subscribe to RSS feed

«

»

Sep
05

The Power of XProc: Converting CSV Files to XML

 I've been writing about XProc, the XML Processing Language, for some time, but until recently have had very few chances to actually do anything with it. Overall, I like processing pipelines of any sort - they work surprisingly well for the kind of workflows that XML tends to lend itself to, which is one of the reasons that XProc has excited me as much as it has - it's a language that's intended to generalize XML processing pipelines, and to use them for fairly sophisticated processing.

One of the things that I've begun to work with as well as the ability to use XProc for working with non-XML data. For instance, recently, I've been looking at the Droid application, produced by the National Archives (http://droid.sourceforge.net/). DROID is an acronym for Digital Record Object IDentification, and is used to read files and determine their actual type, rather than relying upon file extensions (which are notoriously inconsistent, tend to have name collisions, and often mask considerable variation in the way that they work). It has both a command line and a graphical user interface, but for my purposes, the command line was perhaps the most important factor.

Droid has a somewhat odd command-line interface - you create a droid "report" that holds the name of all of the files that you want to process, then you call the component again passing the report along with an optional CSV file (no XML, though that's in the works, I understand) that holds the results of the operation:

> droid -f report.droid -e report.csv

Xproc has the ability to invoke a command line function through the p:exec statement. This statement takes the command (as a fully qualified filename) and arguments as parameters, and then wraps the generated output in a wrapper. This would work great, except that droid doesn't generate output on the standard pipe - it only pipes it to the csv file. The output of the above looks like most "standard" CSV output:

"URI","FILE_PATH","NAME","METHOD","STATUS","SIZE","TYPE","EXT",...

"file:/C:/Users/Kurt/Pictures/","C:\Users\Kurt\Pictures","Pictures",...
"file:/C:/Users/Kurt/Pictures/4873b04928006.jpg",...
"file:/C:/Users/Kurt/Pictures/4a83c510baae4bfd09d3e24a9e1d8c88.jpg",...
"file:/C:/Users/Kurt/Pictures/6a00d8341d417153ef0120a75bcfa1970b-800wi.jpg",...
...

Listing 1. CSV Output From DROID.

The challenge then was two-fold - generating a workflow that would generate this in the first place, then reading the CSV file and converting it into XML. I chose XProc for this both because it may have bearing on another project I'm working with, and because this seemed to be a fairly natural fit for the language.

As it turned out, while it was, the process of understanding how XProc itself work proved to be more than a little challenging. XProc has the promise to be another XSLT - a language that is incredibly powerful and capable of doing any number of things, but that, in order to use it, requires that you undergo the equivalent of one of those ancient Tibetan mysteries involving hot stones, grasshoppers and old men capable of destroying brick walls with a touch. The logic makes sense, but only once you get around the significant conceptual curves involved.

While the execution of external commands turned out to be relatively simple, I am going to wait until the follow-up article to this one to explore how to invoke external processes.  For this article, I'm going to concentrate on the conversion - loading in a CSV file and converting it into a more generalized XML document.

One of the interesting things about XProc is that it is, almost by definition, a language of modules. I decided to use the Calabash XProc module, written by Norman Walsh in Java and editable from Oxygen 11.1 and above (among other tools) as my primary implementation, though with an eye towards minimizing external dependencies as much as possible.

The core module - text:xml-from-csv - was defined as a declare-step script:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
    xmlns:c="http://www.w3.org/ns/xproc-step"
    xmlns:text="http://xmltoday.org/xmlns/text"
    type="text:csv-to-xml"
    version="1.0">
    <p:import href="text-utils.xpl"/>
    <p:output port="result"/>
    <text:load-wrapped-text-file wrapper="c:data">
        <p:with-option name="href" select="'report.csv'"/>
    </text:load-wrapped-text-file>
    <text:map-csv-to-xml name="csv-result">
        <p:with-option name="record-name" select="'Record'"/>
        <p:with-option name="record-set-name" select="'RecordSet'"/>
    </text:map-csv-to-xml>
    <p:identity/>
    <store href="report.xml"/>
</p:declare-step>

Listing 2: The csv-to-xml script

The import statement imported two previously defined pipes - text:load-wrapped-text-file and text:map-csv-to-xml - that handled the importation and conversion respectively of the files. The output statement simply indicated that the result would be the standard result output for a pipeline, while the identity statement insured that the output of the  generated xml could be consumed by the store statement, which saved the resulting file to the the drive.

The text:load-wrapped-text-file statement was built to address a significant problem that XProc has - while it is possible to load in an XML file dynamically (i.e., passing the name of the file to be loaded in an XML attribute or element) non-XML files can only handle fixed filenames. While some of this can be laid to the fact that XProcs is intended primarily as an XML pipeline, for my purposes it proved to be a major limitation, since it essentially meant that references had to be hard-coded for non-XML content.

To get around this, I took advantage of the fact that most XProc implementations support both XSLT2 and XQuery. XLST 2.0 supports a very useful XPath 2 function: unparsed-text(). This function, when provided a URL and an optional text format type (which here I hard-coded to UTF-8), loads in unparsed text content and passes it into an XSLT function. Thus, in order to implement a dynamic XProc statement, I actually created an XSLT within the XProc, as shown in Listing 3 (which also includes the library header for text-utilities.xpl:

<p:library  xmlns:p="http://www.w3.org/ns/xproc"
    xmlns:c="http://www.w3.org/ns/xproc-step"
    xmlns:text="http://xmltoday.org/xmlns/text"
    version="1.0"
    >

<p:declare-step type="text:load-wrapped-text-file">
    <p:output port="result"/>
    <p:option name="href"/>
    <p:option name="wrapper"/>
    <p:xslt>
        <p:with-param name="href" select="$href"/>
        <p:with-param name="wrapper" select="$wrapper"/>
        <p:input port="source">
            <p:inline><foo/></p:inline>
        </p:input>
        <p:input port="stylesheet">
            <p:inline>
        <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
            xmlns:c="http://www.w3.org/ns/xproc-step"
            version="2.0">
            <xsl:output method="text"/>
            <xsl:param name="href"/>
            <xsl:param name="wrapper"/>
            <xsl:template match="/">
                <xsl:element name="{$wrapper}"
xml:space="preserve"><xsl:value-of select="unparsed-text($href,'utf-8')"/>
</xsl:element>
            </xsl:template>
        </xsl:stylesheet>
            </p:inline>
        </p:input>
        <p:input port="parameters">
            <p:empty/>
        </p:input>
    </p:xslt>
    <p:identity/>
</p:declare-step>

Listing 3. The text:load-wrapped-text-file XProc Script.

In this particular case - the first input (the source) to the pipeline contained a single stub xml, since the input in this case will come from the text file - the stub is only important in that it gives the transformation something to transform against. The parameters (within the "parameters" port) likewise are set to empty, because they are being set directly by the options provided to the step.

This leaves the XSLT transformation itself, within the "stylesheet" port, with options being passed to parameters. The only relevant transform is the first one, in which the content is uploaded via the fn:unparsed-text() function, and then wrapped in a user-defined wrapper. Because options are used, the calling xproc step can either pass in direct text values into the option attributes (such as that used by the @wrapper attribute) in the XProc code, or it can be passed via an XPath statement (such as that used by the p:with-option/@href statement), where this content can be generated dynamically. Note that this requires that an XSLT 2.0 engine be available to make this work.

If the first defined step showcases XSLT within XProc, the second serves to illustrate XQuery (Listing 4).

<p:declare-step
    type="text:map-csv-to-xml">
    <p:output port="result"/>
    <p:option name="href" select="'file:///c:/droid/report.csv'"/>
    <p:option name="delimiter" select="concat('"?',',','"?')"/>
    <p:option name="record-name" select="'record'"/>
    <p:option name="record-set-name" select="'records'"/>
    <p:input port="source"/>
    <p:xquery>
        <p:with-param name="delimiter" select="$delimiter"/>
        <p:with-param name="record-name" select="$record-name"/>
        <p:with-param name="record-set-name" select="$record-set-name"/>
        <p:input port="query">
            <p:inline><c:query><![CDATA[
declare namespace c="http://www.w3.org/ns/xproc-step";
declare variable $delimiter external;
declare variable $record-name external;
declare variable $record-set-name external;
declare function local:tokenize-line($line as xs:string) as xs:string* {
    let $trimmed-line := if (starts-with($line,'"')) then substring($line,2) else $line
    let $trimmed-line := if (ends-with($line,'"')) then substring($trimmed-line,1,string-length($trimmed-line) - 1) else $trimmed-line
    let $terms := fn:tokenize($trimmed-line,$delimiter)
    return $terms
    };

let $lines := for $line in tokenize(string(.),'
') return <c:line>{$line}</c:line>
let $keys := local:tokenize-line($lines[1])
let $keys := for $key in $keys return fn:lower-case(fn:translate($key," ","-"))
let $data-lines := subsequence($lines,2)
let $records := element {$record-set-name} {for $data-line in $data-lines return
    let $values := local:tokenize-line($data-line)
    let $record := if (count($values)>0) then element {$record-name} {
        let $fields := for $index in (1 to fn:count($values)) return element {$keys[$index]} {$values[$index]}
        return $fields}
    else ()
    return $record
    }
let $result := $records
return $result
]]></c:query></p:inline>
        </p:input>
    </p:xquery>
</p:declare-step>
</p:library>

Listing 4. The text:map-csv-to-xml XProc script.

This step similarly uses options in order to define three variables - delimiter, for separating fields, the record name, for labeling each line as a separate entry, and the record-set-name, for wrapping the records into a transmittable structure.

CSV files typically have the first line holding the respective field names, while subsequent lines hold the corresponding records Each field is separated by a delimiter, but despite the name not all CSV files are delimited by commas. Making things more difficult is the fact that field entries are usually wrapped in quotation marks, so that fields with enclosed spaces can be stored - however, empty fields may not necessarily be so captured. The code is set up to handle any regular expression as a field end designator, and in this case is set up to use the regular expression  /"?,"?/ which captures any internal comma that's surrounded by question marks. It's not perfect - a more comprehensive regex might catch more edge cases - but it seems to work for most CSV files.

The field names become keys, and from there XML elements, with each record value being inserted into the new element. The $record-set-name and $record-name variables are then used to create the record set and records respectively. The advantage to this approach is that it can work with any kind of CSV file without the routine needing to be aware of what specific field names are involved.

When done, the resulting content, shown in Listing 5, is stored in the file report.xml, though it could just as readily be used as input for another pipeline.

<RecordSet>
   <Record>
      <uri>file:/C:/Users/Kurt/Pictures/</uri>
      <file_path>C:\Users\Kurt\Pictures</file_path>
      <name>Pictures</name>
      <method>OPERATING_SYSTEM</method>
      <status>Identified</status>
      <size/>
      <type>FOLDER</type>
      <ext/>
      <last_modified>2010-09-03T14:33:29</last_modified>
      <puid/>
      <mime_type/>
      <format_name/>
      <format_version/>
   </Record>
   <Record>
      <uri>file:/C:/Users/Kurt/Pictures/4873b04928006.jpg</uri>
      <file_path>C:\Users\Kurt\Pictures\4873b04928006.jpg</file_path>
      <name>4873b04928006.jpg</name>
      <method>BINARY_SIGNATURE</method>
      <status>Identified</status>
      <size>299498</size>
      <type>FILE</type>
      <ext>jpg</ext>
      <last_modified>2010-09-03T08:15:58</last_modified>
      <puid>fmt/42</puid>
      <mime_type>image/jpeg</mime_type>
      <format_name>JPEG File Interchange Format</format_name>
      <format_version>1.00</format_version>
   </Record>
   <Record>
      <uri>file:/C:/Users/Kurt/Pictures/4a83c510baae4bfd09d3e24a9e1d8c88.jpg</uri>
      <file_path>C:\Users\Kurt\Pictures\4a83c510baae4bfd09d3e24a9e1d8c88.jpg</file_path>
      <name>4a83c510baae4bfd09d3e24a9e1d8c88.jpg</name>
      <method>BINARY_SIGNATURE</method>
      <status>Identified</status>
      <size>856163</size>
      <type>FILE</type>
      <ext>jpg</ext>
      <last_modified>2010-07-25T11:25:02</last_modified>
      <puid>fmt/42</puid>
      <mime_type>image/jpeg</mime_type>
      <format_name>JPEG File Interchange Format</format_name>
      <format_version>1.00</format_version>
   </Record>
   <!-- More records -->
</RecordSet>

Listing 5. The output generated by the converted CSV file.

XProc, at least the Calabash version of it built upon the new Saxon 9.x engine, is remarkably fast - it took under a second to process 800 lines and convert them into XML records. While the learning curve for understanding how XProc works can be fairly hairy, the advantage to this is modularization; by creating steps to handle both loading and wrapping of external text files dynamically and creating a CSV to XML converter step, I can reuse these components with very little overhead. I hope to write more on effective XProc design later in the year, when I've had a chance to explore the language more.

Look for the second article in this series within the next couple of weeks, focusing on how you can use XProc to control external processes, such as generating thumbnails or do other processing of images, convert documents, generate PDFs and so forth. 

4 comments

  1. Hardlopen en afvallen | Afvallen zonder dieet says:

    [...] The Power of XProc: Converting CSV Files to XML » <XMLToday/> [...]

  2. The Power of XProc: Converting CSV Files to XML » <XMLToday/> « xsl says:

    [...] Przeczytaj artykuł: The Power of XProc: Converting CSV Files to XML » <XMLToday/> [...]

  3. The Power of XProc: Converting CSV Files to XML » <XMLToday/> - xml says:

    [...] The Power of XProc: Converting CSV Files to XML » <XMLToday/> Tags: been-writing, but-until, few-chances, for-some, had-very, like-processing, processing, [...]

  4. The Power of XProc: Converting CSV Files to XML » <XMLToday/> « xslt says:

    [...] Przeczytaj artykuł: The Power of XProc: Converting CSV Files to XML » <XMLToday/> [...]

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>