Bridging the Gaps
Between XML and TEX

@xporc@mstdn.social / @mkraetke

What is TeX?

  • macro-based programming language by Donald E. Knuth
  • LaTeX: macro package for TeX by Leslie Lamport
  • typesetting system, popular in academia

Anatomy of a LaTeX Document

\documentclass{article}        % preamble: document class
\usepackage[british]{babel}    % preamble: package declaration
\title{my Markup UK paper}     % global parameters
\author{Martin Kraetke}
\date{\today}
\begin{document}               % document body start
\maketitle
\section{Introduction}           
This is a paragraph with \textit{italicized text}.
\end{document}                 % document body end

The Gaps between XML and TeX

The Gaps between XML and TeX

  • curly braces instead of angle brackets
  • special characters: & % $ # _ { } ~ ^ \
  • preamble and document body

Textmode and Mathmode

\documentclass{article}
\begin{document}

text mode: \textsuperscript{superscript}

math mode: $^{superscript}$

\end{document}

Tables

\documentclass{article}
\begin{document}

\begin{table}
\begin{tabular}{|l|c|r|} \hline
a1 & a2 & a3 \\ \hline
b1 & b2 & b3 \\ \hline
c1 & c2 & c3 \\ \hline
\end{tabular}
\end{table}
\end{document}

Tables: Multirow and Multicolumn

\documentclass{article}
\usepackage{multirow}
\begin{document}

\begin{table}
\begin{tabular}{|l|c|r|} \hline
a1 & \multicolumn{2}{|c|}{a2} \\ \hline
b1 & b2 & \multirow{2}{*}{b2} \\ \cline{1-2}
c1 & c2  \\ \hline
\end{tabular}
\end{table}

\end{document}

TeX engines: Unicode support

  • pdflatex ☐
  • xetex ☑
  • luatex ☑

Packages with overlapping functionality

  • default: \underline{…}
  • soul: \ul{…}
  • ulem: \uline{…}

Custom macros

\newcommand{\name}{definition}
\newenvironment{name}[num][default]{before}{after}

Methods to convert XML to TeX

1. xmltex

1. xmltex*

  • non-validating XML parser implemented in TeX by David Carlisle
  • can associate TeX code with XML elements, attributes, processing instructions, and entities

1. xmltex - TeX file

\def\xmlfile{doc.xml} % xml file
\input xmltex.tex % loads xmltex

1. xmltex - catalogue

\NAMESPACE{http://www.tei-c.org/ns/1.0}{tei.xmt}

1. xmltex - mapping

\XMLelement{TEI}          
{}                        % attributes
{\documentclass{article}  % start of element
 \begin{document}         
{\end{document}}          % end of element

1. xmltex

  • lightweight and somehow declarative
  • nested structures require programming
  • no XML query language (just names)
  • error reporting virtually non-existent

2. PassiveTeX

2. PassiveTeX*

  • by Michel Goossens, Sebastian Rahtz
  • xmltex configuration for
    XSL Formatting Objects (XSL-FO)
  • TeX acts as FO Formatters

2. PassiveTeX

\XMLelement{fo:root}
{}
{\documentclass{article}
\usepackage{fotex}
\begin{document}
\pagestyle{empty}
\FOSetHyphenation
%\ignorewhitespace
}
{\end{document}}

2. PassiveTeX

  • FO and xmltex not easy to debug
  • experimental approach, rarely adopted
  • very hard to configure

3. Pandoc

3. Pandoc*

  • „universal“ markup converter
  • supports XML (DocBook, JATS) and
    LaTeX among other formats

3. Pandoc – DocBook example

<?xml version="1.0" encoding="UTF-8"?>
<article xmlns="http://docbook.org/ns/docbook" version="5.0">
  <title>my Title</title>
  <sect1>
    <title>Random Section Title</title>
    <para>This is a para and this is 
      <emphasis role="bold">bold</emphasis>
    </para>
  </sect1>
</article>

3. Pandoc – TeX output

\section{Random Section Title}

This is a para and this is \textbf{bold}

3. Pandoc

  • XML input restricted to DocBook and JATS
  • no configuration, you can program filters with Lua
  • MathML not well supported

4. XSLT

4. XSLT

  • flexible with XPath, regular expressions, grouping etc.
  • <xsl:output method="text"/> for TeX output
  • associate templates with TeX instructions

4. XSLT

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
  xpath-default-namespace="http://www.tei-c.org/ns/1.0"
  version="3.0">
  
  <xsl:output method="text"/>
  
  <xsl:template match="TEI">
    <xsl:text>\documentclass{article}&#xa;</xsl:text>
    <xsl:text>\begin{document}&#xa;</xsl:text>
    <xsl:apply-templates select="* except teiHeader"/>
    <xsl:text>\end{document}</xsl:text>
  </xsl:template>
  
  <xsl:template match="p">
    <xsl:apply-templates/>
    <xsl:text>&#xa;&#xa;</xsl:text>
  </xsl:template>
  
</xsl:stylesheet>

4. XSLT

  • very powerful but also not a very declarative approach
  • reusable code with imports
  • Problem when source XML vocabulary changes
  • implementing tables, math, various packages is complex

An alternative approach: xml2tex

xml2tex*

  • module of le-tex transpect framework
  • based on XProc/XSLT
  • declarative XML configuration
  • modules for MathML, tables

xml2tex - MathML conversion

  1. mml-normalize: MathML normalization
  2. mml2tex: MathML→TeX

mml2tex - MathML example

xml2tex - table conversion

  • input: CALS and HTML tables (HTML→CALS)
  • output: tabularx or htmltabs

Table normalization by Andrew J. Welch

+-----------+-----------+         +-----+-----+-----+-----+
| a         | b         |         | a   | a   | b   | b   |
|           +-----+-----+         +-----+-----+-----+-----+
|           | c   | d   |         | a   | a   | c   | d   |
+-----------+-----+     |   ==>   +-----+-----+-----+-----+
| e               |     |         | e   | e   | e   | d   |
+-----+-----+-----+     |         +-----+-----+-----+-----+
| f   | g   | h   |     |         | f   | g   | h   | d   |
+-----+-----+-----+-----+         +-----+-----+-----+-----+

mml2tex - tables example

xml2tex - configuration

  • <ns/> declare namespaces
  • <import/> import other xml2tex-configs

xml2tex - configuration (document structure)

  • <preamble/> doc class, packages, parameters etc.
  • <front/> frontmatter
  • <back/> backmatter

xml2tex - configuration (text body)

  • <template/> associates XML nodes with TeX
  • <regex/> associates regular expressions with TeX
  • <charmap/> character map

xml2tex - configuration

<template context="dbk:section">
  <rule break-after="2" name="section" type="cmd">
    <param/>
  </rule>
</template>
  • @name → TeX name
  • @type = "cmd"|"env" → command or environment
  • <param> → {parameter}
  • <option> → [options]
  • <text> → regular text

xml2tex - Regex

<regex regex="(\d{1,2})\.(\d{1,2})\.(\d{4})">
  <rule name="mydate" type="cmd">
    <param select="regex-group(1)"/>
    <param select="regex-group(2)"/>
    <param select="regex-group(3)"/>
  </rule>
</regex>

xml2tex - Character maps

<charmap ignore-imported-charmaps="false">
  <!--= Uppercase Gamma -->   
  <char character="Γ" string="${\Upgamma}$"/>
  <char character="Γ" string="$\boldsymbol{\Upgamma}$" 
        context="*[@css:font-weight eq 'bold']"/> 
  <char character="Γ" string="${\Gamma}$"
        context="*[@css:font-style eq 'italic']"/>
</charmap>