A Quality Assurance Tool for JATS/BITS with Schematron and HTML reporting

Franziska Bühring (@fbuehring), Martin Kraetke (@mkraetke)

De Gruyter

  • publisher of scientific content
  • 1,200 book titles, 750 journals, data bases
  • Berlin, Munich, Boston, Bejing

Franziska Bühring

  • Senior Manager eProducts and Standards
  • working at De Gruyter since 2010

le-tex

  • publishing service vendor
  • typesetting (65,000 journal pages p.a.,
    250,000 book pages p.a.), data conversion (600 E-Books), software development
  • Leipzig

Martin Kraetke

  • Lead Content Engineer
  • working at le-tex since 2011

Motivation

Motivation

  • time consuming, inefficient error handling process
  • it was not guaranteed that data was:
    • consistent
    • valid with regard to De Gruyter specifications

XML Guidelines and Requirements

XML at De Gruyter

  • JATS/BITS to archive/publish content for journals and books
  • XML delivery (ZIP) also includes
    • corresponding PDF files
    • figure files
    • Electronic Supplemenentary Material

Requirements

  • parse/validate XML
  • match the internal list of files to referenced files

JATS/BITS at De Gruyter

De Gruyter specific rules regarding:

  • values of attributes
  • required elements/attributes
  • delivery structure and naming conventions

XML at De Gruyter

XML at De Gruyter

Requirements

  • Schematron to implement Business Rules
  • avoid redundancies:
    use concept cascading Rules (adopted from le-tex)

Variables and IDs

{article-id} = {doi-code}-{year}-{article-counter-ID}
{doi-code} = [predefined value for each journal]
{year} = ^\d{4}$
{article-counter-ID} = ^\d{4}$

cclm-2016-0123

Requirements

  • Processing Parameters:
    • File listing all variables/IDs, their composition and dependencies
    • list of allowed DOI-prefixes according to the publisher
    • ...
  • should not be part of the programming code

Additional Requirements

  • configurable as possible
  • Metadata compliance with ERP system
  • Report with error messages that
    • link to the error location
    • include a link to the respective guideline section

Implementation

Publication Workflow

Talend

Transpect

Transpect

  • framework for converting and checking data
  • architecture for customer-specific configurations
  • Open Source (FreeBSD)

Workflow

Internal XML format

One XML to rule them all

  • JATS/BITS
  • PDF check results
  • metadata from ERP system
  • Zip file listing
  • global parameters

Configuration

Configuration: Cascade

  • configuration files can be stored for each workflow, imprint/publishing partner or even individual titles
  • applies to XProc, XSLT, Schematron, CSS, other XML etc.
  • transpect modules load dynamically the most appropriate files

Configuration: Phases

  • custom checking profile
  • includes RelaxNG schema, Schematron phase
  • De Gruyter: strict and lax phase

Configuration: Parameters

  • parameters are stored as XML
  • used in XProc, XSLT and Schematron
  • parameters can include variable references to other parameters

Schematron

Schematron Implementation

Benefits

Benefits

  • minimized effort for error handling within production
  • improved processing for degruyter.com, A&I Services, Aggregators
  • helps improving XML guidelines and the tool itself
  • facilitates the retrieval and processing for future purposes
    (re-use content)
  • improved XML processing with vendors

Benefits

Outlook

  • PDF checks
  • IDPF EPUB check
  • external checking framework for vendors

Thanks for your attention.


Martin Kraetke @mkraetke

Franziska Bühring @fbuehring