Transforming XHTML text into formatted plain text

From Xcri

Jump to: navigation, search

XCRI CAP allows for XHTML marked-up text within the various description elements. Some aggregators may not store XHTML text, but instead require plain text, with some specific formatting rules.

It might be useful to provide a sample XHTML-text-to-plain-text XSLT stylesheet. I've put together a basic prototype, which is shown below. The entities are • (bullet), 
 (new line) and &#x9 (tab).

Please note that some of the element handling, specifically that for span elements, has been written to handle some awkward (and invalid) markup produced by Microsoft InfoPath.

The imported copy.xslt is the identity transform stylesheet as described in XSLT Cookbook.

Feel free to update, correct or replace any code here, or point to another stylesheet which does the job better and can be freely used.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns="http://xcri.org/profiles/catalog" xmlns:cap="http://xcri.org/profiles/catalog" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:xhtml="http://www.w3.org/1999/xhtml">
	<xsl:import href="http://www.adamsmithcollege.ac.uk/xml/xsl/t/courses/copy.xslt" />
	<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
	<!-- ASP.NET error: White space cannot be stripped from input documents that have already been loaded. Provide the input document as an XmlReader instead. -->
	<!--xsl:strip-space elements="xhtml:*" /-->
	<!-- Takes an XCRI CAP 1.0 document and removes XHTML elements from the description elements, replacing these with basic formatted text. -->
	<!-- Created 2008-04-17 by Tavis Reddick, Adam Smith College-->
	<!-- Modified 2009-03-23 by tavisreddick@adamsmith.ac.uk -->
	<!-- Version 0.1.2 -->
	<!--
This work is licenced under the Creative Commons Attribution 3.0 Unported License.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/3.0/ or send
a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California 94105, USA.
	-->
	<!-- Select your favoured character for bullets. -->
	<xsl:param name="bullet" select="'&#x2022;'" />
	<xsl:template match="/">
		<xsl:apply-templates />
	</xsl:template>
	<xsl:template match="cap:catalog/cap:provider/cap:course/cap:description">
		<xsl:copy>
			<xsl:copy-of select="@*" />
			<xsl:apply-templates />
		</xsl:copy>
	</xsl:template>
	<xsl:template match="xhtml:div | xhtml:p | xhtml:h1 | xhtml:h2 | xhtml:h3 | xhtml:h4 | xhtml:h5">
		<!-- For each block text item, replace with linebreak-content-linebreak. -->
		<!--xsl:value-of select="." /-->
		<xsl:text>&#xa;</xsl:text>
		<xsl:apply-templates />
		<xsl:text>&#xa;</xsl:text>
	</xsl:template>
	<xsl:template match="xhtml:ul | xhtml:ol">
		<!-- For each unordered or ordered list, apply templates and add opening and closing linebreaks. -->
		<xsl:text>&#xa;</xsl:text>
		<xsl:apply-templates />
		<xsl:text>&#xa;</xsl:text>
	</xsl:template>
	<xsl:template match="xhtml:span[./text()] | xhtml:strong | xhtml:u">
		<!-- For each span containing text, strong emphases or underlines, lose the element and apply templates. -->
		<xsl:apply-templates />
	</xsl:template>
	<xsl:template match="xhtml:span[./*]">
		<!-- For each span containing other elements, apply templates and add a closing linebreak. -->
		<xsl:apply-templates />
		<xsl:text>&#xa;</xsl:text>
	</xsl:template>
	<xsl:template match="xhtml:a | xhtml:font">
		<!-- For inline elements like a and font, replace elements by applying templates. -->
		<xsl:apply-templates />
	</xsl:template>
	<xsl:template match="xhtml:li[parent::xhtml:ul]">
		<!-- For each unordered list item, replace with bullet-tab-content-linebreak. -->
		<xsl:value-of select="$bullet" />
		<xsl:text>&#x9;</xsl:text>
		<xsl:value-of select="normalize-space(.)" />
		<xsl:text>&#xa;</xsl:text>
	</xsl:template>
	<xsl:template match="xhtml:li[parent::xhtml:ol]">
		<!-- For each ordered list item, replace with number-tab-content-linebreak. -->
		<xsl:value-of select="position()" />
		<xsl:text>&#x9;</xsl:text>
		<xsl:value-of select="normalize-space(.)" />
		<xsl:text>&#xa;</xsl:text>
	</xsl:template>
	<xsl:template match="xhtml:br">
		<!-- For each break element, replace with a linebreak. -->
		<xsl:text>&#xa;</xsl:text>
	</xsl:template>
	<!-- Remove empty XHTML elements. -->
	<xsl:template match="xhtml:*[normalize-space(.) = ' ' or normalize-space(.) = ''][not(self::xhtml:br)]" />
</xsl:stylesheet>
Personal tools