Text to HTML converter using Perlmantic::StateMachineParser

In this example we show how to write a Perl program to convert a text file with simple markup to HTML format.

We'll be using the following sample text file as input:

* NAME
       (WARNING: This is an incomplete and reformated extract from the apt-get man page, used only as a sample formated text.)
       apt-get - APT package handling utility -- command-line interface

* SYNOPSIS
       : apt-get [-asqdyfmubV] [-o=config_string] [-c=config_file] [-t=target_release] [-a=architecture] {update | upgrade | dselect-upgrade |
               dist-upgrade | install pkg [{=pkg_version_number | /target_release}]...  | remove pkg...  | purge pkg...  |
               source pkg [{=pkg_version_number | /target_release}]...  | build-dep pkg [{=pkg_version_number | /target_release}]...  |
               download pkg [{=pkg_version_number | /target_release}]...  | check | clean | autoclean | autoremove | {-v | --version} |
               {-h | --help}}

* DESCRIPTION
       apt-get is the command-line tool for handling packages, and may be considered the user's "back-end" to other tools using the APT library.
       Several "front-end" interfaces exist, such as dselect(1), aptitude(8), synaptic(8) and wajig(1).

       Unless the -h, or --help option is given, one of the commands below must be present.

        : update
           update is used to resynchronize the package index files from their sources. The indexes of available packages are fetched from the
           location(s) specified in /etc/apt/sources.list. For example, when using a Debian archive, this command retrieves and scans the
           Packages.gz files, so that information about new and updated packages is available. An update should always be performed before an
           upgrade or dist-upgrade. Please be aware that the overall progress meter will be incorrect as the size of the package files cannot be
           known in advance.

       : install
           install is followed by one or more packages desired for installation or upgrading. Each package is a package name, not a fully
           qualified filename (for instance, in a Debian system, apt-utils would be the argument provided, not apt-utils_0.9.7.9_amd64.deb). All
           packages required by the package(s) specified for installation will also be retrieved and installed. The /etc/apt/sources.list file is
           used to locate the desired packages.

           A specific version of a package can be selected for installation by following the package name with an equals and the version of the
           package to select. This will cause that version to be located and selected for install. Alternatively a specific distribution can be
           selected by following the package name with a slash and the version of the distribution or the Archive name (stable, testing,
           unstable).

* OPTIONS
       All command line options may be set using the configuration file, the descriptions indicate the configuration option to set. For boolean
       options you can override the config file by using something like -f-,--no-f, -f=no or several other variations.

       : --no-install-recommends
           Do not consider recommended packages as a dependency for installing. Configuration Item: APT::Install-Recommends.

       : --install-suggests
           Consider suggested packages as a dependency for installing. Configuration Item: APT::Install-Suggests.

* FILES
       : /etc/apt/sources.list
           Locations to fetch packages from. Configuration Item: Dir::Etc::SourceList.

       : /etc/apt/sources.list.d/
           File fragments for locations to fetch packages from. Configuration Item: Dir::Etc::SourceParts.

* SEE ALSO
       apt-cache(8), apt-cdrom(8), dpkg(1), dselect(1), sources.list(5), apt.conf(5), apt-config(8), apt-secure(8), The APT User's guide in
       /usr/share/doc/apt-doc/, apt_preferences(5), the APT Howto.

* AUTHORS
       Jason Gunthorpe

       APT team

* NOTES
        1. packages.debian.org/changelogs
           http://packages.debian.org/changelogs

        2. changelogs.ubuntu.com/changelogs
           http://changelogs.ubuntu.com/changelogs

The desired HTML output we want is as follows:

<html>
<header><title>txt2html output</title>
</header>
<body>
<div class='section'>
<h1>NAME</h1>
<p>
       (WARNING: This is an incomplete and reformated extract from the apt-get man page, used only as a sample formated text.)
       apt-get - APT package handling utility -- command-line interface

</p>
</div>
<div class='section'>
<h1>SYNOPSIS</h1>
<dl>
<dt>apt-get [-asqdyfmubV] [-o=config_string] [-c=config_file] [-t=target_release] [-a=architecture] {update | upgrade | dselect-upgrade |</dt>
<dd><p>
               dist-upgrade | install pkg [{=pkg_version_number | /target_release}]...  | remove pkg...  | purge pkg...  |
               source pkg [{=pkg_version_number | /target_release}]...  | build-dep pkg [{=pkg_version_number | /target_release}]...  |
               download pkg [{=pkg_version_number | /target_release}]...  | check | clean | autoclean | autoremove | {-v | --version} |
               {-h | --help}}

</p>
</dd>
</dl>
</div>
<div class='section'>
<h1>DESCRIPTION</h1>
<p>
       apt-get is the command-line tool for handling packages, and may be considered the user's "back-end" to other tools using the APT library.
       Several "front-end" interfaces exist, such as dselect(1), aptitude(8), synaptic(8) and wajig(1).

</p>
<p>
       Unless the -h, or --help option is given, one of the commands below must be present.

</p>
<dl>
<dt>update</dt>
<dd><p>
           update is used to resynchronize the package index files from their sources. The indexes of available packages are fetched from the
           location(s) specified in /etc/apt/sources.list. For example, when using a Debian archive, this command retrieves and scans the
           Packages.gz files, so that information about new and updated packages is available. An update should always be performed before an
           upgrade or dist-upgrade. Please be aware that the overall progress meter will be incorrect as the size of the package files cannot be
           known in advance.

</p>
</dd>
<dt>install</dt>
<dd><p>
           install is followed by one or more packages desired for installation or upgrading. Each package is a package name, not a fully
           qualified filename (for instance, in a Debian system, apt-utils would be the argument provided, not apt-utils_0.9.7.9_amd64.deb). All
           packages required by the package(s) specified for installation will also be retrieved and installed. The /etc/apt/sources.list file is
           used to locate the desired packages.

</p>
<p>
           A specific version of a package can be selected for installation by following the package name with an equals and the version of the
           package to select. This will cause that version to be located and selected for install. Alternatively a specific distribution can be
           selected by following the package name with a slash and the version of the distribution or the Archive name (stable, testing,
           unstable).

</p>
</dd>
</dl>
</div>
<div class='section'>
<h1>OPTIONS</h1>
<p>
       All command line options may be set using the configuration file, the descriptions indicate the configuration option to set. For boolean
       options you can override the config file by using something like -f-,--no-f, -f=no or several other variations.

</p>
<dl>
<dt>--no-install-recommends</dt>
<dd><p>
           Do not consider recommended packages as a dependency for installing. Configuration Item: APT::Install-Recommends.

</p>
</dd>
<dt>--install-suggests</dt>
<dd><p>
           Consider suggested packages as a dependency for installing. Configuration Item: APT::Install-Suggests.

</p>
</dd>
</dl>
</div>
<div class='section'>
<h1>FILES</h1>
<dl>
<dt>/etc/apt/sources.list</dt>
<dd><p>
           Locations to fetch packages from. Configuration Item: Dir::Etc::SourceList.

</p>
</dd>
<dt>/etc/apt/sources.list.d/</dt>
<dd><p>
           File fragments for locations to fetch packages from. Configuration Item: Dir::Etc::SourceParts.

</p>
</dd>
</dl>
</div>
<div class='section'>
<h1>SEE ALSO</h1>
<p>
       apt-cache(8), apt-cdrom(8), dpkg(1), dselect(1), sources.list(5), apt.conf(5), apt-config(8), apt-secure(8), The APT User's guide in
       /usr/share/doc/apt-doc/, apt_preferences(5), the APT Howto.

</p>
</div>
<div class='section'>
<h1>AUTHORS</h1>
<p>
       Jason Gunthorpe

</p>
<p>
       APT team

</p>
</div>
<div class='section'>
<h1>NOTES</h1>
<p>
        1. packages.debian.org/changelogs
           http://packages.debian.org/changelogs

</p>
<p>
        2. changelogs.ubuntu.com/changelogs
           http://changelogs.ubuntu.com/changelogs

</p>
</div>
</body>
</html>

Our program will consist of two files: txt2html.pl and HtmlWriter.pm. The first file is the actual Perl script we will have to run in order to perfrom the conversion, it contains the definition of the state machine required to parse the input; the second file is a Perl module that defines a class we'll use to write the HTML output, this class implements the Perlmantic::StateMachineParser::IReceiver interface we need to process state machine transitions.

use strict;
use warnings;

use FindBin qw( $Bin );
use lib "$Bin";
use lib "$Bin/../../../..";

use Parser::StateMachineParser;
use Parser::StateMachineParser::ParsingPrinter;
use HtmlWriter;

my $writer = HtmlWriter->new (*STDOUT);
my @anyState = ('section', 'subsection', 'defItem', 'paragraph', 'interParagraph');

my $pp;
my $parsingPrinter;
if (open $pp, ">parse.log") {
   $parsingPrinter = ParsingPrinter->new($pp);
} else {
   print {*STDERR} "Cannot open parse.log for writing\n";
}


my $machineDef = {
   receivers => [
      {obj=>$writer, separateStateCallbacks=>1},
      {obj=>$parsingPrinter},
   ],
   states => {
      null => {
         properties => { initialState => 1 },
      },
      section => {
         transitions => [{
            from => ['null', @anyState],
            signal => {
               regex => qr/^\s*\*\s+(.*)$/,
               parts => { sectionName=>1 },
            },
         }],
      },
      subsection => {
         transitions => [{
            from => \@anyState,
            signal => {
               regex => qr/^\s*\*{2}\s+(.*)$/,
               parts => { subsectionName=>1 },
            },
         }],
      },
      defItem => {
         transitions => [{
            from => ['section', 'subsection', 'defItem', 'paragraph', 'interParagraph'],
            signal => {
               regex => qr/^\s*:\s+(.*)$/,
               parts => { defName=>1 },
            },
         }],
      },
      interParagraph => {
         transitions => [{
            from => ['paragraph'],
            signal => {
               regex => qr/^\s*$/,
               parts => { },
            },
         }],
      },
      paragraph => {
         transitions => [{
            from => ['section', 'subsection', 'defItem', 'interParagraph'],
            signal => {
               regex => qr/.*\S.*/,
               parts => { },
            },
         }],
      },
   },
};

my $machine = StateMachineParser->new ($machineDef);
$machine->processFile (*STDIN);

use strict;
use warnings;
package HtmlWriter;

sub new ($$)
{
   my ($className, $outFH) = @_;
   my $self = {
      outFH => $outFH,
      isOpen => {
         paragraph => 0,
         list => 0,
         subsection => 0,
         section => 0,
      },
      tagMap => {
         'paragraph' => 'p',
         'defList' => 'dl',
         'defItem' => 'dd',
         'subsection' => 'div',
         'section' => 'div',
      },
      closeDeps => {
         'document' => ['paragraph', 'defItem', 'defList', 'subsection', 'section'],
         'section' => ['paragraph', 'defItem', 'defList', 'subsection', 'section'],
         'subsection' => ['paragraph', 'defItem', 'defList', 'subsection'],
         'defList' => ['paragraph', 'defItem', 'defList'],
         'defItem' => ['paragraph', 'defItem'],
         'paragraph' => ['paragraph'],
      },
      omitFirst => ['section', 'subsection', 'firstDefItem', 'defItem'],
      prevPara => undef,
   };
   bless $self, $className;

   # write HTML header
   $self->print("<html>\n<header><title>txt2html output</title>\n</header>\n<body>\n");
   return $self;
}

sub print ($$)
{
   my ($self, $text) = @_;
   print {$self->{outFH}} $text;
}

sub closeTags ($$)
{
   my ($self, $curStateType) = @_;
   return if not exists $self->{closeDeps}->{$curStateType};
   for my $toClose (@{$self->{closeDeps}->{$curStateType}})
   {
      if ($self->{isOpen}->{$toClose})
      {
         $self->print("</$self->{tagMap}->{$toClose}>\n");
         $self->{isOpen}->{$toClose} = 0;
      }
   }
}


sub openTag ($$)
{
   my ($self, $curStateType, $text) = @_;

   if (defined $text) {
      $self->print($text);
   } else {
      $self->print("<$self->{tagMap}->{$curStateType}>\n");
   }

   $self->{isOpen}->{$curStateType} = 1;
}


sub DESTROY ($)
{
   my ($self) = @_;
   $self->closeTags('document');
   $self->print("</body>\n</html>\n");
}

sub enter_section ($$$)
{
   my ($self, $raw, $info) = @_;
   $self->closeTags('section');
   $self->openTag('section', "<div class='section'>\n<h1>$info->{sectionName}</h1>\n");
}

sub enter_subsection ($$$)
{
   my ($self, $raw, $info) = @_;
   $self->closeTags('subsection');
   $self->openTag('subsection', "<div class='subsection'>\n<h2>$info->{subsectionName}</h2>\n");
}

sub enter_defItem ($$$)
{
   my ($self, $raw, $info) = @_;
   if (not $self->{isOpen}->{defList})
   {
      $self->closeTags('defList');
      $self->openTag('defList');
   }
   $self->closeTags('defItem');
   $self->openTag('defItem', "<dt>$info->{defName}</dt>\n<dd>");
}

sub enter_paragraph ($$$)
{
   my ($self, $raw, $info) = @_;
   $self->closeTags('paragraph');
   $self->openTag('paragraph');
}

sub processData ($$)
{
   my ($self, $data, $curState, $isTransition) = @_;
   return 0 if $isTransition and grep {$_ eq $curState} @{$self->{omitFirst}};
   $self->print($data);
}

1;