=head1 htmltable2latex perl package
This small package provides a mechanism to translate tables in html
markup (or better xml format with html table tags) into a
corresponding LaTeX code. Not all features of html tables are
supported, but basic tables should work fine.
=cut
package htmltable2latex;
use XML::Parser;
use XML::Parser::EasyTree;
use Exporter;
#use Data::Dumper;
use strict;
use vars qw(@ISA @EXPORT $cols $cols_);
@ISA=qw(Exporter);
@EXPORT=qw(tbl2ltx);
=head1 tbl2ltx($html, $encoding)
The function you will use to convert html to LaTeX. The first
parameter is a string containing the xml/html code. This can contain
any number of
tags. Please note that:
1) The tags are parsed with XML::Parser. This means they have to be
well formed xml code.
2) XML::Parser expects a single (global) tag starting. So if you have
multiple tables in your $html string wrap them into some arbitrary
tag.
The corresponding LaTeX code is returned as string. The parser ignores
any text not included in cells or outside of the tags.
The "border" attribute of the tag is honoured. It is is not 0
a border is drawn in the LaTeX code. The "colspan" attribute of
and | cells is honoured.
Valid values for $encoding are: qw(UTF-8 ISO-8859-1 UTF-16 US-ASCII)
or any other you may have configured for the expat xml parser. See
"man XML::Parser" for more details.
=cut
sub tbl2ltx
{
my ($html, $encoding) = @_;
$XML::Parser::Easytree::Noempty=1;
my $p=new XML::Parser(Style=>'EasyTree');
die "could not create XML::Parser" unless ref $p;
my %opts;
$opts{ProtocolEncoding}=$encoding if $encoding;
my $tree=$p->parse($html, %opts);
die "could not parse html" unless ref $tree;
#print Dumper($tree); return;
my $body='';
foreach my $c (@$tree)
{
$body.=searchtable($c);
}
return $body;
}
sub searchtable
{
my ($tag) = @_;
die "no tag ref" unless ref $tag;
#print Dumper($tag);
return unless $tag->{type} eq 'e';
return table($tag) if lc($tag->{name}) eq 'table';
my $body='';
foreach my $c (@{$tag->{content}})
{
$body.=searchtable($c);
}
return $body;
}
sub table {
my ($tbl) = @_;
die "no table ref" unless ref $tbl;
return unless lc($tbl->{name}) eq 'table';
my $border=$tbl->{attrib}->{border};
#print Dumper($tbl);
$cols=0;
my $body='';
my $foot='';
foreach my $c (@{$tbl->{content}})
{
#print Dumper($c);
die "no ref" unless ref $c;
next unless $c->{type} eq 'e';
if(lc($c->{name}) eq 'tbody') { $body.=tbody($c) }
elsif(lc($c->{name}) eq 'thead') { $body.=tbody($c) }
elsif(lc($c->{name}) eq 'tfoot') { $foot.=tbody($c) }
elsif(lc($c->{name}) eq 'tr') { $body.=tablerow($c) }
else { die "unexpected tag in table" }
}
my $txt="\\begin{tabular}{";
if($border)
{
$txt.='|';
$txt.='l|' x $cols;
}
else
{
$txt.='l' x $cols;
}
$txt.="}\n";
$txt.="\\hline\n" if $border;
$body=$body.$foot;
$body =~ s/\\\\/\\\\ \\hline/g if $border;
$txt.=$body;
$txt.="\\end{tabular}\n";
return $txt;
}
sub tbody
{
my ($tb) = @_;
my $txt='';
foreach my $c (@{$tb->{content}})
{
die "no ref" unless ref $c;
next unless $c->{type} eq 'e';
if(lc($c->{name}) eq 'tr') { $txt.=tablerow($c) }
}
return $txt;
}
sub tablerow
{
my ($row) = @_;
$cols_=0;
#print Dumper($row);
my $txt='';
foreach my $c (@{$row->{content}})
{
die "no ref" unless ref $c;
next unless $row->{type} eq 'e';
if(lc($c->{name}) eq 'td') { $txt.=td($c) }
elsif(lc($c->{name}) eq 'th') { $txt.=td($c, 1) }
}
$cols=$cols_ if $cols_>$cols;
$txt =~ s/\& $/\\\\\n/;
return $txt;
}
sub td
{
my ($cell, $isTH) = @_;
#print Dumper($cell);
my $txt='';
my $span=$cell->{attrib}->{colspan};
if($span>1)
{
$cols_+=$span;
$txt.="\\multicolumn{$span}{l}{" if $span>1;
}
else
{
$cols_++;
}
$txt.="\\textbf{" if $isTH;
foreach my $c (@{$cell->{content}})
{
die "no ref" unless ref $c;
$txt.=text($c);
}
$txt.="}" if $isTH;
$txt.="}" if $span>1;
$txt.=' & ';
return $txt;
}
sub text
{
my ($t) = @_;
return $t->{content} if $t->{type} eq 't';
my $txt;
foreach my $c (@{$t->{content}})
{
die "no ref" unless ref $c;
$txt.=text($c);
}
return $txt;
}
=head1 Bugs
The parser will only recognize the attribute "border" if it is
written in lowercase.
The border of Multicolumn cells is not rendered correct.
The following tags of HTML tables are not supported: .
=head1 Example
#!/usr/bin/perl
$html=<
| In the secönd Column
|
More äction in row2 col1 |
The foot |
Main Part | second col |
| |