=head1 htmltable2latex perl package This small package provides a mechanism to translate tables in html markup (or better xml format with html table tags) into a corresponding LaTeX code. Not all features of html tables are supported, but basic tables should work fine. =cut package htmltable2latex; use XML::Parser; use XML::Parser::EasyTree; use Exporter; #use Data::Dumper; use strict; use vars qw(@ISA @EXPORT $cols $cols_); @ISA=qw(Exporter); @EXPORT=qw(tbl2ltx); =head1 tbl2ltx($html, $encoding) The function you will use to convert html to LaTeX. The first parameter is a string containing the xml/html code. This can contain any number of tags. Please note that: 1) The tags are parsed with XML::Parser. This means they have to be well formed xml code. 2) XML::Parser expects a single (global) tag starting. So if you have multiple tables in your $html string wrap them into some arbitrary tag. The corresponding LaTeX code is returned as string. The parser ignores any text not included in
cells or outside of the tags. The "border" attribute of the
tag is honoured. It is is not 0 a border is drawn in the LaTeX code. The "colspan" attribute of
and cells is honoured. Valid values for $encoding are: qw(UTF-8 ISO-8859-1 UTF-16 US-ASCII) or any other you may have configured for the expat xml parser. See "man XML::Parser" for more details. =cut sub tbl2ltx { my ($html, $encoding) = @_; $XML::Parser::Easytree::Noempty=1; my $p=new XML::Parser(Style=>'EasyTree'); die "could not create XML::Parser" unless ref $p; my %opts; $opts{ProtocolEncoding}=$encoding if $encoding; my $tree=$p->parse($html, %opts); die "could not parse html" unless ref $tree; #print Dumper($tree); return; my $body=''; foreach my $c (@$tree) { $body.=searchtable($c); } return $body; } sub searchtable { my ($tag) = @_; die "no tag ref" unless ref $tag; #print Dumper($tag); return unless $tag->{type} eq 'e'; return table($tag) if lc($tag->{name}) eq 'table'; my $body=''; foreach my $c (@{$tag->{content}}) { $body.=searchtable($c); } return $body; } sub table { my ($tbl) = @_; die "no table ref" unless ref $tbl; return unless lc($tbl->{name}) eq 'table'; my $border=$tbl->{attrib}->{border}; #print Dumper($tbl); $cols=0; my $body=''; my $foot=''; foreach my $c (@{$tbl->{content}}) { #print Dumper($c); die "no ref" unless ref $c; next unless $c->{type} eq 'e'; if(lc($c->{name}) eq 'tbody') { $body.=tbody($c) } elsif(lc($c->{name}) eq 'thead') { $body.=tbody($c) } elsif(lc($c->{name}) eq 'tfoot') { $foot.=tbody($c) } elsif(lc($c->{name}) eq 'tr') { $body.=tablerow($c) } else { die "unexpected tag in table" } } my $txt="\\begin{tabular}{"; if($border) { $txt.='|'; $txt.='l|' x $cols; } else { $txt.='l' x $cols; } $txt.="}\n"; $txt.="\\hline\n" if $border; $body=$body.$foot; $body =~ s/\\\\/\\\\ \\hline/g if $border; $txt.=$body; $txt.="\\end{tabular}\n"; return $txt; } sub tbody { my ($tb) = @_; my $txt=''; foreach my $c (@{$tb->{content}}) { die "no ref" unless ref $c; next unless $c->{type} eq 'e'; if(lc($c->{name}) eq 'tr') { $txt.=tablerow($c) } } return $txt; } sub tablerow { my ($row) = @_; $cols_=0; #print Dumper($row); my $txt=''; foreach my $c (@{$row->{content}}) { die "no ref" unless ref $c; next unless $row->{type} eq 'e'; if(lc($c->{name}) eq 'td') { $txt.=td($c) } elsif(lc($c->{name}) eq 'th') { $txt.=td($c, 1) } } $cols=$cols_ if $cols_>$cols; $txt =~ s/\& $/\\\\\n/; return $txt; } sub td { my ($cell, $isTH) = @_; #print Dumper($cell); my $txt=''; my $span=$cell->{attrib}->{colspan}; if($span>1) { $cols_+=$span; $txt.="\\multicolumn{$span}{l}{" if $span>1; } else { $cols_++; } $txt.="\\textbf{" if $isTH; foreach my $c (@{$cell->{content}}) { die "no ref" unless ref $c; $txt.=text($c); } $txt.="}" if $isTH; $txt.="}" if $span>1; $txt.=' & '; return $txt; } sub text { my ($t) = @_; return $t->{content} if $t->{type} eq 't'; my $txt; foreach my $c (@{$t->{content}}) { die "no ref" unless ref $c; $txt.=text($c); } return $txt; } =head1 Bugs The parser will only recognize the attribute "border" if it is written in lowercase. The border of Multicolumn cells is not rendered correct. The following tags of HTML tables are not supported: . =head1 Example #!/usr/bin/perl $html=<
In the secönd Column
More äction in row2 col1
The foot
Main Partsecond col
EOF use htmltable2latex; print tbl2ltx($html,'ISO-8859-1'); will output: \begin{tabular}{|l|l|} \hline \\ & \textbf{In the secönd Column} \\ \multicolumn{2}{l}{More äction in row2 col1} \\ \hline \end{tabular} \begin{tabular}{ll} Main Part & second col \\ The foot \end{tabular} =head1 Version Version 1.0 released 2006-01-27 L =head1 License Copyright (c) 2006 Dirk Jagdmann This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution. =cut