NAME
Html2Wml -- Program that can convert HTML pages to WML pages
SYNOPSIS
Html2Wml can be used as either a shell command:
$ html2wml file.html
or as a CGI:
/cgi-bin/html2wml.cgi?url=/index.html
In both cases, the file can be either a local file or a URL.
DESCRIPTION
Html2Wml converts HTML pages to WML decks, suitable for being viewed on
a Wap device. The program can be launched from a shell to statically
convert a set of pages, or as a CGI to convert a particular
(potentially dynamic) HTML resource.
Althought the result is not guarantied to be valid WML, it should be
the case for most pages. Good HTML pages will most probably produce
valid WML decks. To check and correct your pages, you can use W3C’s
softwares: the HTML Validator, available online at
http://validator.w3.org and HTML Tidy, written by Dave Raggett.
Html2Wml provides the following features:
· translation of the links
· limitation of the cards size by splitting the result into several
cards
· inclusion of files (similar to the SSI)
· compilation of the result (using the WML Tools, see the section on
"LINKS")
· a debug mode to check the result using validation functions
OPTIONS
Please note that most of these options are also available when calling
Html2Wml as a CGI. In this case, boolean options are given the value
"1" or "0", and other options simply receive the value they expect. For
example, ‘--ascii’ becomes ‘?ascii=1’ or ‘?a=1’. See the file
t/form.html for an example on how to call Html2Wml as a CGI.
Conversion Options
-a, --ascii
When this option is on, named HTML entities and non-ASCII
characters are converted to US-ASCII characters using the same 7
bit approximations as Lynx. For example, ‘©’ is translated to
"(c)", and ‘ß’ is translated to "ss". This option is off by
default.
--[no]collapse
This option tells Html2Wml to collapse redundant whitespaces,
tabulations, carriage returns, lines feeds and empty paragraphs.
The aim is to reduce the size of the WML document as much as
possible. Collapsing empty paragraphs is necessary for two reasons.
First, this avoids empty screens (and on a device with only 4 lines
of display, an empty screen can be quite ennoying). Second,
Html2wml creates many empty paragraphs when converting, because of
the way the syntax reconstructor is programmed. Deleting these
empty paragraphs is necessary like cleaning the kitchen :-)
If this really bother you, you can desactivate this behaviour with
the --nocollapse option.
--ignore-images
This option tells Html2Wml to completly ignore all image links.
--[no]img-alt-text
This option tells Html2Wml to replace the image tags with their
corresponding alternative text (as with a text mode web browser).
This option is on by default.
--[no]linearize
This option is on by default. This makes Html2Wml flattens the HTML
tables (they are linearized), as Lynx does. I think this is better
than trying to use the native WML tables. First, they have
extremely limited features and possibilities compared to HTML
tables. In particular, they can’t be nested. In fact this is normal
because Wap devices are not supposed to have a big CPU running at
some zillions-hertz, and the calculations needed to render the
tables are the most complicated and CPU-hogger part of HTML.
Second, as they can’t be nested, and as typical HTML pages heavily
use imbricated tables to create their layout, it’s impossible to
decide which one could be kept. So the best thing is to keep none
of them.
[Note] Although you can desactivate this behaviour, and although
there is internal support for tables, the unlinearized mode has not
been heavily tested with nested tables, and it may produce
unexpected results.
-n, --numeric-non-ascii
This option tells Html2wml to convert all non-ASCII characters to
numeric entities, i.e., "e" becomes ‘é’, and "ss" becomes
‘ß’. By default, this option is off.
-p, --nopre
This options tells Html2Wml not to use the <pre> tag. This option
was added because the compiler from WML Tools 0.0.4 doesn’t support
this tag.
Links Reconstruction Options
--hreftmpl=TEMPLATE
This options sets the template that will be used to reconstruct the
‘href’-type links. See the section on "LINKS RECONSTRUCTION" for
more information.
--srctmpl=TEMPLATE
This option sets the template that will be used to reconstruct the
‘src’-type links. See the section on "LINKS RECONSTRUCTION" for
more information.
Splitting Options
-s, --max-card-size=SIZE
This option allows you to limit the size (in bytes) of the
generated cards. Default is 1,500 bytes, which should be small
enought to be loaded on most Wap devices. See the section on "DECK
SLICING" for more information.
-t, --card-split-threshold=SIZE
This option sets the threshold of the split event, which can occur
when the size of the current card is between ‘max-card-size’ -
‘card-split-threshold’ and ‘max-card-size’. Default value is 50.
See the section on "DECK SLICING" for more information.
--next-card-label=STRING
This options sets the label of the link that points to the next
card. Default is "[>>]", which whill be rendered as "[>>]".
--prev-card-label=STRING
This options sets the label of the link that points to the previous
card. Default is "[<<]", which whill be rendered as "[<<]".
HTTP Authentication
-U, --http-user=USERNAME
Use this option to set the username for an authenticated request.
-P, --http-passwd=PASSWORD
Use this option to set the password for an authenticated request.
Proxy Support
-[no]Y, --[no]proxy
Use this option to activate proxy support. By default, proxy
support is activated. See the section on "PROXY SUPPORT".
Output Options
-k, --compile
Setting this option tells Html2Wml to use the compiler from WML
Tools to compile the WML deck. If you want to create a real Wap
site, you should seriously use this option in order to reduce the
size of the WML decks. Remember that Wap devices have very little
amount of memory. If this is not enought, use the splitting
options.
Take a look in wml_compilation/ for more information on how to use
a WML compiler with Html2Wml.
-o, --output
Use this option (in shell mode) to specify an output file. By
default, Html2Wml prints the result to standard output.
Debugging Options
-d, --debug[=LEVEL]
This option activates the debug mode. This prints the output result
with line numbering and with the result of the XML check. If the
WML compiler was called, the result is also printed in hexadecimal
an ascii forms. When called as a CGI, all of this is printed as
HTML, so that can use any web browser for that purpose.
--xmlcheck
When this option is on, it send the WML output to XML::Parser to
check its well-formedness.
DECK SLICING
The deck slicing is a feature that Html2Wml provides in order to match
the low memory capabilities of most Wap devices. Many can’t handle
cards larger than 2,000 bytes, therefore the cards must be sufficiently
small to be viewed by all Wap devices. To achieve this, you should
compile your WML deck, which reduce the size of the deck by 50%, but
even then your cards may be too big. This is where Html2Wml comes with
the deck slicing feature. This allows you to limit the size of the
cards, currently only before the compilation stage.
Slice by cards or by decks
On some Wap phones, slicing the deck is not sufficient: the WML browser
still tries to download the whole deck instead of just picking one card
at a time. A solution is to slice the WML document by decks. See the
figure below.
_____________ _____________
│ deck │ │ deck #1 │
│ _________ │ │ _________ │
│ │ card #1 │ │ │ │ card │ │
│ │_________│ │ │ │_________│ │
│ _________ │ │_____________│
│ │ card #2 │ │
│ │_________│ │ . . .
│ _________ │
│ │ ... │ │ _____________
│ │_________│ │ │ deck #n │
│ _________ │ │ _________ │
│ │ card #n │ │ │ │ card │ │
│ │_________│ │ │ │_________│ │
│_____________│ │_____________│
WML document WML document
sliced by cards sliced by decks
What this means is that Html2Wml generates several WML documents. In
CGI mode, only the appropriate deck is sent, selected by the id given
in parameter. If no id was given, the first deck is sent.
Note on size calculation
Currently, Html2Wml estimates the size of the card on the fly, by
summing the length of the strings that compose the WML output, texts
and tags. I say "estimates" and not "calculates" because computing the
exact size would require many more calculations than the way it is done
now. One may objects that there are only additions, which is correct,
but knowing the exact size is not necessary. Indeed, if you compile the
WML, most of the strings of the tags will be removed, but not all.
For example, take an image tag: ‘<img src="images/dog.jpg" alt="Photo
of a dog">’. When compiled, the string ‘"img"’ will be replaced by a
one byte value. Same thing for the strings ‘"src"’ and ‘"alt"’, and
the spaces, double quotes and equal signs will be stripped. Only the
text between double quote will be preserved... but not in every cases.
Indeed, in order to go a step further, the compiler can also encode
parts of the arguments as binary. For example, the string
‘"http://www."’ can be encoded as a single byte (‘8F’ in this case).
Or, if the attribute is ‘href’, the string ‘href="http://’ can become
the byte ‘4B’.
As you see, it doesn’t matter to know exactly the size of the textual
form of the WML, as it will always be far superior to the size of the
compiled form. That’s why I don’t count all the characters that may be
actually written.
Also, it’s because I’m quite lazy ;-)
Why compiling the WML deck?
If you intent to create real WML pages, you should really consider to
always compile them. If you’re not convinced, here is an illustration.
Take the following WML code snipet:
<a href=’http://www.yahoo.com/’>Yahoo!</a>
It’s the basic and classical way to code an hyperlink. It takes 42
bytes to code this, because it is presented in a human-readable form.
The WAP Forum has defined a compact binary representation of WML in its
specification, which is called "compiled WML". It’s a binary format,
therefore you, a mere human, can’t read that, but your computer can.
And it’s much faster for it to read a binary format than to read a
textual format.
The previous example would be, once compiled (and printed here as
hexadecimal):
1C 4A 8F 03 y a h o o 00 85 01 03 Y a h o o ! 00 01
This only takes 21 bytes. Half the size of the human-readable form.
For a Wap device, this means both less to download, and easier things
to read. Therefore the processing of the document can be achieved in a
short time compared to the tectual version of the same document.
There is a last argument, and not the less important: many Wap devices
only read binary WML.
ACTIONS
Actions are a feature similar to (but with far less functionalities!)
the SSI (Server Side Includes) available on good servers like Apache.
In order not to interfere with the real SSI, but to keep the syntax
easy to learn, it differs in very few points.
Syntax
Basically, the syntax to execute an action is:
<!-- [action param1="value" param2=’value’] -->
Note that the angle brackets are part of the syntax. Except for that
point, Actions syntax is very similar to SSI syntax.
Available actions
Only few actions are currently available, but more can be implemented
on request.
include
Description
Includes a file in the document at the current point.
Please note that Html2Wml doesn’t check nor parse the file,
and if the file cannot be found, will silently die (this is
the same behavior as SSI).
Parameters
‘virtual=url’ -- The file is get by http.
‘file=path’ -- The file is read from the local disk.
fsize
Description
Returns the size of a file at the current point of the
document.
Parameters
‘virtual=url’ -- The file is get by http.
‘file=path’ -- The file is read from the local disk.
Notes If you use the file parameter, an absolute path is
recommend.
skip
Description
Skips everything until the first ‘end_skip’ action.
Generic parameters
The following parameters can be used for any action.
for=output format
This paramater restricts the action for the given output format.
Currently, the only available format is "‘wml’" (when using
‘htm.html’ the format is "‘chtml’").
Examples
If you want to share a navigation bar between several WML pages, you
can ‘include’ it this way:
<!-- [include virtual="nav.wml"] -->
Of course, you have to write this navigation bar first :-)
If you want to use your current HTML pages for creating your WML pages,
but that they contains complex tables, or unecessary navigation tables,
etc, you can simply ‘skip’ the complex parts and keep the rest.
<body>
<!--[skip for="wml"]-->
unecessary parts for the WML pages
<!--[end_skip]-->
useful parts for the WML pages
</body>
LINKS RECONSTRUCTION
The links reconstruction engine is IMHO the most important part of
Html2Wml, because it’s this engine that allows you to reconstruct the
links of the HTML document being converted. It has two modes, depending
upon whether Html2Wml was launched from the shell or as a CGI.
When used as a CGI, this engine will reconstructs the links of the HTML
document so that all the urls will be passed to Html2Wml in order to
convert the pointed files (pages or images). This is completly
automatic and can’t be customized for now (but I don’t think it would
be really useful).
When used from the shell, this engine reconstructs the links with the
given templates. Note that absolute URLs will be left untouched. The
templates can be customized using the following syntax.
Templates
HREF Template
This template controls the reconstruction of the ‘href’ attribute
of the ‘A’ tag. Its value can be changed using the --hreftmpl
option. Default value is ‘"{FILEPATH}{FILENAME}{$FILETYPE =~
s/s?html?/wml/o; $FILETYPE}"’.
Image Source Template
This template controls the reconstruction of the ‘src’ attribute of
the ‘IMG’ tag. Its value can be changed using the --srctmpl option.
Default value is ‘"{FILEPATH}{FILENAME}{$FILETYPE =~
s/gif│png│jpe?g/wbmp/o; $FILETYPE}"’
Syntax
The template is a string that contains the new URL. More precisely,
it’s a Text::Template template. Parameters can be interpolated as a
constant or as a variable. The template is embraced between curcly
bracets, and can contain any valid Perl code.
The simplest form of a template is ‘{PARAM}’ which just returns the
value of PARAM. If you want to do something more complex, you can use
the corresponding variable; for example ‘{"foo $PARAM bar"}’, or ‘{join
"_", split " ", PARAM}’.
You may read the Text::Template manpage for more information on what is
possible within a template.
If the original URL contained a query part or a fragment part, then
they will be appended to the result of the template.
Available parameters
URL This parameter contains the original URL from the ‘href’ or ‘src’
attribute.
FILENAME
This parameter contains the base name of the file.
FILEPATH
This parameter contains the leading path of the file.
FILETYPE
This parameter contains the suffix of the file.
This can be resumed this way:
URL = http://www.server.net/path/to/my/page.html
------------^^^^ ----
│ │ \
│ │ \
FILEPATH FILENAME FILETYPE
Note that ‘FILETYPE’ contains all the extensions of the file, so if its
name is index.html.fr for example, ‘FILETYPE’ contains "‘.html.fr’".
Examples
To add a path option:
{URL}$wap
Using Apache, you can then add a Rewrite directive so that URL ending
with ‘$wap’ will be redirected to Html2Wml:
RewriteRule ^(/.*)\$wap$ /cgi-bin/html2wml.cgi?url=$1
To change the extension of an image:
{FILEPATH}{FILENAME}.wbmp
PROXY SUPPORT
Html2Wml uses LWP built-in proxy support. It is activated by default,
and loads the proxy settings from the environment variables, using the
same variables as many others programs. Each protocol (http, ftp, etc)
can be mapped to use a proxy server by setting a variable of the form
‘PROTOCOL_proxy’. Example: use ‘http_proxy’ to define the proxy for
http access, ‘ftp_proxy’ for ftp access. In the shell, this is only a
matter of defining the variable.
For Bourne shell:
$ export http_proxy="http://proxy.domain.com:8080/"
For C-shell:
% setenv http_proxy "http://proxy.domain.com:8080/"
Under Apache, you can add this directive to your configuration file:
SetEnv http_proxy "http://proxy.domain.com:8080"
but this has the default that another CGI, or another program, can use
this to access external ressources. A better way is to edit Html2Wml
and fill the option ‘proxy-server’ with the appropriate value.
CAVEATS
Html2Wml tries to make correct WML documents, but the well-formedness
and the validity of the document are not guarantied.
Inverted tags (like "<b>bold <i>italic</b></i>") may produce unexpected
results. But only bad softwares do bad stuff like this.
LINKS
Download
Html2Wml
This is the web site of the Html2Wml project, hosted by
SourceForge.net. All the stable releases can be downloaded from
this site.
[ http://www.html2wml.org/ ]
Nutialand
This is the web site of the author, where you can find the archives
of all the releases of Html2Wml.
[ http://www.maddingue.org/softwares/ ]
Resources
The WAP Forum
This is the official site of the WAP Forum. You can find some
technical information, as the specifications of all the
technologies associated with the WAP.
[ http://www.wapforum.org/ ]
WAP.com
This site has some useful information and links. In particular, it
has a quite well done FAQ.
[ http://www.wap.com/ ]
The World Wide Web Consortium
Altough not directly related to the Wap stuff, you may find useful
to read the specifications of the XML (WML is an XML application),
and the specifications of the different stylesheet languages (CSS
and XSL), which include support for low-resolution devices.
[ http://www.w3.org/ ]
TuxMobil
This web site is dedicated to Mobile UniX systems. It leads you to
a lot of useful hands-on information about installing and running
Linux and BSD on laptops, PDAs and other mobile computer devices.
[ http://www.tuxmobil.org/ ]
Programmers utilities
HTML Tidy
This is a very handful utility which corrects your HTML files so
that they conform to W3C standards.
[ http://www.w3.org/People/Raggett/tidy ]
Kannel
Kannel is an open source Wap and SMS gateway. A WML compiler is
included in the distribution.
[ http://www.kannel.org/ ]
WML Tools
This is a collection of utilities for WML programmers. This include
a compiler, a decompiler, a viewer and a WBMP converter.
[ http://pwot.co.uk/wml/ ]
WML browsers and Wap emulators
Opera
Opera is originaly a Web browser, but the version 5 has a good
support for XML and WML. Opera is available for free for several
systems.
[ http://www.opera.com/ ]
wApua
wApua is an open source WML browser written in Perl/Tk. It’s easy
to intall and to use. Its support for WML is incomplete, but
sufficient for testing purpose.
[ http://fsinfo.cs.uni-sb.de/~abe/wApua/ ]
Tofoa
Tofoa is an open source Wap emulator written in Python. Its
installation is quite difficult, and its incomplete WML support
makes it produce strange results, even with valid WML documents.
[ http://tofoa.free-system.com/ ]
EzWAP
EzWAP, from EZOS, is a commercial WML browser freely available for
Windows 9x, NT, 2000 and CE. Compared to others Windows WML
browsers, it requires very few resources, and is quite stable. Its
support for the WML specs seems quite complete. A very good
software.
[ http://www.ezos.com/ ]
Deck-It
Deck-It is a commercial Wap phone emulator, available for Windows
and Linux/Intel only. It’s a very good piece of software which
really show how WML pages are rendered on a Wap phone, but one of
its major default is that it cannot read local files.
[ http://www.pyweb.com/tools/ ]
Klondike WAP Browser
Klondike WAP Browser is a commercial WAP browser available for
Windows and PocketPC.
[ http://www.apachesoftware.com/ ]
WinWAP
WinWAP is a commercial Wap browser, freely available for Windows.
[ http://www.winwap.org/ ]
WAPman
WAPman from EdgeMatrix, is a commercial WAP browser available for
Windows and PalmOS.
[
http://www.edgematrix.com/edge/control/MainContentBean?page=downloads
]
Wireless Companion
Wireless Companion, from YourWap.com, is a WAP emulator available
for Windows.
[ http://www.yourwap.com/ ]
Mobilizer
Mobilizer is a Wap emulator available for Windows and Unix.
[ http://mobilizer.sourceforge.net/ ]
QWmlBrowser
QWmlBrowser (formerly known as WML BRowser) is an open source WML
browser, written using the Qt toolkit.
[ http://www.wmlbrowser.org/ ]
Wapsody
Wapsody, developed by IBM, is a freely available simulation
environment that implements the WAP specification. It also features
a WML browser which can be run stand-alone. As Wapsody is written
in Java/Swing, it should work on any system.
[ http://alphaworks.ibm.com/aw.nsf/techmain/wapsody ]
WAPreview
WAPreview is a Wap emulator written in Java. As it uses an HTML
based UI and needs a local web proxy, it runs quite slowly.
[ http://wapreview.sourceforge.net ]
PicoWap
PicoWap is a small WML browser made by three French students.
[ http://membres.lycos.fr/picowap/ ]
ACKNOWLEDGEMENTS
Werner Heuser, for his numerous ideas, advices and his help for the
debugging
Igor Khristophorov, for his numerous suggestions and patches
And all the people that send me bug reports: Daniele Frijia, Axel
Jerabek, Ouyang
AUTHOR
Sebastien Aperghis-Tramoni <sebastien@aperghis.net<gt>
COPYRIGHT
Copyright (C)2000, 2001, 2002 Sebastien Aperghis-Tramoni
This program is free software. You can redistribute it and/or modify it
under the terms of the GNU General Public License, version 2 or later.