SCOUG OS/2 For You - February 1998
Representing Special Characters on the Web
First, I'd like to thank Phil Hoehn, who participates in the
University of California-Stanford Map Librarians list, for prompting me
to write a short piece about the issue of representing special
characters on the Web.
We need to consider two basic issues. The first one is the
specification of the standard for representing characters on the Web
which allows A-Z, a-z, 0-9, and a very small set of certain special
characters (details of which are not covered here) to be represented
directly. The other special characters are those which typically have
an accent mark associated with them as well as a few characters which
are neither letters nor numbers, such as the ampersand and the cent
sign. The ampersand (&) is also a special case because it is the
escape character which triggers the browser to interpret the trailing
string, up to the first semi-colon, as a special character.
The other issue to consider is that different operating systems
(Macs, Windows, Unix, OS/2, OS/390) and, sometimes different programs in
the same operating environment, assume different character sets as a
default. Each character in the set is represented by a different
decimal or hexadecimal (base 16) value. But, until very recently, the
ASCII character set only defined 128 characters. So, we had IBM's
extensions to the ASCII character set and Microsoft's extensions to the
ASCII character set. And, the extensions did not match. For example,
in Word Perfect 5.1, the ligatured letters "a" and "e" are represented
by decimal 145. Ligatured means that the "a" and "e" are stuck
together, as in the example later in this article. Sometimes ligatured
letters are called diphthongs. Meantime, in Microsoft Word 6.0a, the
same ligatured combination is represented by decimal 230.
The net result of these two issues is that representing a particular
special character becomes at least quadruply difficult.
The International Standards Organization (ISO) has specified the
standard, ISO 10646, to define multiple-byte character sets, which are
commonly required in non-Roman languages such as Japanese and Chinese.
A proper subset called the Latin-1 character set for representing most
Roman-language characters is part of the character set standard. The
Latin-1 character subset is used by nearly all non-Microsoft operating
systems which use the Roman alphabet as their standard character set
(IBM's mainframe operating systems use EBCDIC). Latin-1 also contains
the 128-character original ASCII with the special characters defined
as shown in the table accompanying this article.
If you are writing a Web page, you can do some semi-easy (but very
tedious) programming, find out the operating system of the target
system, and allow for different character sets. But, if somebody
invents a new operating system and decides to implement yet another
character set, you will need to reprogram the application.
However, the specification of the standard for representing special
characters in HTML on the Web defines a set of names defined which allow
you to represent the non-standard characters by names or numbers. This
way, you will not need to depend on the vagaries of various operating
systems to represent your priceless thoughts accurately. Cleverly, the
members of the W3 Consortium who wrote the HTML standard took the
Latin-1 character set to build their special characters table. So, the
very safest way to write your HTML is to use the names of the
characters.
Some special characters and their uses are shown in the table below.
Character | HTML number |
Description and usage |
Å | 197 | Capital A
with ring This is commonly used in Danish. In the absence of such a
character on the keyboard, it is commonly represented in English as a
double a, as in the name of a town in Jutland, Aarhus, which should be
Århus. |
ñ | 208 | Small
n with tilde This is commonly found in Spanish, such as in the
proper Spanish spelling of the word we spell canyon in English, where
the ny is replaced with the ntilde character to be spelled
cañon. |
æ | 230 | Small
ae ligature This is commonly used in Latin-based English words.
Because my boss prefers to spell archaeology with the aelig character, I
have this programmed into nearly all my macros and word processing
programs so the word is spelled archæology. |
è | 232 |
Small e with grave accent Both egrave and eacute are commonly
found in French, with the usage of egrave as in the word premiere where
the second e should be an egrave, as première. |
é | 233 | Lower
case e with acute accent. In the word privee where the first e
should have been an eacute, as privée. |
ö | 246 | Small o
with umlaut This is commonly found in Germanic languages. For
example, a town in southern Sweden is commonly written Malmo in English;
the terminal o should be an ouml, as Malmö. |
The numeric representations in the center column are those that are
used in many operating systems (basically, Macs, Unix,
and OS/2) and you will find that they correspond to the numbers in the
larger table, derived directly from the HTML standard, at the end of
this article. But, if you use the same numerical coding scheme as
Microsoft Word 6.0a when
you code your HTML or other Webness, and, if somebody's reading your URL
on a Mac or Unix box, what you intend them to see will not be there. By
the same token, if your computer is a Mac and you use the non-Microsoft
code, what you intend your reader to see also will not be correct if the
reader lives in a Windows environment.
As an example of how this problem shows up,
some people will see one of the first two lines with an actual
ligatured ae, if they are in the appropriate environment.
Everybody should see the third line as a ligatured ae regardless of the
environment:
First, try it with , that is, 145
Second, try it with æ, that is, 230
Third, try it with æ, that is, with the aelig name
To see how this happens, type the three lines below into a plain
text file and save it as an HTML file (most systems will want an
extension of .htm or .html for this file). Then, load it into your
favorite Web browser (in Netscape/2, click on File and then on Open File
and then on the name of the file you've saved).
First, try it with ‘, that is, 145
Second, try it with æ, that is, 230
Third, try it with æ, that is, with the aelig name
As you can see in the example, the special characters start either
with an ampersand and a pound sign, if they are numeric, or with an
ampersand alone, if names are used. In all three cases, the
specification of the special characters terminates with a semi-colon.
So, even though it is a pain in some extremely strategic part of your
anatomy, the safest thing to represent special characters is to use the
character names. One way to do
this, and to make sure you have it right, is to write out everything as
you normally would and then use your word processor or your text
editor's
mass change function to change everything to the way it really needs to
be. If you use one of the Web page tools such as Page Mill (from Adobe
and is fairly costly) or AOLPRESS
(really free, probably the best thing to come out of AOL), they
generate the special characters properly for you. Also, fortunately,
many of the special character names are mnemonic!
One item to do on my list is to take the two tables I have (one for
Microsoft's numbers and one for the numbers used by Unix, Mac, and OS/2)
and blend them into a single document which I will then put out on my
Web site as a .pdf file, latin1cs.pdf. Fortunately for those people who
would like to have a copy of the combined list, I am in the midst
of about 30 weeks of downtime due to a hip replacement, so I should be able to get this
accomplished.
I normally teach a sequence of classes for SHARE and the papers for several of those classes are on my Web site as .pdf files. (I am working on the others.) If you will go to my Web site, click on extracurricular activities and then on classes, you should be able to get them properly. (Files with the .pdf extension are read using the Adobe Acrobat Reader 3.0 or higher. It can be downloaded for free from Adobe -- a pointer on my site will get you there if you do not have this plug-in for your browser.)
If you have questions about this or related Webness,
please feel free to send them to me directly. If the question is fairly general and if you indicate the question is from a member of SCOUG, I will send a copy of the question and answer to Carla to be considered for inclusion in a future issue of the newsletter.
Click here to see the Latin-1 Character Set.
The table was extracted from the HTML 3.2 Reference Specification.
Contact information is:
Virginia R. Hetrick, here in sunny California
Email: drjuice@gte.net
http://home1.gte.net/drjuice
This article is copyright 1998 Virginia R. Hetrick and printed with the author's permission. Most recent update: 04 Feb 1998
The Southern California OS/2 User Group
P.O. Box 26904
Santa Ana, CA 92799-6904, USA
Copyright 1998 the Southern California OS/2 User Group. ALL RIGHTS
RESERVED.
SCOUG is a trademark of the Southern California OS/2 User Group.
OS/2, Workplace Shell, and IBM are registered trademarks of International
Business Machines Corporation.
All other trademarks remain the property of their respective owners.
|