SCOUG Logo


Next Meeting: Sat, February 20, 2021
Meeting Directions


Be a Member
Join SCOUG

Navigation:


Help with Searching

20 Most Recent Documents
Search Archives
Index by date, title, author, category.


Features:

Mr. Know-It-All
Ink
Download!






SCOUG:

Home

Email Lists

SIGs (Internet, General Interest, Programming, Network, more..)

Online Chats

Business

Past Presentations

Credits

Submissions

Contact SCOUG

Copyright SCOUG



warp expowest
Pictures from Sept. 1999


The views expressed in articles on this site are those of their authors.

warptech
SCOUG was there!


Copyright 1998-2021, Southern California OS/2 User Group. ALL RIGHTS RESERVED.

SCOUG, Warp Expo West, and Warpfest are trademarks of the Southern California OS/2 User Group. OS/2, Workplace Shell, and IBM are registered trademarks of International Business Machines Corporation. All other trademarks remain the property of their respective owners.

The Southern California OS/2 User Group
P.O. Box 26904
Santa Ana, CA 92799-6904, USA

SCOUG OS/2 For You - February 1998


Representing Special Characters on the Web

by Virginia Hetrick

First, I'd like to thank Phil Hoehn, who participates in the University of California-Stanford Map Librarians list, for prompting me to write a short piece about the issue of representing special characters on the Web.

We need to consider two basic issues. The first one is the specification of the standard for representing characters on the Web which allows A-Z, a-z, 0-9, and a very small set of certain special characters (details of which are not covered here) to be represented directly. The other special characters are those which typically have an accent mark associated with them as well as a few characters which are neither letters nor numbers, such as the ampersand and the cent sign. The ampersand (&) is also a special case because it is the escape character which triggers the browser to interpret the trailing string, up to the first semi-colon, as a special character.

The other issue to consider is that different operating systems (Macs, Windows, Unix, OS/2, OS/390) and, sometimes different programs in the same operating environment, assume different character sets as a default. Each character in the set is represented by a different decimal or hexadecimal (base 16) value. But, until very recently, the ASCII character set only defined 128 characters. So, we had IBM's extensions to the ASCII character set and Microsoft's extensions to the ASCII character set. And, the extensions did not match. For example, in Word Perfect 5.1, the ligatured letters "a" and "e" are represented by decimal 145. Ligatured means that the "a" and "e" are stuck together, as in the example later in this article. Sometimes ligatured letters are called diphthongs. Meantime, in Microsoft Word 6.0a, the same ligatured combination is represented by decimal 230.

The net result of these two issues is that representing a particular special character becomes at least quadruply difficult.

The International Standards Organization (ISO) has specified the standard, ISO 10646, to define multiple-byte character sets, which are commonly required in non-Roman languages such as Japanese and Chinese. A proper subset called the Latin-1 character set for representing most Roman-language characters is part of the character set standard. The Latin-1 character subset is used by nearly all non-Microsoft operating systems which use the Roman alphabet as their standard character set (IBM's mainframe operating systems use EBCDIC). Latin-1 also contains the 128-character original ASCII with the special characters defined as shown in the table accompanying this article.

If you are writing a Web page, you can do some semi-easy (but very tedious) programming, find out the operating system of the target system, and allow for different character sets. But, if somebody invents a new operating system and decides to implement yet another character set, you will need to reprogram the application.

However, the specification of the standard for representing special characters in HTML on the Web defines a set of names defined which allow you to represent the non-standard characters by names or numbers. This way, you will not need to depend on the vagaries of various operating systems to represent your priceless thoughts accurately. Cleverly, the members of the W3 Consortium who wrote the HTML standard took the Latin-1 character set to build their special characters table. So, the very safest way to write your HTML is to use the names of the characters.

Some special characters and their uses are shown in the table below.
CharacterHTML
number
Description and usage
Å197Capital A with ring
This is commonly used in Danish. In the absence of such a character on the keyboard, it is commonly represented in English as a double a, as in the name of a town in Jutland, Aarhus, which should be Århus.
ñ208Small n with tilde
This is commonly found in Spanish, such as in the proper Spanish spelling of the word we spell canyon in English, where the ny is replaced with the ntilde character to be spelled cañon.
æ230Small ae ligature
This is commonly used in Latin-based English words. Because my boss prefers to spell archaeology with the aelig character, I have this programmed into nearly all my macros and word processing programs so the word is spelled archæology.
è232 Small e with grave accent
Both egrave and eacute are commonly found in French, with the usage of egrave as in the word premiere where the second e should be an egrave, as première.
é233 Lower case e with acute accent.
In the word privee where the first e should have been an eacute, as privée.
ö246Small o with umlaut
This is commonly found in Germanic languages. For example, a town in southern Sweden is commonly written Malmo in English; the terminal o should be an ouml, as Malmö.

The numeric representations in the center column are those that are used in many operating systems (basically, Macs, Unix, and OS/2) and you will find that they correspond to the numbers in the larger table, derived directly from the HTML standard, at the end of this article. But, if you use the same numerical coding scheme as Microsoft Word 6.0a when you code your HTML or other Webness, and, if somebody's reading your URL on a Mac or Unix box, what you intend them to see will not be there. By the same token, if your computer is a Mac and you use the non-Microsoft code, what you intend your reader to see also will not be correct if the reader lives in a Windows environment.

As an example of how this problem shows up, some people will see one of the first two lines with an actual ligatured ae, if they are in the appropriate environment. Everybody should see the third line as a ligatured ae regardless of the environment:

First, try it with ‘, that is, 145
Second, try it with æ, that is, 230
Third, try it with æ, that is, with the aelig name

To see how this happens, type the three lines below into a plain text file and save it as an HTML file (most systems will want an extension of .htm or .html for this file). Then, load it into your favorite Web browser (in Netscape/2, click on File and then on Open File and then on the name of the file you've saved).

First, try it with ‘, that is, 145 
Second, try it with æ, that is, 230
Third, try it with æ, that is, with the aelig name

As you can see in the example, the special characters start either with an ampersand and a pound sign, if they are numeric, or with an ampersand alone, if names are used. In all three cases, the specification of the special characters terminates with a semi-colon.

So, even though it is a pain in some extremely strategic part of your anatomy, the safest thing to represent special characters is to use the character names. One way to do this, and to make sure you have it right, is to write out everything as you normally would and then use your word processor or your text editor's mass change function to change everything to the way it really needs to be. If you use one of the Web page tools such as Page Mill (from Adobe and is fairly costly) or AOLPRESS

(really free, probably the best thing to come out of AOL), they generate the special characters properly for you. Also, fortunately, many of the special character names are mnemonic!

One item to do on my list is to take the two tables I have (one for Microsoft's numbers and one for the numbers used by Unix, Mac, and OS/2) and blend them into a single document which I will then put out on my Web site as a .pdf file, latin1cs.pdf. Fortunately for those people who would like to have a copy of the combined list, I am in the midst of about 30 weeks of downtime due to a hip replacement, so I should be able to get this accomplished.

I normally teach a sequence of classes for SHARE and the papers for several of those classes are on my Web site as .pdf files. (I am working on the others.) If you will go to my Web site, click on extracurricular activities and then on classes, you should be able to get them properly. (Files with the .pdf extension are read using the Adobe Acrobat Reader 3.0 or higher. It can be downloaded for free from Adobe -- a pointer on my site will get you there if you do not have this plug-in for your browser.)

If you have questions about this or related Webness, please feel free to send them to me directly. If the question is fairly general and if you indicate the question is from a member of SCOUG, I will send a copy of the question and answer to Carla to be considered for inclusion in a future issue of the newsletter.

Click here to see the Latin-1 Character Set. The table was extracted from the HTML 3.2 Reference Specification.

Contact information is:
Virginia R. Hetrick, here in sunny California

Email: drjuice@gte.net

http://home1.gte.net/drjuice

This article is copyright 1998 Virginia R. Hetrick and printed with the author's permission. Most recent update: 04 Feb 1998


The Southern California OS/2 User Group
P.O. Box 26904
Santa Ana, CA 92799-6904, USA

Copyright 1998 the Southern California OS/2 User Group. ALL RIGHTS RESERVED.

SCOUG is a trademark of the Southern California OS/2 User Group.
OS/2, Workplace Shell, and IBM are registered trademarks of International Business Machines Corporation.
All other trademarks remain the property of their respective owners.