From 6fe6cc890178033df801668dc735ce6403cf4545 Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Sat, 1 Nov 2008 01:51:51 -0400 Subject: [PATCH] Update gitignore with post-release files, new NEWS entry and spellcheck UTF-8. Signed-off-by: Edward Z. Yang --- .gitignore | 3 +++ NEWS | 2 ++ docs/enduser-utf8.html | 26 +++++++++++++------------- 3 files changed, 18 insertions(+), 13 deletions(-) diff --git a/.gitignore b/.gitignore index 9d342577..65853502 100644 --- a/.gitignore +++ b/.gitignore @@ -3,8 +3,11 @@ test-settings.php library/HTMLPurifier/DefinitionCache/Serializer/*/ library/standalone/ library/HTMLPurifier.standalone.php +library/HTMLPurifier*.tgz +library/package*.xml configdoc/*.html configdoc/configdoc.xml +docs/doxygen* *.phpt.diff *.phpt.exp *.phpt.log diff --git a/NEWS b/NEWS index 3c8ec4fa..45dd6d3e 100644 --- a/NEWS +++ b/NEWS @@ -9,6 +9,8 @@ NEWS ( CHANGELOG and HISTORY ) HTMLPurifier . Internal change ========================== +3.3.0, unknown release date + 3.2.0, released 2008-10-31 # Using %Core.CollectErrors forces line number/column tracking on, whereas previously you could theoretically turn it off. diff --git a/docs/enduser-utf8.html b/docs/enduser-utf8.html index 9ff9da4c..6882c7a4 100644 --- a/docs/enduser-utf8.html +++ b/docs/enduser-utf8.html @@ -481,7 +481,7 @@ if we don't know it's character encoding? And how do we figure out the character encoding, if we don't know the contents of the META tag?

-

Fortunantely for us, the characters we need to write the +

Fortunately for us, the characters we need to write the META are in ASCII, which is pretty much universal over every character encoding that is in common use today. So, all the web-browser has to do is parse all the way down until @@ -526,7 +526,7 @@ you don't have to use those user-unfriendly entities.

User-friendly

-

Websites encoded in Latin-1 (ISO-8859-1) which ocassionally need +

Websites encoded in Latin-1 (ISO-8859-1) which occasionally need a special character outside of their scope often will use a character entity reference to achieve the desired effect. For instance, θ can be written θ, regardless of the character encoding's @@ -584,7 +584,7 @@ disappeared off the web, so I am linking to the Web Archive copy.)

application/x-www-form-urlencoded

This is the Content-Type that GET requests must use, and POST requests -use by default. It involves the ubiquituous percent encoding format that +use by default. It involves the ubiquitous percent encoding format that looks something like: %C3%86. There is no official way of determining the character encoding of such a request, since the percent encoding operates on a byte level, so it is usually assumed that it @@ -674,7 +674,7 @@ it up to the module iconv to do the dirty work.

This approach, however, is not perfect. iconv is blithely unaware of HTML character entities. HTML Purifier, in order to protect against sophisticated escaping schemes, normalizes all character -and numeric entitie references before processing the text. This leads to +and numeric entity references before processing the text. This leads to one important ramification:

Any character that is not supported by the target character @@ -770,7 +770,7 @@ the text when you try to convert it to UTF-8. You'll have to convert it to a binary field, convert it to a Shift-JIS field (the real encoding), and then finally to UTF-8. Many a website had pages irreversibly mangled because they didn't realize that they'd been deluding themselves about -the character encoding all along, don't become the next victim.

+the character encoding all along; don't become the next victim.

For PostgreSQL, there appears to be no direct way to change the encoding of a database (as of 8.2). You will have to dump the data, and then reimport @@ -790,7 +790,7 @@ usually supported).

Binary

-

Due to the abovementioned compatibility issues, a more interoperable +

Due to the aforementioned compatibility issues, a more interoperable way of storing UTF-8 text is to stuff it in a binary datatype. CHAR becomes BINARY, VARCHAR becomes VARBINARY and TEXT becomes BLOB. @@ -917,8 +917,8 @@ anyway. So we'll deal with the other two edge cases.

would like to read your website but get heaps of question marks or other meaningless characters. Fixing this problem requires the installation of a font or language pack which is often highly -dependent on what the language is. Here is an example -of such a help file for the Bengali language, I am sure there are +dependent on what the language is. Here is an example +of such a help file for the Bengali language; I am sure there are others out there too. You just have to point users to the appropriate help file.

@@ -928,7 +928,7 @@ help file.

characters embedded in what otherwise would be very bland ASCII are letters of the International -Phonetic Alphabet (IPA), use to designate pronounciations in a very standard +Phonetic Alphabet (IPA), use to designate pronunciations in a very standard manner (you probably see them all the time in your dictionary). Your average font probably won't have support for all of the IPA characters like ʘ (bilabial click) or ʒ (voiced postalveolar fricative). @@ -941,11 +941,11 @@ most widely used browser in the entire world? Microsoft IE 6 is not smart enough to borrow from other fonts when a character isn't present, so more often than not you'll be slapped with a nice big �. To get things to work, MSIE 6 needs a little nudge. You could configure it -to use a different font to render the text, but you can acheive the same +to use a different font to render the text, but you can achieve the same effect by selectively changing the font for blocks of special characters to known good Unicode fonts.

-

Fortunantely, the folks over at Wikipedia have already done all the +

Fortunately, the folks over at Wikipedia have already done all the heavy lifting for you. Get the CSS from the horses mouth here: Common.css, and search for ".IPA" There are also a smattering of @@ -972,7 +972,7 @@ users.

Dealing with variable width in functions

When people claim that PHP6 will solve all our Unicode problems, they're -misinformed. It will not fix any of the abovementioned troubles. It will, +misinformed. It will not fix any of the aforementioned troubles. It will, however, fix the problem we are about to discuss: processing UTF-8 text in PHP.

@@ -1035,7 +1035,7 @@ directory.

Well, that's it. Hopefully this document has served as a very practical springboard into knowledge of how UTF-8 works. You may have decided that you don't want to migrate yet: that's fine, just know -what will happen to your output and what bug reports you may recieve.

+what will happen to your output and what bug reports you may receive.

Many other developers have already discussed the subject of Unicode, UTF-8 and internationalization, and I would like to defer to them for -- 2.11.4.GIT