Title | non-ASCII on non-Unicode Perforce server breaks replicator |
Status | closed |
Priority | essential |
Assigned user | Nick Barnes |
Organization | Ravenbrook |
Description | When non-ASCII characters are stored on a non-Unicode Perforce server (e.g. by users entering them in a changelist description) the P4DTI replicator doesn't know how to interpret them. They are treated as raw binary and then break when encoding as (e.g.) Latin-1 or ASCII. |
Analysis | Always use the same encoding/decoding to/from non-Unicode servers. The most sensible encoding is probably whatever P4Win and/or P4V use. However, research shows that this is locale-dependent. We need a good default option if this encoding breaks on whatever characters we read from Perforce. The UTF-8 encoding, for instance, balks at any byte in the range 80-bf, such as the common byte 0x92 (which is the Windows-1252 encoding for U+2109 RIGHT SINGLE QUOTATION MARK). Windows-1252 is undefined for bytes 81, 8d, 8f, 90, 9d. Latin-1 has the advantage of being defined (as the identity) on every byte. If we use a fully-defined or mostly-defined encoding, such as Latin-1 or Windows-1252, it might be tolerable to replace undefined characters with a fixed replacement, which is easy in Python (using the "replace" error handler). We can get the locale encoding with (_,encoding) = locale.getdefaultlocale(). Test for existence with codecs.lookup(encoding), and default to "latin-1" if it doesn't exist? |
How found | customer |
Evidence | http://info.ravenbrook.com/mail/2009/04/23/23-14-52/0.txt |
Observed in | 2.4.4 |
Introduced in | 2.4.3 |
Created by | Nick Barnes |
Created on | 2009-04-24 14:02:21 |
Last modified by | Nick Barnes |
Last modified on | 2009-05-07 19:49:08 |
History | 2009-04-24 NB Created |
Change | Effect | Date | User | Description |
---|---|---|---|---|
167927 | closed | 2009-05-06 16:48:44 | Nick Barnes | Improve encoding used for talking to non-Unicode Perforce server. |