#119 closed defect (fixed)
Windows oem character set produces ParseError: not well-formed (invalid token)
Reported by: | silver.aabjoe@… | Owned by: | osimons |
---|---|---|---|
Priority: | critical | Milestone: | 0.6 |
Component: | Build slave | Version: | 0.5.3 |
Keywords: | Cc: | anmar@…, jevans | |
Operating System: |
Description
i have bitten 0.5.2, track 0.9.4, python 2.3.5 the build recipe looks like this
<build xmlns:sh="http://bitten.cmlenz.net/tools/sh"> <step id="build" description="Compile to byte code"> <sh:exec executable="dir"/> </step> </build>
Even if slave reports build compleated successfully the master displays the following:
not well-formed (invalid token): line 1, column 925 Traceback (most recent call last): File "/usr/lib/python2.3/asyncore.py", line 69, in read obj.handle_read_event() File "/usr/lib/python2.3/asyncore.py", line 390, in handle_read_event self.handle_read() File "/usr/lib/python2.3/asynchat.py", line 136, in handle_read self.found_terminator() File "build/bdist.linux-i686/egg/bitten/util/beep.py", line 278, in found_terminator File "build/bdist.linux-i686/egg/bitten/util/beep.py", line 311, in _handle_frame File "build/bdist.linux-i686/egg/bitten/util/beep.py", line 469, in handle_data_frame File "build/bdist.linux-i686/egg/bitten/master.py", line 216, in handle_reply File "build/bdist.linux-i686/egg/bitten/util/xmlio.py", line 182, in parse ParseError: not well-formed (invalid token): line 1, column 925
and stops responding to other slaves.
Attachments (7)
Change History (40)
comment:1 Changed 19 years ago by anonymous
- Version changed from 0.5.2 to 0.5.3
comment:2 Changed 18 years ago by Axel.Thimm@…
- Component changed from Build master to BEEP implementation
- Summary changed from ParseError: not well-formed (invalid token): line 1, column 925 to ParseError: not well-formed (invalid token): line X, column Y
I have a similar issue, but on the slave side and I think this is due to how xml is sometimes being passed over the wire, so I'm changing the component to BEEP.
The system I'm working with is
- master FC4/i386 (python 2.4.1)
- slave FC5/x86_64 (2.4.2)
- trac 1.0dev
- bitten 0.5.3
When the build master is about to send the build recipe the wire-sniffed exchange shows that every character of the build recipe has been padded with three zeros to the right. Looking into the database shows that the build recipe has been stored OK. Maybe this has to do with 64 vs 32 bits (although the padding happens on the master's side which is 32 bits and doesn't know what the slave is other than through os/machine)?
This is the slave's error:
[ERROR ] not well-formed (invalid token): line 1, column 1 Traceback (most recent call last): File "/usr/lib64/python2.4/asyncore.py", line 69, in read obj.handle_read_event() File "/usr/lib64/python2.4/asyncore.py", line 391, in handle_read_event self.handle_read() File "/usr/lib64/python2.4/asynchat.py", line 137, in handle_read self.found_terminator() File "/usr/lib/python2.4/site-packages/bitten/util/beep.py", line 278, in found_terminator self._handle_frame(self.header, self.payload) File "/usr/lib/python2.4/site-packages/bitten/util/beep.py", line 311, in _handle_frame payload) File "/usr/lib/python2.4/site-packages/bitten/util/beep.py", line 465, in handle_data_frame self.profile.handle_msg(msgno, payload) File "/usr/lib/python2.4/site-packages/bitten/slave.py", line 98, in handle_msg elem = xmlio.parse(payload.body) File "/usr/lib/python2.4/site-packages/bitten/util/xmlio.py", line 182, in parse raise ParseError, e ParseError: not well-formed (invalid token): line 1, column 1
And this is the build recipe:
<build xmlns:c="http://bitten.cmlenz.net/tools/c"> <step id="build" description="Configure and build"> <c:configure /> </step> </build>
This is what sniffing reveals:
RPY 0 0 . 0 49 Content-Type: application/beep+xml <greeting/>END RPY 0 0 . 0 119 Content-Type: application/beep+xml <greeting><profile uri="http://bitten.cmlenz.net/beep/orchestration"/></greeting>END MSG 0 0 . 49 124 Content-Type: application/beep+xml <start number="1"><profile uri="http://bitten.cmlenz.net/beep/orchestration"/></start>END RPY 0 0 . 119 98 Content-Type: application/beep+xml <profile uri="http://bitten.cmlenz.net/beep/orchestration"/>END MSG 1 0 . 0 233 Content-Type: application/beep+xml <register name="octopus"><platform processor="x86_64">x86_64</platform><os version="2.6.16-1.2133_FC5xen0" family="posix">Linux</os><package version="4.1.1" type="gfortran" name="fc"/></register>END RPY 1 0 . 0 43 Content-Type: application/beep+xml <ok/>END MSG 1 0 . 43 626 Content-Type: application/beep+xml <...b...u...i...l...d... ...x...m...l...n...s...:...c...=..."...h...t...t...p...:.../.../...b...i...t...t...e...n.......c...m...l...e...n...z.......n...e...t.../...t...o...o...l...s.../...c..."...>... ... ... ... ...<...s...t...e...p... ...i...d...=..."...b...u...i...l...d..."... ...d...e...s...c...r...i...p...t...i...o...n...=..."...C...o...n...f...i...g...u...r...e... ...a...n...d... ...b...u...i...l...d..."...>... ... ... ... ... ... ...<...c...:...c...o...n...f...i...g...u...r...e... .../...>... ... ... ... ...<.../...s...t...e...p...>... ... ...<.../...b...u...i...l...d...>...END
The dots are zero bytes.
Thanks!
comment:3 Changed 18 years ago by Axel.Thimm@…
- Cc Axel.Thimm@… added
comment:4 Changed 18 years ago by Angel Marin <anmar@…>
- Cc anmar@… added
I think the original problem reported and Alex one aren't the same. I've been able to reproduce the original problem under windows with trunk code and it's a windows fault related only to sh:exec recipe command.
I have tracked down the problem to the way windows encodes output for command line apps. The output for the command line interpreter is encoded using the old codepages used by DOS. So when the output is read back by the slave from the temp file, it uses system's default encoding and non ASCII chars are just garbage that the master won't be able to parse.
Forcing the conversion of the read string to utf does not work as python thinks it has read pure ascii content and fails the conversion. Ideas?
comment:5 Changed 18 years ago by Silver Aabjõe <silver.aabjoe@…>
I tried running the dir in unicode output
<build xmlns:sh="http://bitten.cmlenz.net/tools/sh"> <step id="build" description="Compile to byte code"> <sh:exec executable="cmd" args="/U /C dir"/> </step> </build>
but it did not work.
comment:6 Changed 18 years ago by Angel Marin <anmar@…>
After much digging I haven't found anything in python that knows what the default oem codepage is under windows, so the response for commands can't be re-encoded properly without that info. It doesn't matter how you execute the command (popen2.popen3, os.system, ...) if you don't know the encoding used you're lost.
Perhaps something similar to trac's util.to_utf8 could be used here. Master would be able to parse the response though any non-ascii char would be trashed. Other possibility would be to add a config option for this?
The attached patch just converts the response lines to cp850' which is the codepage used by my local windows copy. Of course that's not a solution, just a quick workaround to get it working.
comment:7 Changed 18 years ago by Silver Aabjõe <silver.aabjoe@…>
Under windows there is a utility that displays and allows to set the active code page. Namely CHCP.
comment:8 follow-up: ↓ 9 Changed 18 years ago by anonymous
The attached patch gets the configured OEM codepage calling GetOEMCP win32 function through ctypes and uses it to re-encode the response lines to utf-8.
Though perhaps it's too much overhead to add a dependency on ctypes just for this. Anyway ctypes is included in Python 2.5.
comment:9 in reply to: ↑ 8 Changed 18 years ago by Silver Aabjõe <silver.aabjoe@…>
Seems to work Replying to anonymous:
The attached patch gets the configured OEM codepage calling GetOEMCP win32 function through ctypes and uses it to re-encode the response lines to utf-8.
Though perhaps it's too much overhead to add a dependency on ctypes just for this. Anyway ctypes is included in Python 2.5.
comment:10 Changed 17 years ago by cmlenz
- Component changed from BEEP implementation to Build slave
This is probably not stricly related to the BEEP stuff.
comment:11 Changed 17 years ago by cmlenz
- Milestone set to 0.6
- Status changed from new to assigned
Need better non-ASCII command-line output support in the next release.
comment:12 Changed 16 years ago by dfraser
- Summary changed from ParseError: not well-formed (invalid token): line X, column Y to Windows oem character set produces ParseError: not well-formed (invalid token)
comment:13 Changed 15 years ago by osimons
#115 closed as duplicate.
comment:14 Changed 15 years ago by osimons
- Owner changed from cmlenz to osimons
- Status changed from assigned to new
#151 closed as duplicate, and see also #243 and the currently applied workaround that strips all non-7-bit-ascii characters. The xml parsing needs unicode support, and use a common encoding of utf-8 for all communication.
I'll put it on my todo, and see if I can come up with something that works ok.
comment:15 Changed 15 years ago by osimons
#350 closed as duplicate.
comment:16 Changed 15 years ago by osimons
#142 closed as duplicate (but regarding no longer used beep.py).
comment:17 follow-up: ↓ 18 Changed 15 years ago by osimons
I've made a stab at the issue of unicode and encoding. See attachment:t119-unicode_xml-r655.diff.
This attachment:
- Makes the various xmlio classes aware of encoding - they can accept unicode and utf-8 strings and anything else that cleanly decodes as unicode (like numbers), and any 'xml-stringify-get' will return utf-8 strings (the common xml encoding).
- Allows create+update of recipes that includes non-ascii characters.
- Runs such recipes on various platforms.
- Encodes input for commands and decodes output.
- Removes the old code that just strips away any non-printable/non-ascii7 characters.
On a utf-8 platform like OSX and most current Linux this works quite well, and a simple 'echo' step completes a full cycle with identical result. On Windows, not so well. Windows results are recognizable but not identical, and in my testing a number of characters are wrong or replaced with '?'. Regardless, it does the best it can and the encoding/decoding has yet to result in Bitten errors so far in my testing.
Another interesting artifact is that all 221 tests pass unchanged.
The major problem with encoding is how to deal with commands that return output in another decoding than platform default. On OSX my Terminal and the Bitten slave uses utf-8. However, locale.getpreferredlocale() returns 'mac-roman' so that any external command may return that - or really any encoding that it may employ.
A simple illustration is a python script printing non-ascii characters, and where the script is coded as utf-8 using the common # -*- coding: utf-8 -*- declaration. On OSX that executes on returns nicely, but on Windows the patch will try to convert the utf-8 output using whatever cpxxxx is set in sys.stdout.encoding. The non-ascii part of the result is unrecognizable. Same source, same build script, same output, but no way of knowing how it should be decoded. As the Zen of Python says, "In the face of ambiguity, refuse the temptation to guess"....
Solving the last issue will require a re-think of how recipes are specified, and I've been playing with the idea of an optional encoding= attributes on commands (not part of the patch). That would work for executed programs that return a specific encoding regardless of platform, and would then override the slave-system-default that would otherwise be used.
An encoding= attribute or similar would mean that the encoding would need to be passed through recipe build calls, into each command function, and on through to whatever lower-level mechanism the command employs to get its result - usually the CommandLine class but not necessarily. And of course all the way back again. That requires breaking internal changes, so I don't think that is a good idea before 0.7 even if the basic and unobtrusive encoding-support got enabled now for 0.6.
Please test the patch, and any feedback or help with improving it is much appreciated.
comment:18 in reply to: ↑ 17 Changed 15 years ago by cmlenz
Replying to osimons:
A simple illustration is a python script printing non-ascii characters, and where the script is coded as utf-8 using the common # -*- coding: utf-8 -*- declaration. On OSX that executes on returns nicely, but on Windows the patch will try to convert the utf-8 output using whatever cpxxxx is set in sys.stdout.encoding. The non-ascii part of the result is unrecognizable. Same source, same build script, same output, but no way of knowing how it should be decoded. As the Zen of Python says, "In the face of ambiguity, refuse the temptation to guess"....
Still, I think we should try decoding as UTF-8 first, and only if that fails fallback to the default encoding.
The nice thing about UTF-8 is that there's a very high probably it'll flag something that isn't actually UTF-8 as an error when decoding. So it's not so much guessing as trying a sane & safe default first ;)
comment:19 Changed 15 years ago by osimons
I've update the patch (attachment:t119-unicode_xml-r655.2.diff).
This tries to decode UTF-8 first as cmlenz suggested, and also uses new _to_utf8() and _from_utf8() functions to make the code more rational. Also added some small tests - one of them I can't get to work yet due to some doctest handling of StringIO. Suggestions and improvements welcome...
comment:20 Changed 15 years ago by jevans
- Cc jevans added
comment:21 follow-up: ↓ 22 Changed 15 years ago by osimons
See also #355 for somewhat related issue - characters inside regular ascii that are not parseable.
comment:22 in reply to: ↑ 21 Changed 15 years ago by osimons
Replying to osimons:
See also #355 for somewhat related issue - characters inside regular ascii that are not parseable.
My latest patch actually strips away this code again (that is marked as solved in #243).
Reading up on it: http://www.w3.org/TR/xml11/#charsets
As matching against string.printable is no longer an option (very limited pure ascii), it seems the best way would be to strip characters in the following ranges:
[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F]
I'll update the patch.
Changed 15 years ago by osimons
Updated patch that also filters restricted/control characters according to XML spec.
comment:23 follow-up: ↓ 25 Changed 15 years ago by osimons
I've added a filter to remove the most common characters from the XML spec, essentially adding this to the escaping:
text = filter(lambda c: ord(c) not in [ 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0xB, 0xC, 0xE, 0xF, 0x10, 0x12, 0x13, 0x14, 0x15, 0x15, 0x17, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F, 0x7F, 0x80, 0x81, 0x82, 0x83, 0x84, 0x86, 0x87, 0x88, 0x89, 0x8A, 0x8B, 0x8C, 0x8D, 0x8E, 0x8F, 0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9A, 0x9B, 0x9C, 0x9D, 0x9E, 0x9F], text)
Please test.
comment:24 Changed 15 years ago by Axel.Thimm@…
- Cc Axel.Thimm@… removed
comment:25 in reply to: ↑ 23 ; follow-up: ↓ 26 Changed 15 years ago by mgood
Replying to osimons:
I've added a filter to remove the most common characters from the XML spec, essentially adding this to the escaping:
That sounds reasonable, though the translate method that's used currently does this much faster since it's written in C.
It works slightly differently with unicode objects where it takes a dict mapping code points to a replacement with None deleting that character, but this example would work for either:
_trans = string.maketrans('', '') __todel = ('\x01\x02\x03\x04\x05\x06\x07\x08\x0b\x0c\x0e\x0f\x10\x12\x13\x14' '\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f\x7f\x80\x81\x82\x83' '\x84\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94' '\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f') __uni_trans = dict([(ord(c), None) for c in __todel]) if isinstance(text, str): text = text.translate(__trans, __todel) elif isinstance(text, unicode): text = text.translate(__uni_trans)
BTW, there's also a typo in the patch: 0x15 is in the list twice.
comment:26 in reply to: ↑ 25 Changed 15 years ago by osimons
Replying to mgood:
That sounds reasonable, though the translate method that's used currently does this much faster since it's written in C.
I've now also added some escaping tests locally, and they verify the same output for str and unicode. Thanks for the code - I've used it as-is.
BTW, there's also a typo in the patch: 0x15 is in the list twice.
Heh. Brownie points to you for thorough review :-)
comment:27 Changed 15 years ago by osimons
- Resolution set to fixed
- Status changed from new to closed
Patch with changes from mgood (and some additional tests) committed in [701].
comment:28 follow-up: ↓ 30 Changed 15 years ago by wbell
- Resolution fixed deleted
- Status changed from closed to reopened
Found a small issue when I removed our homebrew solution to this problem. The following text from a build step results in an exception and a failed step
pslist v1.28 - Sysinternals PsList Copyright © 2000-2004 Mark Russinovich Sysinternals
It's the <A9> (the copyright symbol) that throws the XML encoder off. Here's the stacktrace:
Traceback (most recent call last): File "c:\python25\lib\site-packages\Bitten-0.6dev-py2.5.egg\bitten\slave.py", line 385, in _execute_step step.execute(recipe.ctxt): File "c:\python25\lib\site-packages\Bitten-0.6dev-py2.5.egg\bitten\recipe.py", line 243, in execute ctxt.run(self, child.namespace, child.name, child.attr) File "c:\python25\lib\site-packages\Bitten-0.6dev-py2.5.egg\bitten\recipe.py", line 106, in run function(self, **args) File "c:\python25\lib\site-packages\Bitten-0.6dev-py2.5.egg\bitten\build\shtools.py", line 40, in exec_ output=output, args=args, dir_=dir_) File "c:\python25\lib\site-packages\Bitten-0.6dev-py2.5.egg\bitten\build\shtools.py", line 147, in execute .replace(ctxt.basedir, '') File "c:\python25\lib\site-packages\Bitten-0.6dev-py2.5.egg\bitten\util\xmlio.py", line 85, in __getitem__ self.append(node) File "c:\python25\lib\site-packages\Bitten-0.6dev-py2.5.egg\bitten\util\xmlio.py", line 102, in append self.children.append(_from_utf8(node)) File "c:\python25\lib\site-packages\Bitten-0.6dev-py2.5.egg\bitten\util\xmlio.py", line 32, in _from_utf8 return text.decode('utf-8') File "C:\Python25\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 10: unexpected code byte
Not high priority.
comment:29 Changed 15 years ago by osimons
Hmm. Not sure why. I suppose the output may still contain 'wrong' characters - or valid characters located at wrong codepoints.
Could you try just using a 'replace' strategy:
-
bitten/util/xmlio.py
a b 29 29 def _from_utf8(text): 30 30 """Convert utf-8 string to unicode. All other input returned as-is.""" 31 31 if isinstance(text, str): 32 return text.decode('utf-8' )32 return text.decode('utf-8', 'replace') 33 33 else: 34 34 return text 35 35
That should at least be a catch-all for characters that can't be handled, and likely a better alternative than trowing errors.
Does that help?
comment:30 in reply to: ↑ 28 Changed 15 years ago by anonymous
Replying to wbell:
Found a small issue when I removed our homebrew solution to this problem. The following text from a build step results in an exception and a failed step
pslist v1.28 - Sysinternals PsList Copyright © 2000-2004 Mark Russinovich SysinternalsIt's the <A9> (the copyright symbol) that throws the XML encoder off. Here's the stacktrace:
OK, so that means this output wasn't UTF-8 - it was probably latin-1. Surely the way to fix this is a flag on the build script that specifies the output encoding of the program and then the slave to pick that up and convert it into UTF-8 before upload? Of course, I may have missed a couple of tricks about how this works here...
comment:31 Changed 15 years ago by osimons
Ooops. No wonder. My mistake - seems I forgot some intra-patch updating when pushing all my large patches last week... The new subprocess-based execute() forgets to decode the output. Could you try this patch instead?
-
bitten/build/api.py
a b 170 170 name, line = queue.get(block=True, timeout=.01) 171 171 line = line and line.rstrip().replace('\x00', '') 172 172 if name == 'stderr': 173 yield (None, line)173 yield (None, _decode(line)) 174 174 else: 175 yield ( line, None)175 yield (_decode(line), None) 176 176 except Empty: 177 177 if self.returncode != None: 178 178 break
comment:32 follow-up: ↓ 33 Changed 15 years ago by osimons
Same result, only cleaner:
-
bitten/build/api.py
a b 168 168 self.returncode = p.returncode 169 169 try: 170 170 name, line = queue.get(block=True, timeout=.01) 171 line = line and line.rstrip().replace('\x00', '')171 line = line and _decode(line.rstrip().replace('\x00', '')) 172 172 if name == 'stderr': 173 173 yield (None, line) 174 174 else:
comment:33 in reply to: ↑ 32 Changed 15 years ago by osimons
- Resolution set to fixed
- Status changed from reopened to closed
same with 0.5.3