Edgewall Software
Modify

Opened 19 years ago

Closed 15 years ago

Last modified 13 years ago

#119 closed defect (fixed)

Windows oem character set produces ParseError: not well-formed (invalid token)

Reported by: silver.aabjoe@… Owned by: osimons
Priority: critical Milestone: 0.6
Component: Build slave Version: 0.5.3
Keywords: Cc: anmar@…, jevans
Operating System:

Description

i have bitten 0.5.2, track 0.9.4, python 2.3.5 the build recipe looks like this

<build xmlns:sh="http://bitten.cmlenz.net/tools/sh">
  <step id="build" description="Compile to byte code">
     <sh:exec executable="dir"/>
  </step>
</build>

Even if slave reports build compleated successfully the master displays the following:

not well-formed (invalid token): line 1, column 925
Traceback (most recent call last):
  File "/usr/lib/python2.3/asyncore.py", line 69, in read
    obj.handle_read_event()
  File "/usr/lib/python2.3/asyncore.py", line 390, in handle_read_event
    self.handle_read()
  File "/usr/lib/python2.3/asynchat.py", line 136, in handle_read
    self.found_terminator()
  File "build/bdist.linux-i686/egg/bitten/util/beep.py", line 278, in found_terminator
  File "build/bdist.linux-i686/egg/bitten/util/beep.py", line 311, in _handle_frame
  File "build/bdist.linux-i686/egg/bitten/util/beep.py", line 469, in handle_data_frame
  File "build/bdist.linux-i686/egg/bitten/master.py", line 216, in handle_reply
  File "build/bdist.linux-i686/egg/bitten/util/xmlio.py", line 182, in parse
ParseError: not well-formed (invalid token): line 1, column 925

and stops responding to other slaves.

Attachments (7)

bitten-119.patch (756 bytes) - added by Angel Marin <anmar@…> 19 years ago.
Quick and dirty fix
bitten-119-ctypes.patch (879 bytes) - added by Angel Marin <anmar@…> 19 years ago.
Possible solution that applies to r372
t119-unicode_xml-r655.diff (5.3 KB) - added by osimons 15 years ago.
First-draft patch for unicode/encoding-aware XML.
t119-unicode_xml-r655.2.diff (8.5 KB) - added by osimons 15 years ago.
Updated patch.
t119-unicode_xml-r655.3.diff (9.3 KB) - added by osimons 15 years ago.
Minor tweaks + fixed tests. Ready for testing.
t119-unicode_xml-r667.diff (9.8 KB) - added by osimons 15 years ago.
Updated patch that also filters restricted/control characters according to XML spec.
t119-unicode_xml-r696.diff (9.7 KB) - added by osimons 15 years ago.
Updated to apply cleanly on latest trunk (moved xmlio test).

Download all attachments as: .zip

Change History (40)

comment:1 Changed 19 years ago by anonymous

  • Version changed from 0.5.2 to 0.5.3

same with 0.5.3

comment:2 Changed 19 years ago by Axel.Thimm@…

  • Component changed from Build master to BEEP implementation
  • Summary changed from ParseError: not well-formed (invalid token): line 1, column 925 to ParseError: not well-formed (invalid token): line X, column Y

I have a similar issue, but on the slave side and I think this is due to how xml is sometimes being passed over the wire, so I'm changing the component to BEEP.

The system I'm working with is

  • master FC4/i386 (python 2.4.1)
  • slave FC5/x86_64 (2.4.2)
  • trac 1.0dev
  • bitten 0.5.3

When the build master is about to send the build recipe the wire-sniffed exchange shows that every character of the build recipe has been padded with three zeros to the right. Looking into the database shows that the build recipe has been stored OK. Maybe this has to do with 64 vs 32 bits (although the padding happens on the master's side which is 32 bits and doesn't know what the slave is other than through os/machine)?

This is the slave's error:

[ERROR   ] not well-formed (invalid token): line 1, column 1
Traceback (most recent call last):
  File "/usr/lib64/python2.4/asyncore.py", line 69, in read
    obj.handle_read_event()
  File "/usr/lib64/python2.4/asyncore.py", line 391, in handle_read_event
    self.handle_read()
  File "/usr/lib64/python2.4/asynchat.py", line 137, in handle_read
    self.found_terminator()
  File "/usr/lib/python2.4/site-packages/bitten/util/beep.py", line 278, in found_terminator
    self._handle_frame(self.header, self.payload)
  File "/usr/lib/python2.4/site-packages/bitten/util/beep.py", line 311, in _handle_frame
    payload)
  File "/usr/lib/python2.4/site-packages/bitten/util/beep.py", line 465, in handle_data_frame
    self.profile.handle_msg(msgno, payload)
  File "/usr/lib/python2.4/site-packages/bitten/slave.py", line 98, in handle_msg
    elem = xmlio.parse(payload.body)
  File "/usr/lib/python2.4/site-packages/bitten/util/xmlio.py", line 182, in parse
    raise ParseError, e
ParseError: not well-formed (invalid token): line 1, column 1

And this is the build recipe:

<build xmlns:c="http://bitten.cmlenz.net/tools/c">
  <step id="build" description="Configure and build">
    <c:configure />
  </step>
</build>

This is what sniffing reveals:

RPY 0 0 . 0 49

Content-Type: application/beep+xml



<greeting/>END

RPY 0 0 . 0 119

Content-Type: application/beep+xml



<greeting><profile uri="http://bitten.cmlenz.net/beep/orchestration"/></greeting>END

MSG 0 0 . 49 124

Content-Type: application/beep+xml



<start number="1"><profile uri="http://bitten.cmlenz.net/beep/orchestration"/></start>END

RPY 0 0 . 119 98

Content-Type: application/beep+xml



<profile uri="http://bitten.cmlenz.net/beep/orchestration"/>END

MSG 1 0 . 0 233

Content-Type: application/beep+xml



<register name="octopus"><platform processor="x86_64">x86_64</platform><os version="2.6.16-1.2133_FC5xen0" family="posix">Linux</os><package version="4.1.1" type="gfortran" name="fc"/></register>END

RPY 1 0 . 0 43

Content-Type: application/beep+xml



<ok/>END

MSG 1 0 . 43 626

Content-Type: application/beep+xml



<...b...u...i...l...d... ...x...m...l...n...s...:...c...=..."...h...t...t...p...:.../.../...b...i...t...t...e...n.......c...m...l...e...n...z.......n...e...t.../...t...o...o...l...s.../...c..."...>...
...
... ... ...<...s...t...e...p... ...i...d...=..."...b...u...i...l...d..."... ...d...e...s...c...r...i...p...t...i...o...n...=..."...C...o...n...f...i...g...u...r...e... ...a...n...d... ...b...u...i...l...d..."...>...
...
... ... ... ... ...<...c...:...c...o...n...f...i...g...u...r...e... .../...>...
...
... ... ...<.../...s...t...e...p...>...
...
...<.../...b...u...i...l...d...>...END

The dots are zero bytes.

Thanks!

comment:3 Changed 19 years ago by Axel.Thimm@…

  • Cc Axel.Thimm@… added

comment:4 Changed 19 years ago by Angel Marin <anmar@…>

  • Cc anmar@… added

I think the original problem reported and Alex one aren't the same. I've been able to reproduce the original problem under windows with trunk code and it's a windows fault related only to sh:exec recipe command.

I have tracked down the problem to the way windows encodes output for command line apps. The output for the command line interpreter is encoded using the old codepages used by DOS. So when the output is read back by the slave from the temp file, it uses system's default encoding and non ASCII chars are just garbage that the master won't be able to parse.

Forcing the conversion of the read string to utf does not work as python thinks it has read pure ascii content and fails the conversion. Ideas?

comment:5 Changed 19 years ago by Silver Aabjõe <silver.aabjoe@…>

I tried running the dir in unicode output

<build xmlns:sh="http://bitten.cmlenz.net/tools/sh">
  <step id="build" description="Compile to byte code">
     <sh:exec executable="cmd" args="/U /C dir"/>
  </step>
</build>

but it did not work.

Changed 19 years ago by Angel Marin <anmar@…>

Quick and dirty fix

comment:6 Changed 19 years ago by Angel Marin <anmar@…>

After much digging I haven't found anything in python that knows what the default oem codepage is under windows, so the response for commands can't be re-encoded properly without that info. It doesn't matter how you execute the command (popen2.popen3, os.system, ...) if you don't know the encoding used you're lost.

Perhaps something similar to trac's util.to_utf8 could be used here. Master would be able to parse the response though any non-ascii char would be trashed. Other possibility would be to add a config option for this?

The attached patch just converts the response lines to cp850' which is the codepage used by my local windows copy. Of course that's not a solution, just a quick workaround to get it working.

comment:7 Changed 19 years ago by Silver Aabjõe <silver.aabjoe@…>

Under windows there is a utility that displays and allows to set the active code page. Namely CHCP.

Changed 19 years ago by Angel Marin <anmar@…>

Possible solution that applies to r372

comment:8 follow-up: Changed 19 years ago by anonymous

The attached patch gets the configured OEM codepage calling GetOEMCP win32 function through ctypes and uses it to re-encode the response lines to utf-8.

Though perhaps it's too much overhead to add a dependency on ctypes just for this. Anyway ctypes is included in Python 2.5.

comment:9 in reply to: ↑ 8 Changed 19 years ago by Silver Aabjõe <silver.aabjoe@…>

Seems to work Replying to anonymous:

The attached patch gets the configured OEM codepage calling GetOEMCP win32 function through ctypes and uses it to re-encode the response lines to utf-8.

Though perhaps it's too much overhead to add a dependency on ctypes just for this. Anyway ctypes is included in Python 2.5.

comment:10 Changed 17 years ago by cmlenz

  • Component changed from BEEP implementation to Build slave

This is probably not stricly related to the BEEP stuff.

comment:11 Changed 17 years ago by cmlenz

  • Milestone set to 0.6
  • Status changed from new to assigned

Need better non-ASCII command-line output support in the next release.

comment:12 Changed 16 years ago by dfraser

  • Summary changed from ParseError: not well-formed (invalid token): line X, column Y to Windows oem character set produces ParseError: not well-formed (invalid token)

comment:13 Changed 15 years ago by osimons

#115 closed as duplicate.

comment:14 Changed 15 years ago by osimons

  • Owner changed from cmlenz to osimons
  • Status changed from assigned to new

#151 closed as duplicate, and see also #243 and the currently applied workaround that strips all non-7-bit-ascii characters. The xml parsing needs unicode support, and use a common encoding of utf-8 for all communication.

I'll put it on my todo, and see if I can come up with something that works ok.

comment:15 Changed 15 years ago by osimons

#350 closed as duplicate.

comment:16 Changed 15 years ago by osimons

#142 closed as duplicate (but regarding no longer used beep.py).

Changed 15 years ago by osimons

First-draft patch for unicode/encoding-aware XML.

comment:17 follow-up: Changed 15 years ago by osimons

I've made a stab at the issue of unicode and encoding. See attachment:t119-unicode_xml-r655.diff.

This attachment:

  • Makes the various xmlio classes aware of encoding - they can accept unicode and utf-8 strings and anything else that cleanly decodes as unicode (like numbers), and any 'xml-stringify-get' will return utf-8 strings (the common xml encoding).
  • Allows create+update of recipes that includes non-ascii characters.
  • Runs such recipes on various platforms.
  • Encodes input for commands and decodes output.
  • Removes the old code that just strips away any non-printable/non-ascii7 characters.

On a utf-8 platform like OSX and most current Linux this works quite well, and a simple 'echo' step completes a full cycle with identical result. On Windows, not so well. Windows results are recognizable but not identical, and in my testing a number of characters are wrong or replaced with '?'. Regardless, it does the best it can and the encoding/decoding has yet to result in Bitten errors so far in my testing.

Another interesting artifact is that all 221 tests pass unchanged.

The major problem with encoding is how to deal with commands that return output in another decoding than platform default. On OSX my Terminal and the Bitten slave uses utf-8. However, locale.getpreferredlocale() returns 'mac-roman' so that any external command may return that - or really any encoding that it may employ.

A simple illustration is a python script printing non-ascii characters, and where the script is coded as utf-8 using the common # -*- coding: utf-8 -*- declaration. On OSX that executes on returns nicely, but on Windows the patch will try to convert the utf-8 output using whatever cpxxxx is set in sys.stdout.encoding. The non-ascii part of the result is unrecognizable. Same source, same build script, same output, but no way of knowing how it should be decoded. As the Zen of Python says, "In the face of ambiguity, refuse the temptation to guess"....

Solving the last issue will require a re-think of how recipes are specified, and I've been playing with the idea of an optional encoding= attributes on commands (not part of the patch). That would work for executed programs that return a specific encoding regardless of platform, and would then override the slave-system-default that would otherwise be used.

An encoding= attribute or similar would mean that the encoding would need to be passed through recipe build calls, into each command function, and on through to whatever lower-level mechanism the command employs to get its result - usually the CommandLine class but not necessarily. And of course all the way back again. That requires breaking internal changes, so I don't think that is a good idea before 0.7 even if the basic and unobtrusive encoding-support got enabled now for 0.6.

Please test the patch, and any feedback or help with improving it is much appreciated.

comment:18 in reply to: ↑ 17 Changed 15 years ago by cmlenz

Replying to osimons:

A simple illustration is a python script printing non-ascii characters, and where the script is coded as utf-8 using the common # -*- coding: utf-8 -*- declaration. On OSX that executes on returns nicely, but on Windows the patch will try to convert the utf-8 output using whatever cpxxxx is set in sys.stdout.encoding. The non-ascii part of the result is unrecognizable. Same source, same build script, same output, but no way of knowing how it should be decoded. As the Zen of Python says, "In the face of ambiguity, refuse the temptation to guess"....

Still, I think we should try decoding as UTF-8 first, and only if that fails fallback to the default encoding.

The nice thing about UTF-8 is that there's a very high probably it'll flag something that isn't actually UTF-8 as an error when decoding. So it's not so much guessing as trying a sane & safe default first ;)

Changed 15 years ago by osimons

Updated patch.

comment:19 Changed 15 years ago by osimons

I've update the patch (attachment:t119-unicode_xml-r655.2.diff).

This tries to decode UTF-8 first as cmlenz suggested, and also uses new _to_utf8() and _from_utf8() functions to make the code more rational. Also added some small tests - one of them I can't get to work yet due to some doctest handling of StringIO. Suggestions and improvements welcome...

Changed 15 years ago by osimons

Minor tweaks + fixed tests. Ready for testing.

comment:20 Changed 15 years ago by jevans

  • Cc jevans added

comment:21 follow-up: Changed 15 years ago by osimons

See also #355 for somewhat related issue - characters inside regular ascii that are not parseable.

comment:22 in reply to: ↑ 21 Changed 15 years ago by osimons

Replying to osimons:

See also #355 for somewhat related issue - characters inside regular ascii that are not parseable.

My latest patch actually strips away this code again (that is marked as solved in #243).

Reading up on it: http://www.w3.org/TR/xml11/#charsets

As matching against string.printable is no longer an option (very limited pure ascii), it seems the best way would be to strip characters in the following ranges:

[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F]

I'll update the patch.

Changed 15 years ago by osimons

Updated patch that also filters restricted/control characters according to XML spec.

comment:23 follow-up: Changed 15 years ago by osimons

I've added a filter to remove the most common characters from the XML spec, essentially adding this to the escaping:

text = filter(lambda c: ord(c) not in [
        0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0xB, 0xC, 0xE, 0xF,
        0x10, 0x12, 0x13, 0x14, 0x15, 0x15, 0x17, 0x18, 0x19, 0x1A,
        0x1B, 0x1C, 0x1D, 0x1E, 0x1F, 0x7F, 0x80, 0x81, 0x82, 0x83,
        0x84, 0x86, 0x87, 0x88, 0x89, 0x8A, 0x8B, 0x8C, 0x8D, 0x8E,
        0x8F, 0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98,
        0x99, 0x9A, 0x9B, 0x9C, 0x9D, 0x9E, 0x9F], text)

Please test.

comment:24 Changed 15 years ago by Axel.Thimm@…

  • Cc Axel.Thimm@… removed

Changed 15 years ago by osimons

Updated to apply cleanly on latest trunk (moved xmlio test).

comment:25 in reply to: ↑ 23 ; follow-up: Changed 15 years ago by mgood

Replying to osimons:

I've added a filter to remove the most common characters from the XML spec, essentially adding this to the escaping:

That sounds reasonable, though the translate method that's used currently does this much faster since it's written in C.

It works slightly differently with unicode objects where it takes a dict mapping code points to a replacement with None deleting that character, but this example would work for either:

_trans = string.maketrans('', '')
__todel = ('\x01\x02\x03\x04\x05\x06\x07\x08\x0b\x0c\x0e\x0f\x10\x12\x13\x14'
           '\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f\x7f\x80\x81\x82\x83'
           '\x84\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94'
           '\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f')
__uni_trans = dict([(ord(c), None) for c in __todel])

if isinstance(text, str):
    text = text.translate(__trans, __todel)
elif isinstance(text, unicode):
    text = text.translate(__uni_trans)

BTW, there's also a typo in the patch: 0x15 is in the list twice.

comment:26 in reply to: ↑ 25 Changed 15 years ago by osimons

Replying to mgood:

That sounds reasonable, though the translate method that's used currently does this much faster since it's written in C.

I've now also added some escaping tests locally, and they verify the same output for str and unicode. Thanks for the code - I've used it as-is.

BTW, there's also a typo in the patch: 0x15 is in the list twice.

Heh. Brownie points to you for thorough review :-)

comment:27 Changed 15 years ago by osimons

  • Resolution set to fixed
  • Status changed from new to closed

Patch with changes from mgood (and some additional tests) committed in [701].

comment:28 follow-up: Changed 15 years ago by wbell

  • Resolution fixed deleted
  • Status changed from closed to reopened

Found a small issue when I removed our homebrew solution to this problem. The following text from a build step results in an exception and a failed step

pslist v1.28 - Sysinternals PsList
Copyright © 2000-2004 Mark Russinovich
Sysinternals

It's the <A9> (the copyright symbol) that throws the XML encoder off. Here's the stacktrace:

Traceback (most recent call last):
  File "c:\python25\lib\site-packages\Bitten-0.6dev-py2.5.egg\bitten\slave.py", line 385, in _execute_step
    step.execute(recipe.ctxt):
  File "c:\python25\lib\site-packages\Bitten-0.6dev-py2.5.egg\bitten\recipe.py", line 243, in execute
    ctxt.run(self, child.namespace, child.name, child.attr)
  File "c:\python25\lib\site-packages\Bitten-0.6dev-py2.5.egg\bitten\recipe.py", line 106, in run
    function(self, **args)
  File "c:\python25\lib\site-packages\Bitten-0.6dev-py2.5.egg\bitten\build\shtools.py", line 40, in exec_
    output=output, args=args, dir_=dir_)
  File "c:\python25\lib\site-packages\Bitten-0.6dev-py2.5.egg\bitten\build\shtools.py", line 147, in execute
    .replace(ctxt.basedir, '')
  File "c:\python25\lib\site-packages\Bitten-0.6dev-py2.5.egg\bitten\util\xmlio.py", line 85, in __getitem__
    self.append(node)
  File "c:\python25\lib\site-packages\Bitten-0.6dev-py2.5.egg\bitten\util\xmlio.py", line 102, in append
    self.children.append(_from_utf8(node))
  File "c:\python25\lib\site-packages\Bitten-0.6dev-py2.5.egg\bitten\util\xmlio.py", line 32, in _from_utf8
    return text.decode('utf-8')
  File "C:\Python25\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 10: unexpected code byte

Not high priority.

comment:29 Changed 15 years ago by osimons

Hmm. Not sure why. I suppose the output may still contain 'wrong' characters - or valid characters located at wrong codepoints.

Could you try just using a 'replace' strategy:

  • bitten/util/xmlio.py

    a b  
    2929def _from_utf8(text):
    3030    """Convert utf-8 string to unicode. All other input returned as-is."""
    3131    if isinstance(text, str):
    32         return text.decode('utf-8')
     32        return text.decode('utf-8', 'replace')
    3333    else:
    3434        return text
    3535

That should at least be a catch-all for characters that can't be handled, and likely a better alternative than trowing errors.

Does that help?

comment:30 in reply to: ↑ 28 Changed 15 years ago by anonymous

Replying to wbell:

Found a small issue when I removed our homebrew solution to this problem. The following text from a build step results in an exception and a failed step

pslist v1.28 - Sysinternals PsList
Copyright © 2000-2004 Mark Russinovich
Sysinternals

It's the <A9> (the copyright symbol) that throws the XML encoder off. Here's the stacktrace:

OK, so that means this output wasn't UTF-8 - it was probably latin-1. Surely the way to fix this is a flag on the build script that specifies the output encoding of the program and then the slave to pick that up and convert it into UTF-8 before upload? Of course, I may have missed a couple of tricks about how this works here...

comment:31 Changed 15 years ago by osimons

Ooops. No wonder. My mistake - seems I forgot some intra-patch updating when pushing all my large patches last week... The new subprocess-based execute() forgets to decode the output. Could you try this patch instead?

  • bitten/build/api.py

    a b  
    170170                    name, line = queue.get(block=True, timeout=.01)
    171171                    line = line and line.rstrip().replace('\x00', '')
    172172                    if name == 'stderr':
    173                         yield (None, line)
     173                        yield (None, _decode(line))
    174174                    else:
    175                         yield (line, None)
     175                        yield (_decode(line), None)
    176176                except Empty:
    177177                    if self.returncode != None:
    178178                        break

comment:32 follow-up: Changed 15 years ago by osimons

Same result, only cleaner:

  • bitten/build/api.py

    a b  
    168168                    self.returncode = p.returncode
    169169                try:
    170170                    name, line = queue.get(block=True, timeout=.01)
    171                     line = line and line.rstrip().replace('\x00', '')
     171                    line = line and _decode(line.rstrip().replace('\x00', ''))
    172172                    if name == 'stderr':
    173173                        yield (None, line)
    174174                    else:

comment:33 in reply to: ↑ 32 Changed 15 years ago by osimons

  • Resolution set to fixed
  • Status changed from reopened to closed

Replying to osimons:

Same result, only cleaner:

Committed in [709] - hopefully this should solve it...

Add Comment

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain osimons.
The resolution will be deleted. Next status will be 'reopened'.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.