Opened 16 years ago
Closed 15 years ago
#355 closed defect (duplicate)
Invalid XML characters cause ParseError: not well-formed (invalid token)
Reported by: | dbronner@… | Owned by: | dfraser |
---|---|---|---|
Priority: | major | Milestone: | 0.6 |
Component: | Build slave | Version: | 0.5.3 |
Keywords: | Cc: | mgood, osimons | |
Operating System: | Linux |
Description
Some of our test output contains characters like '\x01' (the single ascii character at code point 1), causing a Parse Error? when minidom tries to read in the XML file. This particular data isn't very useful, but I'd like to prevent tests from crashing the build just by printing something in their assertions.
There are a bunch of potential solutions:
- Base64 encode (and decode) all text data.
- Replace all illegal characters with one legal one (something like �).
- Something else.
Since the whole point of this text is for a human to read it, I think I'd prefer option 2.
Attachments (1)
Change History (10)
comment:1 Changed 16 years ago by dfraser
comment:2 Changed 16 years ago by dbronner@…
I'm not sure how to invoke build/pythontools outside the context of our continuous build, but the path leading to the problem is:
pythontools.unittest -> xmlio.parse -> minidom.parse
File "/path/to/Bitten-0.6dev-py2.4.egg/bitten/util/xmlio.py", line 185, in parse
raise Parse Error?(e)
bitten.util.xmlio.Parse Error?: not well-formed (invalid token): line 6, column 16
This is from parsing the results file generated by the test:
def test_foo():
assert False, u"\x01"
-Dave
comment:3 Changed 16 years ago by dfraser
- Owner changed from cmlenz to dfraser
- Status changed from new to assigned
I'll try have a stab at this if I get time. Just to clarify, the crash happens in the slave, right?
comment:4 Changed 16 years ago by dbronner@…
Correct.
comment:5 Changed 15 years ago by anonymous
This crash also happens on the master. The previous reported error happened as a result of the python:unittest command in the build recipe trying to form a unittest result to send back to the server, but this same kind of error happens server-side if another step in the recipe generates bad XML. The server-side error shows up as:
2009-06-04 11:36:32,221 Trac[master] ERROR: Error parsing build step result: not well-formed (invalid token): line 1, column 50604 Traceback (most recent call last):
File "/tmp/bitten-0.6.0-r638/lib/python2.4/site-packages/bitten/master.py", line 204, in _process_build_step File "/tmp/bitten-0.6.0-r638/lib/python2.4/site-packages/bitten/util/xmlio.py", line 195, in parse
Parse Error?: not well-formed (invalid token): line 1, column 50604
This was caused by a make file outputting the character "\x1b" and trying to send it back to the server in a message tag.
comment:6 Changed 15 years ago by dbronner
Sorry, I forgot to identify myself. The last comment was also from me (the original reporter).
comment:7 Changed 15 years ago by osimons
- Cc osimons added
It would be very interesting if you could try the latest patch on #119 to see how that behaves with your problematic characters.
comment:8 Changed 15 years ago by dbronner
Looking through the comments in there, I came across this patch which does fix the problem: http://bitten.edgewall.org/attachment/ticket/243/bitten-escape-chars.patch
This has been in trunk for ages, so I'm not sure why I don't have it in my copy.
I'd close this as a (resolved) duplicate of #243, but I don't see the option to do that.
BTW... #119 is indeed related and the patch is along the lines of what's needed here. It still doesn't handle the low ascii values which are illegal XML though:
cgi.escape("\x01") == "\x01" "\x01".encode(sys.getfilesystemencoding(), 'replace') == "\x01"
comment:9 Changed 15 years ago by osimons
- Resolution set to duplicate
- Status changed from assigned to closed
Hmm. My #119 patch strips away that patch again as it can't be used as-is (current trunk only allows standard ascii through (string.printable)).
Reading up on it: http://www.w3.org/TR/xml11/#charsets
Seems I should reimplement this to strip away characters in the following ranges:
[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F]
I'll add these notes to #119 seeing this is a duplicate of already solved issue, and I'll make sure to reimplement a way to strip these restricted characters from output in an updated patch. Would be OK if you keep an eye on that ticket as I'll likely get an updated patch done this evening.
Could you include an example output file, and a traceback of exactly where the problem occurs?
The proper solution is probably to encode them in XML in a way that can be parsed and displayed.