Opened 15 years ago
Last modified 13 years ago
#513 new enhancement
Bitten slave exits too easily in case of an intermittent issue
Reported by: | dobesv@… | Owned by: | |
---|---|---|---|
Priority: | minor | Milestone: | 0.6.1 |
Component: | Build slave | Version: | 0.6b2 |
Keywords: | Cc: | ||
Operating System: | Windows |
Description
The bitten slave exits if it has any connection issues with the server; for example:
[INFO ] Build step checkout completed successfully [DEBUG ] Sending POST request to 'https://seed2.projectlocker.com/habitsoft/books/trac/builds/38/steps/' [DEBUG ] Server returned error 500: Internal Server Error (Internal Server Error TracError: OSError: [Errno 12] Cannot allocate memory: '/usr/local/lib/python2.5') [ERROR ] HTTP Error 500: Internal Server Error [INFO ] Slave exited at 2009-12-15 13:57:00
However, this was a temporary issue on the server.
It also exits if I put the windows machine to sleep and then wake it up because its TCP connection may fail.
Ideally it would continue to attempt a build occasionally if the master is down, so that maintenance windows or server capacity problems don't nuke the slaves.
Attachments (0)
Change History (2)
comment:1 Changed 15 years ago by sam.hendley@…
comment:2 Changed 15 years ago by anonymous
When booting my (vmware virtual) Debain machine the bitten-slave (for testing it is on the same machine as trac) sometimes starts when the IP adress-stuff has not been set up yet or is changed by the system. Since bitten-slave quits upon some errors, it is not possible to run it reliable as a daemon.
ERROR: <urlopen error (-5,'No address associated with hostname')> and then bitten-slave exits.
As a workaround I installed MONIT to supervice just the bitten-slave process, and restart it if needed. I don't feel secure when a networked application like bitten-slave simply quits when there is a temporary network error. I therefore raise the Priority to critical as the program left no trace at all of what was happening in the logs when it couldn't handle the errors (and I had to spend about 5 hours to try to locate an error that happens "sometimes" and not at all when run from the command line)
+1, I have this issue if a test hangs and takes too long, when it finally completes the session is expired and the slave exits. I think it should just go back and try another build.
On a related note, when the session is timed out the build should be considered failed as well, currently I have to manually go in and invalidate that build because it appears to still be running.