#380 closed defect (fixed)
When the build for *one* revision hangs, the slave hangs forever, and builds aren't triggered anymore
Reported by: | edgewall.org@… | Owned by: | cmlenz |
---|---|---|---|
Priority: | critical | Milestone: | 0.6 |
Component: | General | Version: | dev |
Keywords: | Cc: | felix.schwarz@… | |
Operating System: | BSD |
Description (last modified by osimons)
Someone in our teams scr*wed up and committed some tests that are hanging the build. We corrected after a few commits, but now, when the slaves do the build, they hang when they get to build the revisions that hang, and we have to restart them by hand when there's a new revision to build...
Unchecking "Build all revisions" doesn't seem to change any of this behavior. Actually, This option isn't really doing what I would expect: it should be called "Trigger a build for every commit, even if it is not on the path for the configuration" or something like that. It would be good that there's a "Only build latest revision" option.
The problem is worse for me, as I build two different configurations: and Bitten tries to build *all the revisions* for one of the configurations, before building the other. Since the slaves hang at some of the revisions for the first configurations, the second configuration is never built - for any revision. I cannot get this configuration to be built, at all.
I'm currently trying to patch 'queue.py', to simply skip the builds that are causing trouble:
-
queue.
diff -u queue.py queue.py-original
old new 221 221 platforms = [] 222 222 for platform, rev, build in collect_changes(repos, config, db): 223 223 224 if rev > 1710 and rev < 1726:225 continue226 227 224 if not self.build_all and platform.id in platforms: 228 225 # We've seen this platform already, so these are older 229 226 # builds that should only be built if built_all=True
I think it will work, but there may be a cleaner way of doing it ?
The *slave* code should have a way to stop a build if it doesn't finish before the timeout defined in the admin (currently this timeout is only used on the master). I've looked at the code in the slave, it doesn't seem too difficult to implement a control thread that would stop the build. However, i'm not familiar enough with threading in Python to do it...
Attachments (0)
Change History (7)
comment:1 Changed 16 years ago by dfraser
- Description modified (diff)
comment:2 Changed 16 years ago by wbell
I don't like the idea of doing a heuristic of how long a build should take-- we have many build slaves of differing speeds, and some slaves take 12 hours for builds that others only take 9.
Anytime a slave stops running a build (as far as the master is concerned), it should make an effort to stop building it, so as not to stay stuck orphaned. One consequence of the current timeout behavior is that if your build does hang and exceeds the timeout, the master happily invalidates it, and assigns it to another slave. The slave processing it continues to run the build (hanging), and a new slave starts, and eventually will hang to repeat the process until all slaves are running the same build, but none are shown as running it as far as the master is concerned.
comment:3 Changed 15 years ago by osimons
- Milestone changed from 0.6 to 0.6.1
comment:4 Changed 15 years ago by osimons
- Description modified (diff)
(Fixed diff formatting in description)
comment:5 Changed 15 years ago by Felix Schwarz <felix.schwarz@…>
- Cc felix.schwarz@… added
comment:6 Changed 15 years ago by wbell
- Resolution set to fixed
- Status changed from new to closed
Closed with [830]
comment:7 Changed 15 years ago by osimons
- Milestone changed from 0.6.1 to 0.6
Agreed, I've had lots of trouble with this before. Useful things would be: