Opened 19 years ago
Closed 15 years ago
#95 closed defect (fixed)
Multiple slaves claim the same build
Reported by: | Walter Bell <wwb2@…> | Owned by: | osimons |
---|---|---|---|
Priority: | major | Milestone: | 0.6 |
Component: | Build master | Version: | 0.5 |
Keywords: | Cc: | ||
Operating System: |
Description
Start up multiple slaves around the same time. If transferring the tarball takes long enough, multiple of them will grab the same build. You'll get lots of errors in the log
2006-01-13 09:34:07,645 [bitten.master] INFO: Slave vs2002-jwdesk01 started build 171 ("Countrywide" as of [3533]) 2006-01-13 09:34:13,834 [bitten.beep] ERROR: columns build, name are not unique Traceback (most recent call last): File "d:\Python23\lib\asyncore.py", line 69, in read obj.handle_read_event() File "d:\Python23\lib\asyncore.py", line 390, in handle_read_event self.handle_read() File "d:\Python23\lib\asynchat.py", line 136, in handle_read self.found_terminator() File "build\bdist.win32\egg\bitten\util\beep.py", line 278, in found_terminator File "build\bdist.win32\egg\bitten\util\beep.py", line 311, in _handle_frame File "build\bdist.win32\egg\bitten\util\beep.py", line 469, in handle_data_frame File "build\bdist.win32\egg\bitten\master.py", line 221, in handle_reply File "build\bdist.win32\egg\bitten\master.py", line 277, in _build_step_completed File "build\bdist.win32\egg\bitten\model.py", line 574, in insert File "d:\Python23\lib\site-packages\sqlite\main.py", line 255, in execute self.rs = self.con.db.execute(SQL % parms) IntegrityError: columns build, name are not unique
It's not fatal and works itself out, but it's a waste of resources.
Attachments (2)
Change History (9)
Changed 19 years ago by Walter Bell <wwb2@…>
comment:1 Changed 19 years ago by cmlenz
- Milestone set to 0.6
- Status changed from new to assigned
Looks good, thanks for the patch!
comment:2 Changed 17 years ago by cmlenz
Need to port this to the HTTP branch.
comment:3 Changed 15 years ago by wbell
The simplest fix I've found for this is to add a constraint into the database, but it's not ideal. Discard this original patch.
comment:4 Changed 15 years ago by osimons
- Milestone changed from 0.6 to 0.7
comment:5 Changed 15 years ago by osimons
- Milestone changed from 0.7 to 0.6
- Owner changed from cmlenz to osimons
- Status changed from assigned to new
I think I've found a problem in current trunk related to this. The code that loops the pending builds will break from the loop if it finds a matching build (and build variable will be populated with the correct build). However, if it does not find a matching build at the end of looping, the build variable will still be populated - but now with the last build of the loop. That build will then updated and given to the new slave.
The patch in attachment:t95-slaves_claim_same_build-r712.diff should hopefully fix this. Could anyone review my understanding of this?
comment:6 Changed 15 years ago by osimons
Actually, seeing that build = None when not explicitly found all the changes at the end are noe needed. The new simplified patch:
-
bitten/queue.py
a b 134 134 # Iterate through pending builds by descending revision timestamp, to 135 135 # avoid the first configuration/platform getting all the builds 136 136 platforms = [p.id for p in self.match_slave(name, properties)] 137 build = None138 137 builds_to_delete = [] 138 build_found = False 139 139 for build in Build.select(self.env, status=Build.PENDING, db=db): 140 140 if self.should_delete_build(build, repos): 141 141 self.log.info('Scheduling build %d for deletion', build.id) 142 142 builds_to_delete.append(build) 143 143 elif build.platform in platforms: 144 build_found = True 144 145 break 145 else:146 if not build_found: 146 147 self.log.debug('No pending builds.') 147 148 build = None 148 149
comment:7 Changed 15 years ago by osimons
- Resolution set to fixed
- Status changed from new to closed
Simple patch for #95 which introduces a new RESERVED state so that multiple slaves can't claim the same build. Not the cleanest, but it seems to work.