Opened 15 years ago
Last modified 10 years ago
#403 new enhancement
[PATCH] Add a network retry count for unreliable networks
Reported by: | mike@… | Owned by: | osimons |
---|---|---|---|
Priority: | minor | Milestone: | 0.6.1 |
Component: | Build slave | Version: | dev |
Keywords: | Cc: | ||
Operating System: | Linux |
Description
I'm doing long builds on slaves with occasional network interruptions. This adds an exponential die-off retry count so the build doesn't abort due to a transient interruption.
Attachments (2)
Change History (8)
Changed 15 years ago by mike@…
Changed 15 years ago by mike@…
patch (rebased against trunk this time) (this supersedes the previous patch, but I can't replace it)
comment:1 Changed 15 years ago by osimons
#395 closed as duplicate.
comment:2 Changed 15 years ago by osimons
- Owner changed from cmlenz to osimons
- Summary changed from Add a network retry count for unreliable networks to [PATCH] Add a network retry count for unreliable networks
Patch looks good and useful. I'll put it on my todo list.
comment:3 Changed 15 years ago by osimons
I'm not so sure this is the right location to patch. If the server has received the request and returned a response with a status code in the error-range, can't we presume that it is an actual error? The only error codes I can imagine are valid for this use is something like "503 Service Unavailable" and "502 Bad Gateway", and if so we should check specifically for such temporary states from the server/proxy side. What is the error codes you see?
However, if I take down my webserver to simulate typical connection errors I get <urlopen error (61, 'Connection refused')> in the logs and the slave keeps retrying. If I break the connection in the middle of a build, the slave just loops and restarts the build when server is available again. Wouldn't the proper thing be to instead loop the step-posting attempt until network is available again so that the slave in reality continues what it is doing? In that case it really should just loop by default forever until halted, and not make this a separate setting.
Could you please elaborate on the actual messages and status codes you see in your slave logs when the network is unreliable?
comment:4 Changed 15 years ago by anonymous
There are a variety of messages/codes that can be generated here. My use case: I'm in a coffee shop, using my laptop as a build machine. A 45-minute build/test cycle completes, and the coffee shop's wireless flakes out, as coffee shop wireless is wont to do. Without this patch, I have to start the 45-minute cycle completely over. With this patch, I just add an exponential die-off retry count, and the build status will make it to the master.
comment:5 Changed 15 years ago by mike@…
Sorry for the dual comment, I submitted early, and I forgot to put in my name.
There are a variety of messages/codes that can be generated here. My use case: I'm in a coffee shop, using my laptop as a build machine. A 45-minute build/test cycle completes, and the coffee shop's wireless flakes out, as coffee shop wireless is wont to do. Without this patch, I have to start the 45-minute cycle completely over. With this patch, I just add an exponential die-off retry count, and the build status will make it to the master.
I'm loath to add any sort of mandatory infinite retry, because it's possible that the network error isn't transient. In this case, I just happen to know more about my network situation than the HTTP spec does. Ideally, on a network error the slave would just do the next build step, and buffer the build status until the network comes back up again (or give up and forget about it eventually if it doesn't come back) but that would have required more code, and this does exactly what I needed it to :)
comment:6 Changed 15 years ago by osimons
- Milestone changed from 0.6 to 0.6.1
I can see why it works, of course - providing all is correct with the request. However, if someone has invalidated your build at some point during those 45 minutes, your slave will try making the invalid post over and over. Or authentication fails. Or problems authenticating. Or problems with the included XML, or really anything out of the ordinary.
I like the retry idea, but it needs to be tuned for the class of errors it is intended to catch. The problem is not critical for a 0.6 release, so I'm rescheduling it and will look at it again in not too long.
patch