Edgewall Software
Modify

Opened 16 years ago

Last modified 11 years ago

#403 new enhancement

[PATCH] Add a network retry count for unreliable networks

Reported by: mike@… Owned by: osimons
Priority: minor Milestone: 0.6.1
Component: Build slave Version: dev
Keywords: Cc:
Operating System: Linux

Description

I'm doing long builds on slaves with occasional network interruptions. This adds an exponential die-off retry count so the build doesn't abort due to a transient interruption.

Attachments (2)

0002-Add-a-network-retry-count-for-unreliable-networks.patch (4.4 KB) - added by mike@… 16 years ago.
patch
0002-Add-a-network-retry-count-for-unreliable-networks.2.patch (4.3 KB) - added by mike@… 16 years ago.
patch (rebased against trunk this time) (this supersedes the previous patch, but I can't replace it)

Download all attachments as: .zip

Change History (8)

Changed 16 years ago by mike@…

patch (rebased against trunk this time) (this supersedes the previous patch, but I can't replace it)

comment:1 Changed 16 years ago by osimons

#395 closed as duplicate.

comment:2 Changed 16 years ago by osimons

  • Owner changed from cmlenz to osimons
  • Summary changed from Add a network retry count for unreliable networks to [PATCH] Add a network retry count for unreliable networks

Patch looks good and useful. I'll put it on my todo list.

comment:3 Changed 15 years ago by osimons

I'm not so sure this is the right location to patch. If the server has received the request and returned a response with a status code in the error-range, can't we presume that it is an actual error? The only error codes I can imagine are valid for this use is something like "503 Service Unavailable" and "502 Bad Gateway", and if so we should check specifically for such temporary states from the server/proxy side. What is the error codes you see?

However, if I take down my webserver to simulate typical connection errors I get <urlopen error (61, 'Connection refused')> in the logs and the slave keeps retrying. If I break the connection in the middle of a build, the slave just loops and restarts the build when server is available again. Wouldn't the proper thing be to instead loop the step-posting attempt until network is available again so that the slave in reality continues what it is doing? In that case it really should just loop by default forever until halted, and not make this a separate setting.

Could you please elaborate on the actual messages and status codes you see in your slave logs when the network is unreliable?

comment:4 Changed 15 years ago by anonymous

There are a variety of messages/codes that can be generated here. My use case: I'm in a coffee shop, using my laptop as a build machine. A 45-minute build/test cycle completes, and the coffee shop's wireless flakes out, as coffee shop wireless is wont to do. Without this patch, I have to start the 45-minute cycle completely over. With this patch, I just add an exponential die-off retry count, and the build status will make it to the master.

comment:5 Changed 15 years ago by mike@…

Sorry for the dual comment, I submitted early, and I forgot to put in my name.

There are a variety of messages/codes that can be generated here. My use case: I'm in a coffee shop, using my laptop as a build machine. A 45-minute build/test cycle completes, and the coffee shop's wireless flakes out, as coffee shop wireless is wont to do. Without this patch, I have to start the 45-minute cycle completely over. With this patch, I just add an exponential die-off retry count, and the build status will make it to the master.

I'm loath to add any sort of mandatory infinite retry, because it's possible that the network error isn't transient. In this case, I just happen to know more about my network situation than the HTTP spec does. Ideally, on a network error the slave would just do the next build step, and buffer the build status until the network comes back up again (or give up and forget about it eventually if it doesn't come back) but that would have required more code, and this does exactly what I needed it to :)

comment:6 Changed 15 years ago by osimons

  • Milestone changed from 0.6 to 0.6.1

I can see why it works, of course - providing all is correct with the request. However, if someone has invalidated your build at some point during those 45 minutes, your slave will try making the invalid post over and over. Or authentication fails. Or problems authenticating. Or problems with the included XML, or really anything out of the ordinary.

I like the retry idea, but it needs to be tuned for the class of errors it is intended to catch. The problem is not critical for a 0.6 release, so I'm rescheduling it and will look at it again in not too long.

Add Comment

Modify Ticket

Change Properties
Set your email in Preferences
Action
as new The owner will remain osimons.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.