Bitten Master/Slave Protocol, using HTTP
Note: This is a proposal, and the final / current implementation may differ. Please refer to the documentation for current state of features and commands.
This is a proposal for an HTTP-based protocol enabling communication between the build master and various build slaves. The protocol presented here is not final yet. Implementation was done on the sandbox/http@440 branch, and has been merged to trunk as of r438.
Comparison to the Previous BEEP Protocol
The BEEP-based protocol currently used by Bitten is described on Master Slave Protocol.
The differences can be summarized as follows:
- The build master would be a simple HTTP server, implemented as part of the Trac plugin. That means there would no longer be a separate daemon process needed for the master.
- The build slaves are simply HTTP clients, probably using httplib2 and falling back to the httplib or urllib modules in the standard library.
- Both SSL and the various authentication methods of HTTP can be used to secure the communication.
- Directionality of communication is always from the slave to the master. The master no longer initiates actions on the slave, rather the slave polls the master for pending actions when it is idle.
- The build master would no longer be responsible for packaging tarballs and sending them to the slaves; instead, the slaves receive connection details for the repository, and perform a normal checkout. This is a long standing ticket.
Build Creation
A new slave connects to the build master and “asks” the master whether there are any pending builds it could perform. The slave does this by POSTing its profile to the master, which contains information such as:
- the platform/architecture of the slave machine,
- the operating system,
- the product name and version number of each of the dependencies of the project to build (for example, the C compiler or the Python runtime), and
- the name and email address of the maintainer of the machine.
POST /builds/ HTTP/1.1 Host: example.org Content-Type: application/x-bitten+xml Content-Length: 666 <slave name="lamech"> <maintainer>Christopher Lenz <cmlenz@gmx.de></maintainer> <platform>Power Macintosh</platform> <os family="posix" version="8.1.0">Darwin</os> </slave>
If the build master finds any pending builds that can be performed by the target platform matching the slave, it would send back a response similar to the following:
HTTP/1.1 201 Created Location: http://example.org/builds/trunk/123/ Set-Cookie: slave=lamech; Path=/builds/trunk/123/
The response contains the URL to a build recipe as the value of the Location header. At this point, the master has allocated a pending build entity in its database. The progress on this build can be viewed as HTML at the specified URL using any HTTP user agent.
The master also sets a cookie on the slave so that it can be identified on subsequent requests. In the example above, the cookie contains only the slave name; we'll probably need to include more information, such as when the build was started.
On the other hand, if the master has no work for the slave, it would return a 204 No Content response:
HTTP/1.1 204 No Content
Open issue: we'd need to either repost the slave name/info with every request, or set a cookie that identifies the slave on subsequent requests.
Build Initiation
When the slave has received the URL to a build recipe, it can request the build recipe using a simple GET request:
GET /builds/trunk/123/ HTTP/1.1 Host: example.org Cookie: slave=lamech Accept: application/x-bitten+xml
If the master still has that build in pending state in the database, it will respond with the recipe:
HTTP/1.1 200 OK Content-Type: application/x-bitten+xml Content-Length: 666 <build path="trunk" revision="42" xmlns:python="http://bitten.cmlenz.net/tools/python" xmlns:svn="http://bitten.cmlenz.net/tools/svn"> <step id="checkout"> <svn:checkout url="http://svn.example.org/repos/" path="${path}" revision="${revision}" /> </step> <step id="compile"> <python:distutils command="build"/> </step> <step id="dist"> <python:distutils command="sdist"/> </step> <upload> <file path="dist/foobar*.tar.gz"/> <file path="dist/foobar*.zip"/> </upload> </build>
The first element would pretty much always be a “checkout” step that retrieves the source from the version control repository.
Build Status Reporting
As soon as the slave has received the recipe, it should perform the checkout and execute the steps outlined in the build.
After every completed step, the slave should make a PUT request to the steps member of build collection:
POST /builds/trunk/123/steps/ HTTP/1.1 Host: example.org Cookie: slave=lamech Content-Type: application/x-bitten+xml Content-Length: 666 <result step="test" status="success" started="2005-06-29T16:41:53" duration="7.61"> ... </result>
The started attribute specifies the date and time at which processing of this step was started. The duration attribute contains the number of seconds that it took to complete the step (this may include fractions).
The <result> element may contain one or more of the following child elements:
- <error></error> elements indicate errors in the execution of the step,
- <log></log> elements contain the build log output, and
- <report></report> elements contain generated report data.
The build is assumed to be complete after the master has received a request for every step in the recipe.
The server responds with a 201 Created response.
Uploading of Build Artifacts
If the recipe contains an <upload> element at the end (after all <step> elements), the slave is expected to perform file uploads of any of the files specified. This is done using PUT requests the the files member of the build collection:
POST /builds/trunk/123/files/ HTTP/1.1 Host: example.org Cookie: slave=lamech Content-Type: multipart/form-data Content-Length: 666 ...
The server responds with a 201 Created response.
Cancelling Builds
Using the BEEP protocol, the build master would mark builds as aborted if the connection to the slave was closed unexpectedly. This is no longer possible when using HTTP.
To handle the case of build slaves going away at some point between having created a build and completing the build, the build master should have a configurable timeout. All in-progress builds would be checked against this timeout; if there has been no activity on the build for an amount of time exceeding the timeout, the master should cancel the build, resetting it the PENDING state. If a slave later does decide to come back to life and post results, it would get 404 (Not Found) or 409 (Conflict) errors, and should cancel the build on its side, too.
There should probably be a background thread posting heartbeat requests to the master while lengthy build steps are executed.