Edgewall Software

Bitten Master/Slave Protocol

This document refers to the original protocol as implemeted by Bitten <= 0.5.x. For all newer (and supported) versions of Bitten see wiki:Documentation for reference documentation and user guide.

To decouple the build master and build slave, Bitten defines an application-level protocol on top of the meta-protocol BEEP (Blocks Extensible Exchange Protocol, RFC 3080). BEEP was chosen because it provides peer-to-peer communication (so that both the client and the server can initiate exchanges), and because of its relative simplicity compared to protocols such as XMPP.

BEEP is simple and flexible, and explicitly designed as a foundation for custom application protocols. Bitten includes a simple implementation of BEEP. This implementation does not yet support some of the advanced protocol features such as support for authentication (SASL) and privacy/encryption (TLS).

Protocol Overview

Any build slave will connect to exactly one build master, but the master can be connected to a theoretically unlimited number of slaves simultaneously. The connections between master and slave are kept alive across many exchanges.

The following diagram shows an example of the exchanges between a single build slave and a build master.

Overview of the orchestration protocol

This includes the registration of the slave with the master, the initiation of a build by the build master, and finally the actual execution of the build by the slave. These phases are explained in detail in the following sections.

Slave Registration

A new client connects to the build master and signals its' availability for executing builds by starting a channel for the Bitten orchestration profile.

First, the master needs some information about the slave for orchestration:

  • The platform/architecture of the slave machine,
  • the operating system,
  • the product name and version number of each of the dependencies of the project to build (for example, the C compiler or the Python runtime), and
  • the name and email address of the maintainer of the machine.

After the build orchestration channel has been started, the client would send a message like this to the server:

  MSG 1 0 . 0 78
  Content-Type: application/beep+xml
    
  <register name="levi" maintainer="Christopher Lenz &lt;cmlenz@gmx.de&gt;">
    <platform>Power Macintosh</platform>
    <os family="posix" version="8.1.0">Darwin</os>
  </register>
  END

The server acknowledges that it received the registration with a positive or negative reply, using the <ok/> or <error/> elements in the payload, respectively.

The master may reject the registration of a slave if no build configuration has a target platform that matches the properties of the slave. Effectively this means that the build master doesn't have any build that the slave could perform. Registration of a slave may also be rejected if there are already too many slaves connected to the build master.

If registration of the slave is accepted, the server checks whether there are any pending builds for the target platform matching the slave. For example, if it is the only slave that supports GCC 4.0, and there has been no build of some revision with GCC 4 yet, the build master will initiate a build on that slave. In any case, the master remembers the slave configuration for as long as the connection is open, and may choose to route build requests to that machine when repository changes are detected.

Build Initiation

When the build server detects that builds are necessary for some revision of the project, it queries its database of available slaves and chooses a set of slaves with non-overlapping configurations. For example, if there are 10 clients available that could execute the build of a Java project on Windows 2000 with JDK 1.4, it will only select one of those to actually perform the build.

A build request consists of the name of the project and the build recipe that contains the instructions that the slave must follow to execute the build:

  MSG 1 1 . 0 78
  Content-Type: application/beep+xml
  
  <build project="example" xmlns:python="http://bitten.cmlenz.net/tools/python">
    <step id="compile">
      <python:distutils command="build"/>
    </step>
    <step id="dist">
      <python:distutils command="sdist"/>
    </step>
  </build>
  END

The slave should validate the build recipe and check whether all of the referenced recipe commands are available, before starting the build. In case of a problem the slave must decline such the build request using a negative reply containing an <error></error> element in the payload.

  ERR 1 1 . 0 60
  Content-Type: application/beep+xml
  
  <error code="550">
    Unsupported recipe command http://bitten.cmlenz.net/tools/python#distutils
  </error>
  END

A build initiation can also be declined because the machine on which the slave process is being run has a too high load.

When a build request is declined, the build master must select the next available client with the same (or sufficiently similar) configuration. The slave remains in the pool maintained by the master, but the master should attempt to prioritize slaves that accept build requests over those that regularly reject requests, as to avoid constantly polling the latter with requests that will probably be rejected again anyway.

Build Execution

If the client accepts a build request by sending a positive reply, the server will transmit a tarball of the code base that is to be built. The client does not need to know which exact revision (or branch) of the project it is building, nor does it need to perform a checkout itself.

A client accepts a build request by responding with a RPY message containing a <proceed/> element in the payload. For example:

  RPY 1 1 . 0 123
  Content-Type: application/beep+xml

  <proceed/>
  END

After having received such a reply, the master can proceed by transmitting a snapshot of the code base to the slave:

  MSG 1 2 * 0 3421
  Content-Type: application/zip
  Content-Disposition: myproject-r456.zip
  
  ...

If the slave is not able to handle the received archive, it should respond to this transmission with a negative reply:

  ERR 1 1 . 0 60
  Content-Type: application/beep+xml
  
  <error code="550">
    Invalid ZIP archive
  </error>
  END

Otherwise, the slave should proceed immediately with the execution of the build, and respond a sequence of ANS replies, terminated by a final NUL message (see next section).

Build Status Reporting

After having received and upacked the snapshot archive the slave responds with an ANS message containing a <started/> element in the payload:

  ANS 1 2 . 0 54 0
  Content-Type: application/beep+xml

  <started time="2005-06-29T16:41:22"/>
  END

The time attribute contains the date and time (in ISO 8601 format) at which the build was started. These timestamps must be UTC, and must not contain a timezone offset.

The slave then begins executing the build steps in the recipe one-by-one, in the order they appear in the recipe. After each step is completed, the client informs the server about the step it has processed, and what the outcome was (success or failure), using an ANS message containing a <step/> element in the payload:

  ANS 1 2 . 0 92 1
  Content-Type: application/beep+xml

  <step id="test" description="Run all unit tests" result="success"
        time="2005-06-29T16:41:53" duration="7.61">
    ...
  </step>
  END

The time attribute specifies the date and time at which processing of this step was started. The duration attribute contains the number of seconds that it took to complete the step (this may include fractions).

The <step></step> element may contain one or more child elements:

  • <error></error> elements indicate errors in the execution of the step,
  • <log></log> elements contain the build log output, and
  • <report></report> elements contain generated report data.

After the slave has processed all of the build steps, it sends a final ANS message containing the element <completed/> in the payload:

  ANS 1 2 . 0 66 2
  Content-Type: application/beep+xml

  <completed time="2005-06-29T16:44:02"/>
  END

Furthermore, in case the slave is unexpectedly interrupted while executing a build, it should send an ANS message containing the element <abort></abort> in the payload:

  ANS 1 2 . 0 66 2
  Content-Type: application/beep+xml

  <aborted>Build cancelled</aborted>
  END

Usually, the slave will disconnect directly after having aborted a build, but this is not necessary. It should remain in the slave pool maintained by the master until the orchestration channel gets closed.

In any case, the slave must finish this exchange by sending a final NUL message to the master.

  NUL 1 2 . 0 0
  END

At this point, the build is considered completed (or aborted), and the master is free to initiate a new build on that slave.

Error Handling

A build slave can abort the build whenever it wants (announcing it and saying goodbye), or it can just disconnect (as would happen on a hard shutdown of the machine). Both can be easily detected by the build master, in which case it will choose the next client from its list that matches the given requirements.

Another case to deal with is timing out when the client started a build but fails to respond for an exceptionally long period of time. The server would then disconnect, and choose the next available slave, as above. The timeout would have to be configurable, as it may vary significantly between projects.

Last modified 10 years ago Last modified on Jul 25, 2014, 4:20:30 PM

Attachments (1)

Download all attachments as: .zip