Edgewall Software

Version 11 (modified by cmlenz, 19 years ago) (diff)

Updated path to orchestration profile

Bitten Master/Slave? Protocol

To decouple the master and slave, an application protocol will be defined on top of the meta-protocol BEEP (Blocks Extensible Exchange Protocol, RFC 3080). BEEP was chosen because it provides peer-to-peer communication (so that both the client and the server can send requests to the other) and because of its relative simplicity compared to protocols such as XMPP.

Why BEEP?

I first looked Jabber/XMPP, but it seemed to be very complex (with dozens of related specifications), and there are no sufficiently mature implementations for Python. I could live with the complexity, but not if I have to implement the whole stack myself. I didn't look into other IM protocols because I wanted to build on something open/standardized. Note that even if I'd chosen XMPP/etc I would have to design a protocol on top of the provided infrastructure.

BEEP is simple and flexible, and explicitly designed as a foundation for custom application protocols. While the only Python implementation I found (BEEPy) uses Twisted and looks dead, BEEP is really simple enough to be implemented in a basic way in the scope of this project (i.e. minus

the authentication and security features, which could of course be added later).

Slave Registration

A new client connects to the build master and signals its' availability for executing builds by starting a channel for the Bitten orchestration profile.

First, the server needs to query some information about the client for orchestration:

  • Platform/architecture
  • Operating system
  • The product name and version number of each of the dependencies of the project to build (for example, the C compiler or the Python runtime).
  • Name and email address of the maintainer

After the Bitten channel has been started, the client would send a message like this to the server:

  MSG 1 0 . 0 78
  Content-Type: application/beep+xml
    
  <register name="levi" maintainer="Christopher Lenz &lt;cmlenz@gmx.de&gt;">
    <platform>Power Macintosh</platform>
    <os family="posix" version="8.1.0">Darwin</os>
  </register>
  END

The server acknowledges that it received the registration with a positive or negative reply.

Next, the server checks whether there are any pending builds for that client (see Build Configurations). For example, if it is the only client that supports GCC 4.0, and there has been no build of some revision with GCC 4 yet, it will initiate a build on that client. Anyway, the server remembers the client configuration for as long as the connection is open, and may choose to route build requests to that client when repository changes are detected, or a build is triggered otherwise.

Build Initiation

When the build server detects that builds are necessary for some revision of the project, it queries its database of available slaves and chooses a set of slaves with non-overlapping configurations. For example, if there are 10 clients available that could execute the build of a Java project on Windows 2000 with JDK 1.4, it will only select one of those to actually perform the build.

A build request might look like this (the text is optional and only provided for diagnostic purposes):

  MSG 1 1 . 0 78
  Content-Type: application/beep+xml
  
  <build recipe="path/to/recipe.xml">trunk as of revision 492</build>
  END

The build request must include the path to the recipe file relative to the root of the code base.

A client can decline a build request, in which case the build master selects the next available client with the same (or sufficiently similar) configuration. A build request is declined using a negative reply containing an <error></error> element in the payload:

  ERR 1 1 . 0 60
  Content-Type: application/beep+xml
  
  <error code="550">Too busy</error>
  END

In this case the slave remains in the pool maintained by the master, but the master should attempt to prioritize slaves that accept build requests over those that regularly reject requests, as to avoid constantly polling the latter with requests that will probably be rejected again anyway.

TODO: Specify error scenarios and error codes.

Build Execution

If the client accepts a build request by sending a positive reply, the server will transmit a tarball of the code base that is to be built. The client does not need to know which exact revision (or branch) of the project it is building, nor does it need to perform a checkout itself.

A client accepts a build request by responding with a RPY message containing a <proceed></proceed> element in the payload. The reply must contain a list of archive formats that the slave supports for transmission of the code. For example:

  RPY 1 1 . 0 123
  Content-Type: application/beep+xml

  <proceed>
   <accept type="application/tar" encoding="bzip2" />
   <accept type="application/tar" encoding="gzip" />
  </proceed>
  END

In this message, the client indicates that it will accept tar archives with bzip2 or gzip compression (preferring the former). Another client might specify that it supported only ZIP archives, for example.

After having received such a reply, the master can proceed by transmitting a snapshot of the code base to the slave:

  MSG 1 2 * 0 78
  Content-Type: application/tar
  Content-Disposition: myproject-r456.tar
  Content-Transfer-Encoding: gzip
  
  ...

The client may respond to this transmission either with a negative reply (ERR containing an <error></error> element with a description of the error), or by starting a sequence of ANS replies, terminated by a final NUL message (see next section).

TODO: Specify error scenarios and error codes.

Build Status Reporting

After having received and upacked the snapshot archive, and having successfully parsed the build recipe, the slave responds with ANS message containing a <started/> element in the payload:

  ANS 1 2 . 0 54 0
  Content-Type: application/beep+xml

  <started time="2005-06-29T16:41:22"/>
  END

The time attribute contains the date and time (in ISO 8601 format) at which the build was started. These timestamps must be UTC, and consequently must not contain a timezone offset.

The slave then begins executing the steps in the recipe one-by-one (in the order they appear in the file). After each step of the build recipe, the client informs the server, with ANS messages containing a <step/> element in the payload, about the step it has processed, and what the outcome was (success or failure):

  ANS 1 2 . 0 92 1
  Content-Type: application/beep+xml

  <step id="test" description="Run all unit tests" result="success"
        time="2005-06-29T16:41:53" duration="7.61"/>
  END

The time attribute specifies the date and time at which processing of this step was started. The duration attribute contains the number of seconds that it took to complete the step (this may include fractions).

In case of an error, the message should include the primary error message in the body of the <step></step> element:

  ANS 1 2 . 0 135 1
  Content-Type: application/beep+xml

  <step id="test" description="Run all unit tests" result="failure"
        time="2005-06-29T16:41:53" duration="7.61">
    Could not load command "unittest".
  </step>
  END

TODO: Transmission of build log and generated reports to the master

After the slave has processed all of the build steps, it sends an ANS message containing the element <completed/> in the payload:

  ANS 1 2 . 0 66 2
  Content-Type: application/beep+xml

  <completed time="2005-06-29T16:44:02"/>
  END

Furthermore, in case the slave is unexpectedly interrupted while executing a build, it should send an ANS message containing the element <abort></abort> in the payload:

  ANS 1 2 . 0 66 2
  Content-Type: application/beep+xml

  <aborted>Build cancelled</aborted>
  END

Usually, the slave will disconnect directly after having aborted a build, but this is not necessary. It should remain in the slave pool maintained by the master until the orchestration channel gets closed.

In any case, the slave must finish this exchange by sending a final NUL message to the master.

  NUL 1 2 . 0 0
  END

At this point, the build is considered completed (or aborted), and the master is free to initiate a new build on that slave.

Error Handling

A build slave can abort the build whenever it wants (announcing it and saying goodbye), or it can just disconnect (as would happen on a hard shutdown of the machine). Both can be easily detected by the build master, in which case it will choose the next client from its list that matches the given requirements.

Another case to deal with is timing out when the client started a build but fails to respond for an exceptionally long period of time. The server would then disconnect, and choose the next available slave, as above. The timeout would have to be configurable, as it may vary significantly between projects.

Attachments (1)

Download all attachments as: .zip