Network considerations

This tuning guide briefly explains the network philosophy of drasi, and what networking equipment is needed to achieve the highest performance.

drasi network philosophy

drasi will attempt to pass data along from any node rather quickly. This applies to readout nodes as well as event mergers, i.e. event builders and time sorters.

The merging nodes are designed to read data from multiple sources at the same time, at the full speed of each sending node when possible. The protocol preferred is a variant of the transport protocol (called ‘drasi’ in the options). Data is sent in a streaming fashion without any spurious acknowledgement cycles in-between, such that the full output bandwidth of sending nodes can be utilised.

The cost of hardware

The cost of an entire data acquisition system is usually completely dominated by the front-end digitiser modules (ADCs, TDCs, QDCs). This is generally followed by any crates needed to host those modules, together with readout processors, that have access to the particular data buses being used (e.g. VME, CAMAC, and other custom buses).

Very cheap in comparison is the normal computers used for event building and time sorting and network switches. This applies even if the most performant hardware available is used for these last categories of equipment.

This very skew cost distribution implies that:

  • The data transport pipeline should be designed to make the best use of the front-end modules and readout processor hardware (including its network cards). Drasi is designed such.

  • If needed, network hardware and event merger nodes should be upgraded. This is a very cheap investment. Most often, what is needed is a faster network card for the event merger node and a switch supporting that speed.

    Such equipment can often be acquired even as used machines directly from a nearby computing centre with completely satisfactory results. Server grade network equipment can usually be bought second-hand, but that is a different howto topic. (A satisfactory result is when the hardware is fast enough to not impact the total DAQ performance.)

Use a fat switch!

The above discussions make the network design straightforward:

When multiple readout nodes are involved, the event merger node should generally have a network interface card in a speed grade above the network interfaces of the readout nodes. If e.g. some readout nodes have 1 Gbps interfaces that are heavily used, it implies a 10 Gbps interface for the merger node. (This is < 300 EUR).

Together with this, the readout nodes and merger node should be connected using a single switch which supports the higher speed for the needed number of ports. (Continuing the above example, a switch with 4x10 Gbps and 24x1 Gbps ports is < 1500 EUR.)

It is naturally also possible to connect readout and merger nodes using multiple switches. In this case, care must be taken that no link between switches becomes a bottleneck. This just means that the connections between switches also should be made using the higher speed.

Fast CPU for event merger

The processing needs of an event merger are rather small, and a machine with a few CPU cores even of an older generation is usually enough to not present any bottleneck. (See some performance tests.)

For more demanding cases, it shall be noted that both event building and time-sorting is bottlenecked by one thread, and thus a machine for this task should have faster but few cores, rather than many but slow cores.

Would NxM help?

Probably not.

The multiple event-builder mode of MBS (NxM) is designed to fulfil the overarching principle above that the network hardware of readout nodes shall be used to their limits. It does this by utilizing multiple event builders, while only sending data from one readout node at a time to each event builder. The drawback is that the user is presented with multiple data streams, one from each event builder. (This latter issue can be alleviated by employing a time-sorter stage after the event building.)

Drasi takes another approach: it can make use of network hardware on the event builder node such that data can be read full-speed from all readout nodes simultaneously with only one event builder node.

Delayed event-bulding

With a setup where the beam is not continuous, but rather comes in few-second spills (due to being produced by a synchrotron rather than a cyclotron), readout performance during the on-spill periods can be enhanced by not sending data during those periods. The purpose is that the CPU of readout nodes shall be fully available for readout tasks during on-spill.

If the buffer of the readout node is large enough to hold all data for one spill, and its network bandwidth is high enough to transmit all the data for one spill period during only the off-spill period, then this method gives the maximally achievable readout performance. If any of these conditions are not fulfilled, normal event-building has to be resumed within the spill.

Drasi implements delayed event building, controlled by the presence of triggers 12 and 13 to mark be begin- and end-of-spill. Transmission of data from each readout node is delayed independently, made possible by the deep input buffers of the event builder. If the buffer fill level passes a threshold close to full, transmission from that node will be resumed within the spill. Global event building will be forced if some event builder input buffer becomes close to full. To avoid readout nodes that otherwise would not require forced transmission (due to having enough buffer space and bandwidth to send during offspill periods only), the event builder input buffers need to hold one spill worth of data for each readout node.

Since the readout task usually is single-threaded, delayed event-building does not bring much on multi-core machines. Such are however still rare for readout tasks. There is a small gain to made though, so transmission is delayed also on such machines, with a slightly lower threshold for the fill level threshold though.

Finding bottle-necks

Generally, the hardware after the readout nodes shall not be affecting the performance of a data acqusition system. (Since if they are, it means that the lack of some cheap equipment is preventing the expensive readout nodes and beamtime from collecting more statistics).

Bottlenecks are generally found by monitoring the fill level of buffers in the data path. If a buffer has a tendency to run full rather than being mostly empty, it means that the following stage is a bottleneck.

If a buffer before a network connection is running full, the network connection is too thin. (Currently, it may also be that the single thread that handles the input connections on the merger side is the limit. Although this is capable to handle > 10 Gbps on recent hardware, so not likely).

If an input buffers before a merger stage is running full, the CPU is too slow to handle the merging task.

Infiniband

The experiences from general tests to set infiniband systems up are that this is a much more complicated than ethernet. Plain networking equipment with equivalent speeds can easily be obtained, often refurbished. The low-latency advantage of infiniband offers no advantage for a DAQ system that is designed for streaming data. Therefore this path is not pursued.