Implementation details

This note explains some details of drasi implementation and operation, that do not fit in the other sections. They may help to understand drasi behaviour.

CPU affinity

The most important thread on a readout node is the readout thread itself. On machines with multiple CPU cores, it should preferably be the exclusive user of one (physical) CPU core. The OS kernel scheduler often manage to do this, but sometimes do a sub-optimal placement, interrupting the readout with other tasks. Drasi avoids this by assigning fixed CPU core affinity to threads on start-up while there are enough cores remaining in the set of cores it has been given. Assignment is performed in decreasing order of thread importance:

  1. Readout thread.
  2. Merger thread (for event builder / time sorter).
  3. File writer thread.
  4. Network data output thread.
  5. Merger input thread(s).
  6. Other threads. (Generally low CPU usage.)

If there are no further cores remaining to assign for a level, their placement is left to be dynamically decided by the OS kernel, but only among the cores used for the previous level. If there are not enough cores to cater for all the threads within a level, their placement is as well dynamically decided, among the remaining cores.

Cores are assigned starting with the highest-numbered. The rationale is that interrupts coming from the network interfaces generally seem to be routed to the first core of the machine. This strategy then avoids the readout thread from being interrupted due to network tasks. (See some performance measurements).

If one machine is used to host several drasi processes, the total strategy above is not particularly clever: the most important threads of all instances would be placed fixed on the highest-numbered core. This is easily prevented by the user: select different cores to use for each process: taskset -c core-list <drasi-cmd>.

No measures to deal with SMT (hyper-threaded) CPU cores have been taken so far. A good strategy would probably be to remove SMT siblings when dedicated cores are assigned. For the time being, this effect can be achieved using taskset.

Drasi threads

With one exception, all drasi threads only use CPU resources proportional to the amount of data they process. They multiplex I/O by waiting for ready file descriptors using select(2), and wake each other up using internal notification pipes (that mix well with other I/O using select(2)). Separate threads are only created for tasks that can be expected to need a substantial amount of CPU resources.

Readout thread

This thread deals with triggers and fetches data from external sources and place it in the internal data buffer. Most of the work performed is in the user-supplied readout functions. In TRIVA/MI mode, the thread will use all (remaining) CPU cycles on its core, since it waits for the next trigger by continuously polling the trigger module. (Tests have so far not shown any benefit of interrupted mode, which therefore is not implemented. Interrupted mode in itself has a higher overhead per trigger than polling due to interrupt handling and a system call. The reason for no improvement in interrupted mode is likely that the other threads are completely inactive (waiting for select(2)) when they have no work to do, and thus spend no time polling.)

The readout thread is the main/original thread of the drasi process, to simplify debugging of user readout functions.

Merger thread

For an event builder / time sorter, the merger thread plays the role of the readout thread. It uses data from the input buffers, and place the result in the internal data buffer. This single thread will see all data, so should preferably have a dedicated core.

File writer thread

This thread also sees all data, but has less processing to do.

Network data server thread

Note that this thread for technical reasons is currently only employed in instances where the server is operated in hold mode. This however cover the important case that occur for a readout nodes sending data to event mergers.

This thread also sees all data, but also has less processing to do.

Merger input thread(s)

The network inputs to a merger is handled by a separate input thread. (So far only one single thread, but most of the code is prepared to use several threads, up to one per network connection.)

Merger timestamp analysis thread

The timestamp alignment analysis of the time-sorter is performed in a separate thread.

General network thread

This thread is responsible for all incoming networking connections. It also sends messages to the logger, send monitor data, and handles the network part of all control connections.

It currently also handle the network data output for the cases not handled above.

This thread is also responsible for running the overall DAQ operation state machine, that on a master node is responsible for communication with all slaves and event builders.

TRIVA control thread

This thread is a partner of the readout thread when working in TRIVA/MI mode. It handles all communication with the DAQ operation state machine, as well as provide that with regular status updates (upon request). Thus such updates can be provided independently of the readout thread.

Sacrificial TCP port

TCP network servers must bind and listen to specific ports for clients to know where to reach them. drasi is no exception. When a TCP connection is closed, whoever closes the connection cannot bind to that port again for a short while (usually about a minute). This is not a problem for a client which uses any free port to set initiate connections. A server would need to bind to the same port number again if it is restarted, and when restarted, it most likely just have closed one or more connections. Thus the restart is delayed due to waiting for network timeouts, at the most inconvenient of times: the user is actively waiting for the program to start again.

To avoid this frustrating side-effect of the TCP protocol drasi uses sacrificial TCP ports: the server binds to two ports, one with the fixed number, and one with a random number. Any client will first connect to the fixed port, where it will only get to know the number of the other random port, where all actual authentication and data transmission takes place. The client closes the original connection after having received the second port number, thus any timeout is generally on the client side, where it does not matter. If the client does not close the port map connection, the server will do so within a few seconds (in the hope that any timeout time passes before the server is possibly restarted). Thus avoiding about 99% of failure-to-bind during restart issues.

It may be argued that the network timeouts can be avoided by closing connections in the right order. While true, and drasi tries to actually do this as well, this approach fails if the programs do crash — which naturally only happens during development. That, however, is more than reason enough.