Contents ======== 1. Purpose 2. License 3. Compression format 4. User routines 5. C function interface 5a. Decompression in C 5b. Compression in C 6. VHDL compression interface 6a. Additional pipelining 7. Performance 8. Compilation and testing 9. Acknowledgements (referencing) 10. Contact 1. Purpose ========== DPTC, difference predicted trace compression, is a simple bit-packed format to store flash-ADC traces (or other smooth data) in a space efficient manner. It comes with a VHDL module for on-the-fly compression and with C routines for decompression and compression. 2. License ========== DPTC is free software; distributed under the 3-clause BSD license. For details, see the accompanying LICENSE file. 3. Compression format ===================== The input data consist of a sequence of n-bit data words. Typically, n is 12, or 14. The code currently allows n to have any value in the range 5 - 16, inclusive. For the compression to be efficient (i.e. space-saving), the data sequence should describe a smooth function, i.e. most consecutive values should have small differences. This is typically the case with flash-ADC data from detectors that record pulses. The compression can logically be viewed as two steps: first a treatment of the data values, such that the modified values more easily can be stored compactly. This is followed by the bit-packing. The first stage, treating each data value, is: - Calculate the difference to the previous value. (Purpose: with smooth traces, the values to process further are then centred around 0.) - If the previous three differences had the same sign, then instead use the difference to the previous difference. 0 is considered to have no sign, so breaks such sequences. (Purpose: when pulses are built up over many samples, especially trailing decay parts, the double differencing removes the linear component. It makes the distribution more narrow around 0. However, for flat parts of a trace (which only contains noise), the double difference would lead to a wider distribution of values to store. Such flat parts change sign often, or have 0 differences, and are thereby detected by the three-most-recent rule.) Tests on actual data shows that this rule on average causes larger storage space to be used. It is therefore by default NOT enabled. - If the value to be stored is negative, change the sign of the following values. If positive, store the following values as are. A 0 does not change how to store following values. (Purpose: the later bit-encoding is asymmetric - it can with a certain number of bits store one more negative value than positive. This sign-changing scheme makes negative values more common than positive values for flat (noise) parts of a trace.) Data is compressed in 32-bit words, being filled from the least significant bits. When a value to store cannot fit, the overflowing bits are stored in the next 32-bit output word. Values are stored in groups of four. All four values are stored using the same number of bits. Before each group, a header gives the number of bits to store for each value that is encoded. Since the number of bits to use will often not change very much between adjacent groups, a short and long encoding of the number of bits is used. The short group header consist of two bits. If the value is 1, 2 or 3, the number of bits to store for each value in the group is the same as in the previous group with a change of -1, 0 or +1 bits. If the two-bit short header is 0, it is followed by the difference - 2, to not repeat codes handled by the short group header. The long header difference is encoded using the number of bits needed to store at most n-3, i.e. 2 for n <= 7, 3 for n <= 11 and 4 for n <= 19. The number of bits is interpreted with a bias of 1, meaning that storing 0 bits per value is not supported. The data values are then stored with the given number of bits. The value stored is relative to the most negative value that can be stored with that number of bits. This is to simplify decoding, as the stored value only has to be unmasked, and the bias added. (Storing directly would have required the decoder to fill with the sign bit). As an exception to the above rules, the first data value is stored verbatim, unconditionally, with n bits. This to avoid having the entire first group of data values being stored with many bits, in case the baseline of the ADC trace is far from 0. It is the responsibility of the enclosing data format to keep track of both the number of original data bits, the number of original data values, as well as the number of data words produced by the compression. These values used by the decompression procedure. In case the original data contains an excessive number of lower bits with noise, that shall not be stored, they must be shifted out of the original data before giving them to the compression routines. Just masking them out will *not* improve the compression efficiency, as the compression routines effectively are looking for the most significant bit of the differences to be stored. It is however quite harmless to use a compressor with n larger than necessary. The loss in compression will be negligible. 4. User routines ================= Only the routines in the directories c/ and vhdl/ are needed for any user program or firmware. The code in all other directories are for testing purposes. 5. C function interface ======================= The C interface consist of two functions. 5a. Decompression in C ====================== The decompression is performed by one function: int dptc_unpack16(uint32_t *compr, size_t ncompr, uint16_t *output, size_t ndata, int bits) compr Pointer to the compressed input buffer. ncompr Number of elements in the input buffer. output Pointer to a buffer to receive the decompressed values. ndata Number of original/decompressed values. bits Number of bits of each value that was stored. This must be the same as the number used during compression. On success, 0 is returned, otherwise a non-zero value. The decompression routine will not read items beyond the end of the source buffer even if it runs out of data (due to e.g. a corrupted data stream). It will however report decompression failure. Failure will also be reported if there are non-zero bits left in the input buffer, or entire words have not been used. A version that handles data words with more than 16 bits is also available, dptc_unpack32(), with uint32_t *output. 5b. Compression in C ==================== A C compression routine is also available: size_t dptc_compress16(uint16_t *data, size_t ndata, uint32_t *compr, size_t ncompr, int bits) data Pointer to a buffer with the original values. ndata Number of original elements. compr Pointer to the compressed output buffer. ncompr Number of elements in the output buffer. bits Number of bits to store of each value. Return value: the number of words written in the output buffer. A version that handles data words with more than 16 bits is also available, dptc_compress32(), with uint32_t *data. It is rather easy to calculate the worst-case number of output words that can be produced (to be done). A safe value is to have ncompr = ndata. 6. VHDL compression interface ============================= The VHDL compression presents itself as a single entity to the user, configurable using a generic map. The other entities are solely for internal use. All identifiers that are exposed through work.* have dptc_ prefixes, to avoid clashing with other user code. The compression interface: entity dptc_module is generic (predictor : integer := 0; dc_cd_reg : integer := 0; dc_pd_reg : integer := 0; dc_od_reg : integer := 0; ec_mc_reg : integer := 0; ec_cb_reg : integer := 0; op_ss_reg : integer := 0 ); port (clk: in std_logic; reset: in std_logic; data_in: in std_logic_vector; dv_in: in std_logic; flush: in std_logic; dv_out: out std_logic; out_word: out std_logic_vector; done: out std_logic ); end; clk The clock network signal that drives the compression. For the minimum period of this clock, see section 6a. reset Used to reset the compression system between each trace. It needs to be held for at least 6 clock cycles (and depends on extra pipeline registers). It is suggested, although not required to hold it until right before the first data value is provided. data_in The data value to compress each clock cycle. The number of bits of the compressor is given by the length of this std_logic_vector. dv_in Set to '1' for each input data value to be compressed. It is currently NOT valid to omit values by letting 'dv_in' be 0 during some cycles. flush Must be set after last data word has been given, and then held, in order to produce the last output data words (even if not all 32 bits have been completely filled). It is valid to set 'flush' while 'dv_in' is still set. But dv_in must not be set to '1' after 'dv_in' has been '0' while flush is '1'. 'flush' must currently be set directly after the last data value. out_word Gives the compressed data words. dv_out Marks when output words are produced. The content of 'out_word' shall only be stored when 'dv_out' is '1'. done Marks that all data words have been produced. (It will be set a number of cycles after 'flush' has been set and no further input data has been given.) Note that a data word may be produced in the same cycle as 'done' is first set to '1', but not later. 6a. Additional pipelining ========================= The achievable minimum clock cycle period for an FPGA depends on the length of the longest logic chain between register latches. Which logic expression becomes the longest depends on the model and grade of FPGA which is targeted. In order to allow some flexibility in inserting pipeline stages in the code, a few generic parameters (..._reg in the interface above) control some optional pipeline stages. Generally, introducing a pipeline stage will cause more LUTs to be used, as well as flip-flops. Minimum period values for some FPGA kinds are extracted from synthesis of the code for those FPGA modules. They can be found in the file min_periods.txt. The VHDL code will generally easily be configured to run the minimum period of the clock well below 10 ns (i.e. 100 MHz) even on 10-year old FPGAs, and reaching 5 ns with pipelining stages. On modern FPGAs, going below 3 ns seems rather easy. 7. Performance ============== Both the C compression and decompression routines process samples at an almost constant rate (depending on CPU). Some typical values: Machine E3-1285v6 E3-1276v3 X5450 PPC 7455 Speed (4.5 GHz) (4 GHz) (3 GHz) (1 GHz) Introduced (2017) (2014) (2007) (2002) Decompress C 2.2 ns 3.1 ns 6.2 ns 36 ns Compress C 3.3 ns 4.6 ns 8.8 ns 50 ns The values become easier to compare when expressed in clock cycles: Decompress C 10 12 19 36 Compress C 15 18 26 50 It should be noted that a decompression time per symbol of < 2.5 ns (achievable on modern hardware) corresponds to > 400 MS/s. If compared to a raw storage using 16-bit words for each value, this produces > 800 MB/s decompressed data. 8. Compilation and testing ========================== Usage of the code as such does not involve any separate compilation steps. The code is intended to be included into other projects either by symlinking or copying. The distribution includes a machinery to test the routines using sample and random data, as well as determining approximate minimum clock periods achievable using different pipelining options. The test machinery will build and test the C and VHDL codes using all values for n from 5 to 16, and all pipelining combinations. The VHDL code is compiled using ghdl (http://ghdl.free.fr/). To perform the tests: (cd traces ; ./mkgen.pl) rm -rf sim/ make -j 12 runsim # or 'short' instead of 'runsim' for a quicker test # and possibly add FEWBITS=1 FEWREGS=1 The flag '-j 12' sets the number of jobs to run in parallel. This may be omitted or changed, depending on the number of threads available on the testing machine. Determining approximate FPGA clocking capabilities requires that the development chain from the manufacturer can be accessed from the command line. To test with xilinx tools ('xst' must be accessible): rm -rf syntiming/ make min_periods.txt -j 16 Again, adjust the '-j 16' flag as appropriate. Some files with real traces are also included in the distribution. The compression efficiency can be tested for them (here with n=12 and n=16): make testpack_b12 make testpack_b16 To test the execution speed of the C routines: make testspeed_b16 Recompiling with '-march=native' can have some effect: make clean && CFLAGS=-march=native make testspeed_b16 9. Acknowledgements (referencing) ================================= The recommended way to refer to DPTC, when used for work that is published in a research article, is to cite the following paper: G. Bruni and H. T. Johansson, DPTC - an FPGA-based trace compression, IEEE Transactions on Circuits and Systems I: Regular Papers, 67(1) (2020), 189-197. Pre-print (2019) at arXiv:1903.10984. @ARTICLE{bruni2019, author = {G. Bruni and H. T. Johansson}, title = {DPTC---An FPGA-Based Trace Compression}, journal = {IEEE Transactions on Circuits and Systems I: Regular Papers}, year = {2020}, volume = {67}, number = {1}, pages = {189-197}, doi = {10.1109/TCSI.2019.2945179} } 10. Contact =========== Håkan T. Johansson e-mail: f96hajo@chalmers.se Subatomic and Plasma Physics Department of Physics Chalmers University of Technology 412 96 Göteborg Sweden Giovanni Bruni e-mail: bruni.gvn@gmail.com Subatomic and Plasma Physics Department of Physics Chalmers University of Technology 412 96 Göteborg Sweden