logo_white

A C++17 Data Stream Processing Parallel Library for Multicores and GPUs 

Operators

WindFlow provides a quite rich set of operators like popular streaming frameworks (e.g, Apache Flink). The figure below shows the list of operators and a brief description of each of them.

placeholder image

Operator Parallelism

Sources, the other basic operators, and sinks have an immediately clear semantics for programmers a bit expert with open-source stream processing frameworks. During their instantiation, the programmer can specify a parallelism level (number of replicas). Each replica receives a fraction of the input stream in order to process inputs in parallel to improve throughput. The Map, Filter, FlatMap and Sink operators can be configured to work on a keyby basis, i.e., all the inputs with the same key attribute are delivered to the same destination replica by the preceding operator. The Reduce operator always works on a keyby basis.

In addition, the library provides a set of window-based operators that offer a useful set of programming abstractions [1,2] to express different kinds of parallelism approaches solving window-based streaming analytics tasks:

  • Key-based parallelism: it is the default parallelism exploitation pattern in stream processing systems dealing with stateful computations. The assumption is that the stream conveys input items belonging to multiple logical substreams, each composed of items all having the same value of a key attribute (a.k.a. keyed streams). The idea is to execute in parallel the processing of sliding windows belonging to different substreams. The default operator enabling this behavior is the Keyed_Windows (KW).
  • Inter-window parallelism: in case of high-speed streams, sliding windows can become ready to be computed very frequently. The idea is to split the set of windows to compute among different parallel entities in charge of doing the processing in parallel on distinct windows (regardless the fact that they belong to different or to the same keyed substream). The operators enabling this behavior are the Parallel_Windows (PW) and the Paned_Windows (PaW).
  • Intra-window parallelism: the idea is to go in parallel within the computation of each window, by splitting its content in partitions that are processed in parallel while the results of the partitions are computed and merged to obtain window-wise results. The operators enabling this behavior are the Paned_Windows (PaW) and the MapReduce_Windows (MRW).

The idea of these approaches is sketched in the figure below. For simplicity, we show an example with count-based windows of length of six items sliding every two items. However, the library also provides support to time-based windows (with event time semantics only at the moment) instantiated with both non-incremental and incremental queries.

Parallelism exploitation patterns on sliding windows

Intra-window parallelism needs additional explanations. The library supports operators enabling the parallel processing within each window. The first is based on the paned approach [3], where each window is split into tumbling windows called panes with length equal to the greatest common divisor between the window length and its slide (in the example, each pane is two items long). The operator computes a partial result per pane (by processing them in parallel if possible), and the pane results are eventually aggregated to produce window-wise results (in the example three pane results are needed per window). A good point in favor of this approach is that pane results shared by consecutive windows do not need to be recomputed by improving the computation performance.

The second approach to intra-window parallelism is based on the map-reduce pattern: each window is split into partitions, and a partial result is computed per partition. Then, partition results are aggregated into window-wise results. Although similar to the paned approach, here the size of the partitions depends on the number of parallel replicas involved in the processing of the operator (two in the figure below, right-hand side), leading to the possibility to decrease the processing latency proportionally to the parallelism level used by the operator. However, differently from the paned approach, windows are recomputed from scratch although in parallel.

Intra-window parallelism

We point out that the existing frameworks do not provide all these parallel operators in an integrated and easy-to-use manner. As an example, Apache Storm and Apache Flink provide key-based parallelism natively, while intra-window parallelism is naturally provided in Spark Streaming as a derivation from the Spark engine for batch processing and map-reduce computations.

Starting from version 2.8 of the library, WindFlow provides a further window-based operator called FFAT_Windows. As for the Keyed_Windows operator, this operator processes windows of different keyed substreams in parallel on different cores of the CPU. However, windows of the same substream are processed using the Flat Fixed-size Aggregator approach (Flat-FAT) based on balanced binary trees. The approach avoids recomputing windows from scratch. Further details can be found in the paper presenting this approach for the first time [4].

GPU Operators

WindFlow version 3.x supports operators targeting NVIDIA GPU devices. The idea is that the processing of hotspot computationally intensive operators can be accelerated leveraging GPU devices and small batching approaches. A GPU operator receives a batch of inputs from the preceding operators in the dataflow graph and processes each batch in parallel on GPU.

For stateless operators, i.e., the one created without a keyby modifier, all inputs within the batch are processed in parallel on GPU by a CUDA thread each. This idea is shown in the figure below.

GPU stateless processing

Stateful operators require a keyby distribution and maintain a state partition per key. The general requirement is that all inputs belonging to the same substream are processed sequentially by reading and modifying the associated state. WindFlow provides a clever approach to do that on GPU, by keeping a device object per key representing the state partition, and enforcing the stateful constraints in the design of the internal CUDA kernels.

GPU stateful processing

From the release 3.0.0, WindFlow supports both Map and Filter operators on GPU. Furthermore, a GPU variant of the FFAT_Windows has also been provided (FFAT_Windows_GPU), which uses a modified version of the FlatFAT structure that can efficiently be computed on GPU.

References

Useful references for this page:

  1. G. Mencagli, M. Torquati, D. Griebler, M. Danelutto and L. G. L. Fernandes. Raising the Parallel Abstraction Level for Streaming Analytics Applications. IEEE Access, vol. 7, pp. 131944-131961, 2019. DOI: 10.1109/ACCESS.2019.2941183
  2. Tiziano Matteis and Gabriele Mencagli. 2017. Parallel Patterns for Window-Based Stateful Operators on Data Streams: An Algorithmic Skeleton Approach. Int. J. Parallel Program. 45, 2 (April 2017), 382-401. DOI: https://doi.org/10.1007/s10766-016-0413-x
  3. Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker. 2005. No pane, no gain: efficient evaluation of sliding-window aggregates over data streams. SIGMOD Rec. 34, 1 (March 2005), 39-44. DOI: http://dx.doi.org/10.1145/1058150.1058158
  4. Kanat Tangwongsan, Martin Hirzel, Scott Schneider, and Kun-Lung Wu. 2015. General incremental sliding-window aggregation. Proc. VLDB Endow. 8, 7 (February 2015), 702–713. DOI:https://doi.org/10.14778/2752939.2752940
  5. T. De Matteis, G. Mencagli, D. De Sensi, M. Torquati and M. Danelutto. GASSER: An Auto-Tunable System for General Sliding-Window Streaming Operators on GPUs. IEEE Access, vol. 7, pp. 48753-48769, 2019. DOI: 10.1109/ACCESS.2019.2910312