WindFlow provides a quite rich set of operators like popular streaming frameworks (e.g, Apache Flink). The figure below shows the list of operators and a brief description of each of them.
Sources, the other basic operators, and sinks have an immediately clear semantics for programmers a bit expert with open-source stream processing frameworks. During their instantiation, the programmer can specify a parallelism level (number of replicas). Each replica receives a fraction of the input stream in order to process inputs in parallel to improve throughput. The Map, Filter, FlatMap and Sink operators can be configured to work on a keyby basis, i.e., all the inputs with the same key attribute are delivered to the same destination replica by the preceding operator. The Reduce operator always works on a keyby basis.
In addition, the library provides a set of window-based operators that offer a useful set of programming abstractions [1,2] to express different kinds of parallelism approaches solving window-based streaming analytics tasks:
The idea of these approaches is sketched in the figure below. For simplicity, we show an example with count-based windows of length of six items sliding every two items. However, the library also provides support to time-based windows (with event time semantics only at the moment) instantiated with both non-incremental and incremental queries.
Intra-window parallelism needs additional explanations. The library supports operators enabling the parallel processing within each window. The first is based on the paned approach [3], where each window is split into tumbling windows called panes with length equal to the greatest common divisor between the window length and its slide (in the example, each pane is two items long). The operator computes a partial result per pane (by processing them in parallel if possible), and the pane results are eventually aggregated to produce window-wise results (in the example three pane results are needed per window). A good point in favor of this approach is that pane results shared by consecutive windows do not need to be recomputed by improving the computation performance.
The second approach to intra-window parallelism is based on the map-reduce pattern: each window is split into partitions, and a partial result is computed per partition. Then, partition results are aggregated into window-wise results. Although similar to the paned approach, here the size of the partitions depends on the number of parallel replicas involved in the processing of the operator (two in the figure below, right-hand side), leading to the possibility to decrease the processing latency proportionally to the parallelism level used by the operator. However, differently from the paned approach, windows are recomputed from scratch although in parallel.
We point out that the existing frameworks do not provide all these parallel operators in an integrated and easy-to-use manner. As an example, Apache Storm and Apache Flink provide key-based parallelism natively, while intra-window parallelism is naturally provided in Spark Streaming as a derivation from the Spark engine for batch processing and map-reduce computations.
Starting from version 2.8 of the library, WindFlow provides a further window-based operator called FFAT_Windows. As for the Keyed_Windows operator, this operator processes windows of different keyed substreams in parallel on different cores of the CPU. However, windows of the same substream are processed using the Flat Fixed-size Aggregator approach (Flat-FAT) based on balanced binary trees. The approach avoids recomputing windows from scratch. Further details can be found in the paper presenting this approach for the first time [4].
WindFlow version 3.x supports operators targeting NVIDIA GPU devices. The idea is that the processing of hotspot computationally intensive operators can be accelerated leveraging GPU devices and small batching approaches. A GPU operator receives a batch of inputs from the preceding operators in the dataflow graph and processes each batch in parallel on GPU.
For stateless operators, i.e., the one created without a keyby modifier, all inputs within the batch are processed in parallel on GPU by a CUDA thread each. This idea is shown in the figure below.
Stateful operators require a keyby distribution and maintain a state partition per key. The general requirement is that all inputs belonging to the same substream are processed sequentially by reading and modifying the associated state. WindFlow provides a clever approach to do that on GPU, by keeping a device object per key representing the state partition, and enforcing the stateful constraints in the design of the internal CUDA kernels.
From the release 3.0.0, WindFlow supports both Map and Filter operators on GPU. Furthermore, a GPU variant of the FFAT_Windows has also been provided (FFAT_Windows_GPU), which uses a modified version of the FlatFAT structure that can efficiently be computed on GPU.
Useful references for this page: