
WindFlow provides a rich set of operators similar to those found in popular streaming frameworks (e.g., Apache Flink). The figure below shows the list of operators along with a brief description of eac

Sources, the other basic operators, and sinks have straightforward semantics for programmers who are familiar with open-source stream processing frameworks. When instantiating these components, the programmer can specify a parallelism level (i.e., the number of replicas). Each replica receives a portion of the input stream, allowing inputs to be processed in parallel and improving throughput. The Map, Filter, FlatMap, and Sink operators can be configured to work on a key-by basis, meaning that all inputs with the same key attribute are delivered to the same destination replica by the preceding operator. The Reduce operator always operates on a key-by basis.
In addition, the library provides a set of window-based operators that offer useful programming abstractions [1,2] for expressing different kinds of parallelism strategies aimed at solving window-based streaming analytics tasks:
The idea behind these approaches is illustrated in the figure below. For simplicity, we present an example using count-based windows of length six items, sliding every two items. However, the library also supports time-based windows (currently with event-time semantics only), which can be instantiated with both non-incremental and incremental queries.

Intra-window parallelism requires additional explanation. The library supports operators that enable parallel processing within each window. The first approach is based on the paned technique [3], in which each window is divided into tumbling subwindows, called panes, whose length is equal to the greatest common divisor between the window length and its slide (in the example, each pane is two items long). The operator computes a partial result for each pane—processing them in parallel when possible—and then aggregates these pane results to produce the final window-level output (in the example, three pane results are needed per window). A notable advantage of this approach is that pane results shared across consecutive windows do not need to be recomputed, thereby improving computational performance.
The second approach to intra-window parallelism is based on the map-reduce pattern: each window is divided into partitions, and a partial result is computed for each partition. The partition results are then aggregated to produce the final window-level output. Although similar to the paned approach, in this case the size of the partitions depends on the number of parallel replicas involved in processing the operator (two in the figure below, right-hand side), making it possible to reduce processing latency proportionally to the operator’s parallelism level. However, unlike the paned approach, windows must be recomputed from scratch—albeit in parallel.

It is worth noting that existing frameworks do not provide all these parallel operators in an integrated and easy-to-use manner. For example, Apache Storm and Apache Flink offer native support for key-based parallelism, while intra-window parallelism is naturally supported in Spark Streaming as an extension of the Spark engine’s batch-processing and map-reduce capabilities.
Starting from version 2.8 of the library, WindFlow provides an additional window-based operator called FFAT_Windows. Similar to the Keyed_Windows operator, it processes windows from different keyed substreams in parallel on different CPU cores. However, windows belonging to the same substream are processed using the Flat Fixed-size Aggregator (Flat-FAT) approach, which is based on balanced binary trees. This method avoids recomputing windows from scratch. Further details can be found in the paper that first introduced this approach [4].
WindFlow version 3.x supports operators designed for NVIDIA GPUs. The idea is to accelerate the execution of computationally intensive operators by leveraging GPUs together with small batching techniques. A GPU operator receives batches of inputs from the preceding operators in the dataflow graph and processes each batch in parallel on the GPU.
For stateless operators—that is, those created without a key-by modifier—all inputs within a batch are processed in parallel on the GPU, with each input handled by a separate CUDA thread. This idea is illustrated in the figure below.

Stateful operators require a key-by distribution and maintain a state partition for each key. The general requirement is that all inputs belonging to the same substream must be processed sequentially, since they read and modify the associated state. WindFlow provides an effective approach to achieve this on GPUs by keeping a device-side object per key to represent the state partition, and by enforcing stateful constraints directly within the design of its internal CUDA kernels.

Starting from release 3.0.0, WindFlow supports both Map and Filter operators on GPUs. Additionally, a GPU variant of FFAT_Windows (called FFAT_Windows_GPU) is provided, which uses a modified version of the FlatFAT structure designed for efficient computation on GPUs.
Useful references: