Computing Systems are undergoing a multitude of interesting changes: from the platforms (cloud, appliances) to the workloads, data types, and operations (big data, machine learning). Many of these changes are driven or being tackled through innovation in hardware even to the point of having fully specialized designs for particular applications. In this talk I will review some of the most important changes happening in hardware and discuss how they affect system design as well as the opportunities they create. I will focus on data processing with an emphasis on streams and event based systems but also discuss applications in other areas. I will also briefly discuss how these trends are likely to result in a very different form of IT, and consequently of Computer Science, from the one we know today.
Quantitative models in informatics are well accepted and employed, whereas qualitative, conceptual, modeling is frequently considered with scepticism: huge effort is said to trade for poor benefit.
In this talk I discuss criteria for adequate conceptual models of event driven systems, and discuss what kind of benefit such models may yield.
In particular, I will argue that an adequate model supports not only understanding, abstraction, structure, behavior, and user interaction. In addition, a lot of nontrivial, deep insight can be gained, including invariants in a broad sense, with e.g. models in physics as a role model.
In a few short years, the Function-as-a-Service paradigm has revolutionized how we think about distributed event processing. Within seconds, developers can deploy and run event-driven functions in the cloud that scale on demand. CIOs love the high availability and zero administration. CFOs love the fine-grained, pay-as-you-go pricing model.
Most FaaS platforms today are biased toward (if not restricted to) simple functions, functions that only run for a short while, use limited CPU and memory, and process relatively small amounts of data. In this talk, we will get a closer look at FaaS platforms (using Apache OpenWhisk as our exemplar) to understand these trade-offs.
Functions are not applications. To build compelling FaaS applications we need to compose functions. In this talk, we will compare approaches to function composition both from a developer and from a system's perspective. We will show how composition can dramatically expand the scope of FaaS.
In the last four years, stream processing has gone from niche to mainstream, with real-time data processing systems gaining traction in not only fast-moving startups, but also their more skeptical and cautious enterprise brethren. In light of such pervasive adoption, is it safe to say we've we finally reached the point where stream processing is a solved commodity? Are we done and ready to move on to the next big thing?
In this talk, I will argue that the answer to those questions is conclusively "no": stream processing as a field of research is alive and well. In fact, as streaming systems evolve to look more and more alike, the need for active exploration of new ideas is all the more pressing. And though streaming systems are more capable and robust than ever, they remain in many ways difficult to use, difficult to maintain, and difficult to understand. But we can change that.
I don't claim to have all the answers; no one does. But I do have a few ideas of where we can start. And by sharing my thoughts on some of the more interesting open problems in stream processing, and encouraging others to share theirs, I'm hoping that we as a research community can work together to help move the needle just a little bit further.
Online event processing are abound ranging from web and mobile applications to data processing. Such event processing applications often require the ability to ingest, store, dispatch and process events. Until now, supporting all of these needs has required different systems for each task -- stream processing engines, messaging queuing middleware, and pub/sub messaging systems. This has led to the unnecessary complexity for the development of such applications and operations leading to increased barrier to adoption in the enterprises.
In this keynote, Karthik will outline the need to unify these capabilities in a single system and make it easy to develop and operate at scale. Karthik will delve into how Apache Pulsar was designed to address this need with an elegant architecture. Apache Pulsar is a next generation distributed pub-sub system that was originally developed and deployed at Yahoo and running in production in more than 100+ companies. Karthik will explain how the architecture and design of Pulsar provides the flexibility to support developers and applications needing any combination of queuing, messaging, streaming and lightweight compute for events. Furthermore, he will provide real life use cases how Apache Pulsar is used for event processing ranging from data processing tasks to web processing applications.
The number of computers in the world is still growing at an exponential rate. For instance, there are 9 Billion new micro-controllers shipped every year and that is on top of other devices such as mobile phones and servers in the cloud and on-premise. This talk takes a different perspective on this trend and argues that it makes sense to look at all these micro-controllers, devices, and servers as one big supercomputer.
The talk gives examples that illustrate how to make this supercomputer sustainable, secure, programable, and useful.
Despite the established scientific knowledge on efficient parallel and elastic data stream processing, it is challenging to combine generality and high level of abstraction (targeting ease of use) with fine-grained processing aspects (targeting efficiency) in stream processing frameworks. Towards this goal, we propose STRETCH, a framework that aims at guaranteeing (i) high efficiency in throughput and latency of stateful analysis and (ii) fast elastic reconfigurations (without requiring state transfer) for intra-node streaming applications. To achieve these, we introduce virtual shared-nothing parallelization and propose a scheme to implement it in STRETCH, enabling users to leverage parallelization techniques while also taking advantage of shared-memory synchronization, which has been proven to boost the scaling-up of streaming applications while supporting determinism. We provide a fully-implemented prototype and, together with a thorough evaluation, correctness proofs for its underlying claims supporting determinism and a model (also validated empirically) of virtual shared-nothing and pure shared-nothing scalability behavior. As we show, STRETCH can match the throughput and latency figures of the front of state-of-the-art solutions, while also achieving fast elastic reconfigurations (taking only a few milliseconds).
In modern Stream Processing Engines (SPEs), numerous diverse applications, which can differ in aspects such as cost, criticality or latency sensitivity, can co-exist in the same computing node. When these differences need to be considered to control the performance of each application, custom scheduling of operators to threads is of key importance (e.g., when a smart vehicle needs to ensure that safety-critical applications always have access to computational power, while other applications are given lower, variable priorities).
Many solutions have been proposed regarding schedulers that allocate threads to operators to optimize specific metrics (e.g., latency) but there is still lack of a tool that allows arbitrarily complex scheduling strategies to be seamlessly plugged on top of an SPE. We propose Haren to fill this gap. More specifically, we (1) formalize the thread scheduling problem in stream processing in a general way, allowing to define ad-hoc scheduling policies, (2) identify the bottlenecks and the opportunities of scheduling in stream processing, (3) distill a compact interface to connect Haren with SPEs, enabling rapid testing of various scheduling policies, (4) illustrate the usability of the framework by integrating it into an actual SPE and (5) provide a thorough evaluation. As we show, Haren makes it is possible to adapt the use of computational resources over time to meet the goals of a variety of scheduling policies.
Data Stream Processing (DSP) has emerged as a key enabler to develop pervasive services that require to process data in a near real-time fashion. DSP applications keep up with the high volume of produced data by scaling their execution on multiple computing nodes, so as to process the incoming data flow in parallel. Workloads variability requires to elastically adapt the application parallelism at run-time in order to avoid over-provisioning. Elasticity policies for DSP have been widely investigated, but mostly under the simplifying assumption of homogeneous infrastructures. The resulting solutions do not capture the richness and inherent complexity of modern infrastructures, where heterogeneous computing resources are available on-demand. In this paper, we formulate the problem of controlling elasticity on heterogeneous resources as a Markov Decision Process (MDP). The resulting MDP is not easily solved by traditional techniques due to state space explosion, and thus we show how linear Function Approximation and Tile Coding can be used to efficiently compute elasticity policies at run-time. In order to deal with parameters uncertainty, we integrate the proposed approach with Reinforcement Learning algorithms. Our numerical evaluation shows the efficacy of the presented solutions compared to standard methods in terms of accuracy and convergence speed.
In this paper, we address issues in the design and operation of a Big Active Data Publish Subscribe (BAD Pub/Sub) systems to enable the next generation of enriched notification systems that can scale to societal levels. The proposed BAD Pub/Sub system will aim to ingest massive amounts of data from heterogeneous publishers and sources and deliver customized, enriched notifications to end users (subscribers) that express interests in these data items via parameterized channels. To support scalability, we employ a hierarchical architecture that combines a back-end big data cluster (to receive publications and data feeds, store data and process subscriptions) with a client-facing distributed broker network that manages user subscriptions and scales the delivery process. A key aspect of our broker capacity is its ability to aggregate subscriptions from end users to immensely reduce the end to end overheads and loads. The skewed distribution of subscribers, their interests and the dynamic nature of societal scale publications, create load imbalance in the distributed broker network. We mathematically formulate the notion of broker load in this setting and derive an optimization problem to minimize the maximum load (an NP-hard problem). We propose a staged approach for broker load balancing that executes in multiple stages --- initial placement of brokers to subscribers, dynamic subscriber migration during operation to handle transient and instantaneous loads and occasional shuffles to re-stabilize the system. We develop a prototype implementation of our staged load balancing on a real BAD Pub/Sub testbed (multinode cluster) with a distributed broker network and conduct experiments using real world workloads. We further evaluate our schemes via a detailed simulation studies.
Vehicles exchange Floating Car Data (FCD) to improve awareness beyond their local perception and thereby increase traffic safety and comfort. If the FCD is required at distant locations, FCD can be shared using the cellular network to notify vehicles early of upcoming road events. However, this monitoring of the roads congests the cellular network, which is already utilized by other applications. The available bandwidth for monitoring is expected to decrease further with the introduction of fully autonomous vehicles.
In this paper, we propose a hybrid dissemination approach for the distribution of road events in vehicular networks. Our approach aims to utilize only a predefined bandwidth for information exchange, which is achieved by two mechanisms: (i) the offloading of information to Wifi-based Vehicle to Vehicle (V2V) communication and (ii) the filtering of low-impact information. We offload the information to Wifi-based communication using non-cooperative game-theory: Each vehicle chooses the minimum impact of information it wants to receive via the cellular network. Through cooperation, the vehicles in proximity might provide information the other vehicles cannot receive. In the evaluation, we show that our approach significantly improves the data quality at the vehicles compared to traditional offloading approaches while sticking to the predefined bandwidth constraints.
Location-based feed-following is a trending service that can provide contextually relevant information to users based on their locations. In this paper, we consider the view selection problem in a location-based feed-following system that continuously provides aggregated query results over feeds that are located within a certain range from users. Previous solutions adopt a user-centric approach and require re-optimizations of the view selection once users move their locations. Such methods limit the system's scalability to the number of users and can be very costly when a substantial number of users move their locations. To solve the problem, we propose the new concept of location-centric query plans. In this approach, we use a grid to partition the space into cells and generate view selection and query processing plans for each cell, and user queries will be evaluated using the query plans associated with the users' current locations. In this way, the problem's complexity and dynamicity is largely determined by the granularity of the grid instead of the number of users. To minimize the query processing cost, we further propose an algorithm to generate an optimized set of materialized views to store the aggregated events of some feeds and a number of location-centric query plans for each grid cell. The algorithm can also efficiently adapt the plans according to the movement of the users. We implement a prototype system by using Redis as the back-end in-memory storage system for the materialized views and conduct extensive experiments over two real datasets to verify the effectiveness and efficiency of our approach.
We present a system for online, incremental composite event recognition. In streaming environments, the usual case is for data to arrive with a (variable) delay from, and to be retracted/revised by the underlying sources. We propose RTECinc, an incremental version of RTEC, a composite event recognition engine with a formal, declarative semantics, that has been shown to scale to several real-world data streams. RTEC deals with delayed arrival and retraction of events by computing at each query time composite event intervals from scratch. This often results to redundant computations. Instead, RTECinc deals with delays and retractions in a more efficient way, by updating only the affected events. We evaluate RTECinc theoretically, presenting a complexity analysis, and show the conditions in which it outperforms RTEC. Moreover, we compare RTECinc and RTEC experimentally using two real-world datasets. The results are compatible with our theoretical analysis and show that RTECinc may outperform RTEC.
Processing event streams is an increasingly important area for modern businesses aiming to detect and efficiently react to critical situations in near real-time. The need to govern the behavior of systems where such streams exist has led to the development of numerous Complex Event Processing (CEP) engines, capable of detecting patterns and analyzing event streams. Although current CEP systems provide real-time analysis foundations for a variety of applications, several challenges arise due to languages' limitations and imprecise semantics, as well as the lack of power to handle big data requirements. In this paper, we discuss such systems, analyzing some of the most sensitive issues in this domain. Further in this context, we present our contributions expressed in LEAD, a formal specification for processing complex events. LEAD provides an algebra that consists of a set of operators for constructing complex events (patterns), temporally restricting the construction process and choosing among several selection and consumption policies. We show how to build LEAD rules to demonstrate the expressive power of our approach. Furthermore, we introduce a novel approach of interpreting these rules into a logical execution plan, built with temporal prioritized colored petri nets.
Many domains, such as the Internet of Things and Social Media, demand to combine data streams with background knowledge to enable meaningful analysis in real-time. When background knowledge takes the form of taxonomies and class hierarchies, Semantic Web technologies are valuable tools and their extension to data streams, namely RDF Stream processing (RSP), offers the opportunity to integrate the background knowledge with RDF streams. In particular, RSP Engines can continuously answer SPARQL queries while performing reasoning. However, current RSP engines are at risk of failing to perform reasoning at the required throughput. In this paper, we formalize continuous hierarchical reasoning. We propose an optimized algorithm, namely C-Sprite, that operates in constant time and scales linearly in the number of continuous queries (to be evaluated in parallel). We present two implementations of C-Sprite: one exploits a language feature often found in existing Stream Processing engines while the other is an optimized implementation. The empirical evaluation shows that the proposed solution is at least twice as fast as current approaches.
Physical event detection has long been the domain of static event processors operating on numeric sensor data. This works well for large scale strong-signal events such as hurricanes, and important classes of events such as earthquakes. However, for a variety of domains there is insufficient sensor coverage, e.g., landslides, wildfires, and flooding. Social networks have provided massive volume of data from billions of users, but data from these generic social sensors contain much more noise than physical sensors. One of the most difficult challenges presented by social sensors is concept drift, where the terms associated with a phenomenon evolve and change over time, rendering static machine learning (ML) classifiers less effective. To address this problem, we develop the ASSED (Adaptive Social Sensor Event Detection) framework with an ML-based event processing engine and show how it can perform simple and complex physical event detection on strong- and weak-signal with low-latency, high scalability, and accurate coverage. Specifically, ASSED is a framework to support continuous filter generation and updates with machine learning using streaming data from high-confidence sources (physical and annotated sensors) and social networks. We build ASSED to support procedures for integrating high-confidence sources into social sensor event detection to generate high-quality filters and to perform dynamic filter selection by tracking its own performance. We demonstrate ASSED capabilities through a landslide detection application that detects almost 350% more landslides compared to static approaches. More importantly, ASSED automates the handling of concept drift: four years after initial data collection and classifier training, ASSED achieves event detection accuracy of 0.988 (without expert manual intervention), compared to 0.762 for static approaches.
Cyber Physical Systems (CPS) combine communication, computation and data storage capabilities to oversee and control physical processes in domains including manufacturing, medical monitoring and smart grids. CPS behavior can be remotely monitored by aggregating event data from various sensors, forwarded over wireless networks. One of the main challenges for CPS application developers is to manage event arrival-time boundaries and to trade off between timeliness and completeness: waiting too long until all events arrive can fail to produce a useful result, while not waiting long enough may lead to faults because the status information is incomplete. Monitoring the production lines in a factory, for example, depends on the aggregation of event data from multiple sensors in the distributed CPS, such as temperature and movement. Yet, predicting time-boundaries for individual event arrivals is difficult, if not impossible, for an application developer, because the wireless network and the sensing devices introduce latencies which vary continuously along with the load, status or environmental conditions of the network and the sensors. This paper proposes Khronos, a middleware that automatically determines timeouts for event arrivals that improve timeliness, given completeness constraint(s) specified by the CPS application developer and taking into account variations in event propagation delays. Extensive evaluations on a physical testbed show that Khronos considerably improves timeliness under varying network configurations and conditions, while satisfying application-specific completeness constraints.
Motivated by the growth of Internet of Things (IoT) technologies and the volumes and velocity of data that they can and will produce, we investigate automated data repair for event-driven, IoT applications. IoT devices are heterogeneous in their hardware architectures, software, size, cost, capacity, network capabilities, power requirements, etc. They must execute in a wide range of operating environments where failures and degradations of service due to hardware malfunction, software bugs, network partitions, etc. cannot be immediately remediated. Further, many of these failure modes cause corruption in the data that these devices produce and in the computations "downstream" that depend on this data.
To "repair" corrupted data from its origin through its computational dependencies in a distributed IoT setting, we explore SANS-SOUCI--a system for automatically tracking causal data dependencies and re-initiating dependent computations in event-driven IoT deployment frameworks. SANS-SOUCI presupposes an event-driven programming model based on cloud functions, which we extend for portable execution across IoT tiers (device, edge, cloud). We add fast, persistent, append-only storage and versioning for efficient data robustness and durability. SANS-SOUCI records events and their causal dependencies using a distributed event log and repairs applications dynamically, across tiers via replay. We evaluate SANS-SOUCI using a portable, open source, distributed IoT platform, example applications, and microbenchmarks. We find that SANS-SOUCI is able to perform repair for both software (function) and sensor produced data corruption with very low overhead.
Termination is an important non-functional property of distributed algorithms. In an event-driven setting, the interesting aspect of termination is the possibility of control flow loops through communication, which this paper aims to investigate.
In practice, it is often difficult to spot the possible communication behaviour of an algorithm at a glance. With a static analysis, the design process can be supported by visualizing possible flow of messages and give hints on possible sources of non-termination.
We propose a termination analysis for distributed algorithms formulated in an event-driven specification language. The idea is to construct a message flow graph describing the possible communication between components (input-action pairs). We show that acyclicity of that graph implies termination. While many interesting algorithms indeed contain cycles, we also suggest ways of detecting cycles which cannot lead to non-termination.
As a practical evaluation, we describe a concrete programming language together with a tool for automated termination analysis.
Maritime monitoring systems support safe shipping as they allow for the real-time detection of dangerous, suspicious and illegal vessel activities. We present such a system using the Run-Time Event Calculus, a composite event recognition system with formal, declarative semantics. For effective recognition, we developed a library of maritime patterns in close collaboration with domain experts. We present a thorough evaluation of the system and the patterns both in terms of predictive accuracy and computational efficiency, using real-world datasets of vessel position streams and contextual geographical information.
In complex event processing (CEP), simple derived event tuples are combined in pattern matching procedures to derive complex events (CEs) of interest. Big Data applications analyze event streams online and extract CEs to support decision making procedures. At massive scale, such applications operate over distributed networks of sites where efficient CEP requires reducing communication as much as possible. Besides, events often encompass various types of uncertainty. Therefore, massively distributed Big event Data applications in a world of uncertain events call for communication-efficient, uncertainty-aware CEP solutions, which is the focus of this work. As a proof-of-concept for the applicability of our techniques, we show how we bridge the gap between two recent CEP prototypes which use the same CEP engine and each extend it towards only one of the dimensions of distribution and uncertainty.
For performance analysis, optimization and anomaly detection, there is strong need to monitor industrial systems, which, along modern datacenter infrastructures, feature a high level of decentralization. Continuous Distributed Monitoring (CDM) corresponds to the task of continuously keeping track of statistics from different distributed nodes. We review the feasibility of implementing state-of-the-art CDM algorithms in a large-scale, distributed, performance-critical production system. The study is on the Evolved Packet Core (EPC), an inherently distributed component of the 4G architecture, for processing mobile broadband data. In this work, we propose adjustments to classical models that are needed to account for communication and computation delays and the hierarchical architecture present in production systems. We further demonstrate efficient CDM implementations in the EPC, and analyze trade-offs of accuracy versus savings in communication, as well as availability of the monitoring system.
Wikipedia states: "Situational awareness or situation awareness (SA) is the perception of environmental elements and events with respect to time or space, the comprehension of their meaning, and the projection of their future status." [1]
In 1976 John Boyd published a model he named OODA-loop (Observe, Orient, Decide, and Act) to capture the main elements of Situation Awareness. However, this is a pre-IT model that needs to be updated to leverage as much IT technology as possible; KIDS (Knowledge Intensive Data-management Systems) is such a model [6]. KIDS captures data, identifies abnormal conditions, transforms the data related to these abnormal conditions into the language of domain experts, give guidance to the interpretation (classification and assessment), and recommends reactions. While many required technologies, especially event processing, are available they have to be embedded/organized into a comprehensive SA model; some of these technologies must be extended and (significantly) improved.
Event-based systems encounter challenges of correctness and consistency. Correctness means that the execution results match the intention of the designer. Consistency means that different data elements that co-exist within a certain system creates an internal consistency with respect to the system's requirements. In this tutorial, we cover the different aspects of correctness and consistency. We discuss issues of correctness with respect of the temporal properties of the system, such as order of events, and boundaries of time windows. We further discuss the different aspects of fine-tuning required in event-based system, where different semantic interpretations are possible, such as: repeating events, or consumption. The consistency discussion relates to two classic issues in the data management world: data dependencies and integrity constraint enforcement. Since event-based systems typically consist of loosely coupled component architecture in distributed environment, the challenge is compliance with data dependencies and assertions about consistency. Finally, yet importantly, we discuss the validation of event-based systems by using static and dynamic analysis.
In the Big Data context, data streaming systems have been introduced to tame velocity and enable reactive decision making. However, approaching such systems is still too complex due to the paradigm shift they require, i.e., moving from scalable batch processing to continuous analysis and detection. Initially, modern big stream processing systems (e.g., Flink, Spark, Storm) have been lacking the support of declarative languages to express the streaming-based data processing tasks and have been mainly relying on providing low-level APIs for the end-users to implement their tasks. However, recently, this fact has been changing and most of them started to provide SQL-like languages for their end-users.
In general, declarative Languages are playing a crucial role in fostering the adoption of Stream Processing. This tutorial focuses on introducing various approaches for declarative querying of the state-of-the-art big data streaming frameworks. In addition, we provide guidelines and practical examples on developing and deploying Stream Processing applications using a variety of SQL-like languages, such as Flink-SQL, KSQL and Spark Streaming SQL.
Developing distributed systems is a complex task that requires to program different peers, often using several languages on different platforms, writing communication code and handling data serialization and conversion.
We show how the multitier programming paradigm can alleviate these issues, supporting a development model where all peers in the system can be written in the same language and coexist in the same compilation units, communication code is automatically inserted by the compiler and the language abstracts over data conversion and serialization. We present multitier programming abstractions, discuss their applicability step by step for the development of small applications and discuss larger case studies on distributed stream processing, like Apache Flink and Apache Gearpump.
The ACM DEBS 2019 Grand Challenge is the ninth in a series of challenges which seek to provide a common ground and evaluation criteria for a competition aimed at both research and industrial event-based systems. The focus of the 2019 Grand Challenge is on the application of machine learning to LiDAR data. The goal of the challenge is to perform classification of objects found in urban environments and sensed in several 3D scenes by the LiDAR. The applications of LIDAR and object detection go well beyond autonomous vehicles and are suitable for use in agriculture, waterway maintenance and flood prevention, and construction. This paper describes the specifics of the data streams provided in the challenge as well as the benchmarking platform that supports the testing of corresponding solutions.
In this paper, we present our approach to solve the DEBS Grand challenge 2019 which consists of classifying urban objects in different scenes that originate from a LiDAR sensor. In general, at any point in time, LiDAR data can be considered as a point cloud where a reliable feature extractor and a classification model are required to be able to recognize 3-D objects in such scenes. Herein, we propose and describe an implementation of a 3-D point cloud object detection and classification system based on a 3-D global feature called Ensemble of Shape Functions (ESF) and a random forest object classifier.
In many robotic applications, LiDAR (Light Detection and Ranging) scanner is used to gather data about the environment. Applications like autonomous vehicles require real-time processing of LiDAR point cloud data with high accuracy.
We describe in this paper, our implementation for DEBS 2019 Grand Challenge for an object recognition system from high-speed LiDAR data stream. Our system includes a data processing pipeline with 3 main stages, 1. LiDAR data filtering 2. Object segmentation and noise reduction 3. Multi-class object classification using Convolutional Neural Network (CNN). Our evaluation shows that we can classify objects with high accuracy using the point cloud data and neural network. However, we observed that the classification may fail if the object segmentation is not separating objects correctly in different segments especially when the objects are largely covering each other. We proposed a pre-processing approach for object segmentation based on separating LiDAR data into multiple area sectors before segmenting the objects.
The main mission of smart cities is to offer core infrastructure and quality of life to its citizens. This quality of life can be achieved by providing a cleaner and sustainable environment. However, the main mission of smart cities is undermined due to the existing data-centric applications. Data-centric approaches feed on the data of users and lack of users privacy and security. In result, smart city solutions elected by city administration are not entirely adopted by citizens. We need to understand the fact that each and every city and its citizen are different from each other and the same solution cannot be applied to all of them. If we will accept this fact then we can offer a sustainable solution which will be used by city administration but also welcomed by its citizens.
In this paper we are discussing the city and user-centric solution "axxessity" which is designed and proposed to the city administration of Bonn. The solution is powered by the Open Urban Platform. In our approach citizens and cities are in the center of attention while deciding for a digitization-solution for respective cities. Our achievements lie in the fact that our solution cares about citizens privacy first. Moreover, the platform can be scaled and adapted to the heterogeneous demands of cities and citizens. As a result, it is most welcomed by citizens and accepted by the city administration. However, we accept the fact that solutions and processes need to be upgraded and strengthened with time. Therefore, we would like to take the opportunity to present our axxessity solution and platform to show the capability of our solution followed by getting valuable feedback from researchers and industry experts.
Complex Event Processing (CEP) is the state-of-the-art technology for continuously monitoring and analyzing streams of events. One of the key features of CEP is to support pattern matching queries to detect user-defined sequences of predicates on event streams. However, due to the vast amount of parameters, tweaking the queries to deliver the desired results is challenging. For this demonstration, we connected a database-backed CEP system (Jepc, ChronicleDB) with a scientific toolbox for interactive data exploration and geo visualization (Vat System), and thus allow users to interactively explore event data via CEP queries. Furthermore, the pairing of these systems allows to combine the results of event queries with additional non-relational data-sets such as raster data, leading to insights beyond pure event-based analytics. In this demonstration, we showcase the first promising results of this combination by evaluating three use cases involving high-volume event data collected from aircrafts in-flight.
Information filtering has emerged as a prominent paradigm of timely information delivery; in such a setup, users submit profiles that express their information needs to a server which is responsible for notifying them when information that matches their profiles becomes available. Traditionally, information filtering research has focused mainly on providing algorithmic solutions that enhance the efficiency and effectiveness of the filtering process, and has largely neglected the development of tools/systems that showcase the usefulness of the paradigm. In this work, we put forward Ping, a fully-functional content-based information filtering system aiming (i) to showcase the realisability of information filtering and (ii) to explore and test the suitability of the existing technological arsenal for information filtering tasks. The proposed system is entirely based upon open-source tools and components, is customisable enough to be adapted for different textual information filtering tasks, and puts emphasis in user profile expressivity, intuitive UIs, and timely information delivery. To assess the customisability of Ping, we deployed it in two distinct application scenarios, and assessed its performance under both scenarios.
In present times, where the shipping industry has an important role in the global trade, there is an increased need for safer and more secure shipping. Maritime Monitoring systems allow the real-time detection of dangerous, suspicious and illegal vessel activities. We demonstrate such a system, that uses the Run-Time Event Calculus, a composite event recognition system with formal, declarative semantics. All the activities we demonstrate were detected using real-world maritime data and contextual geographical information.
Today's data-driven applications need to process, analyze and act on real-time data as it arrives. The massive amount of data is continuously generated from multiple sources and arrives in a streaming fashion with high volume and high velocity, which makes it hard to process and analyze in real time. We introduce Striim, a distributed streaming platform that enables real-time integration and intelligence. Striim provides high-throughput, low-latency event processing. It can ingest streaming data from multiple sources, process data with SQL-like query language, analyze data with sophisticated machine learning models, write data into a variety of targets, and visualize data for real-time decision making. In this demonstration, we showcase Striim's ability to collect, integrate, process, analyze and visualize large streaming data in real time.
Remarkable advantages of Containers (CNs) over Virtual Machines (VMs) such as lower overhead and faster startup has gained the attention of Communication Service Providers (CSPs) as using CNs for providing Virtual Network Functions (VNFs) can save costs while increasing the service agility. However, as it is not feasible to realise all types of VNFs in CNs, the coexistence of VMs and CNs is proposed. To put VMs and CNs together, an orchestration framework that can chain services across distributed and heterogeneous domains is required. To this end, we implemented a framework by extending and consolidating state-of-the-art tools and technologies originated from Network Function Virtualization (NFV), Software-defined Networking (SDN) and cloud computing environments. This framework chains services provisioned across Kubernetes and OpenStack domains. During the demo, we deploy a service consist of CN- and VM-based VNFs to demonstrate different features provided by our framework.
DCEP-Sim facilitates simulation of distributed CEP where the latency and bandwidth limitations in the network are well reflected, but it currently lacks models to simulate the temporal behavior of event processing. In this demonstration, we use a modeling methodology to model the software execution of a CEP system called T-Rex. We instrument and trace T-Rex to parameterize a software execution model that is integrated into DCEP-Sim. Furthermore, we use this instance of DCEP-Sim to run simulations and see how significant the processing delay introduced by the model is compared to the transmission delay.
Distributed complex event processing evaluates CEP-queries over data produced by geographically distributed sources. To cope with bandwith restrictions, it is common to employ in-network processing where the operators of a query are placed at network nodes, especially those that act as data sources. While existing operator placement algorithms handle distributed sources, query results are gathered at one designated node -- the sink. We argue that such single-sink solutions are not applicable for non-hierarchical systems, in which multiple nodes need to be informed about query results. Also, in compositional applications, where the result of one query is the input to another query, having a single sink enforces centralisation. Using a real-world application scenario, we illustrate the need for a novel DCEP approach with a multi-sink operator placement. Furthermore, we elaborate on the requirements to be incorporated by a DCEP approach targeting highly decentralized systems.
Many application domains involve monitoring the temporal evolution of large-scale graph data structures. Unfortunately, this task is not well supported by modern programming paradigms and frameworks for large-scale data processing. This paper presents ongoing work on the implementation of FlowGraph, a framework to recognize temporal patterns over properties of large-scale graphs. FlowGraph combines the programming paradigm of traditional graph computation frameworks with the temporal pattern detection capabilities of Complex Event Recognition (CER) systems. In a nutshell, FlowGraph distributes the graph data structure across multiple nodes that also contribute to the computation and store partial results for pattern detection. It exploits temporal properties to defer as much as possible expensive computations, to sustain a high rate of changes.
Data-driven solutions for the investment industry require event-based backend systems to process high-volume financial data feeds with low latency, high throughput, and guaranteed delivery modes.
At vwd we process an average of 18 billion incoming event notifications from 500+ data sources for 30 million symbols per day and peak rates of 1+ million notifications per second using custom-built platforms that keep audit logs of every event.
We currently assess modern open source event-processing platforms such as Kafka, NATS, Redis, Flink or Storm for the use in our ticker plant to reduce the maintenance effort for cross-cutting concerns and leverage hybrid deployment models. For comparability and repeatability we benchmark candidates with a standardized workload we derived from our real data feeds.
We have enhanced an existing light-weight open source benchmarking tool in its processing, logging, and reporting capabilities to cope with our workloads. The resulting tool wrench can simulate workloads or replay snapshots in volume and dynamics like those we process in our ticker plant. We provide the tool as open source.
As part of ongoing work we contribute details on (a) our workload and requirements for benchmarking candidate platforms for financial feed processing; (b) the current state of the tool wrench.
Financial markets are event- and data-driven to an extremely high degree. For making decisions and triggering actions stakeholders require notifications about significant events and reliable background information that meet their individual requirements in terms of timeliness, accuracy, and completeness. As one of Europe's leading providers of financial data and regulatory solutions vwd: processes an average of 18 billion event notifications from 500+ data sources for 30 million symbols per day. Our large-scale distributed event-based systems handle daily peak rates of 1+ million event notifications per second and additional load generated by singular pivotal events with global impact.
In this poster we give practical insights into our IT systems. We outline the infrastructure we operate and the event-driven architecture we apply at vwd. In particular we showcase the (geo)distributed publish/subscribe broker network we operate across locations and countries to provide market data to our customers with varying quality of information (QoI) properties.
Evaluating modern stream processing systems in a reproducible manner requires data streams with different data distributions, data rates, and real-world characteristics such as delayed and out-of-order tuples. In this paper, we present an open source stream generator which generates reproducible and deterministic out-of-order streams based on real data files, simulating arbitrary fractions of out-of-order tuples and their respective delays.
Streaming applications are used for analysing large volumes of continuous data. Achieving efficiency and effectiveness in data streaming imply challenges that gen all the more important when different parties (i) define applications' semantics, (ii) choose the stream Processing Engine (SPE) to use, and (iii) provide the processing infrastructure (e.g., cloud or fog), and when one party's decisions (e.g., how to deploy applications or when to trigger adaptive reconfigurations) depend on information held by a distinct one (and possibly hard to retrieve). In this context, machine learning can bridge the involved parties (e.g., SPEs and cloud providers) by offering tools that learn from the behavior of streaming applications and help take decisions.
Such a tool, the focus of our ongoing work, can be used to learn which operators are run by a streaming application running in a certain SPE, without relying on the SPE itself to provide such information. More concretely, to classify the type of operator based on a desired level of granularity (from a coarse-grained characterization into stateless/stateful, to a fine-grained operator classification) based on general application-related metrics. As an example application, this tool could help a Cloud provider decide which infrastructure to assign to a certain streaming application (run by a certain SPE), based on the type (and thus cost) of its operators.
Vehicle exchange Floating Car Data (FCD) to increase their awareness beyond their local perception. For this purpose, the context-sensitive FCD is commonly distributed using the cellular network. To receive FCD via the cellular network, the vehicles share their current context (often location) with a backend.
This permanent and accurate observation of the vehicle's context interferes with the privacy of the occupants. In this work, we use obfuscation to protect the location privacy of the occupants. For this purpose, we adopt the precision of the context monitoring of the vehicle to the privacy needs of the occupants. As unnecessary FCD is transmitted to the vehicle, this imprecision reduces the share of useful FCD, which might lead to additional bandwidth consumption. We compensate for this additional consumption by filtering FCD with low impact for the vehicle and simultaneously relying on less-privacy-aware vehicles to provide FCD via local communication channels like Wifi.
With the ever increasing number of IoT devices getting connected, an enormous amount of streaming data is being produced with very high velocity. In order to process these large number of data streams, a variety of stream processing platforms and query engines are emerging. In the stream query processing, an infinite data stream is divided into small chunks of finite data using a window operator. Window size and its type play an important role in the performance of any stream query engine. Due to the dynamic nature of IoT, data stream rate fluctuates very often, thus impeding the performance of query engines. In this work, we investigated the impact of any changes in data stream rates over the performance of a distributed query engine (e.g. Flink - https://flink.apache.org/). Our evaluation results indicate a direct impact of any changes in stream rate and window size over the performance of the engines. We propose an adaptive and dynamic query window size and type selector to improve the resilience of query processing engines. We consider several characteristics of input data streams, application workload, and resource constraints and proposes an optimal stream query window size and type for stream query execution.
Large-scale classification of text streams is an essential problem that is hard to solve. Batch processing systems are scalable and proved their effectiveness for machine learning but do not provide low latency. On the other hand, state-of-the-art distributed stream processing systems are able to achieve low latency but do not support the same level of fault tolerance and determinism. In this work, we discuss how the distributed streaming computational model and fault tolerance mechanisms can affect the correctness of text classification data flow. We also propose solutions that can mitigate the revealed pitfalls.
We propose RTECinc, an incremental version of RTEC, a composite event recognition engine with formal, declarative semantics, that has been shown to scale to several real-world data streams. RTEC deals with delayed arrival of events by computing at each query time everything from scratch. This is often not efficient since it results to redundant computations. Instead, RTECinc deals with delays in a more efficient way, by updating only the affected events. We compare RTECinc and RTEC experimentally using a real-world dataset with position signals of vehicles.
Mobile text messages (SMS) are often used for sending authentication codes, meeting reminders and other messages which need to be forwarded from applications to various mobile network operators, to eventually be delivered to the users' mobile phones. To avoid message loss in the case of server failures, each message needs to be stored on one or more of the other servers until it has been successfully delivered. When this replication is implemented using a traditional clustered database (e.g., FairCom CTree, MongoDB, and MySQL), we achieve an unacceptably low throughput of at most a few hundred messages per second (MPS) getting stored, marked as being processed by a single server, and then finally deleted.
In order to improve upon this throughput, we will combine and extend ideas from existing work in the design and implementation of a new protocol better suited for this use case. An early version of such a protocol has been able to reach a throughput of about 14 000 MPS.
The existing replication protocols use a general system model which covers most common situations. By carefully scrutinizing this model and discarding assumptions not applicable in our particular situation, e.g. a consistent global order, we should be able to achieve a significantly higher throughput.
Temporally evolving graphs are an indispensable requisite of modern-day big data processing pipelines. Existing graph processing systems mostly focus on static graphs and lack the essential support for pattern detection and event processing in graph-shaped data. On the other hand, stream processing systems support event and pattern detection, but they are inadequate for graph processing. This work lies at the intersection of the graph and stream processing domains with the following objectives: (i) It introduces the syntax of a language for the detection of temporal patterns in large-scale graphs. (ii) It presents a novel data structure called distributed label store (DLS) to efficiently store graph computation results and discover temporal patterns within them. The proposed system, called FlowGraph, unifies graph-shaped data with stream processing by observing graph changes as a stream flowing into the system. It provides an API to handle temporal patterns that predicate on the results of graph computations with traditional graph computations.
An ever increasing number of services requires real-time analysis of collected data streams. Emerging Fog/Edge computing platforms are appealing for such latency-sensitive applications, encouraging the deployment of Data Stream Processing (DSP) systems in geo-distributed environments. However, the highly dynamic nature of these infrastructures poses challenges on how to satisfy the Quality of Service requirements of both the application and the infrastructure providers.
In this doctoral work we investigate how DSP systems can face the dynamicity of workloads and computing environments by self-adapting their deployment and behavior at run-time. Targeting geo-distributed infrastructures, we specifically search for decentralized solutions, and propose a framework for organizing adaptation using a hierarchical control approach. Focusing on application elasticity, we equip the framework with decentralized policies based on reinforcement learning. We extend our solution to consider multi-level elasticity, and heterogeneous computing resources. In the ongoing research work, we aim to face the challenges associated with mobility of users and computing resources, exploring complementary adaptation mechanisms.
Abstract submission for research track |
|
Research and Industry paper submission |
|
Tutorial proposal submission |
|
Grand challenge solution submission |
|
Author notification research and industry track |
|
Poster, demo & doctoral symposium submission |
|
Early registration | May 31st, 2019 |