Track 2: Architecture for real-time and scalable system-level and cloud-level monitoring and analysis

Abstract:

Track 2 will propose a new architecture and algorithms to provide a hierarchical streaming framework for the control, monitoring, aggregation and analysis of tracing and debugging data.

Challenges:

Once tracing sources become available, a proper framework is required to monitor, aggregate and process this data. The traditional approach is to collect all data during the execution and process it at a later time. However, when real-time online monitoring is required, sometimes to dynamically adjust the tracepoints to activate and snapshots to record, this approach is not sufficient.

The challenge is at several levels. Within a single processing unit, whether a 4096 cores GPGPU or a 1024 cores Epiphany V, it is impossible to route all the available detailed tracing data to the outside, without severely impacting the performance and changing the system behavior. Thus, a suitable organisation should be proposed to reduce the data collected at any given time through selective activation, filtering, aggregation, sampling and similar techniques. To this end, it may be required to dedicate some of the available cores to monitoring tasks, such as aggregation and anomaly detection. The same problem arises at the next level, when a cluster or cloud contains thousands of nodes. Here again, some of the nodes may dedicate part of their resources to monitoring tasks.

The monitoring infrastructure ultimately serves to detect and diagnose problems. In a large system running continuously, it is simply impossible to record everything and browse through stored traces at a later time. The trace viewing and analysis tools thus need to be integrated with the runtime system, in order to interact with the monitoring and aggregation processes. These may in turn decide to store portions of traces for later detailed interactive analysis.

At this scale, the trace analysis itself will benefit from specialised co-processors (e.g., GPGPUs) and parallel processing streaming frameworks. While it would be conceptually cleaner to completely separate the trace monitoring and analysis framework from the system monitored, it is often more efficient to process the data locally, where it is created, and use dedicated resources within the co-processing units and the cluster to perform the required processing. A suitable interface will then be required between these distributed monitoring and analysis nodes and the user display application.

Plan:

In track 2 the aim is to propose a new architecture and algorithms to provide a hierarchical streaming framework for the control, monitoring, aggregation, and analysis of tracing and debugging data. Moreover, this flexible architecture must support specialised user-defined analysis modules and interface to different display applications.