Track 3: Anomaly Detection and diagnosis with Machine Learning


Track 3 will study the problem of automated analysis, proposing new algorithms and data processing to reduce the manual intervention required to detect, diagnose and correct problems in the monitored systems.


Low-level tracing is often mainly used by system experts who are able to devise efficient strategies to quickly find anomalies and problems. However, as the number and complexity of digital systems increases, the need for automated monitoring, anomaly detection and problem diagnosis becomes ever more obvious. During the discussions with the industrial partners, automated analysis was an important common need identified. A system may perform badly because of an improper configuration, a change in the environment, unusual possibly malicious network traffic, or simply an inefficient code modification. With proper tracing tools, a human will eventually find the problem, reading trouble reports (TR), looking at various metrics, comparing sequences of events, and contrasting the behavior of the problematic system with that of a correct system.

Recent advances in Machine Learning have led to impressive results in automating decision and diagnosis tasks, whether for intrusion detection, malware detection, Internet traffic classification, code correlation and program optimisation or in other applications like game playing, algorithmic trading or medical diagnosis. These techniques will be harnessed to automatically find correlations between changes in the code, the configuration or the environment, and the changes in performance.

There are several libraries and frameworks for applying Machine Learning techniques, like Weka, MOA, Apache Spark, Apache Singa and Google tensorflow. These tools are flexible in terms of input data, but a proper structure and model is required. The complexity and volume of the tracing data, and the complexity of the multi-core, multi-node, multi-layer systems studied are significant challenges.


In track 3, Anomaly Detection and diagnosis with Machine Learning, the aim is to enable the developer to properly model the semantics of the tracing events. Then, with the derived metrics and links, Machine Learning techniques will be proposed to group different trace segments, corresponding to different task executions, into clusters. These clusters (e.g., fast and slow executions) will then be compared in order to identify the differences (code versions, configuration, network traffic) and thus the underlying root causes, possibly linked with trouble reports.