Automated monitoring and debugging of large scale manycore heterogeneous systems
The communication and computing infrastructure has evolved through the years getting more efficient, sophisticated, integrated and networked. To this end, the traditional central processing units are now getting support from specialised co-processing units to speedup specific tasks such as graphics display (GPGPUs with thousands of cores), networking, signal processing or even for Machine Learning. These newer heterogeneous systems are becoming more complex at an even faster rate and are used not only in mobile devices and servers, but also in intelligent devices, (the Internet of Things or IoT), such as autonomous cars, smart robots or automated video surveillance. These processing units are highly parallel and may contain over 8 billion logic elements (transistors) each. For example, newer Graphical Processing Units (GPU), often used for General Purpose computing (GPGPU) contain several thousand computing cores.
As a result, even a simple operation such as initiating a phone call, making a Web search, routing a packet or displaying a video frame can involve many parallel cores on more than one processing units, possibly on several servers. Moreover, the same operation a few seconds later may be served in a different way, by different cores and physical servers. Therefore, understanding the performance of these operations has become extremely difficult and the tools for that purpose are severely lacking. In this project, the tracing, monitoring, profiling and debugging tools for manycore systems will be rearchitected to efficiently extract information from all units in all layers, from the hardware to the application, and to cope with the large number (several thousands) of cores. Furthermore, as manual problem investigation is becoming increasingly difficult, given the systems sophistication and the thousands of cores, a particular emphasis of this research project is to develop new methods and algorithms to automate the analysis of the extracted monitoring data, through Machine Learning techniques.
The availability of these tools will simplify and automate the debugging, tuning and monitoring of new complex applications, running on heterogeneous manycore processors in the era of the Clouds and the Internet of Things. With this, the engineers will be able to quickly understand the system behavior and performance, and optimise its operation, leading to a faster design of more efficient products.
Project Tracks
- Track 1: Data collection through the whole hardware/software stack
- Track 2: Architecture for real-time and scalable system-level and cloud-level monitoring and analysis
- Track 3: Anomaly Detection and diagnosis with Machine Learning
- Track 4: Tracing and debugging support for advanced programming environments