Streamer - a software platform for machine learning for data streams
With the proliferation of data sources and of connected objects, and the increase in the number of sensors, streamed data is everywhere. Its dissemination in real time raises the question of continuous machine learning. To address this major research issue, researchers from CEA List (Université Paris-Saclay, CEA) and the DAVID laboratory (Université Paris-Saclay, UVSQ) have joined forces for the StreamOps project, financed by the DATAIA Institute in order to further develop the STREAMER platform. Their aim is to enable users to easily integrate and test machine learning algorithms in realistic data flow contexts.
It all began in 2014 with the European SmartWater4Europe project (completed at the end of 2017) in which Sandra Garcia Rodriguez, an engineer-researcher at CEA List participated. “Our goal in this project was to detect leaks in the Paris water network. To do this, we needed software which could receive all the incoming data in a continuous stream, learn from this data and use the resulting models to detect anomalies,” she remembers. The problem was how to evaluate learning algorithms in the context of a continuous flow of data. “It was in trying to answer this question that we had the idea of developing a platform capable of realistically simulating these continuous data flows in order to integrate and test machine learning algorithms,” continues Sandra Garcia Rodriguez.
StreamOps: a project at the interface of algorithmic aspects, business and software
This initial idea resulted in 2018 in the StreamOps project, led by Cédric Gouy-Pailler, the laboratory manager at the CEA List institute (Université Paris-Saclay, CEA) and Karine Zeitouni, a professor at UVSQ who is in charge of the ADAM team at the Données et algorithmes pour une ville intelligente et durable laboratory (Data and algorithms for a smart and sustainable city) (DAVID - Université Paris-Saclay, UVSQ). “With StreamOps, our aim was to continue the work started by Sandra and offer the scientific community a simple tool for developing and testing powerful algorithms which was as close as possible to the conditions encountered in the field,” explains Cédric Gouy-Pailler. As Karine Zeitouni recalls, this objective was all the more ambitious as it required “developing algorithms which bridge the gap between a community which sees the Internet of Things (IoT) as a flow of data which it analyses proactively as it is recorded, and a community which sees data as a time series which it analyses from a historical point of view”.
Streamer - an open source platform for researchers and industry
Three years later, the goal has been reached with the stabilisation of STREAMER, the first search and integration platform for retrieving, manipulating and analysing streamed data in realistic streaming operational contexts. STREAMER is an open source solution which can be used with Linux, Windows and macOS. It provides a free interface which facilitates monitoring and supports the integration of algorithms in any programming language (Python, R, Java, etc.). Now fully operational, STREAMER is currently aimed at two main user target groups. In the first instance, there are the data scientists who would like to test their algorithms in realistic data flow contexts. “Thanks to the existing modules, data scientists will be able to simulate the sending of data into the platform and integrate their algorithms to test them,” explains Sandra Garcia Rodriguez. “Secondly, we are hoping to reach industrial partners who are also very interested in the possibility of having automation tools for processing data which arrives in the form of streaming,” adds Cédric Gouy-Pailler.
Multiple fields of application: cybersecurity, health, environment, etc.
While its development continues thanks to the work of Jingwei Zuo, a PhD student at UVSQ, and Mohammad AlShaer, a post-doctoral fellow at CEA List, STREAMER is being used in 2021 in several projects in various fields. “Internally, we are thinking of using the developed tool as a platform for experimenting with algorithms for detecting suspicious Internet requests, with a view to making rapid decisions in the field of cyber security. We are also working within the context of confidence in AI (a key challenge) with Confiance.ai, led by IRT SystemX, to develop new tools to increase confidence in AI algorithms,” says Cédric Gouy-Pailler. “In the field of the environment, we’re developing algorithms to characterise individual exposure to air pollution by using measurements collected by micro-sensors as part of the ANR Polluscope project,” explains Karine Zeitouni. This work will also be extended thanks to the data collected as part of the European GoGreen Routes project linked to smart cities which was started in September 2020. “This time we will rely on time series generated from fixed sensors placed in the urban environment at strategic locations,” explains Karine Zeitouni. Finally, new applications are already envisaged, whether in the field of health, with the monitoring of patients and the detection of risks, or in the field of industry 4.0 with a view to the rapid detection of defects on a production line. “The influx of data today is everywhere and the needs are immense. This is just the start. We therefore need to continue our research and experimentation to achieve our goals and enable incremental learning as and when we receive the data on an ongoing basis,” concludes Karine Zeitouni.