Although analytics should be agnostic with regard to how the data is fed to the platform, we have to consider several potential pitfalls that can affect the efficiency of the analytics. There are several strategies that we can use to feed I-IoT data to the platform:
- Bulk ingestion, for example, one file daily
- Small portion, for example, one file every five minutes
- Data streams, where files are fed continuously with a small latency
Data is also affected by several issues:
- It might be in the wrong order. For example, a data point at 18:00 might be sent at 18:10 and a data point at 17:59 might be sent at 18:11.
- It might be of a bad quality.
- It might have holes in it.
- It might have anomalous spikes in it.
- It might be frozen. This refers to a situation where you have a suspiciously flat number for a long time.
These issues are illustrated in the following diagram:
Data being affected by issues
Data might also be delayed for a long period of time. We know this from personal experience in the oil and gas industry—one particular customer reactivated their connection to the cloud after six months of being disconnected and the data from the sensors filled the data lake in three days. Unfortunately, the analytics processed the data of the entire time period and detected a whole series of anomalies and alerts. These alerts were not useful at all because they were from the time in which the customer was disconnected, so the operations center was flooded with junk alerts.
To build a real I-IoT platform, we have to develop our architecture with sufficient robustness to address these issues. For example, we can adopt a timeout for data that is too late, or we can pre-process data and mark any data that is in the wrong order or that is frozen. Moreover, we can interpolate data before fueling the analytics to avoid holes.
We should avoid manipulating the raw data during pre-processing. The best approach is to simply mark the data point or the events so that we can restore the raw data if there are errors.
In the previous section, we learned about the technologies required to build analytics. Now, we need to deploy them, assuming that the infrastructure supports our use case. We then need to define the method to trigger the analytics. We have three methods we can use to do this:
- Stream analytics
- Micro-batch analytics
- Condition-based analytics
Streaming versus batch analytics
Streaming analytics are triggered when data flows into the system. This method is very good for simpler analytics, such as simple rules or signal-processing. However, things become more complicated if analytics require the support of an additional data source or if they require data of a specific size.
An alternative to streaming is to schedule the analytics to run regularly after a set period of time, such as every 10 seconds. This involves checking the availability of the data and then pulling it from the data lake or the time-series database. We call these analytics micro-batch analytics.
Micro-batch analytics have a few great advantages:
- They are easy to debug and to develop offline
- They can require a relatively large bunch of data
- They can re-process data when needed
Unfortunately, these analytics waste resources on checking the availability of data continuously. A good compromise is to use streaming analytics for simple rules, signal processing, and to also register the last timestamp of the data availability of the specific equipment.
The following diagram shows the proposed implementation:
Batch and streaming analytics
Data is acquired from sensors and feeds the streaming analytics layer. Data points can be cleaned and a simple rule can be applied. In parallel, we register the last timestamp and we store the cleaned data in the storage layer. Micro-batch analytics are activated by the data’s availability.
Several cloud-based solutions support additional features for stream analytics such as windowing, which refers to the ability to collect a small batch of data before processing it. We saw this feature previously in our custom architecture and during the Azure exercise.
Analytics can also be triggered by unpredictable events such as human interactions or rare events such as the abrupt shutdown of a piece of equipment that is being monitored. The previous framework remains valid; we simply trigger the execution based on a rule condition.
It’s a different story, however, if a human wants to work with these analytics to perform, for instance, a what-if scenario, or to discover a specific pattern.
Interactive analytics represents a combination of queries, a user interface, and a core for the analytics. It is not possible to define the best method or technology to implement interactive analytics. The best suggestion is to keep the core of the analytics in a shared library and to develop the user interface with a small micro-app that uses a common framework such as Power BI, Google Data Studio, Python-based libraries such as Dash, Bokeh, or D3.js (https://d3js.org/), or R-based libraries such as R shiny (https://shiny.rstudio.com/). Alternatively, we can use more advanced tools such as Jupyter (http://jupyter.org/) or Apache Zeppelin (https://zeppelin.apache.org/).
Analytics on the cloud
Cloud analytics offer the best advantages, but also have two important disadvantages. They can be deployed and managed easily and they can scale rapidly. They can also work on a large amount of data over the entire fleet and cross-assets. However, they cannot process high frequency data (more then 100 GB every hour with a sampling interval of less than 1 ms). This issue is not due to the cloud analytics themselves, but due to the bandwidth. The latency of the network causes a problem because cloud analytics cannot work with low latency. For these special cases, we prefer to run on the edge.
Analytics on the edge
On-edge analytics are the newest class of analytics in the IoT and the oldest class of analytics in the industrial sector. Indeed, industries started developing analytics on the controller from the beginning. Recently, the idea that everything should be cloud-based has become less important and vendors have started deploying cloud analytics on the edge.