In our earlier post on intelligent edge computing management, we identified edge computing as a crucial stepping stone on our journey to the network compute fabric. The distributed, heterogenous and resource-constrained nature of edge computing brings complexity to management. Intelligent management functions driven by AI/ML can help in addressing the challenges.
In this post, we present a distributed architecture involving several components to enable the intelligent edge management services and discuss a set of fault and performance management related use cases.
The distributed nature of edge computing requires a distributed system to execute management functions. To enable AI/ML-based management tasks, new supporting functionalities need to be dispersed within central data centers (called Global Sites) and geographically distributed within edge sites. We propose an architecture we call Intelligent Edge Cloud (IEC) for enabling the AI/ML-based management of edge computing.
There are four identified subsystems:
- Intelligent Operations (IO)
- Monitoring (MON)
- Data Management (DM)
- Model Management (MM)
Each subsystem has a global component within the global site and multiple local components residing within each of the edge sites. For instance, the IO subsystem is composed of the Global and Local IO subsystems.
The IO subsystem provides the execution environment and its associated interface for running the control logic of the specific management function, such as anomaly detection, fault prediction, performance profiling, or system intrusion detection. While both the global and local IO subsystems provide the runtime needed to execute the management functions, the global IO subsystem also provides an interface for multi/cross-site service logic development and deployment.
The global IO subsystem exposes Platform as a Service (PaaS) and Software as a Service (SaaS) interfaces for the developers of intelligent management functions. These interfaces are used to deploy and run their AI/ML driven intelligent management functions. Beyond the interface, the runtime also provides developers unified access to other subsystems of the IEC architecture such as data access, collection and processing, as well as model lifecycle management.
In this way, a developer can focus on developing the high-level control logic that is specific to the management functions of interest, instead of worrying about the actual data collection, model training and other similar processes.
Let’s take an example!
A developer wants to develop an anomaly detection function for metrics collected from an application. The IO PaaS interface allows the developer to specify the binaries to be deployed at selected sites and the respective function parameters. The binaries may include those to train the anomaly detection model and those to evaluate it against live data. The parameters include the type and amount of data to be collected, the AI/ML model to be used, whether retraining is needed or not, the retraining frequency and the inference frequency. Based on this specification, the IO subsystem will call on the other subsystems to execute the logic for anomaly detection as specified by the developer. Such a specification can also be saved as a template and provisioned by the IO SaaS interface – then an anomaly detection logic developer can simply select the template and its parameters to be updated, such as the sites to deploy, the training data source, or the inferencing frequency.
The Monitoring (MON) subsystem collects the raw data required to build AI/ML models and makes it available to use. It continuously collects statistics of the managed components and stores the data from each edge site, as well as the central site. The monitoring subsystem generally includes several components, such as data collection, data transformation, data storage, and data visualization. Monitoring data shall be stored close to the system under monitoring, e.g., the monitoring data for an edge site shall be stored locally in the edge site. In resource constraint edges, however, there might not have enough resources for storage, the monitoring data may be selectively stored, e.g., less amount of data retained or less frequency of data collected and stored. When it is necessary to retain the amount of data beyond the edge storage limitation, the data can be transferred and stored in the global data site.
The Data Management (DM) subsystem processes the data collected by the monitoring system into a format that could be used by the other subsystems. Data is the input of AI/ML models and selecting the appropriate data is critical to ensuring model quality (accuracy, for example). Here, DM is required to process the data collected from the monitoring subsystem to identify the relevant data – in the appropriate format, dimension, and location – which allows the AI/ML model to perform the training and inferencing with better performance. In collaborative learning cases such as centralized learning and cluster-based ensemble learning, the data may need to be combined for training a model. In other cases, such as an edge site that does not have enough resources to process and store the local data, data may need to be transferred to the global DM. Generally, data shall be processed and stored as much as possible in local DM.
DM generally includes three main components – data acquisition, data preparation and data storage:
Data acquisition: is responsible for finding and collecting the relevant data from the MON subsystem. It may also include data augmentation and synthetization.
Data preparation: includes processes such as data cleansing, transformation, normalization, validation, featurization, and labeling.
Data component: Once the data is processed and transformed into the appropriate shape for AI/ML model, it will be stored in a data storage component.
The Model Management (MM) subsystem builds and stores the AI/ML models required for intelligent management operations. In practice, building an AI/ML model is an iterative process, in which the models progress through multiple stages within the ML lifecycle: training, evaluation, deployment, and monitoring. As the data changes, each of these stages is revisited to ensure that the model performance maintains the expected levels. MM performs model training either at the edge sites or global sites, depending on the developer’s requirements. Specifically, it can be prohibitive to send all training data to the global site for centralized training for reasons such as increased traffic load, incurred latency and data privacy concerns. Therefore, solutions such as distributed, transfer and federated learning, or a combination of these techniques, should be supported by MM. MM deploys the models as requested by the developer and once deployed, the model’s performance may be monitored and when necessary (e.g., degraded model performance due to a concept drift), the model could be retrained with new training data. MM also categorizes and stores the trained models for future use.
Deployment of the proposed architecture
The design of an edge cloud depends on who owns it and what it will be used for. Edge cloud for telecom services, for instance, will likely be owned and operated by telco providers and will mostly be used to run telco virtual network functions (VNFs).
Consider a simple edge cloud infrastructure for hosting containerized applications: this cloud could be realized by a container management system such as Kubernetes, where individual Kubernetes clusters are deployed at access and regional sites. Here, the regional sites could act as ‘global’ sites and be used to federate various edge site clusters.
The components of our architecture can either be deployed in parallel with the Kubernetes cluster control plane or as containerized applications (with proper access permissions) inside the Kubernetes clusters. In the latter case, a distributed execution framework such as Ray can be used to make our deployment easier. On top of this framework, we could realize the IO interface using some sort of service logic template (for example a yaml file with predefined fields for service descriptions), and the template can be translated to a set of commands that usually includes:
- Collecting data from the MON subsystems (e.g., Prometheus, Zabbix, DynaTrace) of edge sites where the applications are running.
- Preprocessing the data and labeling the application events using DM (e.g., Tableau).
- Training the defined models using MM (e.g., MLflow, Kubeflow, Ericsson MXE).
- Deploying the trained model on the specified edge sites using local model execution environment provisioned by the MM.
The flexibility of our architecture allows several common AI/ML scenarios. For instance, trained models can be stored in the global model management site and reused by any other edge sites where the same event having the same characteristics is expected to occur. In situations where delta training may be required to reuse the trained model, the IO component could implement a CI/CD-like functionality. In another scenario, one may require for multiple edge sites to train joint models so that the models can be shared and generalized by edge sites. In such a case, a user can explicitly define such a logic via the IO interface. The SaaS-like interface enables developed service logics to be published and made available for reuse by other developers, and among several edge sites.
Demonstrative use cases
Now, let us go through a selection of use cases to demonstrate the role of AI/ML in edge cloud management and how our architecture plays a crucial role. Figure 3 (below) demonstrates how these use cases relate to each other, with the arrows indicating possible information flow between them.
Use case 1: Fault detection
Using our proposed architecture, an effective solution for fault identification, based on the concept of semi-supervised learning, can be built. Such a solution can be trained to automatically identify a large class of faults, ranging from crashed software components to failed hardware components (for example microwave link failures) which can help avoid unnecessary site visits by human administrators. On a high level, the solution would:
- Use anomaly/novelty detection techniques to identify the symptoms of potential faults.
- Employ automated test scripts (or a human) to identify what fault the symptoms indicate.
- Exploit signature generation/clustering techniques to calculate a signature of the detected fault.
- Identify all subsequent problems on this and other edge sites with the corresponding signature automatically.
Our architecture provides the ideal platform to deploy such a solution. When a fault occurs, the anomaly/novelty detection solution will identify that something has gone wrong. For the very first time this specific fault occurs, a human (or scripts that check system status) would need to identify and label the fault as a specific issue, and then compute its signature based on the detected anomaly. This new signature will be shared with all other sites so that the next time such an anomaly detected, it will be automatically matched to the identified fault. The distributed execution environment (IO) and data management functionalities (DM) help in realizing this.
Use case 2: Root cause analysis
An automated root cause analysis solution could be developed using components of our architecture based on the concept of unsupervised learning where ML models could identify contributing and root causes of detected faults. Specifically:
- Anomaly detection models could be trained, using the MM subsystem, to identify unexpected states from the various subsystems of the edge cloud, and
- IO allows a custom program to build a relationship model, using data collected by the MON component, to identify how events in the various subsystems could be related to each other.
Using these models, the root cause of a fault can be identified as follows: When a fault is detected, all anomalies that occurred following and prior to the detected fault are analyzed to identify their relationships using the relationships model. Using this information, possible paths leading to the detected fault could be identified out of the detected anomaly events. Given the identified paths, each possible path will be characterized with a score and the path with the highest score will be identified as a root cause of the detected fault.
One example could be identifying the contributing causes leading to performance bottlenecks and anomalous behavior of the managed infrastructure (consisting of hundreds of servers and thousands of metrics). Such a solution could point out the root issues that caused anomalies/faults in those metrics and the detected performance bottleneck.
Use case 3: Fault prediction
Given the right data, it is possible to predict the faults before they occur and cause disruption of services. For instance, using the test results from passive intermodulation (PIM) failures on frequency division duplex (FDD) and advanced antenna systems (AAS), it’s possible to find patterns that would lead to specific radio system failures. In such cases, time series prediction techniques are usually applied to predict faults that are characterized by a certain pattern.
One example is to train deep learning models (such as RNN and LSTM) at the MM component using labeled data with both faults and non-faults (where data collected from MON and then labeled at the DM component), so that the model can learn the trends towards the fault. Once the patterns/trends identified, it is feasible to execute the model in MM to predict the future occurrence of the fault.
A fault prediction model trained for a specific edge site should be possible to reuse in other edge sites. This is enabled by the MM features. Fault prediction usually works together with fault management (FM) systems to achieve closed-loop fault management. For example, an ‘unknown’ fault is first detected by a fault detection system, the root cause of the fault is analyzed by a root cause analysis system, then relevant data are collected and labeled to train a fault prediction model. Finally, the fault prediction results can be used for a fault prevention system that take actions to prevent the fault from occurring.
Use case 4: Predictive workload analysis
With predictive workload analysis, the performance management system of an edge cloud can proactively facilitate the efficient placement and scheduling of workloads over edge clouds. Using time series prediction techniques, workload data – collected from edge sites using the MON component – can be analyzed to extract patterns at the MM component. The extracted patterns can be used to build a solution that predicts the characteristics of upcoming workloads on edge cloud sites. An ML model for this purpose can be trained in the MM component by using historical data related to workloads to predict the amount of resources that will be required by an application in the near future. Once predicted, resources can be allocated ahead of time, which can reduce the delay.
Another scenario could be, when predicting a fault, resources can be re-assigned accordingly to reduce the impact of the predicted fault. Transfer learning techniques can also be used to transfer a model trained on edge sites with similar traffic patterns to other sites. Our proposed architecture can enable automated capacity planning to identify the gap in resources and early predict a gap in resources, enabling smooth operations and end user satisfaction. Thereby, it can allocate the predicted amount of resources for coming time interval and ensure timely availability of infrastructure in the data centers.
Use case 5: Dynamic reconfiguration
Reconfiguration can be performed in either a reactive or proactive manner. The output of a predictive workload analysis model, as well as the output of a fault prediction/detection model, can be used as input for a proactive reconfiguration that takes actions based on predicted system behaviors. Considering the scale and the dynamics of the infrastructure and workload, there is a need to adopt automated techniques to decide when and how to configure resources for workloads in edge clouds. Using reinforcement learning (RL), it’s possible to continuously monitor the system and suggest adjustments for reconfiguration of the edge system at the MM component (i.e., readjusting the number of resources provisioned for workloads, changing the number of workload instances). By adopting our proposed architecture, the system can automatically learn and update its decisions by interacting with the dynamic environment.
For instance, considering the dynamic nature of RAN and traffic, our proposed architecture can identify the overloaded cells in cellular network and decide to transfer traffic from overloaded cells to neighbors with available resources. This would allow it to adapt to workload fluctuations which helps in preventing the application from becoming unresponsive or violating its SLA. When the resource requirement of an application component increases, the RL agent proposes a reconfiguration recommendation such as increasing the capacity of the resource by increasing the number of Kubernetes pods hosting the application components.
In future blog posts we will discuss the performance of the proposed solution and investigate more use cases related to configuration, accounting and security management.
Learn more about our research on network compute fabric
Deep dive into our white paper, Edge computing and deployment strategies for communication service providers.