Do Enterprises Need an Operating System (OS) for AI?

Why Do We Need an OS for AI?

The need for production AI deployments is gaining more prominence and growing in adoption. New research suggests that the AI market size will grow from $40.74 billion in 2020 to $390.9 billion by 2025. Currently, most companies that have adopted AI, are still only in the development and experimentation phase of their AI journey.

What most of these companies will soon realize is that going from development and experimentation to production is a very difficult and time-consuming process. You have to set up a complex infrastructure that supports a diverse range of tools, workflows, global teams, and regulatory requirements (e.g. GDPR). Investment of human capital and the risk acceptance of potential data breaches in pursuing these types of projects is now a critical component of a calculated growth strategy.

How Is AI Deployed to Production Now?

The hard truth is that most large enterprises are yet to deploy AI in production in a meaningful way. Although this is starting to change and the adoption of AI is now trending, there is more work to be done to raise the engagement of C-level leadership in this space. The early adopters who had the foresight and risk appetite have deployed traditional machine learning workflows using older technologies and approaches. They are now exploring how to migrate or are starting to migrate their old and costly infrastructure – built on a mixture of tools like Spark and Hadoop in favor of more modern, cloud-native solutions.

“The growth of many new open-source tools centered around Kubernetes has spurred the need to string together these tools to form production AI pipelines with glue code.” – Rush Tehrani, CEO Onepanel

So Who Is Actually Deploying AI Projects to Production Today? 

Tesla is a great example of a company deploying large scale HPC infrastructure to support its AI pipelines. Having the largest and most complex computer vision pipelines, they have developed and patented their own platform (“OS”).

“Importance of good tools & infrastructure is underrated.”   – Elon Musk

Another example of a similar proprietary solution is deployed at Cruise. In addition to traditional web applications, they also use their platform to “manage 3D maps, navigation services, driving simulations, machine learning, data processing, test pipelines, security suites, and a lot more — both on-premises and in the cloud.”

These companies have spent years and many millions of dollars to build these platforms, but what about the rest of us? What we need is a universal, open source, cloud-native OS for AI so that other innovative companies can deploy their AI projects at the fraction of time and cost of what it took Tesla and Cruise to do it.

What Would an OS for AI Look Like?

Knowing that AI workloads differ from traditional software workloads, let’s look at some of the foundational principles that would guide the design of an OS for AI.

Composability — You should be able to combine and recombine resources in different ways to quickly create AI workflows and pipelines with new and existing tools, environments, data, and data sources.

Portability — Portability applies to both infrastructure and workloads. In order for this to work, the underlying hardware must be abstracted by a common abstraction layer that is built on open standards. With this abstraction in place, you can automate your infrastructure on any cloud provider, on-premises or in a hybrid cloud, and move your workflows across any of them as needed.

Scalability — Any task in the AI pipeline should be able to scale as needed. This is especially true for model training where the training workload needs to run on one or many GPU instances. It is also the case with inference workloads that need to scale on-demand depending on traffic.

Reproducibility — Modern robust AI pipelines require transparency around the relationships between code, environments, datasets, models, hyperparameters, and performance metrics. It is essential to innovation that experiments, workflows, and even the infrastructure are reproducible so that they can be iteratively improved.

Now that we’ve outlined the principles, let’s look at some of the challenges that you can face when deploying AI project in the real world:

Distributed or parallel computation – AI projects sometimes require models to be trained on multiple machines at the same time or in parallel. For example, you may want to train models in parallel on the same dataset with different hyperparameters to see which provides the best results. As for distributed processing, Neural architecture search (NAS) is a use case.

On-demand scaling — AI workloads require expensive hardware which includes GPUs and/or TPUs. These workloads need to be scheduled and scaled up on-demand and scale down when the compute task (i.e. training) is complete. This reduces the costs of projects running in a public cloud but is also essential for resource sharing across an organization. This also applies to inference where resources need to scale up or down on demand.

Multi-disciplinary teams and tools — AI projects are unique in the sense that they require collaboration amongst multi-disciplinary teams, with each team needing their own tools to get the job done.

Model deployment CI/CD — Similar to traditional software, AI models and applications should be deployed, tested, and monitored continuously. This requires implementing CI/CD features that are similar to their traditional software counterparts, but with unique challenges around testing and monitoring.

Reproducibility and version control — This is a key challenge with AI projects, it’s not just about code like it is the case with traditional software projects. Everything that results in a model needs to be version controlled and reproducible, this includes code, datasets, environments, and hyperparameters. 

Regulatory policies — Regulations like HIPAA and GDPR require a platform that has fine-grained role-based access control (RBAC) and other policy enforcement. Ideally, in addition to guidelines, the implementation of these policies should be automated as much as possible.

Now that we have outlined the principles of this OS and the unique challenges it needs to address, how do we go about implementing it?

Onepanel CE: The distributed, cloud-native operating system for AI

Kubernetes has become the cloud-native operating system of choice for traditional software workloads. We can’t use Kubernetes as is since we know that AI workloads present different challenges than traditional software workloads. However, Kubernetes provides us with some of the foundational principles we discussed earlier, so it can serve as the kernel for a robust OS for AI. Which is exactly how we have designed Onepanel.

Onepanel is the first open source operating system for building and deploying AI pipelines at scale. Out of the box, Onepanel offers solutions for data labeling as well as support for pipelines for model training, data transformation, and model deployment. Onepanel’s plug-and-play architecture allows multi-disciplinary teams to plug in an array of development, analysis, simulation, and deployment tools at any point in their pipelines to create new pipelines.

Look at a Computer Vision Project to Illustrate a Use Case on Onepanel

In a computer vision project, these are generally the tools and teams involved:

  • Labeling tools used by a team of labelers
  • Deep learning frameworks, simulations tools and notebooks used by data scientists
  • Extract, transform, load (ETL) tools used by data engineers
  • Deployments of models and dashboards, performed by ML and software engineers, used by internal and external customers.

And here is what a semi-automated computer vision pipeline looks like:

An example of a computer vision pipeline with auto annotation using Onepanel CE

An example of a computer vision pipeline with auto annotation using Onepanel CE

Onepanel provides an end-to-end solution for this pipeline while allowing these teams to plugin and experiment with their own specialized tools at any point in the pipeline. The entire pipeline is version controlled and reproducible, so teams don’t have to worry about breaking changes. These teams can now manage their own tools and pipelines, relying less on DevOps and IT teams.  You can find the open source project documentation and a live demo for a sample computer vision pipeline.

This UrIoTNews article is syndicated fromDzone