Building Compound AI Systems

September 5, 2024
Share this post
https://www.truefoundry.com/blog/building-compound-ai-systems
URL
Building Compound AI Systems

AI systems that rely on single monolithic models are limited by their design. These models are built to handle a wide range of general tasks but often struggle to adapt to specific contexts. Generative AI models are fundamentally probabilistic in nature, which can lead to hallucinations. Additionally, large language models require significant computational power and memory, making them resource-intensive. Most companies today are optimizing for "intelligence per dollar." All of this has led to the development of complex AI systems with multiple components.

  • For example, ChatGPT Plus can use external tools like web browsing or code execution to improve its responses. It decides when it needs extra information beyond its training, such as accessing up-to-date data from the web or running a Python script.
  • Another example is Retrieval-Augmented Generation (RAG), which can include various components. These may consist of a retriever, embedding model, reranker, large language model (LLM), prompt construction module, post-processing module for filtering and answer verification, and a caching system for frequently used documents.

In the tweet below, Matei Zaharia, co-founder and CTO at Databricks and Professor at UC Berkeley, highlights an important point - the shift to ‘thinking in systems’. The example he gives—32-CoT (Chain of Thought prompting with 32 steps) vs. 5-shot learning—illustrates that different systems can behave very differently depending on the context of how they are used, even with the same base model. The point he stresses is that focusing on the broader system and its components is essential to fully understand and benchmark the performance of AI systems, rather than just isolating the model itself.

Read the tweet here

What Are Compound AI Systems?

Compound AI systems refer to AI architectures that consist of multiple models and components working together to perform tasks that a single model cannot handle efficiently. Berkeley AI coined the term in a blog that highlights the shift from using single models to compound systems.

This new paradigm leverages the strengths of various AI models, tools, and processing steps to enhance performance, versatility, and reusability.

Compound AI systems span across a continuum of complexity, starting with:

  1. Simple Prompt Engineering: Optimizes a single model's performance by refining inputs, useful for straightforward tasks.
  2. Retrieval-Augmented Generation (RAG): Combines LLMs with external knowledge sources, enhancing responses by dynamically retrieving relevant information.
  3. Agents: Advanced systems where multiple models work together, autonomously solving complex, multi-step problems through collaboration and task specialization.
Source: medium

Convergence of AI Agents & Systems Engineering 

In systems engineering, the focus is on designing and managing large, interconnected systems that meet specific requirements and perform reliably under a variety of conditions. AI agents, particularly within compound AI systems, take this idea a step further by incorporating autonomous, intelligent decision-making into these components.

AI agents share key similarities with traditional software systems in their modular design, task automation, external interactions, and decision logic. Both rely on modular components that perform specific tasks, with traditional systems using functions or services, while AI agents deploy specialized models or sub-agents. 

Key Components of a Compound AI System

A typical Compound AI system can vary depending on the use case, but some common, repeatable components include:

  1. Large Language Models (LLMs): Generate and verify natural language responses based on user inputs and context.
  2. Retrievers: Fetch relevant information from databases or external sources to inform the system's responses.
  3. Databases/ VectorDBs: Store structured and unstructured data for easy querying and retrieval by the AI system.
  4. External Tools: Access APIs and services to perform specific functions, such as web browsing or executing code.
  5. Embedding Models: Convert data into vector representations for efficient similarity searches and retrieval tasks.
  6. Rerankers: Evaluate and prioritize retrieved results to ensure the most relevant information is presented.
  7. Prompt Construction Modules: Formulate effective prompts to optimize input for large language models.
  8. Post-processing Modules: Filter and verify generated outputs to ensure quality and coherence before delivery.
  9. Caching Systems: Store frequently accessed responses to improve efficiency and reduce retrieval latency.
  10. Task and Data Planners: Manage task orchestration and data flow to optimize component interactions and resource use.
  11. Evaluation Modules: Assess the system's performance and output quality to guide improvements and fine-tuning.
  12. Monitoring and Feedback Systems: Continuously track performance and gather user feedback for ongoing adaptation and enhancement.
  13. Fine-Tuning Modules: Adapt pre-trained models to specific tasks or domains by training them on targeted datasets 
  14. Agent Frameworks: Provide a structure for building agents that can autonomously perform tasks & make decisions.

An example of an advanced RAG Pipeline used by Elastic here

Example - A sample RAG pipeline used by Elastic. Read more here 

Why use compound AI systems?

The Berkeley blog lays out very well why compound systems are important - 

  1. Better improvement via system design that training - Some tasks are better improved through system design rather than just adding more resources. Large language models (LLMs) benefit from more computing power, but the cost often outweighs the gains.
  2. More flexibility to create dynamic systems - Since machine learning models learn from static data sets, their knowledge is fixed. Developers can enhance these models by integrating them with other components like search functions to pull in timely data.
  3. Improve control and trust - Training influences neural networks but doesn't ensure they avoid certain behaviors. Building an AI system that filters outputs can provide tighter control. For example, combining LLMs with fact-checking tools can make them more trustworthy by adding citations or verifying data.
  4. Balancing cost and quality - Developers need to design systems that can use budgets effectively.Making trade offs between costs and quality/precision is often required based on the use case. 

Challenges of Compound AI Systems

Compound AI Systems however pose multiple challenges in building, optimizing and deployment of these systems 

Building 

The complexity of compound AI systems stems from the need to integrate various components, such as AI models, data retrieval mechanisms, and external tools. Each of these components comes with multiple configuration options, creating a vast design space that must be carefully navigated. This complexity requires thoughtful consideration when selecting and combining components.

Building a compound AI system involves managing multiple models and processing steps that must work in harmony. 

The complexity increases when different hardware configurations, such as switching between GPUs and CPUs, are required, even for quick prototyping and testing. This flexibility in hardware management adds another layer of difficulty, as it requires seamless transitions between resources. Without the right tools, constructing such systems can demand significant engineering effort.

 Additionally, robust metrics and logging systems are critical for debugging and performance optimization. Key challenges include:

  • Adapting hardware resources (GPUs, CPUs) and scaling for different processing tasks
  • Seamless integration of open-source or proprietary models into modular workflows
  • Orchestrating interactions between diverse components
  • Maintaining built-in observability for tracking performance metrics across the entire workflow
  • Ensuring efficient testing and debugging capabilities

Optimizing 

Co-optimizing System Components - Optimization in compound AI systems goes beyond the individual performance of models—it extends to managing the interplay between multiple models and additional processing steps.

Properly balancing latency, throughput, and resource utilization is essential to avoid bottlenecks. Bottlenecks can easily arise if one component is over-optimized at the expense of others. For example, deploying an extremely fast retrieval system may not yield the expected performance gains if the downstream language model is not equipped to handle the increased input rate. Developers must analyze the system holistically to identify and address such imbalances.

The interdependence adds complexity to the optimization process, necessitating meticulous tuning to ensure that all components work together seamlessly. For example, one language model may excel when paired with a specific retrieval system, while another model might not achieve the same level of performance with that system. As a result, careful adjustments are essential to harmonize the interactions among all components effectively.

Cost optimization is a significant challenge when building and maintaining compound AI systems.Balancing performance with budget constraints while maintaining system complexity is a tough act. It is crucial to establish an infrastructure that allows for the detection of resource inefficiencies and seamless switching between configurations, all while implementing strategies like spot compute, fractional GPUs, auto scaling, etc to maintain cost-effectiveness without sacrificing performance.

Deploying

Each component of a compound AI system has specific requirements for hardware, software, and scalability. Building a system that meets these diverse needs can require a significant investment of engineering time.

Operational Complexity -  Managing compound AI systems requires robust MLOps and DataOps practices, as handling multiple models and tools simultaneously increases complexity in serving, monitoring, and securing these systems.Balancing and optimizing the performance of individual components while ensuring seamless integration demands extensive experimentation and tuning.

Scalability and Elasticity -  Ensuring compound AI systems scale efficiently requires implementing auto-scaling and load balancing techniques to maintain performance and control costs under varying workloads.

Integration with Existing Infrastructure - Integrating with legacy systems and data sources while maintaining flexibility for future additions presents a significant challenge.

Lack of Best Practices - The novelty of compound AI systems means there are few established best practices, leading developers to rely on trial and error, increasing development time and costs.

Security and Privacy- Protecting sensitive data across multiple components and adhering to governance policies is essential, particularly in regulated industries.

Explainability and Interpretability- Providing understandable, interpretable outputs from compound AI systems is difficult due to the complexity of multiple components contributing to decision-making.

How TrueFoundry helps build Compound AI systems? 

TrueFoundry helps build compound AI systems by offering a robust framework that streamlines model deployment, scaling, and integration. Here's how it achieves that:

TrueFoundry's Architecture

Modularity 

Abstractions for AI Modules

TrueFoundry allows users to build and compose modular AI systems where each component (e.g., language models, image classifiers, recommendation engines, embedding models, vector dbs etc) can work independently or be integrated into a larger application.

Support for Multiple AI Paradigms

Truefoundry supports a range of AI models, from traditional machine learning and deep learning models to more complex architectures like RAG and Agent Frameworks.

TrueFoundry's platform is designed with a modular, API-driven architecture that enables seamless integration of various AI models, data sources, and processing components

Composable Workflows

Developers can design, test, and deploy compound systems by combining different models or AI components that handle tasks such as reasoning, understanding, generation, or retrieval (e.g., Retrieval-Augmented Generation (RAG) workflows).

Seamless Integration with Existing Infrastructure

TrueFoundry integrates with existing data pipelines, workflows, cloud infrastructure( across AWS, Azure, GCP and even on-prem), and development environments.

Infrastructure Management 

Infra on AutoPilot

TrueFoundry’s autopilot detects and automatically fixes any infrastructure inefficiencies or optimization opportunities. This ensures that resources are always utilized efficiently, without manual intervention

Cross-Cloud Deployment

TrueFoundry is a cloud-agnostic platform that allows users to deploy their applications across multiple cloud and even on prem  providers seamlessly. This flexibility ensures that organizations can leverage the best services and pricing available without being locked into a single vendor.

Kubernetes-Based Architecture

Built on Kubernetes, TrueFoundry abstracts away the complexity of container orchestration. This means that developers can focus on building and deploying their AI applications without needing to manage the underlying infrastructure intricacies, making it easier to deploy complex systems reliably.

Resource optimization

TrueFoundry enables users to run their applications on both GPUs and CPUs, optimizing resource usage based on the specific needs of different AI models. This capability is crucial for balancing performance and cost, especially when dealing with resource-intensive machine learning tasks.

Autoscaling 

The platform includes autoscaling features that automatically adjust the computational resources based on real-time demand. 

Scale to Zero

TrueFoundry supports a scale-to-zero feature, which allows applications to automatically scale down to zero when not in use. 

Cost Optimization 

TrueFoundry is designed to help organizations significantly reduce infrastructure costs, often achieving savings of 30-60% by leveraging advanced cost optimization techniques

Bare Instances: TrueFoundry enables workloads to run on bare instances, providing the lowest compute cost by avoiding the 30% markup typically applied by services like SageMaker.

Spot Instances:  TrueFoundry allows teams to leverage discounted spot instances for non-critical tasks, with the option to seamlessly switch to on-demand instances as a fallback for uninterrupted performance.

Fractional GPUs: TrueFoundry provides fractional GPUs, allowing users to pay only for the GPU capacity they need, optimizing costs for smaller workloads.

Avoiding Costly Retraining Errors: With checkpointing and automated validation, TrueFoundry prevents unnecessary retraining, saving both compute resources and time.

Cost monitoring and budgeting tools, allowing teams to track real-time infrastructure expenses, set spending limits, and ensure that resources are being used efficiently to stay within budget.

Governance and compliance

Deploys in Your VPC

TrueFoundry runs entirely within your VPC, ensuring that no data leaves your cloud environment for maximum security.

Role-Based Access Control

It provides role-based access control for managing data, models, and compute, allowing fine-grained permissions.

Audit Logs

TrueFoundry maintains detailed audit logs, tracking all actions to ensure transparency and compliance.

Regulatory Compliance

The platform supports GDPR, HIPAA, and SOC2 compliance, ensuring adherence to industry security and privacy standards.

Developer experience

Easy-to-Use Interfaces: TrueFoundry offers intuitive UIs and APIs that simplify complex workflows. Developers can quickly deploy models, manage infrastructure, and monitor performance without needing deep expertise in underlying systems like Kubernetes.

Inbuilt best software practices

TrueFoundry embeds best software practices like CI/CD, version control, and automated testing into the platform.

Real-Time Monitoring and Observability

Developers have access to tools, including logs, metrics, and dashboards, providing insights into model performance, infrastructure health, and potential bottlenecks

Modules for Building Compound AI Systems

TrueFoundry offers several pre-built modules to simplify and accelerate the development of compound AI systems:

  • Model as a Service: Simplifies the deployment of AI models, allowing developers to focus on building compound AI systems rather than worrying about infrastructure scalability or reliability.
  • No-Code Model Fine-Tuning: Allows users to fine-tune pre-trained models with minimal effort, making it easier to customize models without extensive coding knowledge.
  • LLM Templates for Agents & RAG Framework: Provides inbuilt templates and frameworks to kickstart projects, especially for Retrieval-Augmented Generation (RAG) systems and AI agents. These are essential components for creating compound AI systems involving multiple models or task-specific agents.
  • AI Gateway: Centralizes prompt management, key management, and provides a unified API for interacting with models, enabling better control and security, especially across distributed teams. The gateway serves as the hub for managing and orchestrating multiple AI components, crucial for compound systems.
Read More About LLM Gateways
Read More

TrueFoundry’s Core Abstractions

TrueFoundry simplifies the complexities of building AI systems by providing powerful abstractions:

  • Services: Enables the seamless deployment of AI models as scalable services, managing inference tasks with minimal infrastructure concerns. This abstraction simplifies operational aspects like auto-scaling and health monitoring.
  • Jobs: Facilitates the scheduling of tasks for batch processing, training, or automated workflows. These jobs can be executed on-demand or at specified intervals, offering flexibility for complex workflows.
  • Workflows: Helps connect multiple tasks into a cohesive AI pipeline. By building workflows, users can automate processes and link different models, tasks, or services into compound AI systems.
  • Open-source Helm Charts: Streamlines the packaging and deployment of AI workloads onto Kubernetes clusters, offering ease of use with industry-standard Helm charts.

By combining these features, TrueFoundry enables the development of compound AI systems that integrate multiple models and tasks into cohesive, scalable, and cost-efficient solutions.

Know more about building Compound AI Systems on TrueFoundry
Book Demo

Discover More

December 4, 2024

Enabling 3-15X Faster Docker Image Builds with TrueFoundry on Kubernetes

Engineering and Product
September 12, 2024

Understanding Total Cost of Ownership for GenAI Infrastructure

Engineering and Product
September 6, 2024

Build Vs Buy

Engineering and Product
August 8, 2024

A Guide to LLM Gateways

Engineering and Product

Related Blogs

No items found.

Blazingly fast way to build, track and deploy your models!

pipeline