True ML Talks #12 - Cofounder @ Llama-Index

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

We are back with another episode of True ML Talks. In this, we dive deep into Llama-Index, and we are speaking with Jerry Liu.

Jerry Liu is the creator and co-founder of LlamaIndex. He brings his expertise in ML research and engineering from esteemed companies like Uber, Quora, and Robust Intelligence. With a strong focus on generative models and a passion for advancing AI technologies, Jerry has pioneered the development of LlamaIndex, an open-source tool that seamlessly connects language models with private data sources.

📌

Our conversations with Jerry will cover below aspects:
- The Genesis of Llama-Index
- LlamaIndex's Versatile Features
- Anthropic 100k Window Model
- Challenges in Response Synthesis Models
- Comparing Retrieval and Fine Tuning Approaches

Watch the full episode below:

The Genesis of LlamaIndex: Building Stateful Systems for Language Models

Jerry Liu's diverse background in machine learning and AI, including experiences at Uber and Quora, prepared him for his work on Llama-Index. His fascination with generative models, sparked by discovering GANs, drew him into the realm of large language models (LLMs).

Realizing the inherent statelessness of LLMs like GPT-3, Jerry sought to integrate external data into these models to provide them with context. Inspired by computer architecture, he conceived of LlamaIndex as an overall system with additional memory and storage modules. This allowed the LLM to store and traverse external data using a tree-based structure called GPT index, enabling reasoning over the data within the tree.

Jerry's initial design project resonated with others facing similar challenges, leading him to recognize the potential for a practical solution. LlamaIndex evolved into a comprehensive toolkit, empowering users to leverage their structured and unstructured data in language model applications.

This pivot enabled LlamaIndex to facilitate data retrieval mechanisms and offer intuitive ways to augment LLMs with state. By bridging the gap between language models and private data, LlamaIndex opened up new possibilities for practical applications in working with unstructured and structured data.

LlamaIndex transformed from an idea into a powerful toolkit, empowering users to overcome the challenges of integrating external data into language models. It streamlined the process of leveraging personalized data and revolutionized language model applications.

Unlocking User Empowerment: The LlamaIndex Advantage

LlamaIndex has gained popularity as a versatile tool, appreciated by users for its various features. Three key features that users love about LlamaIndex are:

Data Ingestion and Loaders: LlamaIndex simplifies the process of loading data from different sources into the tool. One notable feature is Llama Hub, a community-driven site offering a wide range of data loaders. These loaders enable users to easily import unstructured text from various file formats such as PDFs, PowerPoints, Excel sheets, and data from platforms like Salesforce, Notion, and Slack. By leveraging contributions from the community, LlamaIndex empowers users to harness the capabilities of text parsing and document parsing technologies, enhancing the flexibility and accessibility of the tool.
Easy Getting Started: Users appreciate the straightforward nature of LlamaIndex's API. With just a few lines of code, users can load, index, and query data, unlocking the tool's value quickly. This simplicity appeals to both technically proficient users and those with limited technical experience. The ability to effortlessly interact with their data and access powerful features empowers users to derive valuable insights without significant technical expertise.
Advanced Retrieval Capabilities: LlamaIndex offers advanced retrieval functionality, catering to users who require sophisticated features for specific use cases. These capabilities allow users to ask complex questions, compare documents, perform multi-step reasoning, and route to different data sources. Users seeking more advanced retrieval capabilities appreciate LlamaIndex's ability to handle diverse scenarios and support their complex information retrieval needs.

With a combination of user-friendly features, comprehensive data ingestion options, ease of use, and advanced retrieval capabilities, LlamaIndex has garnered a loyal user base. The tool continues to evolve, enabling users to leverage their data effectively and extract meaningful insights from their unstructured and structured data sources.

Deep Dive into the Anthropic 100k Window Model: Insights and Considerations

The Anthropic 100k window model has sparked excitement and revealed fascinating insights. This extensive context window complements existing approaches like LlamaIndex, expanding language modeling possibilities with its ability to process up to 100,000 tokens.

Experimenting with Uber's lengthy SEC 10-K filing exceeded the token limit, but highlighted the model's advantage: the inclusion of vast information without complex retrieval methods or selective prompts. Dumping the entire document into the prompt yielded intriguing outcomes.

The 100k token API showcased impressive speed compared to querying GPT-3 in smaller trunks. The underlying algorithm behind these speedups remains undisclosed, fueling speculation and curiosity.

The larger context window enables the language model to understand data holistically, synthesizing relationships between distant text portions reasonably well. Fine-tuning is crucial to address occasional struggles with complex instructions and confusion, an area where GPT-4 shows improvement.

While the benefits of the 100k window model are evident, practical considerations arise. Filling the window with certain question types can be computationally costly, leading to increased query expenses. Evaluating economic feasibility becomes crucial, with each query costing approximately $1 to $2, depending on the use case.

Despite the limitations and cost implications, researchers and developers prioritize ongoing exploration of the Anthropic 100k window model. Valuable insights gained from these experiments will drive future advancements in the field.

Tackling Challenges in Response Synthesis Models

Response synthesis is a critical aspect of the cloud model context, aimed at addressing the challenges associated with handling large context windows that exceed the prompt limit. It involves the development of strategies to simplify the process of generating accurate and comprehensive responses. Two such strategies are Create and Refine and Tree Summarization.

Create and Refine

Create and Refine involves breaking down the context into manageable chunks. For example, when dealing with Uber's SEC document, it would be split into two 90,000-token chunks. The first chunk is fed into the input prompt, along with the question, to obtain an initial response. This response is then refined through a refined prompt that incorporates the existing answer, additional context, and the question. This iterative process continues to synthesize an answer over all the contexts.

While Create and Refine is effective, the refined prompt tends to confuse the model. Its complexity, with multiple components to consider, hinders reasoning abilities.

Tree summarization

Tree Summarization offers an alternative approach that has shown improved performance. In this strategy, each chunk of context is processed independently to generate individual answers. These answers are hierarchically combined, forming a tree-like structure, until a final answer is derived at the root node, based on the question. By simplifying the prompt and leveraging the hierarchical combination of answers, Tree Summarization achieves better results compared to the refined prompt approach.

The precise reason behind the improved effectiveness of Tree Summarization is still not fully understood. However, it can be attributed, at least in part, to the simplicity of the prompt used in this strategy. Ongoing exploration and refinement of these response synthesis strategies will contribute to further advancements in generating accurate and comprehensive responses within the cloud model framework.

📌

Practical Challenges in Context Parsing:
When iteratively parsing context within response synthesis strategies, certain challenges arise. These strategies offer effective workarounds to accommodate extensive context within the prompt window, but they come with limitations and trade-offs.

The Create and Refine approach, aimed at compressing information, has an interesting observation. Over time, the model tends to accumulate details, resulting in longer answers irrespective of their accuracy or relevance. This accumulation may pose a drawback for Create and Refine.

In contrast, the Tree Summarization approach hierarchically summarizes context, combining individual chunk responses. However, this summarization process sacrifices finer-level details. Striking a balance between summarization and preserving nuanced information is crucial when employing Tree Summarization.

Retrieval vs. Fine-Tuning: A Comparative Analysis

The choice between retrieval and fine tuning approaches for working with data is a topic of exploration. Retrieval-augmented generation, commonly used in systems like LlamaIndex, involves feeding context chunks into a pre-trained language model, offering ease of use and no requirement for model training.

Fine tuning is another approach with significant potential. By leveraging pre-trained models trained on extensive data, fine tuning enables tasks like style transfer, poetry generation, and serving as a knowledge source. However, current fine tuning APIs from larger companies may pose challenges in terms of cost, maintenance, and usability.

Recent advancements, such as LoRA, and the availability of smaller open-source models, offer more accessible avenues for fine tuning on user data. This suggests that, in the future, fine tuning may provide a better cost-benefit trade-off compared to relying solely on retrieval-augmented systems.

A hybrid approach that combines retrieval and fine tuning is anticipated to prevail in the future. This approach involves a continual learning model that can reference external sources of information as needed, allowing for a combination of internal and external knowledge.

As advancements continue and accessibility improves, a combination of retrieval and fine tuning approaches is expected to shape the future of working with data within the cloud model framework.

Read our previous blogs in the True ML Talks series:

‍

True ML Talks #11 - LLMs, LLMops and Generative AI

Deep dive into LLMs, LLMops, Generative AI and ChatGPT. We talk with Micheal, CTO at GreenHouse about the trends in the Machne Learning Space.

TrueFoundry Blog TrueFoundry

Keep watching the TrueML youtube series and reading the TrueML blog series.

TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.

Discuss About your ML Pipeline Challenges with us here

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now