Programmatic Data Labelling and Training LLMs at

March 22, 2024
Share this post


In the latest episode of TrueML Talks, Nikunj, co-founder at True Foundry, delves into an enlightening conversation with Vincent, a foundational figure at Snorkel AI. As a company that finds itself at the heart of AI's evolving landscape, Snorkel AI's journey from academia to leading the charge in data-centric AI development offers profound insights. Vincent shares his experiences from the early days at Stanford AI Lab to steering product and design at Snorkel AI, shedding light on the intricacies of machine learning (ML), Large Language Models (LLMs), and the impact of generative AI on the industry. We touched upon the following topics:

- The Evolution of Snorkel AI
- Data-Centric AI Development
- Transition to Product Leadership
- Generative AI and Open Models
- Career Advice for AI Enthusiasts

Beginning of Snorkel AI

Vincent tells about Snorkel AI's roots as an academic project focused on weak supervision and programmatic labeling. This approach laid the groundwork for what Snorkel AI has become today navigating AI application development enterprises. Vincent's journey from a graduate student to a leader at Snorkel AI shows us how strong academic research converts into a startup and what snorkel is today. At Stanford, they collaborated with doctors and created tailored datasets for them which helped them get a real-life use-case for their research.
He also covers his days at Y-Combinator, sharing his early days and his hunger for growth and learning in tech.

The Core of Snorkel AI: Data-Centric AI Development

Vincent shares how at the beginning creating databases was just sharing large data sheets between teams and an unorganised task this has been changed. Vincent elaborates on the company's focus on facilitating enterprise teams to manage, curate, and label data at scale—turning, the janitorial tasks of AI development. This data-centric approach enables companies to align AI closely with their unique objectives and datasets, emphasizing the critical role of data in programming AI systems he also mentions that for industries like banking and healthcare, there can't be a probability of data accuracy as one mistake on LLMs part can be fatal for operations.

  • Programmatic Data Development: Introduction of a scalable, adaptable, and efficient approach to data labeling, moving away from traditional manual methods.
  • Impact on Enterprises: Demonstrating how Snorkel AI's approach has revolutionized data handling for companies, making AI development more agile and responsive to changes.
  • Adaptability and Scalability: The ability for businesses to quickly adapt their data labeling processes without starting from scratch, showcasing a future where AI development is significantly more dynamic.

Shift from ML Engineering to Product Leadership

Coming from an ML background Vincent shares how the role of Head of Product(AI/ML) and design helps him talk directly to data scientists, and ML engineers. This helps him understand their use-cases and pain points which he can incorporate directly into the product. Due to his multi-dimensional involvement across different domains in Snorkel, he can navigate the product according to the needs of customers.

The Impact of Generative AI and Open Models

The generative AI age and the proliferation of open models have significantly influenced the AI landscape. Vincent explains how LLMs are the newest addition in generating datasets for training purposes but on the other hand, they often struggle with the accuracy of produced datasets. As we discussed before, data generated by an LLM can be suited for generalized use cases and demo-level tasks but this does not apply to use cases where accuracy plays an important role in domains like banking, finance, insurance, and healthcare domains.

  • Post-ChatGPT Landscape: Reflections on the emergence of generative AI and its impact on the AI community and enterprise applications.
  • Importance of Open-Sourcing Data: The call for open-sourcing not just AI models but also the datasets and development processes to foster innovation and ensure AI safety and reliability.
  • Specialized Data for Enterprise Applications: The ongoing need for high-quality, specialized data to train generative AI models for specific business needs.

Hot take on the current AI landscape

Vincent's hot take on the current state of AI development emphasizes the pivotal shift towards open-source models and data, proposing a more holistic approach to sharing AI innovations. He argues that the true essence of open-sourcing in AI should extend beyond merely releasing the model weights; it should include making datasets, development processes, and the rationale behind model training accessible. This approach fosters a collaborative ecosystem that accelerates innovation, ensures reproducibility, and builds safer AI systems. By advocating for the open data movement, Vincent highlights the importance of transparency in AI development, enabling a broader community to contribute to and benefit from the advancements in the field. This perspective not only challenges the conventional practices of AI sharing but also calls for a comprehensive strategy that could democratize AI development, ensuring that the benefits of AI technologies are widely distributed and accessible.

  • Accelerates Innovation: Open-source datasets and development processes encourage the community to innovate, building upon existing work rather than starting from scratch.
  • Ensures Reproducibility: Transparency in AI development processes allows for the verification of results and methodologies, which is crucial for scientific progress and trust in AI applications.
  • Builds Safer Systems: Access to the datasets and logic used in training models helps identify biases and errors, contributing to the development of more reliable and ethical AI solutions.
  • Democratizes AI Development: Making comprehensive AI resources available to a wider audience levels the playing field, allowing individuals and organizations with varied resources to contribute to and benefit from AI advancements.
  • Challenges Conventional Practices: Vincent's take invites the AI community to rethink how AI technologies are shared and developed, advocating for a more inclusive and collaborative approach.

Advice for Aspiring AI Professionals

Vincent mentions, that the hackathon level is not enough, you'll have to get your hands dirty and try out something that you use that will help you get results and stand out. Reflecting on his journey, Vincent offers advice to those embarking on their AI careers. He emphasizes the value of hands-on experience, encouraging individuals to build and iterate on AI projects that address real-world challenges. This experiential learning, coupled with collaboration and a passion for exploration, is pivotal in navigating the rapidly evolving AI domain.

Discover More

March 28, 2024

Applications of GenAI at Google

True ML Talks
January 4, 2024

TrueML Talks #26 - Enterprise GenAI and LLMOps with Labhesh Patel

True ML Talks
February 15, 2024

TrueML Talks #29 - GenAI and LLMs for Location Intelligence @ Beans.AI

True ML Talks
April 13, 2023

True ML Talks #5 - Machine Learning Platform @ Simpl

True ML Talks

Related Blogs

No items found.

Blazingly fast way to build, track and deploy your models!