Leveraging Data Beyond Text: Multimodal AI at Scale

TL;DR

Multimodal AI at scale demands more than fast hardware—it requires a fundamentally different architecture.
Vespa AI brings compute to the data, enabling real-time performance across text, images, and video.
Companies like Spotify, Perplexity, and Vinted rely on Vespa to power search, recommendations, and RAG at global scale.
Tensor-based retrieval and hybrid ranking strategies make Vespa uniquely capable of supporting complex multimodal use cases.
Video search use cases—like content licensing and ad analytics—showcase the need for token-level and patch-level retrieval.

Why go beyond text in AI systems?

Multimodal AI is no longer a nice-to-have. From search to safety, leading companies need to retrieve and rank across modalities—text, images, and video—while maintaining low latency and high relevance. Yet most production systems break down at scale, especially when they rely on flat vector representations and traditional architectures that separate storage and compute.

What makes Vespa different for multimodal AI?

As Bonnie Chase, Director of Product Marketing @ Vespa AI, explains, Vespa was built for hybrid vector search from the start. Unlike traditional stacks that tack on vector support as an afterthought, Vespa’s architecture co-locates data, models, and computation on the same nodes. That means fast, cost-efficient retrieval—even across billions of documents or video segments.

How does this architecture perform in practice?

Chase detailed several real-world deployments:

Spotify: Uses Vespa to index every word of 5M+ podcasts, enabling semantic retrieval based on content, not just titles.
Perplexity: Powers web-scale RAG with over 1.5B documents indexed (targeting 10B+), ensuring low-latency responses from Vespa-backed retrieval.
SafeKiddo: Detects online threats in real-time, leveraging Vespa for multimodal moderation across text, memes, screenshots, and more.
Vinted: Delivers highly personalized secondhand shopping recommendations across a billion-plus listings.

Why are tensors essential for multimodal search?

Vectors flatten structure. That’s fine for some use cases, but when precision matters—like distinguishing scenes in a movie or tracking objects over time—you need more. Vespa supports rich tensor representations, enabling token-level, frame-level, and patch-level embedding storage and retrieval. This is critical for aligning modalities in complex queries.

How does video retrieval actually work?

Zohar Nisari Husen, Strategic Lead @ Vespa AI, demonstrated a video search use case using 12 Labs models and Vespa’s tensor framework. Each video is chunked into six-second segments, with embeddings generated per chunk. Vespa stores these embeddings in a tensor field, alongside metadata like titles, keywords, and summaries.

The result: hybrid queries that combine lexical and semantic retrieval. For example, you can search for “Santa Claus on his sleigh” and retrieve not just the right video—but the exact timestamped segments that match.

What’s the strategy for high-accuracy ranking?

Vespa uses distributed multi-phase ranking. Lightweight scoring happens early, and heavier ML models are applied only to top candidates. This means you can blend business logic, real-time behavior signals, and metadata to tailor ranking for each use case—search, recommendations, ads, and more.

What about cost and latency?

Vespa’s architecture avoids expensive network hops by doing everything locally. For performance tuning, Vespa supports techniques like binary quantization and coarse-to-fine re-ranking. As Zohar noted, it’s about matching model precision to each phase of retrieval, so you don’t waste compute where you don’t need it.

Want to try it yourself?

The full video retrieval demo, including code and notebook, is available on the Vespa GitHub. All you need is a Vespa trial account and a 12 Labs API key. The sample uses public domain cartoons from Internet Archive and shows how to build a fully working multimodal search engine.

Questions answered in this session

How do you build real-time RAG systems at web scale?
Why do vectors break down for multimodal data?
What are tensors and how do they support better retrieval?
How does hybrid ranking improve accuracy and efficiency?
What are real-world examples of multimodal AI in production?

Last updated: July 27, 2025. Watch the full session on YouTube.