One of Google's favorite statistics is that every day, roughly 15% of Google queries are for things that have never before been typed into the search box. And even at Google's impossible scale, the number never seems to go down. "Part of it, I have to admit, is that people find new and creative ways of misspelling words," Pandu Nayak, a Google fellow and the company's VP of search, told me earlier this month. But there are two other reasons, he said: The world changes all the time, and people's curiosity "is quite infinite in its complexity."
Google's challenge on the web is to find ever-better ways to collect and sort information. Crawling web pages is the easy part, relatively speaking. Understanding what's authoritative vaccine guidance and what's dangerous misinformation, or whether you typed "spaghetti" looking for definitions or recipes? That's all much more complicated. Nayak rattled off numbers like the 3,600 changes made to the search system last year, or the 60,000-plus experiments run internally. It's a lot of work, but Google's better at it than most.
But there's a core change happening on the internet that threatens Google in a serious, potentially existential way. An increasingly large amount of the web is not web pages full of text and hyperlinks. It's images, video and audio. TikTok and Instagram, podcasts and videos: Those platforms are just as much "the internet" as the Wikipedias and publisher sites Google has long relied on. And for the company that has spent two decades dedicated to organizing the world's information, that presents a problem.
At Google's Search On event on Wednesday, Google executives showed off some fancy new features, like a camera feature that can take a picture of a shirt and find socks with the same pattern. Or a way to take a photo of your broken bike chain and get search results for how to fix it. It's all part of Google Lens, the visual-first search system the company has been building for several years. Google has long talked about wanting to take search beyond the text box, to make it easier for people to input information and get answers. Context is crucial to that, too.
But just as important, and just as difficult, is understanding the information on the other side. It's technically possible to search TikTok and Instagram through Google, but the results are pretty primitive and mostly based on hashtags and video descriptions. Google is reportedly working on deals with ByteDance and Facebook to bring more content with better metadata into Google's search results, but that, too, is only half the battle.
Even on YouTube — itself the world's second-largest search engine, and obviously a Google-owned company — Google's search relies on metadata and automatically generated transcripts to figure out what's going on in a video. Introducing chapter markers made the system better, but only because creators gave Google hints about where to look. Its search crawlers don't understand what's on the screen in any meaningful way.
When he introduced Google's new Multitask Unified Model system (or MUM, as it's known) at Google I/O in May, Nayak hinted that things might be about to change. "MUM is multimodal," he wrote in a blog post, "so it understands information across text and images and, in the future, can expand to more modalities like video and audio." He echoed the sentiment in our conversation. "You can give [MUM] inputs that are both text and images, as a sequence of tokens," he said. "It only thinks about tokens … and it essentially learns the relationships between image tokens and word tokens, and I think we'll see a number of interesting examples coming out of that." He said that's not coming immediately, but "in the maybe not-too-distant future."
If Google can unlock a truly visual search engine in both directions — visual queries, visual data, visual output — it can be much better equipped to be to the future what, well, Google was to the past. More than two decades ago, the company took a disparate set of content and put it at users' fingertips. Now the content has changed, but the need hasn't.
The other upside for Google? Shopping. Practically every corner of the internet is embracing shopping as a way to make money, both for creators and for the platforms. For Google, the potential is massive: It could allow users to click on any product in any video or image anywhere on the internet, from the gadget in the foreground to the lamp in the background to the shoes on creators' feet, and be taken to a store to buy that thing. MUM could help Google build the world's biggest catalog, with Google as a happy fulfillment and payment service.
Companies around the industry, from Spotify to Pinterest to Apple to practically every other platform and service that deals in audiovisual content, are trying to figure out how to better understand and index the content in their systems. Google, as the trillion-dollar tech giant predicated on understanding and indexing all content everywhere, is in a high-stakes race to do it better.