Yahoo Search Code Released as Open Source
Oath, Inc., the Verizon subsidiary that’s been the owner of record of Yahoo since June, has released some important Yahoo code as open source under the Apache 2.0 license. The project, called Vespa, was originally based on code Yahoo inherited with its acquisition of AlltheWeb in 2003. The software is used across all Yahoo websites, including Flickr, for everything from handling search queries to serving ads.
“Over the last couple of years we have rewritten most of the engine from scratch to incorporate our experience onto a modern technology stack,” Jon Bratseth, an architect with Vespa said in a blog post. “Vespa is larger in scope and lines of code than any open source project we’ve ever released. Now that this has been battle-proven on Yahoo’s largest and most critical systems, we are pleased to release it to the world.”
This is an unexpected boon for developers. Unexpected because although Yahoo has a history of releasing some of its code as open source, most famously the big data project Hadoop, it wasn’t known if the practice would continue under Verizon’s ownership. A boon, because Vespa is loaded with potential that reaches far beyond search.
“Building applications increasingly means dealing with huge amounts of data,” Bratseth said. “While developers can use the the Hadoop stack to store and batch process big data, and Storm to stream-process data, these technologies do not help with serving results to end users. Serving is challenging at large scale, especially when it is necessary to make computations quickly over data while a user is waiting, as with applications that feature search, recommendation, and personalization.”
Vespa’s scalability should surprise no one. It was designed for use by Yahoo, which despite decades of decreasing traffic is still ranked by Alexa as the sixth most visited site on the web, both globally and in the US.
“Vespa processes and serves content and ads almost 90,000 times every second with latencies in the tens of milliseconds,” he said. “On Flickr alone, Vespa performs keyword and image searches on the scale of a few hundred queries per second on tens of billions of images. Additionally, Vespa makes direct contributions to our company’s revenue stream by serving over 3 billion native ad requests per day via Yahoo Gemini, at a peak of 140k requests per second (per Oath internal data).”
It also comes with out-of-the-box versatility. According to Bratseth, Vespa can be run on premises or in the cloud, and Oath is providing both Docker images and rpm packages, including guides for running them on laptops or as AWS clusters.
There’s little doubt that now it’s open sourced, Vespa will become a major part of the open source toolbox, alongside the likes of Hadoop, Kubernetes, OpenStack, and even Linux.
“By releasing Vespa, we are making it easy for anyone to build applications that can compute responses to user requests, over large datasets, at real time and at internet scale – capabilities that up until now have been within reach of only a few large companies.”
For the time being, control of the project seems to remain with Oath. However, it wouldn’t be surprising if it eventually finds a home at the Apache Foundation, which is where Hadoop ended up. It the meantime, it can be found on GitHub.