Yahoo! Advances Hadoop From Science to the World's Largest Internet Deployment to Mainstream Business Use

Chief Product Officer Blake Irving to Discuss the Business-Critical Role of Hadoop at Yahoo!, Powering Consumer and Advertiser Experiences With Advanced Cloud Computing

 

Yahoo! today announced significant enhancements to the open source software, accelerating the potential for enterprise-wide adoption by mainstream businesses. Hadoop is the open source technology at the epicenter of big data and cloud computing, helping companies get value from their data and better manage their businesses.

As Internet usage continues to grow, data proliferates, making it challenging for businesses of all sizes to manage data in secure and useful ways. Yahoo!, one of the largest online companies in the world, initially used Hadoop for applied science projects and has built it into an enterprise-class platform being used across its business to develop increasingly personalized consumer experiences, built on relevance and trust. Hadoop plays a key role in Yahoo!’s popular global home page, Yahoo! Search, Yahoo! Mail, and many more.

“Hadoop is where science meets big data – it’s the technical underpinning that powers our innovative consumer and advertiser products on the world’s most-advanced digital canvas,” said Blake Irving, Executive Vice President and Chief Product Officer at Yahoo!. “Yahoo!’s cloud and Hadoop make it possible for Yahoo! to rapidly personalize our content and advertising, and deliver highly relevant experiences, while maintaining the trust of our 600 million users.”

Improving Hadoop With Security, Ease of Use, and Reliability for the Enterprise

Today, at the Yahoo!-hosted Third Annual Hadoop Summit, Yahoo! announced the beta release of Hadoop with Security and Oozie, Yahoo!’s workflow engine for Hadoop. The enterprise will benefit from these open source releases because they include better controls for managing business-sensitive data and enabling complex processes to be delivered via Hadoop. Hadoop with Security and Oozie are interoperable and have been tested and deployed at Yahoo! on tens of thousands of servers.

Today’s Contributions Include:

  • Hadoop with Security: a set of significant security updates to Hadoop, enabling strong authentication.
    • Integrates Hadoop with Kerberos, a mature, open source authentication standard, enabling more secure collaboration and sharing of authenticated data.
    • Enables multi-tenancy, or the use of hardware by multiple internal parties, providing authenticated secure access and processing of sensitive data.
  • Oozie, Yahoo!’s workflow engine for Hadoop: an open-source workflow management and coordination engine to manage jobs running on Hadoop, including Hadoop Distributed File System, Pig and MapReduce.
    • Designed for Yahoo!’s rigorous use cases that require managing complex work processes and ETL (extract, transform, load) at global scale
    • Integrates with Hadoop with Security
    • Tested and deployed across Yahoo!

“Businesses across all sectors are looking for ways to leverage the vast quantities of data they are accumulating, and Apache Hadoop is an efficient solution for processing data at scale. Hadoop has matured and is now becoming an enterprise-ready cloud computing technology with the addition of Kerberos authentication,” said Melanie Posey, research director at IDC Research. “Now organizations of various sizes can leverage Yahoo!'s Hadoop investment and deployments to run it on their own systems and build out their own Hadoop deployments without starting from scratch on internal science experiments.”

Science and Research on Hadoop and the Cloud

Yahoo! Labs has been at the forefront of using and developing a variety of open source cloud software and has also been an early adopter of Hadoop. Since 2005, Yahoo! Labs has used Hadoop to conduct science at true Internet scale, leveraging Yahoo!’s global network to unearth insights into consumer behavior, social systems, economics, machine learning, and a host of scientific disciplines critical to the development of the Web.

Several projects developed by scientists in Yahoo! Labs have migrated to production in the Yahoo! Cloud and a few have been open sourced to the community. Examples include Pig, a programming language for performing procedural data processing tasks on top of Hadoop, and Zookeeper, a service for managing performance across distributed computing environments that recently won the best paper award at USENIX ATC ‘10.

In addition, Yahoo! continues to partner closely with the global academic and scientific community as both a founding member of the Open Cirrus Testbed, which is advancing cloud computing research at an international scale, and the Open Cloud Consortium, a testbed for systems research on large-scale data clouds. Top research universities such as Carnegie Mellon, the University of California at Berkeley, Cornell University, and the University of Massachusetts at Amherst use Hadoop on Yahoo!’s M45 supercomputer for a broad range of computer science research.