By Armando Acosta
The Apache™ Hadoop® platform speeds storage, processing and analysis of big, complex data sets, supporting innovative tools that draw immediate insights.
Big data has taken a giant leap beyond its large-enterprise roots, entering boardrooms and data centers across organizations of all sizes and industries. The Apache Hadoop platform has evolved along with the big data landscape and emerged as a major option for storing, processing and analyzing large, complex data sets. In comparison, traditional relational management database or enterprise data warehouse tools often lack the capability to handle such large amounts of diverse data effectively.
Hadoop enables distributed parallel processing of high-volume, high-velocity data across industry-standard servers that both store and process the data. Because it supports structured, semi-structured and unstructured data from disparate systems, the highly scalable Hadoop framework allows organizations to store and analyze more of their data than before to extract business insights. As an open platform for data management and analysis, Hadoop complements existing data systems to bring organizational capabilities into the big data era as analytics environments grow more complex.
Evolving data needs
Early adopters tended to utilize Hadoop for batch processing; prime use cases included data warehouse optimization and extract, transform, load (ETL) processes. Now, IT leaders are expanding the application of Hadoop and related technologies to customer analytics, churn analysis, network security and fraud prevention — many of which require interactive processing and analysis.
As organizations transition to big data technologies, Hadoop has become essential for enabling predictive analytics that use multiple data sources and types. Predictive analytics helps organizations in many different industries answer business-critical questions that had been beyond their reach using basic spreadsheets, databases or business intelligence (BI) tools. For example, financial services companies can move from asking “How much does each customer have in their account?” to answering sophisticated business enablement questions such as “What upsell should I offer a 25-year-old male with checking and IRA accounts?” Retail businesses can progress from “How much did we sell last month?” to “What packages of products are most likely to sell in a given market region?” A healthcare organization can predict which patient is most likely to develop diabetes and when.
Using Hadoop and analytical tools to manage and analyze big data, organizations can personalize each customer experience, predict manufacturing breakdowns to avoid costly repairs and downtime, maximize the potential for business teams to unlock valuable insights, drive increased revenue and more. [See the sidebar, “Doing the (previously) impossible.”]
Parlaying big data to best advantage
Effective use of big data is key to competitive gain, and Dell works with ecosystem partners to help organizations succeed as they evolve their data analytics capabilities. Cloudera plays an important role in the Hadoop ecosystem by providing support and professional feature development to help organizations leverage the open-source platform.
The combination of Cloudera® software on Dell servers enables organizations to successfully implement new data capabilities on field-tested, low-risk technologies. (See section, “Taking Hadoop for a test-drive.”)
Dell | Cloudera Hadoop Solutions comprise software, hardware, joint support, services and reference architectures that support rapid deployment and streamlined management (see figure). Dell PowerEdge servers, powered by the latest Intel® Xeon® processors, provide the hardware platform.
Solution stack: Dell | Cloudera Hadoop Solutions for big data
Dell | Cloudera Hadoop Solutions are available with Cloudera Enterprise, designed specifically for mission-critical environments. Cloudera Enterprise comprises the Cloudera Distribution including Apache Hadoop (CDH) and the management software and support services needed to keep a Hadoop cluster running consistently and predictably. Cloudera Enterprise allows organizations to implement powerful end-to-end analytic workflows — including batch data processing, interactive query, navigated search, deep data mining and stream processing — from a single common platform.
Accelerated processing. Cloudera Enterprise leverages Hadoop YARN (Yet Another Resource Negotiator), a resource management framework designed to transition users from general batch processing with Hadoop MapReduce to interactive processing. The Apache Spark™ compute engine provides a prime example of how YARN enables organizations to build an interactive analytics platform capable of large-scale data processing. (See the sidebar, “Revving up cluster computing.”)
Built-in security. Role-based access control is critical for supporting data security, governance and compliance. The Apache Sentry system, integrated in CDH, enhances data access protection by defining what users and applications can do with data, based on permissions and authorization. Apache Sentry continues to expand its support for other ecosystem tools within Hadoop. It also includes features and functionality from Project Rhino, originally developed by Intel to enable a consistent security framework for Hadoop components and technologies.
Supporting rapid big data implementations
Dell | Cloudera Hadoop Solutions, accelerated by Intel, provide organizations of all sizes with several turnkey options to meet a wide range of big data use cases.
Getting started. Dell QuickStart for Cloudera Hadoop enables organizations to easily and cost-effectively engage in Hadoop development, testing and proof-of-concept work. The solution includes Dell PowerEdge servers, Cloudera Enterprise Basic Edition and Dell Professional Services to help organizations quickly deploy Hadoop and test processes, data analysis methodologies and operational needs against a fully functioning Hadoop cluster.
Taking the first steps with Hadoop through Dell QuickStart allows organizations to accelerate cluster deployment to pinpoint effective strategies that address the business and technical demands of a big data implementation.
Going mainstream. The Dell | Cloudera Apache Hadoop Solution is an enterprise-ready, end-to-end big data solution that comprises Dell PowerEdge servers, Dell Networking switches, Cloudera Enterprise software and optional managed Hadoop services. The solution also includes Dell | Cloudera Reference Architectures, which offer tested configurations and known performance characteristics to speed the deployment of new data platforms.
Cloudera Enterprise is thoroughly tested and certified to integrate with a wide range of operating systems, hardware, databases, data warehouses, and BI and ETL systems. Broad compatibility enables organizations to take advantage of Hadoop while leveraging their existing tools and resources.
Advancing analytics. The shift to near-real-time analytics processing necessitates systems that can handle memory-intensive workloads. In response, Dell teamed up with Cloudera and Intel to develop the Dell In-Memory Appliance for Cloudera Enterprise with Apache Spark, aimed at simplifying and accelerating Hadoop cluster deployments. By providing fast time to value, the appliance allows organizations to focus on driving innovation and results, rather than on using resources to deploy their Hadoop cluster.
The appliance’s ease of deployment and scalability addresses the needs of organizations that want to use high-performance interactive data analysis for analyzing utility smart meter data, social data for marketing applications, trading data for hedge funds, or server and network log data. Other uses include detecting network intrusion and enabling interactive fraud detection and prevention.
Built on Dell hardware and an Intel performance- and security-optimized chipset, the appliance includes Cloudera Enterprise, which is designed to store any amount or type of data in its original form for as long as desired. The Dell In-Memory Appliance for Cloudera Enterprise comes bundled with Apache Spark and Cloudera Enterprise components such as Cloudera Impala and Cloudera Search.
Cloudera Impala is an open-source massively parallel processing (MPP) query engine that runs natively in Hadoop. The Apache-licensed project enables users to issue low-latency SQL queries to data stored in Apache HDFS™ (Hadoop Distributed File System) and the Apache HBase™ columnar data store without requiring data movement or transformation.
Cloudera Search brings full-text, interactive search and scalable, flexible indexing to CDH and enterprise data hubs. Powered by Hadoop and the Apache Solr™ open-source enterprise search platform, Cloudera Search is designed to deliver scale and reliability for integrated, multi-workload search.
Changing the game
Since its beginnings in 2005, Apache Hadoop has played a significant role in advancing large-scale data processing. Likewise, Dell has been working with organizations to customize big data platforms since 2009, delivering some of the first systems optimized to run demanding Hadoop workloads.
Just as Hadoop has evolved into a major data platform, Dell sees Apache Spark as a game-changer for interactive processing, driving Hadoop as the data platform of choice. With connected devices and embedded sensors generating a huge influx of data, streaming data must be analyzed in a fast, efficient manner. Spark offers the flexibility and tools to meet these needs, from running machine-learning algorithms to graphing and visualizing the interrelationships among data elements — all on one platform.
Working together with other industry innovators, Dell is enabling organizations of all sizes to harness the power of Hadoop to accelerate actionable business insights.
Joey Jablonski contributed to this article.
Doing the (previously) impossible
Apache Hadoop and big data analytics capabilities enable organizations to do what they couldn’t do before, whether that means making memorable customer experiences or optimizing operations.
Personalized content. A digital media company turned to Hadoop when burgeoning data volumes hindered its mission to simplify marketers’ access to data that would let them tailor content to individual customers. The company’s move to Cloudera Enterprise, powered by Dell PowerEdge servers, enabled complex, large-scale data processing that delivered greater than 90 percent accuracy for its content personalization services. Moreover, the 24x7 reliability of the Hadoop platform lets the company provide the data its customers need, when they need it.
Product quality management. To help global manufacturers efficiently manage product quality, Omneo implemented a software solution based on the Cloudera Distribution including Apache Hadoop (CDH) running on a cluster of Dell PowerEdge servers. Using the solution, Omneo customers can quickly search, analyze and mine all their data in a single place, so they can identify and resolve emerging supply chain issues. “We are able to help customers search billions of records in seconds with Dell infrastructure and support, Cloudera’s Hadoop solution, and our knowledge of supply chain and quality issues,” says Karim Lokas, senior vice president of marketing and product strategy for Omneo, a division of the global enterprise manufacturing software firm Camstar Systems. “With the visibility provided by this solution, manufacturers can put out more consistent, better products and have less suspect product go out the door.”
Information security services. Dell SecureWorks is on deck 24 hours a day, 365 days a year, to help protect customer IT assets against cyberthreats. To meet its enormous data processing challenges, Dell SecureWorks deployed the Dell | Cloudera Apache Hadoop Solution, powered by Intel Xeon processors, to process billions of events every day. “We can collect and more effectively analyze data with the Dell | Cloudera Apache Hadoop Solution,” says Robert Scudiere, executive director of engineering for SecureWorks. “That means we’re able to increase our research capabilities, which helps with our intelligence services and enables better protection for our clients.” By moving to the Dell | Cloudera Apache Hadoop Solution, Dell SecureWorks can put more data into its clients’ hands so they can respond faster to security threats than before.
Taking Hadoop for a test-drive
How can IT decision makers determine the best way to capitalize on an investment in Apache Hadoop and big data initiatives? Dell has teamed up with Intel to offer the Dell | Intel Cloud Acceleration Program at Dell Solution Centers, giving decision makers a firsthand opportunity to see and test Dell big data solutions.
Experts at Dell Solution Centers located worldwide help bolster the technical skills of anyone new (and not so new) to Hadoop. Participants gain hands-on experience in a variety of areas, from optimizing performance for an application deployed on Dell servers to exploring big data solutions using Hadoop. At a Dell Solution Center, participants can attend a technical briefing with a Dell expert, investigate an architectural design workshop or build a proof of concept to comprehensively validate a big data solution and streamline deployment. Using an organization’s specific configurations and test data, participants can discover how a big data solution from Dell meets their business needs.
For more information, visit Dell Solution Centers
Revving up cluster computing
The expansion of the Internet of Things (IoT) has led to a proliferation of connected devices and machines with embedded sensors that generate tremendous amounts of data. To derive meaningful insights quickly from this data, organizations need interactive processing and analytics, as well as simplified ecosystems and solution stacks.
Apache Spark is poised to become the underpinning technology driving the analysis of IoT data. Spark utilizes in-memory computing to deliver high-performance data processing. It enables applications in Hadoop clusters to run up to 100 times faster than Hadoop MapReduce in memory or 10 times faster on disk. Integrated with Hadoop, Spark runs on the Hadoop YARN (Yet Another Resource Negotiator) cluster manager and is designed to read any existing Hadoop data.
Within its computing framework, Spark is tooled with analytics capabilities that support interactive query, iterative processing, streaming data and complex analytics such as machine learning and graph analytics. Because Spark combines these capabilities in a single workflow out of the box, organizations can use one tool instead of traditional specialized systems for each type of analysis, streamlining their data analytics environments.
Learn More:
Hadoop@Dell.com