Wednesday, February 10, 2021

Oracle Exadata Cloud vs. AWS and Spark for big data analytics

Oracle Database Tutorial and Material, Oracle Database Learning, Oracle Database Exam Prep, Database Preparation

This blog takes a closer look at why the Moat Reach team moved their data warehouse from Spark on AWS to Exadata on OCI.

What is Moat Reach?

Moat Reach, by Oracle Data Cloud, is a cross-platform TV and digital measurement solution that measures how many people and households your campaigns reach, at what frequency, and how relevant those people are to your marketing objectives.

It integrates Moat Analytics’ digital impression data with TV advertisement impression data from iSpot.tv, which uses 14 million opted-in TV devices, against the people and households in the Oracle ID Graph. This data allows marketers to measure valid and viewable impressions for the audiences that matter to them across TV and digital channels.

In marketing terminology, an impression refers to the number of times an advertisement was seen when browsing the internet on a web browser, watching a sports game on TV, playing a game in an app, and so on. Examples of marketing channels include browsers, apps, traditional cable TV, and streaming video.

Oracle Database Tutorial and Material, Oracle Database Learning, Oracle Database Exam Prep, Database Preparation

For example, a consumer packaged goods (CPG) advertiser can use Moat Reach to measure the unique, people-based reach of its new soft drink campaign using the same metrics across all of its TV and digital channels. Even better, it can also measure its audience based on custom or first-party segments, like soft drink buyers or frequent shoppers, and analyzes key demographics, like age and gender, through each channel.

By comparing the reach, frequency, and viewability of its ads through each channel, the CPG advertiser might discover that it was reaching more of its intended audience through celebrity magazine websites than DIY cable channels. It can then adjust its media strategy to invest more in those channels.

After years of cross-channel confusion, Moat Reach gives marketers a self-service tool to measure the consolidated reach and frequency of their campaigns, so they can see whether they reached their intended audiences through each platform, publisher, and network on a consolidated and de-duplicated basis.

Version 1: Apache Spark on Amazon Web Services (AWS)


Moat Reach answers the impact of the marketing campaign on an individual across TV and digital media. To do that, it counts distinct records for that individual in the data set from Moat and iSpot.tv.

The Moat team developed their first data warehouse solution using Apache Spark, Qubole, and Scala on AWS infrastructure. This solution operated on a dataset with about 97 billion impressions from Moat and 49 billion impressions from iSpot.tv.

The dataset in this first version architecture spanned 34 clients, 275 thousand campaigns, and two years of data. The Moat team loaded this data from object storage into the Apache Spark cluster and ran analytics using Qubole and Scala. This data was precomputed into dimensions to assist with query processing.

Although this method has limitations, other analytics companies in the industry use similar technologies and the same methodology of precomputing the dimensions because it was too hard a problem to solve doing it in real time.

Getting answers out of this v1 data warehousing solution took roughly seven days from when a user asked the question to the return of a response. This solution also costs a lot of money, with each nightly run costing about $5–8K USD, $100–160K USD for the monthly runs, and about $1.2–2.8M USD annually.

Moat Reach v1 relied on a snapshot model where the Moat team ingested impression event, id-graph, demographic, and audience data into a set of parquet tables on AWS S3. They then ran several large Spark jobs to populate reach and frequency metrics in a web UI. The metrics were stored in a database by the nightly job.

The Moat team wrote the job to quickly return user requests without computing any more information. This approach allowed them a responsive site but at the cost of only computing a finite set of metrics.

This lack of flexibility frustrated customers. For example, they computed the number of 18–30 year-olds who saw an ad, the number of women who saw an ad, how many mobile views, or five-second views an ad received. But changing the range of women to 18–29 year olds would have taken at least a day.

A cross-section of 18-year-old women who saw the ad for five seconds on their mobile phones with all other cross-sections was similarly financially infeasible. It would be too expensive to run a job every night for every value of every dimension across every client and campaign. The Moat team set a goal to gain insights into and across all dimensions of the Moat Reach data.

Version 2: Oracle Database on Oracle Cloud Infrastructure and Exadata


While the v1 Moat Reach service was already differentiated in the market because of its uniquely consolidated insights, the team felt there was an opportunity to be more comprehensive and offer more customized insights for customers in near real time. The Moat team faced the challenge of finding technology with enough storage to handle terabytes of data—and over a petabyte a year—and make much faster queries than the existing architecture.

Oracle Cloud Infrastructure (OCI) is the only provider to offer Oracle Database and Oracle Exadata as a service. This offering has vast amounts of space and equally impressive compute capabilities. In the on-demand architecture, they only had to ingest impression event, id-graph, demographic, and audience data into their Exadata appliance, and then let an API do the rest. When users visit the site, an API queries the database for counts, based on the selected filters. Data science methodology is applied to estimate total reach and frequency from the observed data. This architecture extends the product use case from broad, top-line insights to precise and deeply detailed metrics that clients can use to optimize their ad campaigns’ reach in real-time.

To query billions of rows quickly, they used Exadata’s probabilistic data structures and hyperloglog estimation functions to obtain bounded estimates of the needed inputs for their data science methodology. To further optimize performance, they took advantage of Exadata’s tuning abilities and support team. By working with Oracle engineering staff, they tuned Exadata to be more performant than its competitors AWS Redshift and Clickhouse to run a baseline set of queries meant to mimic the live production system requirements. In-house Oracle expertise and the wide availability of Oracle Database and SQL skill sets were other factors to encourage them toward OCI and Exadata.

The following table compares the two approaches:

V1 solution V2 solution 
Warehouse  On-demand
Spark and AWS  Exadata on OCI 
Precomputed dimensions for reporting  Flexible dimensions for reporting 
Customer answers in days  Custom answers in seconds 
Limited reports  Unlimited number of custom reports 
Long development cycles  Rapid development cycles 
Delayed innovation turnaround  Real-time access for Innovation 
$$$  $$ 

While improved customer experience was the Moat team’s top priority, cost was also a factor in their decision to use Exadata over the V1 architecture and AWS’s data warehouse, Redshift. An AWS Redshift instance cost roughly 2 million dollars for a baseline of 912 CPUs and 940 TB of storage. For the same cost, Exadata on OCI offered much faster performance and real-time data queries with an optimized database shape with 150 CPUs, 23 TB of RAM, persistent memory for an in-memory database, and close to 1 PB of flash and disk storage for historical data. Exadata features like smart scan, storage indexes, and smart flash cache were some of the factors for the faster query performance.

V1 solution Alternative  V2 solution 
Warehouse Warehouse On-demand
Spark and AWS  AWS Redshift  Exadata on OCI
Customer answers in days  Customer answers in hours  Customer answers in seconds 
Limited reports  Less limited reports  Unlimited number of custom reports 
$$$  $$  $$ 

Source: oracle.com

Related Posts

0 comments:

Post a Comment