Home Data Reducing Cost in Big Data using Statistics and In-memory Technology – Part...

Reducing Cost in Big Data using Statistics and In-memory Technology – Part 1

July 3, 2015 - 12:00 am

1142

4 min read

The world is shifting from private, dedicated data centers to on-demand computing in the cloud. This shift moves the onus of cost from the hands of IT companies to the hands of developers. As your data sizes start to rise, the computing cost grows linearly with it. We have found that using statistical algorithms gives us a 95 percent accuracy rate, is faster, and is a lot more beneficial than waiting for the exact results. The following are some common analytical queries that we have often come across in applications:

How many distinct elements are in the data set (that is, what is the cardinality of the data set)?
What are the most frequent elements (that is, the “heavy hitters” and “top elements”)?
What are the frequencies of the most frequent elements?
Does the data set contain a particular element (search query)?
Can you filter data based upon a category?

Statistical algorithms for quicker analytics

Frequently, statistical algorithms avoid storing the original data, replacing it with hashes that eliminate a lot of network. Let’s get into the details of some of these algorithms that can help answer queries similar to those mentioned previously.

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. It is suitable in cases when we need to quickly filter items that are present in a set.

HyperLogLog is an approximate technique for computing the number of distinct entries in a set (cardinality). It does this while using only a small amount of memory. For instance, to achieve 99 percent accuracy, it needs only 16 KB. In cases when we need to count distinct elements in a dataset spread across a Hadoop cluster, we could compute the hashes on different machines, build the bit index, and combine the bit index to compute the overall distinct elements. This eliminates the need of moving the data across the network and thus saves us a lot of time.

The Count–min sketch is a probabilistic sub-linear space streaming algorithm that can be used to summarize a data stream to obtain the frequency of elements. It allocates a fixed amount of space to store count information, which does not vary over time even as more and more counts are updated. Nevertheless, it is able to provide useful estimated counts, because the accuracy scales with the total sum of all the counts stored.

Spark – a faster execution engine

Spark is a faster execution engine that provides 10 times the performance over MapReduce when combined with these statistical algorithms. Using Spark with statistical algorithms gives us a huge benefit both in terms of cost and time savings. Spark gets most of its speed by constructing Directed Acyclic Graphs (DAGs) out of the job operations and uses memory to save intermediate data, thus making the reads faster. When using statistical algorithms, saving the hashes in memory makes the algorithms work much faster.

Case study

Let’s say we have a continuous stream of user log data coming every hour at a rate of 4.4 GB per hour, and we need to analyze the distinct IPs in the logs on a daily basis. At my old company, when MapReduce was used to process the data, it was taking about 6 hours to process one day’s worth of data at a size of 106 GB. We had an AWS cluster consisting of 50 spot instances and 4 on-demand instances running to perform the analysis at a cost of $150 per day.

Our system was then shifted to use Spark and HyperLogLog. This shift brought down the cost to $16.50 per day.

To summarize, we had a 3.1 TB stream of data processed every month at a cost of $495, which was costing about $4,500 on the original system using MapReduce without the statistical algorithm in place.

About the author

Praveen Rachabattuni is a tech lead at Sigmoid Analytics, a company that provides a real-time streaming and ETL framework on Apache Spark. Praveen is also a committer to Apache Pig.

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…

Reducing Cost in Big Data using Statistics and In-memory Technology – Part 1

Statistical algorithms for quicker analytics

Spark – a faster execution engine

Case study

Further reading

About the author

LEAVE A REPLY Cancel reply

Must Read in Cloud & Networking

Top life hacks for prepping for your IT certification exam

Learning Essential Linux Commands for Navigating the Shell Effectively

ServiceNow Partners with IBM on AIOps from DevOps.com

Must Read in Data

Learn Transformers for Natural Language Processing with Denis Rothman

Scientific Analysis of Donald Trump’s Tweets on COVID-19 with Transformers

Distributed training in TensorFlow 2.x

Interviews

Learn Transformers for Natural Language Processing with Denis Rothman

Clean Coding in Python with Mariano Anaya

Bringing AI to the B2B world: Catching up with Sidetrade CTO Mark Sheldon [Interview]

On Adobe InDesign 2020, graphic designing industry direction and more: Iman Ahmed, an Adobe Certified Partner and Instructor [Interview]

Is DevOps experiencing an identity crisis? [Interview]

MobilePro

datapro

Programming

Subscribe to our newsletter