Categories: DataBig DataInsights

Reducing Cost in Big Data using Statistics and In-memory Technology – Part 1

4 min read

The world is shifting from private, dedicated data centers to on-demand computing in the cloud. This shift moves the onus of cost from the hands of IT companies to the hands of developers. As your data sizes start to rise, the computing cost grows linearly with it. We have found that using statistical algorithms gives us a 95 percent accuracy rate, is faster, and is a lot more beneficial than waiting for the exact results. The following are some common analytical queries that we have often come across in applications:

  • How many distinct elements are in the data set (that is, what is the cardinality of the data set)?
  • What are the most frequent elements (that is, the “heavy hitters” and “top elements”)?
  • What are the frequencies of the most frequent elements?
  • Does the data set contain a particular element (search query)?
  • Can you filter data based upon a category?

Statistical algorithms for quicker analytics

Frequently, statistical algorithms avoid storing the original data, replacing it with hashes that eliminate a lot of network. Let’s get into the details of some of these algorithms that can help answer queries similar to those mentioned previously.

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. It is suitable in cases when we need to quickly filter items that are present in a set.

HyperLogLog is an approximate technique for computing the number of distinct entries in a set (cardinality). It does this while using only a small amount of memory. For instance, to achieve 99 percent accuracy, it needs only 16 KB. In cases when we need to count distinct elements in a dataset spread across a Hadoop cluster, we could compute the hashes on different machines, build the bit index, and combine the bit index to compute the overall distinct elements. This eliminates the need of moving the data across the network and thus saves us a lot of time.

The Count–min sketch is a probabilistic sub-linear space streaming algorithm that can be used to summarize a data stream to obtain the frequency of elements. It allocates a fixed amount of space to store count information, which does not vary over time even as more and more counts are updated. Nevertheless, it is able to provide useful estimated counts, because the accuracy scales with the total sum of all the counts stored.

Spark – a faster execution engine

Spark is a faster execution engine that provides 10 times the performance over MapReduce when combined with these statistical algorithms. Using Spark with statistical algorithms gives us a huge benefit both in terms of cost and time savings. Spark gets most of its speed by constructing Directed Acyclic Graphs (DAGs) out of the job operations and uses memory to save intermediate data, thus making the reads faster. When using statistical algorithms, saving the hashes in memory makes the algorithms work much faster.

Case study

Let’s say we have a continuous stream of user log data coming every hour at a rate of 4.4 GB per hour, and we need to analyze the distinct IPs in the logs on a daily basis. At my old company, when MapReduce was used to process the data, it was taking about 6 hours to process one day’s worth of data at a size of 106 GB. We had an AWS cluster consisting of 50 spot instances and 4 on-demand instances running to perform the analysis at a cost of $150 per day.

Our system was then shifted to use Spark and HyperLogLog. This shift brought down the cost to $16.50 per day.

To summarize, we had a 3.1 TB stream of data processed every month at a cost of $495, which was costing about $4,500 on the original system using MapReduce without the statistical algorithm in place.

Further reading

In the second part of this two-part blog series, we will discuss two tools in depth: Apache Spark and Apache Pig. We will take a look at how Pig combined with Spark makes existing ETL pipelines 100 times faster, and we will further our understanding of how statistical perspectives positively effect data analytics.

About the author

Praveen Rachabattuni is a tech lead at Sigmoid Analytics, a company that provides a real-time streaming and ETL framework on Apache Spark. Praveen is also a committer to Apache Pig.

Praveen Rachabattuni

Share
Published by
Praveen Rachabattuni

Recent Posts

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Clean Coding in Python with Mariano Anaya

Key-takeaways:   Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Software architecture is one of the most discussed topics in the software industry today, and…

3 years ago