This article by Simon Chapple, author of the book Mastering Parallel Programming with R, helps us understand the intricacies of parallel computing. Here, we'll take a look into Delores' Crystal Ball at what the future holds for massively parallel computation that will likely have a significant impact on the world of R programming, particularly when applied to big data.

(For more resources related to this topic, see here.)

Three steps to successful parallelization

The following three-step distilled guidance is intended to help you decide what form of parallelism might be best suited for your particular algorithm/problem and summarizes what you learned throughout this article. Necessarily, it applies a level of generalization, so approach these guidelines with due consideration:

Determine the type of parallelism that may best apply to your algorithm.
Is the problem you are solving more computationally bound or data bound? If the former, then your problem may be amenable to GPUs; if the latter, then your problem may be more amenable to cluster-based computing; and if your problem requires a complex processing chain, then consider using the Spark framework.

Can you divide the problem data/space to achieve a balanced workload across all processes, or do you need to employ an adaptive load-balancing scheme—for example, a task farm-based approach?

Does your problem/algorithm naturally divide spatially? If so, consider whether a grid-based parallel approach can be used.

Perhaps your problem is on an epic scale? If so, maybe you can develop your message passing-based code and run it on a supercomputer.

Is there an implied sequential dependency between tasks; that is, do processes need to cooperate and share data during their computation, or can each separate divided task be executed entirely independently of one another?

A large proportion of parallel algorithms typically have a work distribution phase, a parallel computation phase, and a result aggregation phase. To reduce the overhead of the startup and close down phases, consider whether a Tree-based approach to work distribution and result aggregation may be appropriate in your case.
Ensure that the basis of the compute in your algorithm has optimal implementation.
Profile your code in serial to determine whether there are any bottlenecks, and target these for improvement.

Is there an existing parallel implementation similar to your algorithm that you can use directly or adopt?

Review CRAN Task View: High-Performance and Parallel Computing with R at https://cran.r-project.org/web/views/HighPerformanceComputing.html; in particular, take a look at the subsection entitled Parallel Computing: Applications, a snapshot of which at the time of writing can be seen in the following figure:

Figure 1: CRAN provides various parallelized packages you can use in your own program.
Test and evaluate the parallel efficiency of your implementation.
Use the P-estimated form of Amdahl's Law to predict the level of scalability you can achieve.

Test your algorithm at varying amounts of parallelism, particularly odd numbers that trigger edge-case behaviors. Don't forget to run with just a single process. Running with more processes than processors will lead to trigger-lurking deadlock/race conditions (this is most applicable to message passing-based implementations).

Where possible, to reduce overhead, ensure that your method of deployment/initialization places the data being consumed locally to each parallel process.

What does the future hold?

Obviously, this final section is at a considerable risk of "crystal ball gazing" and getting it wrong. However, there are a number of clear directions in which we can see how both hardware and software will develop, which makes it clear that parallel programming will play an ever more important and increasing role in our computational future. Besides, it has now become critical for us to be able to process vast amounts of information within a short window of time in order to ensure our own individual and collective safety. For example, we are experiencing an increased momentum towards significant climate change and extreme weather events and will therefore require increasingly accurate weather prediction to help us deal with this; this will only be possible with highly efficient parallel algorithms.

In order to gaze into the future, we need to look back at the past. The hardware technology available to parallel computing has evolved at a phenomenal pace through the years. The levels of performance that can be achieved today by single chip designs are truly staggering in terms of recent history.

The history of HPC

For an excellent infographic review of the development of computing performance, I would urge you to visit the following web page:

http://pages.experts-exchange.com/processing-power-compared/

This beautifully illustrates how, for example, an iPhone 4 released in 2010 has near-equivalent performance to the Cray 2 supercomputer from 1985 of around 1.5 gigaflops, and the Apple Watch released in 2015 has around twice the performance of the iPhone 4 and Cray 2!

While chip manufacturers have managed to maintain the famous Moore's law that predicts transistor count doubling every two years, we are now at 14 nanometers (nm) in chip production, giving us around 100 complex processing cores in a single chip. In July 2015, IBM announced a prototype chip at 7 nm (1/10,000th the width of a human hair). Some scientists suggest that quantum tunneling effects will start to impact at 5 nm (which Intel expects to bring to the market by 2020), although a number of research groups have shown individual transistor construction as small as 1 nm in the lab using materials such as graphene. What all of this suggests is that the placement of 1,000 independent high-performance computational cores, together with sufficient amounts of high-speed cache memory inside a single-chip package comparable to the size of today's chips, could potentially be possible within the next 10 years.

NIVIDA and Intel are arguably at the forefront of the dedicated HPC chip development with their respective offerings used in the world's fastest supercomputers, which can also be embedded in your desktop computer. NVIDIA produces Tesla, the K80 GPU-based accelerator available that now peaks at 1.87 teraflops double precision and 5.6 teraflops single precision, utilizing 4,992 cores (dual processor) and 24 GB of on-board RAM. Intel produces Xeon Phi, the collective family brand name for its Many Integrated Core (MIC) architecture; Knights Landing, which is new, is expected to peak at 3 teraflops double precision and 6 teraflops single precision, utilizing 72 cores (single processor) and 16 GB of the highly integrated on-chip fast memory when it is released, likely in fall 2015.

The successors to these chips, namely NVIDIA's Volta and Intel's Knights Hill, will be the foundation for the next generation of American $200 million dollar supercomputers in 2018, delivering around 150 to 300 petaflops peak performance (around 150 million iPhone 4s), as compared to China's TIANHE-2, ranked as the fastest supercomputer in the world in 2015, with a peak performance of around 50 petaflops from 3.1 million cores.

At the other extreme, within the somewhat smaller and less expensive world of mobile devices, most currently use between two and four cores, though mixed multicore capability such as ARM's big.LITTLE octacore makes eight cores available. However, this is already on the increase with, for example, MediaTek's new SoC MT6797, which has 10 main processing cores split into a pair and two groups of four cores with different clock speeds and power requirements to serve as the basis for the next-generation mobile phones. Top-end mobile devices, therefore, exhibit a rich heterogeneous architecture with mixed power cores, separate sensor chips, GPUs, and Digital Signal Processors (DSP) to direct different aspects of workload to the most power-efficient component. Mobile phones increasingly act as the communications hub and signal a processing gateway for a plethora of additional devices, such as biometric wearables and the rapidly expanding number of ultra-low power Internet of Things (IoT) sensing devices smartening all aspects of our local environment.

While we are a little bit away from running R itself natively on mobile devices, the time will come when we seek to harness the distributed computing power of all our mobile devices. In 2014 alone, around 1.25 billion smartphones were sold. That's a lot of crowd-sourced compute power and potentially far outstrips any dedicated supercomputer on the planet either existing or planned.

The software that enables us to utilize parallel systems, which, as we noted, are increasingly heterogeneous, continues to evolve. In this book, we have examined how you can utilize OpenCL from R to gain access to both the GPU and CPU, making it possible to perform mixed computation across both components, exploiting the particular strengths of each for certain types of processing. Indeed, another related initiative, Heterogeneous System Architecture (HSA), which enables even lower-level access to the spectrum of processor capabilities, may well gain traction over the coming years and help promote the uptake of OpenCL and its counterparts.

HSA Foundation

HSA Foundation was founded by a cross-industry group led by AMD, ARM, Imagination, MediaTek, Qualcomm, Samsung, and Texas Instruments. Its stated goal is to help support the creation of applications that seamlessly blend scalar processing on the CPU, parallel processing on the GPU, and optimized processing on the DSP via high bandwidth shared memory access, enabling greater application performance at low power consumption. To enable this, HSA Foundation is defining key interfaces for parallel computation using CPUs, GPUs, DSPs, and other programmable and fixed-function devices, thus supporting a diverse set of high-level programming languages and creating the next generation in general-purpose computing. You can find the recently released version 1.0 of the HSA specification at the following link:

http://www.hsafoundation.com/html/HSA_Library.htm

Hybrid parallelism

As a final wrapping up, I thought I would show how you can overcome some of the inherent single-threaded nature of R even further and demonstrate a hybrid approach to parallelism that combines two of the different techniques we covered previously within a single R program. We've also discussed how heterogeneous computing is potentially the way of the future.

This example refers to the code we would develop to utilize MPI through pbdMPI together with ROpenCL to enable us to exploit both the CPU and GPU simultaneously. While this is a slightly contrived example and both the devices compute the same dist() function, the intention is to show you just how far you can take things with R to get the most out of all your available compute resource.

Basically, all we need to do is to top and tail our implementation of the dist() function in OpenCL with the appropriate pbdMPI initialization and termination and run the script with mpiexec on two processes, as follows:

# Initialise both ROpenCL and pdbMPI
require(ROpenCL)
library(pbdMPI, quietly = TRUE)
init()
# Select device based on my MPI rank
r <- comm.rank()
if (r == 0) { # use gpu
 device <- 1
} else { # use cpu
 device <- 2
}
...
# Execute the OpenCL dist() function on my assigned device
comm.print(sprintf("%d executing on device %s", r, getDeviceType(deviceID)), all.rank = TRUE)
res <- teval(openclDist(kernel))
comm.print(sprintf("%d done in %f secs",r,res$Duration), all.rank = TRUE)
finalize()

This is simple and very effective!

Summary

In this article, we looked at Crystal Ball and saw the prospects for the combination of heterogeneous compute hardware that is here today and that will expand in capability even further in the future, not only in our supercomputers and laptops but also in our personal devices. Parallelism is the only way these systems can be utilized effectively.

As the volume of new quantified self- and environmentally-derived data increases and the number of cores in our compute architectures continues to rise, so does the importance of being able to write parallel programs to make use of it all—job security for parallel programmers looks good for many years to come!