In this article written by Jalem Raj Rohit, author of the book Julia Cookbook, cover the following recipes:
(For more resources related to this topic, see here.)
In this article, you will learn about performing parallel computing and using it to handle big data. So, some concepts like data movements, sharded arrays, and the map-reduce framework are important to know in order to handle large amounts of data by computing on it using parallelized CPUs. So, all the concepts discussed in this article will help you build good parallel computing and multiprocessing basics, including efficient data handling and code optimization.
Parallel computing is a way of dealing with data in a parallel way. This can be done by connecting multiple computers as a cluster and using their CPUs for carrying out the computations.
This style of computation is used when handling large amounts of data and also while running complex algorithms over significantly large data. The computations are executed faster due to the availability of multiple CPUs running them in parallel as well as the direct availability of RAM to each of them.
Julia has an in-built support for parallel computing and multiprocessing. So, these computations rarely require any external libraries for the task.
julia -p 2
task = remotecall(2, rand, 2, 2)
The preceding command gives the following output:
fetch(task)
The preceding command gives the following output:
task2 = @spawnat 5 .+ fetch(task)
remotecall_fetch(2, getindex, task2, 1, 1)
In parallel computing, data movements are quite common and are also a thing to be minimized due to the time and the network overhead due to the movements. In this recipe, we will see how that can be optimized to avoid latency as much as we can.
To get ready for this recipe, you need to have the Julia REPL started in the multiprocessing mode. This is explained in the Getting ready section of the preceding recipe.
mat = rand(200, 200)
exec_mat = @spawn mat^2
fetch(exec_mat)
The preceding command gives the following output:
mat = @spawn rand(200, 200)^2
fetch(mat)
The preceding command gives the following output:
In this example, we try to construct a 200X200 matrix and then used the @spawn macro to spawn a process in the CPU to execute the same for us. The @spawn macro spawns one of the two processes running, and it uses one of them for the computation.
In the second example, you learned how to use the @spawn macro directly without an extra initialization part. The fetch() function helps us fetch the results from a common data resource of the processes. More on this will be covered in the following recipes.
In this recipe, you will learn a bit about the famous Map Reduce framework and why it is one of the most important ideas in the domains of big data and parallel computing. You will learn how to parallelize loops and use reducing functions on them through the several CPUs and machines and the concept of parallel computing, which you learned about in the previous recipes.
Just like the previous sections, Julia just needs to be running in the multiprocessing mode to follow along the following examples. This can be done through the instructions given in the first section.
require("count_heads")
a = @spawn count_heads(100)
b = @spawn count_heads(100)
fetch(a) + fetch(b)
nheads = @parallel (+) for i = 1:200
Int(rand(Bool))
end
The first function is a simple Julia function that adds random bits with every loop iteration. It was created just for the demonstration of Map-Reduce operations.
In the second point, we spawn two separate processes for executing the function and then fetch the results of both of them and add them up.
However, that is not really a neat way to carry out parallel computation of functions and loops. Instead, the @parallel macro provides a better way to do it, which allows the user to parallelize the loop and then reduce the computations through an operator, which together would be called the Map-Reduce operation.
Channels are like the background plumbing for parallel computing in Julia. They are like the reservoirs from where the individual processes access their data from.
The requisite is similar to the previous sections. This is mostly a theoretical section, so you just need to run your experiments on your own. For that, you need to run your Julia REPL in a multiprocessing mode.
Channels are shared queues with a fixed length. They are common data reservoirs for the processes which are running.
The channels are like common data resources, which multiple readers or workers can access. They can access the data through the fetch() function, which we already discussed in the previous sections.
The workers can also write to the channel through the put!() function. This means that the workers can add more data to the resource, which can be accessed by all the workers running a particular computation.
Closing a channel after usage is a good practice to avoid data corruption and unnecessary memory usage. It can be done using the close() function.
In this article we covered the basic concepts of parallel computing and data movement that takes place in the network.
We also learned about parallel maps and loop operations along with the famous Map Reduce framework. At the end we got a brief understanding of channels and how individual processes access their data from channels.
Further resources on this subject:
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…