18 min read

In this article by Akhil Wali, the author of the book Mastering Clojure, we will study this particular abstraction of collections and how it is quite orthogonal to viewing collections as sequences.

Sequences and laziness are great way of handling collections. The Clojure standard library provides several functions to handle and manipulate sequences. However, abstracting a collection as a sequence has an unfortunate consequence—any computation that is performed over all the elements of a sequence is inherently sequential. All standard sequence functions create a new collection that be similar to the collection they are passed. Interestingly, performing a computation over a collection without creating a similar collection—even as an intermediary result—is quite useful. For example, it is often required to reduce a given collection to a single value through a series of transformations in an iterative manner. This sort of computation does not necessarily require the intermediary results of each transformation to be saved.

(For more resources related to this topic, see here.)

A consequence of iteratively computing values from a collection is that we cannot parallelize it in a straightforward way. Modern map-reduce frameworks handle this kind of computation by pipelining the elements of a collection through several transformations in parallel and finally reducing the results into a single result. Of course, the result could be a new collection as well. A drawback is that this methodology produces concrete collections as intermediate results of each transformation, which is rather wasteful. For example, if we want to filter values from a collection, a map-reduce strategy would require creating empty collections to represent values that are left out of the reduction step to produce the final result. This incurs unnecessary memory allocation and also creates additional work for the reduction step that produces the final result. Hence, there’s scope for optimizing these kinds of computations.

This brings us to the notion of treating computations over collections as reducers to attain better performance. Of course, this doesn’t mean that reducers are a replacement for sequences. Sequences and laziness are great for abstracting computations that create and manipulate collections, while reducers are specialized high-performance abstractions of collections in which a collection needs to be piped through several transformations and combined to produce the final result. Reducers achieve a performance gain in the following ways:

  • Reducing the amount of memory allocated to produce the desired result
  • Parallelizing the process of reducing a collection into a single result, which could be an entirely new collection

The clojure.core.reducers namespace provides several functions for processing collections using reducers. Let’s now examine how reducers are implemented and also study a few examples that demonstrate how reducers can be used.

Using reduce to transform collections

Sequences and functions that operate on sequences preserve the sequential ordering between the constituent elements of a collection. Lazy sequences avoid unnecessary realization of elements in a collection until they are required for a computation, but the realization of these values is still performed in a sequential manner. However, this characteristic of sequential ordering may not be desirable for all computations performed over it. For example, it’s not possible to map a function over a vector and then lazily realize values in the resulting collection by random access, since the map function converts the collection that it is supplied into a sequence. Also, functions such as map and filter are lazy but still sequential by nature.

Consider a unary function as shown in Example 3.1 that we intend to map it over a given vector. The function must compute a value from the one it is supplied, and also perform a side effect so that we can observe its application over the elements in a collection:

Example 3.1. A simple unary function

(defn square-with-side-effect [x]

  (do

    (println (str "Side-effect: " x))

    (* x x)))

The square-with-side-effect function defined here simply returns the square of a number x using the * function. This function also prints the value of x using a println form whenever it is called. Suppose this function is mapped over a given vector. The resulting collection will have to be realized completely if a computation has to be performed over it, even if all the elements from the resulting vector are not required. This can be demonstrated as follows:

user> (def mapped (map square-with-side-effect [0 1 2 3 4 5]))

#'user/mapped

user> (reduce + (take 3 mapped))

Side-effect: 0

Side-effect: 1

Side-effect: 2

Side-effect: 3

Side-effect: 4

Side-effect: 5

5

As previously shown, the mapped variable contains the result of mapping the square-with-side-effect function over a vector. If we try to sum the first three values in the resulting collection using the reduce, take, and + functions, all the values in the [0 1 2 3 4 5] vector are printed as a side effect. This means that the square-with-side-effect function was applied to all the elements in the initial vector, despite the fact that only the first three elements were actually required by the reduce form. Of course, this can be solved by using the seq function to convert the vector to a sequence before we map the square-with-side-effect function over it. But then, we lose the ability to efficiently access elements in a random order in the resulting collection.

To dive deeper into why this actually happens, you first need to understand how the standard map function is actually implemented. A simplified definition of the map function is shown here:

Example 3.2. A simplified definition of the map function

(defn map [f coll]

  (cons (f (first coll))

        (lazy-seq (map f (rest coll)))))

The definition of map in Example 3.2 is a simplified and rather incomplete one, as it doesn’t check for an empty collection and cannot be used over multiple collections. That aside, this definition of map does indeed apply a function f to all the elements in a coll collection. This is implemented using a composition of the cons, first, rest, and lazy-seq forms. The implementation can be interpreted as, “apply the f function to the first element in the coll collection, and then map f over the rest of the collection in a lazy manner.” An interesting consequence of this implementation is that the map function has the following characteristics:

  • The ordering among the elements in the coll collection is preserved.
  • This computation is performed recursively.
  • The lazy-seq form is used to perform the computation in a lazy manner.
  • The use of the first and rest forms indicates that coll must be a sequence, and the cons form will also produce a result that is a sequence. Hence, the map function accepts a sequence and builds a new one.

Another interesting characteristic about lazy sequences is that they are realized in chunks. This means that a lazy sequence is realized in chunks of 32 elements, each as an optimization, when the values in the sequence are actually required. Sequences that behave this way are termed as chunked sequences. Of course, not all sequences are chunked, and we can check whether a given sequence is chunked using the chunked-seq? predicate. The range function returns a chunked sequence, shown as follows:

user> (first (map #(do (print !) %) (range 70)))

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

0

user> (nth (map #(do (print !) %) (range 70)) 32)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

32

Both the statements in the output shown previously select a single element from a sequence returned by the map function. The function passed to the map function in both the above statements prints the ! character and returns the value supplied to it. In the first statement, the first 32 elements of the resulting sequence are realized, even though only the first element is required. Similarly, the second statement is observed to realize the first 64 elements of the resulting sequence when the element at the 32nd position is obtained using the nth function.

Chunked sequences have been an integral part of Clojure since version 1.1.

However, none of the properties of the sequences described above are needed to transform a given collection into a result that is not a sequence. If we were to handle such computations efficiently, we cannot build on functions that return sequences, such as map and filter. Incidentally, the reduce function does not necessarily produce a sequence. It also has a couple of other interesting properties:

  • The reduce function actually lets the collection it is passed to define how it is computed over or reduced. Thus, reduce is collection independent.
  • Also, the reduce function is versatile enough to build a single value or an entirely new collection as well. For example, using reduce with the * or + function will create a single-valued result, while using it with the cons or concat function can create a new collection as the result. Thus, reduce can build anything.

A collection is said to be reducible if it defines how it can be reduced to a single result. The binary function that is used by the reduce function along with a collection is also termed as a reducing function. A reducing function requires two arguments: one to represent the result of the computation so far, and another to represent an input value that has to be combined into the result. Several reducing functions can be composed into one, which effectively changes how the reduce function processes a given collection. This composition is done using reducers, which can be thought of as a shortened version of the term reducing function transformers.

The use of sequences and laziness can be compared to the use of reducers to perform a given computation by Rich Hickey’s infamous pie-maker analogy. Suppose a pie-maker has been supplied a bag of apples with an intent to reduce the apples to a pie. There are a couple of transformations needed to perform this task. First, the stickers on all the apples have to be removed; as in, we map a function to “take the sticker off” the apples in the collection. Also, all the rotten apples will have to be removed, which is analogous to using the filter function to remove elements from a collection. Instead of performing this work herself, the pie-maker delegates it to her assistant. The assistant could first take the stickers off all the apples, thus producing a new collection, and then take out the rotten apples to produce another new collection, which illustrates the use of lazy sequences. But then, the assistant would be doing unnecessary work by removing the stickers off of the rotten apples, which will have to be discarded later anyway. On the other hand, the assistant could delay this work until the actual reduction of the processed apples into a pie is performed. Once the work actually needs to be performed, the assistant will compose the two tasks of mapping and filtering the collection of apples, thus avoiding any unnecessary work. This case depicts the use of reducers for composing and transforming the tasks needed to effectively reduce a collection of apples to a pie. By using reducers, we create a recipe of tasks to reduce a collection of apples to a pie and delay all processing until the final reduction, instead of dealing with collections of apples as intermediary results of each task:

The following namespaces must be included in your namespace declaration for the upcoming examples.

(ns my-namespace

  (:require [clojure.core.reducers :as r]))

The clojure.core.reducers namespace requires Java 6 with the jsr166y.jar or Java 7+ for fork/join support. 

Let’s now briefly explore how reducers are actually implemented. Functions that operate on sequences use the clojure.lang.ISeq interface to abstract the behavior of a collection. In the case of reducers, the common interface that we must build upon is that of a reducing function. As we mentioned earlier, a reducing function is a two-arity function in which the first argument is the result produced so far and the second argument is the current input, which has to be combined with the first argument. The process of performing a computation over a collection and producing a result can be generalized into three distinct cases. They can be described as follows:

  • A new collection with the same number of elements as that of the collection it is supplied needs to be produced. This one-to-one case is analogous to using the map function.
  • The computation shrinks the supplied collection by removing elements from it. This can be done using the filter function.
  • The computation could also be expansive, in which case it produces a new collection that contains an increased number of elements. This is like what the mapcat function does.

These cases depict the different ways by which a collection can be transformed into the desired result. Any computation, or reduction, over a collection can be thought of as an arbitrary sequence of such transformations. These transformations are represented by transformers, which are functions that transform a reducing function. They can be implemented as shown here in Example 3.3:

Example 3.3. Transformers

(defn mapping [f]

  (fn [rf]

    (fn [result input]

      (rf result (f input)))))

 

 (defn filtering [p?]

  (fn [rf]

    (fn [result input]

      (if (p? input)

        (rf result input)

        result))))

 

(defn mapcatting [f]

  (fn [rf]

    (fn [result input]

      (reduce rf result (f input)))))

The mapping, filtering, and mapcatting functions in the above example Example 3.3 represent the core logic of the map, filter, and mapcat functions respectively. All of these functions are transformers that take a single argument and return a new function. The returned function transforms a supplied reducing function, represented by rf, and returns a new reducing function, created using this expression: (fn [result input] … ). Functions returned by the mapping, filtering, and mapcatting functions are termed as reducing function transformers.

The mapping function applies the f function to the current input, represented by the input variable. The value returned by the f function is then combined with the accumulated result, represented by result, using the reducing function rf. This transformer is a frighteningly pure abstraction of the standard map function that applies an f function over a collection. The mapping function makes no assumptions about the structure of the collection it is supplied, or how the values returned by the f function are combined to produce the final result.

Similarly, the filtering function uses a predicate, p?, to check whether the current input of the rf reducing function must combined into the final result, represented by result. If the predicate is not true, then the reducing function will simply return the result value without any modification. The mapcatting function uses the reduce function to combine the value result with the result of the (f input) expression. In this transformer, we can assume that the f function will return a new collection and the rf reducing function will somehow combine two collections into a new one.

One of the foundations of the reducers library is the CollReduce protocol defined in the clojure.core.protocols namespace. This protocol defines the behavior of a collection when it is passed as an argument to the reduce function, and it is declared as shown below Example 3.4:

Example 3.4. The CollReduce protocol

(defprotocol CollReduce

  (coll-reduce [coll rf init]))

The clojure.core.reducers namespace defines a reducer function that creates a reducible collection by dynamically extending the CollReduce protocol, as shown in this code Example 3.5:

The reducer function

(defn reducer

  ([coll xf]

   (reify

     CollReduce

     (coll-reduce [_ rf init]

       (coll-reduce coll (xf rf) init)))))

The reducer function combines a collection (coll) and a reducing function transformer (xf), which is returned by the mapping, filtering, and mapcatting functions, to produce a new reducible collection. When reduce is invoked on a reducible collection, it will ultimately ask the collection to reduce itself using the reducing function returned by the (xf rf) expression. Using this mechanism, several reducing functions can be composed into a computation that has to be performed over a given collection. Also, the reducer function needs to be defined only once, and the actual implementation of coll-reduce is provided by the collection supplied to the reducer function.

Now, we can redefine the reduce function to simply invoke the coll-reduce function implemented by a given collection, as shown herein Example 3.6:

Example 3.6. Redefining the reduce function

(defn reduce

  ([rf coll]

   (reduce rf (rf) coll))

  ([rf init coll]

   (coll-reduce coll rf init)))

As shown in the above code Example 3.6, the reduce function delegates the job of reducing a collection to the collection itself using the coll-reduce function. Also, the reduce function will use the rf reducing function to supply the init argument when it is not specified. An interesting consequence of this definition of reduce is that the rf function must produce an identity value when it is supplied no arguments. The standard reduce function even uses the CollReduce protocol to delegate the job of reducing a collection to the collection itself, but it will also fall back on the default definition of reduce if the supplied collection does not implement the CollReduce protocol.

Since Clojure 1.4, the reduce function allows a collection to define how it is reduced using the clojure.core.CollReduce protocol. Clojure 1.5 introduced the clojure.core.reducers namespace, which extends the use of this protocol.

All the standard Clojure collections, namely lists, vectors, sets, and maps, implement the CollReduce protocol. The reducer function can be used to build a sequence of transformations to be applied on a collection when it is passed as an argument to the reduce function. This can be demonstrated as follows:

user> (r/reduce + 0 (r/reducer [1 2 3 4] (mapping inc)))

14

user> (reduce + 0 (r/reducer [1 2 3 4] (mapping inc)))

14

In this output, the mapping function is used with the inc function to create a reducing function transformer that increments all the elements in a given collection. This transformer is then combined with a vector using the reducer function to produce a reducible collection. The call to reduce in both of the above statements is transformed into the (reduce + [2 3 4 5]) expression, thus producing the result as 14. We can now redefine the map, filter, and mapcat functions using the reducer function, as shown belowin Example 3.7:

Redefining the map, filter and mapcat functions using the reducer form

(defn map [f coll]

  (reducer coll (mapping f)))

 

(defn filter [p? coll]

  (reducer coll (filtering p?)))

 

(defn mapcat [f coll]

  (reducer coll (mapcatting f)))

As shown in Example 3.7, the map, filter, and mapcat functions are now simply compositions of the reducer form with the mapping, filtering, and mapcatting transformers respectively.

The definitions of CollReduce, reducer, reduce, map, filter, and mapcat are simplified versions of their actual definitions in the clojure.core.reducers namespace.

The definitions of the map, filter, and mapcat functions shown in Example 3.7 have the same shape as the standard versions of these functions, shown as follows:

user> (r/reduce + (r/map inc [1 2 3 4]))

14

user> (r/reduce + (r/filter even? [1 2 3 4]))

6

user> (r/reduce + (r/mapcat range [1 2 3 4]))

10

Hence, the map, filter, and mapcat functions from the clojure.core.reducers namespace can be used in the same way as the standard versions of these functions. The reducers library also provides a take function that can be used as a replacement for the standard take function. We can use this function to reduce the number of calls to the square-with-side-effect function (from Example 3.1) when it is mapped over a given vector, as shown below:

user> (def mapped (r/map square-with-side-effect [0 1 2 3 4 5]))

#'user/mapped

user> (reduce + (r/take 3 mapped))

Side-effect: 0

Side-effect: 1

Side-effect: 2

Side-effect: 3

5

Thus, using the map and take functions from the clojure.core.reducers namespace as shown above avoids the application of the square-with-side-effect function over all five elements in the [0 1 2 3 4 5] vector, as only the first three are required.

The reducers library also provides variants of the standard take-while, drop, flatten, and remove functions which are based on reducers. Effectively, functions based on reducers will require a lesser number of allocations than sequence-based functions, thus leading to an improvement in performance. For example, consider the process and process-with-reducer functions, shown here:

Example 3.8. Functions to process a collection of numbers using sequences and reducers

(defn process [nums]

  (reduce + (map inc (map inc (map inc nums)))))

(defn process-with-reducer [nums]

  (reduce + (r/map inc (r/map inc (r/map inc nums)))))

This process function in Example 3.8 applies the inc function over a collection of numbers represented by nums using the map function. The process-with-reducer function performs the same action but uses the reducer variant of the map function. The process-with-reducer function will take a lesser amount of time to produce its result from a large vector compared to the process function, as shown here:

user> (def nums (vec (range 1000000)))

#'user/nums

user> (time (process nums))

"Elapsed time: 471.217086 msecs"

500002500000

user> (time (process-with-reducer nums))

"Elapsed time: 356.767024 msecs"

500002500000

The process-with-reducer function gets a slight performance boost as it requires a lesser number of memory allocations than the process function. The performance of this computation can be improved by a greater scale if we can somehow parallelize it.

Summary

In this article, we explored the clojure.core.reducers library in detail.

We took a look at how reducers are implemented and also how we can use reducers to handle large collections of data in an efficient manner.

Resources for Article:

 


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here