Home Programming Languages Understanding Ranges

Understanding Ranges

October 28, 2015 - 12:00 am

1846

40 min read

In this article by Michael Parker, author of the book Learning D, explains since they were first introduced, ranges have become a pervasive part of D. It’s possible to write D code and never need to create any custom ranges or algorithms yourself, but it helps tremendously to understand what they are, where they are used in Phobos, and how to get from a range to an array or another data structure. If you intend to use Phobos, you’re going to run into them eventually. Unfortunately, some new D users have a difficult time understanding and using ranges.

The aim of this article is to present ranges and functional styles in D from the ground up, so you can see they aren’t some arcane secret understood only by a chosen few. Then, you can start writing idiomatic D early on in your journey. In this article, we lay the foundation with the basics of constructing and using ranges in two sections:

Ranges defined
Ranges in use

(For more resources related to this topic, see here.)

Ranges defined

In this section, we’re going to explore what ranges are and see the concrete definitions of the different types of ranges recognized by Phobos. First, we’ll dig into an example of the sort of problem ranges are intended to solve and in the process, develop our own solution. This will help form an understanding of ranges from the ground up.

The problem

As part of an ongoing project, you’ve been asked to create a utility function, filterArray, which takes an array of any type and produces a new array containing all of the elements from the source array that pass a Boolean condition. The algorithm should be nondestructive, meaning it should not modify the source array at all. For example, given an array of integers as the input, filterArray could be used to produce a new array containing all of the even numbers from the source array.

It should be immediately obvious that a function template can handle the requirement to support any type. With a bit of thought and experimentation, a solution can soon be found to enable support for different Boolean expressions, perhaps a string mixin, a delegate, or both. After browsing the Phobos documentation for a bit, you come across a template that looks like it will help with the std.functional.unaryFun implementation. Its declaration is as follows:

template unaryFun(alias fun, string parmName = "a");

The alias fun parameter can be a string representing an expression, or any callable type that accepts one argument. If it is the former, the name of the variable inside the expression should be parmName, which is “a” by default. The following snippet demonstrates this:

int num = 10;
assert(unaryFun!("(a & 1) == 0")(num));
assert(unaryFun!("(x > 0)", "x")(num));

If fun is a callable type, then unaryFun is documented to alias itself to fun and the parmName parameter is ignored. The following snippet calls unaryFun first with struct that implements opCall, then calls it again with a delegate literal:

struct IsEven {
  bool opCall(int x) {
    return (x & 1) == 0;
  }
}
IsEven isEven;
assert(unaryFun!isEven(num));
assert(unaryFun!(x => x > 0)(num));

With this, you have everything you need to implement the utility function to spec:

import std.functional;
T[] filterArray(alias predicate, T)(T[] source)
  if(is(typeof(unaryFun!predicate(source[0])))
{
  T[] sink;
  foreach(t; source) {
    if(unaryFun!predicate(t))
      sink ~= t;
  }
  return sink;
}
unittest {
  auto ints = [1, 2, 3, 4, 5, 6, 7];
  auto even = ints.filterArray!(x => (x & 1) == 0)();
  assert(even == [2, 4, 6]);
}

The unittest verifies that the function works as expected. As a standalone implementation, it gets the job done and is quite likely good enough. But, what if, later on down the road, someone decides to create more functions that perform specific operations on arrays in the same manner? The natural outcome of that is to use the output of one operation as the input for another, creating a chain of function calls to transform the original data.

The most obvious problem is that any such function that cannot perform its operation in place must allocate at least once every time it’s called. This means, chain operations on a single array will end up allocating memory multiple times. This is not the sort of habit you want to get into in any language, especially in performance-critical code, but in D, you have to take the GC into account. Any given allocation could trigger a garbage collection cycle, so it’s a good idea to program to the GC; don’t be afraid of allocating, but do so only when necessary and keep it out of your inner loops.

In filterArray, the naïve appending can be improved upon, but the allocation can’t be eliminated unless a second parameter is added to act as the sink. This allows the allocation strategy to be decided at the call site rather than by the function, but it leads to another problem. If all of the operations in a chain require a sink and the sink for one operation becomes the source for the next, then multiple arrays must be declared to act as sinks. This can quickly become unwieldy.

Another potential issue is that filterArray is eager, meaning that every time the function is called, the filtering takes place immediately. If all of the functions in a chain are eager, it becomes quite difficult to get around the need for allocations or multiple sinks. The alternative, lazy functions, do not perform their work at the time they are called, but rather at some future point. Not only does this make it possible to put off the work until the result is actually needed (if at all), it also opens the door to reducing the amount of copying or allocating needed by operations in the chain. Everything could happen in one step at the end.

Finally, why should each operation be limited to arrays? Often, we want to execute an algorithm on the elements of a list, a set, or some other container, so why not support any collection of elements? By making each operation generic enough to work with any type of container, it’s possible to build a library of reusable algorithms without the need to implement each algorithm for each type of container.

The solution

Now we’re going to implement a more generic version of filterArray, called filter, which can work with any container type. It needs to avoid allocation and should also be lazy. To facilitate this, the function should work with a well-defined interface that abstracts the container away from the algorithm. By doing so, it’s possible to implement multiple algorithms that understand the same interface. It also takes the decision on whether or not to allocate completely out of the algorithms. The interface of the abstraction need not be an actual interface type. Template constraints can be used to verify that a given type meets the requirements.

You might have heard of duck typing. It originates from the old saying, If it looks like a duck, swims like a duck, and quacks like a duck, then it’s probably a duck. The concept is that if a given object instance has the interface of a given type, then it’s probably an instance of that type. D’s template constraints and compile-time capabilities easily allow for duck typing.

The interface

In looking for inspiration to define the new interface, it’s tempting to turn to other languages like Java and C++. On one hand, we want to iterate the container elements, which brings to mind the iterator implementations in other languages. However, we also want to do a bit more than that, as demonstrated by the following chain of function calls:

container.getType.algorithm1.algorithm2.algorithm3.toContainer();

Conceptually, the instance returned by getType will be consumed by algorithm1, meaning that inside the function, it will be iterated to the point where it can produce no more elements. But then, algorithm1 should return an instance of the same type, which can iterate over the same container, and which will in turn be consumed by algorithm2. The process repeats for algorithm3. This implies that instances of the new type should be able to be instantiated independent of the container they represent.

Moreover, given that D supports slicing, the role of getType previously could easily be played by opSlice. Iteration need not always begin with the first element of a container and end with the last; any range of elements should be supported. In fact, there’s really no reason for an actual container instance to exist at all in some cases. Imagine a random number generator; we should be able to plug one into the preceding function chain; just eliminate the container and replace getType with the generator instance. As long as it conforms to the interface we define, it doesn’t matter that there is no concrete container instance backing it.

The short version of it is, we don’t want to think solely in terms of iteration, as it’s only a part of the problem we’re trying to solve. We want a type that not only supports iteration, of either an actual container or a conceptual one, but one that also can be instantiated independently of any container, knows both its beginning and ending boundaries, and, in order to allow for lazy algorithms, can be used to generate new instances that know how to iterate over the same elements.

Considering these requirements, Iterator isn’t a good fit as a name for the new type. Rather than naming it for what it does or how it’s used, it seems more appropriate to name it for what it represents. There’s more than one possible name that fits, but we’ll go with Range (as in, a range of elements). That’s it for the requirements and the type name. Now, moving on to the API.

For any algorithm that needs to sequentially iterate a range of elements from beginning to end, three basic primitives are required:

There must be a way to determine whether or not any elements are available
There must be a means to access the next element in the sequence
There must be a way to advance the sequence so that another element can be made ready

Based on these requirements, there are several ways to approach naming the three primitives, but we’ll just take a shortcut and use the same names used in D. The first primitive will be called empty and can be implemented either as a member function that returns bool or as a bool member variable. The second primitive will be called front, which again could be a member function or variable, and which returns T, the element type of the range. The third primitive can only be a member function and will be called popFront, as conceptually it is removing the current front from the sequence to ready the next element.

A range for arrays

Wrapping an array in the Range interface is quite easy. It looks like this:

auto range(T)(T[] array) {
  struct ArrayRange(T) {
    private T[] _array;
    bool empty() @property {
      return _array.length == 0;
    }
    ref T front() {
      return _array[0];
    }
    void popFront() {
      _array = _array[1 .. $];
    }
  }
  return ArrayRange!T(array);
}

By implementing the iterator as struct, there’s no need to allocate GC memory for a new instance. The only member is a slice of the source array, which again avoids allocation. Look at the implementation of popFront. Rather than requiring a separate variable to track the current array index, it slices the first element out of _array so that the next element is always at index 0, consequently shortening.length of the slice by 1 so that after every item has been consumed, _array.length will be 0. This makes the implementation of both empty and front dead simple.

ArrayRange can be a Voldemort type because there is no need to declare its type in any algorithm it’s passed to. As long as the algorithms are implemented as templates, the compiler can infer everything that needs to be known for them to work. Moreover, thanks to UFCS, it’s possible to call this function as if it were an array property. Given an array called myArray, the following is valid:

auto range = myArray.range;

Next, we need a template to go in the other direction. This needs to allocate a new array, walk the iterator, and store the result of each call to element in the new array. Its implementation is as follows:

T[] array(T, R)(R range) @property {
  T[] ret;
  while(!range.empty) {
    ret ~= range.front;
    range.popFront();
  }
  return ret;
}

This can be called after any operation that produces any Range in order to get an array. If the range comes at the end of one or more lazy operations, this will cause all of them to execute simply by the call to popFront (we’ll see how shortly). In that case, no allocations happen except as needed in this function when elements are appended to ret. Again, the appending strategy here is naïve, so there’s room for improvement in order to reduce the potential number of allocations. Now it’s time to implement an algorithm to make use of our new range interface.

The implementation of filter

The filter function isn’t going to do any filtering at all. If that sounds counterintuitive, recall that we want the function to be lazy; all of the work should be delayed until it is actually needed. The way to accomplish that is to wrap the input range in a custom range that has an internal implementation of the filtering algorithm. We’ll call this wrapper FilteredRange. It will be a Voldemort type, which is local to the filter function. Before seeing the entire implementation, it will help to examine it in pieces as there’s a bit more to see here than with ArrayRange.

FilteredRange has only one member:

private R _source;

R is the type of the range that is passed to filter. The empty and front functions simply delegate to the source range, so we’ll look at popFront next:

void popFront() {
    _source.popFront();
    skipNext();
}

This will always pop the front from the source range before running the filtering logic, which is implemented in the private helper function skipNext:

private void skipNext() {
    while(!_source.empty && !unaryFun!predicate(_source.front))
        _source.popFront();
}

This function tests the result of _source.front against the predicate. If it doesn’t match, the loop moves on to the next element, repeating the process until either a match is found or the source range is empty. So, imagine you have an array arr of the values [1,2,3,4]. Given what we’ve implemented so far, what would be the result of the following chain? Let’s have a look at the following code:

arr.range.filter!(x => (x & 1) == 0).front;

As mentioned previously, front delegates to _source.front. In this case, the source range is an instance of ArrayRange; its front returns _source[0]. Since popFront was never called at any point, the first value in the array was never tested against the predicate. Therefore, the return value is 1, a value which doesn’t match the predicate. The first value returned by front should be 2, since it’s the first even number in the array.

In order to make this behave as expected, FilteredRange needs to ensure the wrapped range is in a state such that either the first call to front will properly return a filtered value, or empty will return true, meaning there are no values in the source range that match the predicate. This is best done in the constructor:

this(R source) {
    _source = source;
    skipNext();
}

Calling skipNext in the constructor ensures that the first element of the source range is tested against the predicate; however, it does mean that our filter implementation isn’t completely lazy. In an extreme case, that _source contains no values that match the predicate; it’s actually going to be completely eager. The source elements will be consumed as soon as the range is instantiated. Not all algorithms will lend themselves to 100 percent laziness. No matter what we have here is lazy enough. Wrapped up inside the filter function, the whole thing looks like this:

import std.functional;
auto filter(alias predicate, R)(R source)
  if(is(typeof(unaryFun!predicate))) {
    struct FilteredRange {
      private R _source;
      this(R source) {
        _source = source;
        skipNext();
      }
      bool empty() { return _source.empty; }
      auto ref front() { return _source.front; }
      void popFront() {
        _source.popFront();
        skipNext();
      }
      private void skipNext() {
        while(!_source.empty && 
          !unaryFun!predicate(_source.front))
          _source.popFront();
      }
    }
    return FilteredRange(source);
}

It might be tempting to take the filtering logic out of the skipNext method and add it to front, which is another way to guarantee that it’s performed on every element. Then no work would need to be done in the constructor and popFront would simply become a wrapper for _source.popFront. The problem with that approach is that front can potentially be called multiple times without calling popFront in between. Aside from the fact that it should return the same value each time, which can easily be accommodated, this still means the current element will be tested against the predicate on each call. That’s unnecessary work. As a general rule, any work that needs to be done inside a range should happen as a result of calling popFront, leaving front to simply focus on returning the current element.

The test

With the implementation complete, it’s time to put it through its paces. Here are a few test cases in a unittest block:

unittest {
  auto arr = [10, 13, 300, 42, 121, 20, 33, 45, 50, 109, 18];
  auto result = arr.range
                     .filter!(x => x < 100 )
                     .filter!(x => (x & 1) == 0)
                     .array!int();
  assert(result == [10,42,20,50,18]);

  arr = [1,2,3,4,5,6];
  result = arr.range.filter!(x => (x & 1) == 0).array!int;
  assert(result == [2, 4, 6]);

  arr = [1, 3, 5, 7];
  auto r = arr.range.filter!(x => (x & 1) == 0);
  assert(r.empty);

  arr = [2,4,6,8];
  result = arr.range.filter!(x => (x & 1) == 0).array!int;
  assert(result == arr);
}

Assuming all of this has been saved in a file called filter.d, the following will compile it for unit testing:

dmd -unittest -main filter

That should result in an executable called filter which, when executed, should print nothing to the screen, indicating a successful test run. Notice the test that calls empty directly on the returned range. Sometimes, we might not need to convert a range to a container at the end of the chain. For example, to print the results, it’s quite reasonable to iterate a range directly. Why allocate when it isn’t necessary?

The real ranges

The purpose of the preceding exercise was to get a feel of the motivation behind D ranges. We didn’t develop a concrete type called Range, just an interface. D does the same, with a small set of interfaces defining ranges for different purposes. The interface we developed exactly corresponds to the basic kind of D range, called an input range, one of one of two foundational range interfaces in D (the upshot of that is that both ArrayRange and FilteredRange are valid input ranges, though, as we’ll eventually see, there’s no reason to use either outside of this article). There are also certain optional properties that ranges might have, which, when present, some algorithms might take advantage of. We’ll take a brief look at the range interfaces now, then see more details regarding their usage in the next section.

Input ranges

This foundational range is defined to be anything from which data can be sequentially read via the three primitives empty, front, and popFront. The first two should be treated as properties, meaning they can be variables or functions. This is important to keep in mind when implementing any generic range-based algorithm yourself; calls to these two primitives should be made without parentheses. The three higher-order range interfaces, we’ll see shortly, build upon the input range interface.

To reinforce a point made earlier, one general rule to live by when crafting input ranges is that consecutive calls to front should return the same value until popFront is called; popFront prepares an element to be returned and front returns it. Breaking this rule can lead to unexpected consequences when working with range-based algorithms, or even foreach.

Input ranges are somewhat special in that they are recognized by the compiler. The opApply enables iteration of a custom type with a foreach loop. An alternative is to provide an implementation of the input range primitives. When the compiler encounters a foreach loop, it first checks to see if the iterated instance is of a type that implements opApply. If not, it then checks for the input range interface and, if found, rewrites the loop. In a given range someRange, take for example the following loop:

foreach(e; range) { ... }

This is rewritten to something like this:

for(auto __r = range; !__r.empty; __r.popFront()) {
    auto e = __r.front;
    ...
}

This has implications. To demonstrate, let’s use the ArrayRange from earlier:

auto ar = [1, 2, 3, 4, 5].range;
foreach(n; ar) {
    writeln(n);
}
if(!ar.empty) writeln(ar.front);

The last line prints 1. If you’re surprised, look back up at the for loop that the compiler generates. ArrayRange is a struct, so when it’s assigned to __r, a copy is generated. The slices inside, ar and __r, point to the same memory, but their .ptr and .length properties are distinct. As the length of the __r slice decreases, the length of the ar slice remains the same.

When implementing generic algorithms that loop over a source range, it’s not a good idea to assume the original range will not be consumed by the loop. If it’s a class instead of struct, it will be consumed by the loop, as classes are references types. Furthermore, there are no guarantees about the internal implementation of a range. There could be struct-based ranges that are actually consumed in a foreach loop. Generic functions should always assume this is the case.

Test if a given range type R is an input range:

import std.range : isInputRange;
static assert(isInputRange!R);

There are no special requirements on the return value of the front property. Elements can be returned by value or by reference, they can be qualified or unqualified, they can be inferred via auto, and so on. Any qualifiers, storage classes, or attributes that can be applied to functions and their return values can be used with any range function, though it might not always make sense to do so.

Forward ranges

The most basic of the higher-order ranges is the forward range. This is defined as an input range that allows its current point of iteration to be saved via a primitive appropriately named save. Effectively, the implementation should return a copy of the current state of the range. For ranges that are struct types, it could be as simple as:

auto save() { return this; }

For ranges that are class types, it requires allocating a new instance:

auto save() { return new MyForwardRange(this); }

Forward ranges are useful for implementing algorithms that require a look ahead. For example, consider the case of searching a range for a pair of adjacent elements that pass an equality test:

auto saved = r.save;
if(!saved.empty) {
  for(saved.popFront(); !saved.empty;
              r.popFront(), saved.popFront()) {
    if(r.front == saved.front)
      return r;
  }
}
return saved;

Because this uses a for loop and not a foreach loop, the ranges are iterated directly and are going to be consumed. Before the loop begins, a copy of the current state of the range is made by calling r.save. Then, iteration begins over both the copy and the original. The original range is positioned at the first element, and the call to saved.popFront in the beginning of the loop statement positions the saved range at the second element. As the ranges are iterated in unison, the comparison is always made on adjacent elements. If a match is found, r is returned, meaning that the returned range is positioned at the first element of a matching pair. If no match is found, saved is returned—since it’s one element ahead of r, it will have been consumed completely and its empty property will be true.

The preceding example is derived from a more generic implementation in Phobos, std.range.findAdjacent. It can use any binary (two argument) Boolean condition to test adjacent elements and is constrained to only accept forward ranges.

It’s important to understand that calling save usually does not mean a deep copy, but it sometimes can. If we were to add a save function to the ArrayRange from earlier, we could simply return this; the array elements would not be copied. A class-based range, on the other hand, will usually perform a deep copy because it’s a reference type. When implementing generic functions, you should never make the assumption that the range does not require a deep copy. For example, given a range r:

auto saved = r;         // INCORRECT!!
auto saved = r.save;    // Correct.

If r is a class, the first line is almost certainly going to result in incorrect behavior. It would in the preceding example loop.

To test if a given range R is a forward range:

import std.range : isForwardRange;
static assert(isForwardRange!R);

Bidirectional ranges

A bidirectional range is a forward range that includes the primitives back and popBack, allowing a range to be sequentially iterated in reverse. The former should be a property, the latter a function. Given a bidirectional range r, the following forms of iteration are possible:

foreach_reverse(e; r) writeln(e);
for(; !r.empty; r.popBack)
    writeln(r.back);
}

Like its cousin foreach, the foreach_reverse loop will be rewritten into a for loop that does not consume the original range; the for loop shown here does consume it.

Test whether a given range type R is a bidirectional range:

import std.range : isBidirectionalRange;
static assert(isBidirectionalRange!R);

Random-access ranges

A random-access range is a bidirectional range that supports indexing and is required to provide a length primitive unless it’s infinite (two topics we’ll discuss shortly). For custom range types, this is achieved via the opIndex operator overload. It is assumed that r[n] returns a reference to the (n+1)th element of the range, just as when indexing an array.

Test whether a given range R is a random-access range:

import std.range : isRandomAccessRange;
static assert(isRandomAccessRange!R);

Dynamic arrays can be treated as random-access ranges by importing std.array. This pulls functions into scope that accept dynamic arrays as parameters and allows them to pass all the isRandomAccessRange checks. This makes our ArrayRange from earlier obsolete. Often, when you need a random-access range, it’s sufficient just to use an array instead of creating a new range type. However, char and wchar arrays (string and wstring) are not considered random-access ranges, so they will not work with any algorithm that requires one.

Getting a random-access range from char[] and wchar[]

Recall that a single Unicode character can be composed of multiple elements in a char or wchar array, which is an aspect of strings that would seriously complicate any algorithm implementation that needs to directly index the array. To get around this, the thing to do in a general case is to convert char[] and wchar[] into dchar[]. This can be done with std.utf.toUTF32, which encodes UTF-8 and UTF-16 strings into UTF-32 strings. Alternatively, if you know you’re only working with ASCII characters, you can use std.string.representation to get ubyte[] or ushort[] (on dstring, it returns uint[]).

Output ranges

The output range is the second foundational range type. It’s defined to be anything that can be sequentially written to via the primitive put. Generally, it should be implemented to accept a single parameter, but the parameter could be a single element, an array of elements, a pointer to elements, or another data structure containing elements. When working with output ranges, never call the range’s implementation of put directly; instead, use the Phobos’ utility function std.range.put. It will call the range’s implementation internally, but it allows for a wider range of argument types. Given a range r and element e, it would look like this:

import std.range : put;
put(r, e);

The benefit here is if e is anything other than a single element, such as an array or another range, the global put does what is necessary to pull elements from it and put them into r one at a time. With this, you can define and implement a simple output range that might look something like this:

MyOutputRange(T) {
  private T[] _elements;
  void put(T elem) {
    _elements ~= elem;
  }
}

Now, you need not worry about calling put in a loop, or overloading it to accept collections of T. For example, let’s have a look at the following code:

MyOutputRange!int range;
auto nums = [11, 22, 33, 44, 55];
import std.range : put;
put(range, nums);

Note that using UFCS here will cause compilation to fail, as the compiler will attempt to call MyOutputRange.put directly, rather than the utility function. However, it’s fine to use UFCS when the first parameter is a dynamic array. This allows arrays to pass the isOutputRange predicate

Test whether a given range R is an output range:

import std.range : isOutputRange;
static assert(isOutputRange!(R, E));

Here, E is the type of element accepted by R.put.

Optional range primitives

In addition to the five primary range types, some algorithms in Phobos are designed to look for optional primitives that can be used as an optimization or, in some cases, a requirement. There are predicate templates in std.range that allow the same options to be used outside of Phobos.

hasLength

Ranges that expose a length property can reduce the amount of work needed to determine the number of elements they contain. A great example is the std.range.walkLength function, which will determine and return the length of any range, whether it has a length primitive or not. Given a range that satisfies the std.range.hasLength predicate, the operation becomes a call to the length property; otherwise, the range must be iterated until it is consumed, incrementing a variable every time popFront is called. Generally, length is expected to be a O(1) operation. If any given implementation cannot meet that expectation, it should be clearly documented as such. For non-infinite random-access ranges, length is a requirement. For all others, it’s optional.

isInfinite

An input range with an empty property, which is implemented as a compile-time value set to false is considered an infinite range. For example, let’s have a look at the following code:

struct IR {
    private uint _number;
    enum empty = false;
    auto front() { return _number; }
    void popFront() { ++_number; }
}

Here, empty is a manifest constant, but it could alternatively be implemented as follows:

static immutable empty = false;

The predicate template std.range.isInfinite can be used to identify infinite ranges. Any range that is always going to return false from empty should be implemented to pass isInfinite. Wrapper ranges (such as the FilterRange we implemented earlier) in some functions might check isInfinite and customize an algorithm’s behavior when it’s true. Simply returning false from an empty function will break this, potentially leading to infinite loops or other undesired behavior.

Other options

There are a handful of other optional primitives and behaviors, as follows:

hasSlicing: This returns true for any forward range that supports slicing. There are a set of requirements specified by the documentation for finite versus infinite ranges and whether opDollar is implemented.
hasMobileElements: This is true for any input range whose elements can be moved around in the memory (as opposed to copied) via the primitives moveFront, moveBack, and moveAt.
hasSwappableElements: This returns true if a range supports swapping elements through its interface. The requirements are different depending on the range type.
hasAssignableElements: This returns true if elements are assignable through range primitives such as front, back, or opIndex.

At http://dlang.org/phobos/std_range_primitives.html, you can find specific documentation for all of these tests, including any special requirements that must be implemented by a range type to satisfy them.

Ranges in use

The key concept to understand ranges in the general case is that, unless they are infinite, they are consumable. In idiomatic usage, they aren’t intended to be kept around, adding and removing elements to and from them as if they were some sort of container. A range is generally created only when needed, passed to an algorithm as input, then ultimately consumed, often at the end of a chain of algorithms. Even forward ranges and output ranges with their save and put primitives usually aren’t intended to live long beyond an algorithm.

That’s not to say it’s forbidden to keep a range around; some might even be designed for long life. For example, the random number generators in std.random are all ranges that are intended to be reused. However, idiomatic usage in D generally means lazy, fire-and-forget ranges that allow algorithms to operate on data from any source.

For most programs, the need to deal with ranges directly should be rare; most code will be passing ranges to algorithms, then either converting the result to a container or iterating it with a foreach loop. Only when implementing custom containers and range-based algorithms is it necessary to implement a range or call a range interface directly. Still, understanding what’s going on under the hood helps in understanding the algorithms in Phobos, even if you never need to implement a range or algorithm yourself. That’s the focus of the remainder of this article.

Custom ranges

When implementing custom ranges, some thought should be given to the primitives that need to be supported and how to implement them. Since arrays support a number of primitives out of the box, it might be tempting to return a slice from a custom type, rather than struct wrapping an array or something else. While that might be desirable in some cases, keep in mind that a slice is also an output range and has assignable elements (unless it’s qualified as const or immutable, but those can be cast away). In many cases, what’s really wanted is an input range that can never be modified; one that allows iteration and prevents unwanted allocations.

A custom range should be as lightweight as possible. If a container uses an array or pointer internally, the range should operate on a slice of the array, or a copy of the pointer, rather than a copy of the data. This is especially true for the save primitive of a forward iterator; it could be called more than once in a chain of algorithms, so an implementation that requires deep copying would be extremely suboptimal (not to mention problematic for a range that requires ref return values from front).

Now we’re going to implement two actual ranges that demonstrate two different scenarios. One is intended to be a one-off range used to iterate a container, and one is suited to sticking around for as long as needed. Both can be used with any of the algorithms and range operations in Phobos.

Getting a range from a stack

Here’s a barebones, simple stack implementation with the common operations push, pop, top, and isEmpty (named to avoid confusion with the input range interface). It uses an array to store its elements, appending them in the push operation and decreasing the array length in the pop operation. The top of the stack is always _array[$-1]:

struct Stack(T) {
    private T[] _array;
    void push(T element) {
        _array ~= element;
    }
    void pop() {
        assert(!isEmpty);
        _array.length -= 1;
    }
    ref T top() {
        assert(!isEmpty);
        return _array[$-1];
    }
    bool isEmpty() { return _array.length == 0; }
}

Rather than adding an opApply to iterate a stack directly, we want to create a range to do the job for us so that we can use it with all of those algorithms. Additionally, we don’t want the stack to be modified through the range interface, so we should declare a new range type internally. That might look like this:

private struct Range {
    T[] _elements;
    bool empty() { return _elements.length == 0; }
    T front() { return _elements[$-1]; }
    void popFront() { _elements.length -= 1; }
}

Add this anywhere you’d like inside the Stack declaration. Note the iteration of popFront. Effectively, this range will iterate the elements backwards. Since the end of the array is the top of the stack, that means it’s iterating the stack from the top to the bottom. We could also add back and popBack primitives that iterate from the bottom to the top, but we’d also have to add a save primitive since bidirectional ranges must also be forward ranges.

Now, all we need is a function to return a Range instance:

auto elements() { return Range(_array); }

Again, add this anywhere inside the Stack declaration. A real implementation might also add the ability to get a range instance from slicing a stack. Now, test it out:

Stack!int stack;
foreach(i; 0..10)
    stack.push(i);
writeln("Iterating...");
foreach(i; stack.elements)
    writeln(i);
stack.pop();
stack.pop();
writeln("Iterating...");
foreach(i; stack.elements)
    writeln(i);

One of the great side effects of this sort of range implementation is that you can modify the container behind the range’s back and the range doesn’t care:

foreach(i; stack.elements) {
        stack.pop();
        writeln(i);
    }
    writeln(stack.top);

This will still print exactly what was in the stack at the time the range was created, but the writeln outside the loop will cause an assertion failure because the stack will be empty by then. Of course, it’s still possible to implement a container that can cause its ranges not just to become stale, but to become unstable and lead to an array bounds error or an access violation or some such. However, D’s slices used in conjunction with structs give a good deal of flexibility.

A name generator range

Imagine that we’re working on a game and need to generate fictional names. For this example, let’s say it’s a music group simulator and the names are those of group members. We’ll need a data structure to hold the list of possible names. To keep the example simple, we’ll implement one that holds both first and last names:

struct NameList {
private:
    string[] _firstNames;
    string[] _lastNames;
    struct Generator {
        private string[] _first;
        private string[] _last;
        private string _next;
        enum empty = false;
        this(string[] first, string[] last) {
            _first = first;
            _last = last;
            popFront();
        }
        string front() {
            return _next;
        }
        void popFront() {
            import std.random : uniform;
            auto firstIdx = uniform(0, _first.length);
            auto lastIdx = uniform(0, _last.length);
            _next = _first[firstIdx] ~ " " ~ _last[lastIdx];
        }
    }
public:
    auto generator() {
        return Generator(_firstNames, _lastNames);
    }
}

The custom range is in the highlighted block. It’s a struct called Generator that stores two slices, _first and _last, which are both initialized in its only constructor. It also has a field called _next, which we’ll come back to in a minute. The goal of the range is to provide an endless stream of randomly generated names, which means it doesn’t make sense for its empty property to ever return true. As such, it is marked as an infinite range by the manifest constant implementation of empty that is set to false.

This range has a constructor because it needs to do a little work to prepare itself before front is called for the first time. All of the work is done in popFront, which the constructor calls after the member variables are set up. Inside popFront, you can see that we’re using the std.random.uniform function. By default, this function uses a global random number generator and returns a value in the range specified by the parameters, in this case 0 and the length of each array. The first parameter is inclusive and the second is exclusive. Two random numbers are generated, one for each array, and then used to combine a first name and a last name to store in the _next member, which is the value returned when front is called. Remember, consecutive calls to front without any calls to popFront should always return the same value.

std.random.uniform can be configured to use any instance of one of the random number generator implementations in Phobos. It can also be configured to treat the bounds differently. For example, both could be inclusive, exclusive, or the reverse of the default. See the documentation at http://dlang.org/phobos/std_random.html for details.

The generator property of NameList returns an instance of Generator. Presumably, the names in a NameList would be loaded from a file on disk, or a database, or perhaps even imported at compile-time. It’s perfectly fine to keep a single Generator instance handy for the life of the program as implemented. However, if the NameList instance backing the range supported reloading or appending, not all changes would be reflected in the range. In that scenario, it’s better to go through generator every time new names need to be generated.

Now, let’s see how our custom range might be used:

auto nameList = NameList(
    ["George", "John", "Paul", "Ringo", "Bob", "Jimi",
     "Waylon", "Willie", "Johnny", "Kris", "Frank", "Dean",
     "Anne", "Nancy", "Joan", "Lita", "Janice", "Pat",
     "Dionne", "Whitney", "Donna", "Diana"],
    ["Harrison", "Jones", "Lennon", "Denver", "McCartney",
     "Simon", "Starr", "Marley", "Dylan", "Hendrix", "Jennings",
     "Nelson", "Cash", "Mathis", "Kristofferson", "Sinatra",
     "Martin", "Wilson", "Jett", "Baez", "Ford", "Joplin",
     "Benatar", "Boone", "Warwick", "Houston", "Sommers",
     "Ross"]
);
import std.range : take;
auto names = nameList.generator.take(4);
writeln("These artists want to form a new band:");
foreach(artist; names)
    writeln(artist);

First up, we initialize a NameList instance with two array literals, one of first names and one of last names. Next, the highlighted line is where the range is used. We call nameList.generator and then, using UFCS, pass the returned Generator instance to std.range.take. This function creates a new lazy range containing a number of elements, four in this case, from the source range. In other words, the result is the equivalent of calling front and popFront four times on the range returned from nameList.generator, but since it’s lazy, the popping doesn’t occur until the foreach loop. That loop produces four randomly generated names that are each written to standard output. One iteration yielded the following names for me:

These artists want to form a new band:

Dionne Wilson
Johnny Starr
Ringo Sinatra
Dean Kristofferson

Other considerations

The Generator range is infinite, so it doesn’t need length. There should never be a need to index it, iterate it in reverse, or assign any values to it. It has exactly the interface it needs. But it’s not always so obvious where to draw the line when implementing a custom range. Consider the interface for a range from a queue data structure.

A basic queue implementation allows two operations to add and remove items—enqueue and dequeue (or push and pop if you prefer). It provides the self-describing properties empty and length. What sort of interface should a range from a queue implement?

An input range with a length property is perhaps the most obvious, reflecting the interface of the queue itself. Would it make sense to add a save property? Should it also be a bidirectional range? What about indexing? Should the range be random-access? There are queue implementations out there in different languages that allow indexing, either through an operator overload or a function such as getElementAt. Does that make sense? Maybe. More importantly, if a queue doesn’t allow indexing, does it make sense for a range produced from that queue to allow it? What about slicing? Or assignable elements? For our queue type at least, there are no clear-cut answers to these questions.

A variety of factors come into play when choosing which range primitives to implement, including the internal data structure used to implement the queue, the complexity requirements of the primitives involved (indexing should be an O(1) operation), whether the queue was implemented to meet a specific need or is a more general-purpose data structure, and so on. A good rule of thumb is that if a range can be made a forward range, then it should be.

Custom algorithms

When implementing custom, range-based algorithms, it’s not enough to just drop an input range interface onto the returned range type and be done with it. Some consideration needs to be given to the type of range used as input to the function and how its interface should affect the interface of the returned range. Consider the FilteredRange we implemented earlier, which provides the minimal input range interface. Given that it’s a wrapper range, what happens when the source range is an infinite range? Let’s look at it step by step.

First, an infinite range is passed in to filter. Next, it’s wrapped up in a FilteredRange instance that’s returned from the function. The returned range is going to be iterated at some point, either directly by the caller or somewhere in a chain of algorithms. There’s one problem, though: with a source range that’s infinite, the FilteredRange instance can never be consumed. Because its empty property simply wraps that of the source range, it’s always going to return false if the source range is infinite. However, since FilteredRange does not implement empty as a compile-time constant, it will never match the isInfiniteRange predicate. This will cause any algorithm that makes that check to assume it’s dealing with a finite range and, if iterating it, enter into an infinite loop. Imagine trying to track down that bug.

One option is to prohibit infinite ranges with a template constraint, but that’s too restrictive. The better way around this potential problem is to check the source range against the isInfinite predicate inside the FilteredRange implementation. Then, the appropriate form of the empty primitive of FilteredRange can be configured with conditional compilation:

import std.range : isInfinite;
static if(isInfinite!T)
  enum empty = false;
else
  bool empty(){ return _source.empty; }

With this, FilteredRange will satisfy the isInfinite predicate when it wraps an infinite range, avoiding the infinite loop bug.

Another good rule of thumb is that a wrapper range should implement as many of the primitives provided by the source range as it reasonably can. If the range returned by a function has fewer primitives than the one that went in, it is usable with fewer algorithms. But not all ranges can accommodate every primitive.

Take FilteredRange as an example again. It could be configured to support the bidirectional interface, but that would have a bit of a performance impact as the constructor would have to find the last element in the source range that satisfies the predicate in addition to finding the first, so that both front and back are primed to return the correct values. Rather than using conditional compilation, std.algorithm provides two functions, filter and filterBidirectional, so that users must explicitly choose to use the latter version. A bidirectional range passed to filter will produce a forward range, but the latter maintains the interface.

The random-access interface, on the other hand, makes no sense on FilteredRange. Any value taken from the range must satisfy the predicate, but if users can randomly index the range, they could quite easily get values that don’t satisfy the predicate. It could work if the range was made eager rather than lazy. In that case, it would allocate new storage and copy all the elements from the source that satisfies the predicate, but that defeats the purpose of using lazy ranges in the first place.

Summary

In this article, we’ve taken an introductory look at ranges in D and how to implement them into containers and algorithms. For more information on ranges and their primitives and traits, see the documentation at http://dlang.org/phobos/std_range.html.

Resources for Article:

Further resources on this subject:

Transactions and Operators [article]
Using Protocols and Protocol Extensions [article]
Hand Gesture Recognition Using a Kinect Depth Sensor [article]