10 min read

(For more resources related to this topic, see here.)

Sorting ranges efficiently

Phobos’ std.algorthm includes sorting algorithms. Let’s look at how they are used, what requirements they have, and the dangers of trying to implement range primitives without minding their efficiency requirements.

Getting ready

Let’s make a linked list container that exposes an appropriate view, a forward range, and an inappropriate view, a random access range that doesn’t meet its efficiency requirements. A singly-linked list can only efficiently implement forward iteration due to its nature; the only tool it has is a pointer to the next element. Implementing any other range primitives will require loops, which is not recommended. Here, however, we’ll implement a fully functional range, with assignable elements, length, bidirectional iteration, random access, and even slicing on top of a linked list to see the negative effects this has when we try to use it.

How to do it…

We’re going to both sort and benchmark this program.

To sort

Let’s sort ranges by executing the following steps:

  1. Import std.algorithm.

  2. Determine the predicate you need. The default is (a, b) => a < b, which results in an ascending order when the sorting is complete (for example, [1,2,3]). If you want ascending order, you don’t have to specify a predicate at all. If you need descending order, you can pass a greater-than predicate instead, as shown in the following line of code:

    auto sorted = sort!((a, b) => a > b)([1,2,3]); // results: [3,2,1]

  3. When doing string comparisons, the functions std.string.cmp (case-sensitive) or std.string.icmp (case-insensitive) may be used, as is done in the following code:

    auto sorted = sort!((a, b) => cmp(a, b) < 0)(["b", "c", "a"]); // results: a, b, c

  4. Your predicate may also be used to sort based on a struct member, as shown in the following code:

    auto sorted = sort!((a, b) => a.value < b.value)(structArray);

  5. Pass the predicate as the first compile-time argument. The range you want to sort is passed as the runtime argument.

  6. If your range is not already sortable (if it doesn’t provide the necessary capabilities), you can convert it to an array using the array function from std.range, as shown in the following code:

    auto sorted = sort(fibanocci().take(10)); // won't compile, not enough capabilities auto sorted = sort(fibanocci().take(10).array); // ok, good

  7. Use the sorted range. It has a unique type from the input to signify that it has been successfully sorted. Other algorithms may use this knowledge to increase their efficiency.

To benchmark

Let’s sort objects using benchmark by executing the following steps:

  1. Put our range and skeleton main function from the Getting ready section of this recipe into a file.

  2. Use std.datetime.benchmark to test the sorting of an array from the appropriate walker against the slow walker and print the results at the end of main. The code is as follows:

    auto result = benchmark!( { auto sorted = sort(list.walker.array); }, { auto sorted = sort(list.slowWalker); } )(100); writefln("Emulation resulted in a sort that was %d times slower.", result[1].hnsecs / result[0].hnsecs);

  3. Run it. Your results may vary slightly, but you’ll see that the emulated, inappropriate range functions are consistently slower. The following is the output:

    Emulation resulted in a sort that was 16 times slower.

  4. Tweak the size of the list by changing the initialization loop. Instead of 1000 entries, try 2000 entries. Also, try to compile the program with inlining and optimization turned on (dmd –inline –O yourfile.d) and see the difference. The emulated version will be consistently slower, and as the list becomes longer, the gap will widen.

On my computer, a growing list size led to a growing slowdown factor, as shown in the following table:

List size

Slowdown factor

500

13

1000

16

2000

29

4000

73

How it works…

The interface to Phobos’ main sort function hides much of the complexity of the implementation. As long as we follow the efficiency rules when writing our ranges, things either just work or fail, telling us we must call an array in the range before we can sort it. Building an array has a cost in both time and memory, which is why it isn’t performed automatically (std.algorithm prefers lazy evaluation whenever possible for best speed and minimum memory use). However, as you can see in our benchmark, building an array is much cheaper than emulating unsupported functions.

The sort algorithms require a full-featured range and will modify the range you pass to it instead of allocating memory for a copy. Thus, the range you pass to it must support random access, slicing, and either assignable or swappable elements. The prime example of such a range is a mutable array. This is why it is often necessary to use the array function when passing data to sort.

Our linked list code used static if with a compile-time parameter as a configuration tool. The implemented functions include opSlice and properties that return ref. The ref value can only be used on function return values or parameters. Assignments to a ref value are forwarded to the original item. The opSlice function is called when the user tries to use the slice syntax: obj[start .. end].

Inside the beSlow condition, we broke the main rule of implementing range functions: avoid loops. Here, we see the consequences of breaking that rule; it ruined algorithm restrictions and optimizations, resulting in code that performs very poorly. If we follow the rules, we at least know where a performance problem will arise and can handle it gracefully.

For ranges that do not implement the fast length property, std.algorithm includes a function called walkLength that determines the length by looping through all items (like we did in the slow length property). The walkLength function has a longer name than length precisely to warn you that it is a slower function, running in O(n) (linear with length) time instead of O(1) (constant) time. Slower functions are OK, they just need to be explicit so that the user isn’t surprised.

See also

  • The std.algorithm module also includes other sorting algorithms that may fit a specific use case better than the generic (automatically specialized) function. See the documentation at http://dlang.org/phobos/std_algorithm.html for more information.

Searching ranges

Phobos’ std.algorithm module includes search functions that can work on any ranges. It automatically specializes based on type information. Searching a sorted range is faster than an unsorted range.

How to do it…

Searching has a number of different scenarios, each with different methods:

  • If you want to know if something is present, use canFind.

  • Finding an item generically can be done with the find function. It returns the remainder of the range, with the located item at the front.

  • When searching for a substring in a string, you can use haystack.find(boyerMooreFinder(needle)). This uses the Boyer-Moore algorithm which may give better performance.

  • If you want to know the index where the item is located, use countUntil. It returns a numeric index into the range, just like the indexOf function for strings.

  • Each find function can take a predicate to customize the search operation.

  • When you know your range is sorted but the type doesn’t already prove it, you may call assumeSorted on it before passing it to the search functions. The assumeSorted function has no runtime cost; it only adds information to the type that is used for compile-time specialization.

How it works…

The search functions in Phobos make use of the ranges’ available features to choose good-fit algorithms. Pass them efficiently implemented ranges with accurate capabilities to get best performance.

The find function returns the remainder of the data because this is the most general behavior; it doesn’t need random access, like returning an index, and doesn’t require an additional function if you are implementing a function to split a range on a given condition. The find function can work with a basic input range, serving as a foundation to implement whatever you need on top of it, and it will transparently optimize to use more range features if available.

Using functional tools to query data

The std.algorithm module includes a variety of higher-order ranges that provide tools similar to functional tools. Here, we’ll see how D code can be similar to a SQL query.

A SQL query is as follows:

SELECT id, name, strcat("Title: ", title) FROM users WHERE name LIKE 'A%' ORDER BY id DESC LIMIT 5;

How would we express something similar in D?

Getting ready

Let’s create a struct to mimic the data table and make an array with the some demo information. The code is as follows:

struct User { int id; string name; string title; } User[] users; users ~= User(1, "Alice", "President"); users ~= User(2, "Bob", "Manager"); users ~= User(3, "Claire", "Programmer");

How to do it…

Let’s use functional tools to query data by executing the following steps:

  1. Import std.algorithm.

  2. Use sort to translate the ORDER BY clause. If your dataset is large, you may wish to sort it at the end. This will likely require a call to an array, but it will only sort the result set instead of everything. With a small dataset, sorting early saves an array allocation.

  3. Use filter to implement the WHERE clause.

  4. Use map to implement the field selection and functions. The std.typecons.tuple module can also be used to return specific fields.

  5. Use std.range.take to implement the LIMIT clause.

  6. Put it all together and print the result.

The code is as follows:

import std.algorithm; import std.range; import std.typecons : tuple; // we use this below auto resultSet = users. sort!((a, b) => a.id > b.id). // the ORDER BY clause filter!((item) => item.name.startsWith("A")). // the WHERE clause take(5). map!((item) => tuple(item.id, item.name,"Title: " ~ item.title));
// the field list and transformations import std.stdio; foreach(line; resultSet) writeln(line[0], " ", line[1], " ", line[2]);

It will print the following output:

1 Alice Title: President

How it works…

Many SQL operations or list comprehensions can be expressed in D using some building blocks from std.algorithm. They all work generally the same way; they take a predicate as a compile-time argument. The predicate is passed one or two items at a time and you perform a check or transformation on it. Chaining together functions with the dot syntax, like we did here, is possible thanks to uniform function call syntax. It could also be rewritten as take(5, filter!pred(map!pred(users))). It depends on author’s preference, as both styles work exactly the same way.

It is important to remember that all std.algorithm higher-order ranges are evaluated lazily. This means no computations, such as looping over or printing, are actually performed until they are required. Writing code using filter, take, map, and many other functions is akin to preparing a query. To execute it, you may print or loop the result, or if you want to save it to an array for use later, simply call .array at the end.

There’s more…

The std.algorithm module also includes other classic functions, such as reduce. It works the same way as the others.

D has a feature called pure functions. The functions in std.algorithm are conditionally pure, which means they can be used in pure functions if and only if the predicates you pass are also pure. With lambda functions, like we’ve been using here, the compiler will often automatically deduce this for you. If you use other functions you define as predicates and want to use it in a pure function, be sure to mark them pure as well.

See also

Summary

In this article, we learned how to sort ranges in an efficient manner by using sorting algorithms. We learned how to search a range using different functions. We also learned how to use functional tools to query data (similar to a SQL query).

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here