30 min read

Memory management is part of practically every computing system. Multiple programs must coexist inside a limited memory space, and that can only be possible if the operating system is taking care of it. When a program needs some memory, for example, to create an object, it can ask the operating system and it will give it a slice of shared memory. When an object is not needed anymore, that memory can be returned to the loving care of the operating system. In this tutorial, we will touch upon memory management techniques, the most prime factor in parallel programming.

The article is an excerpt from a book written by Primož Gabrijelčič, titled Delphi High Performance

Slicing and dicing memory straight from the operating system is a relatively slow operation. In lots of cases, a memory system also doesn’t know how to return small chunks of memory. For example, if you call Windows’ VirtualAlloc function to get 20 bytes of memory, it will actually reserve 4 KB (or 4,096 bytes) for you. In other words, 4,076 bytes would be wasted.

To fix these and other problems, programming languages typically implement their own internal memory management algorithms. When you request 20 bytes of memory, the request goes to that internal memory manager. It still requests memory from the operating system but then splits it internally into multiple parts.

In a hypothetical scenario, the internal memory manager would request 4,096 bytes from the operating system and give 20 bytes of that to the application. The next time the application would request some memory (30 bytes for example), the internal memory manager would get that memory from the same 4,096-byte block.

To move from hypothetical to specific, Delphi also includes such a memory manager. From Delphi 2006, this memory manager is called FastMM. It was written as an open source memory manager by Pierre LeRiche with help from other Delphi programmers and was later licensed by Borland. FastMM was a great improvement over the previous Delphi memory manager and, although it does not perform perfectly in the parallel programming world, it still functions very well after more than ten years.

Optimizing strings and array allocations

When you create a string, the code allocates memory for its content, copies the content into that memory, and stores the address of this memory in the string variable.

If you append a character to this string, it must be stored somewhere in that memory. However, there is no place to store the string. The original memory block was just big enough to store the original content. The code must, therefore, enlarge that memory block, and only then can the appended character be stored in the newly acquired space

A very similar scenario plays out when you extend a dynamic array. Memory that contains the array data can sometimes be extended in place (without moving), but often this cannot be done.

If you do a lot of appending, these constant reallocations will start to slow down the code. The Reallocation demo shows a few examples of such behavior and possible workarounds.

The first example, activated by the Append String button, simply appends the '*' character to a string 10 million times. The code looks simple, but the s := s + '*' assignment hides a potentially slow string reallocation:

procedure TfrmReallocation.btnAppendStringClick(Sender: TObject);
var
  s: String;
  i: Integer;
begin
  s := '';
  for i := 1 to CNumChars do
    s := s + '*';
end;

By now, you probably know that I don’t like to present problems that I don’t have solutions for and this is not an exception. In this case, the solution is called SetLength. This function sets a string to a specified size. You can make it shorter, or you can make it longer. You can even set it to the same length as before.

In case you are enlarging the string, you have to keep in mind that SetLength will allocate enough memory to store the new string, but it will not initialize it. In other words, the newly allocated string space will contain random data.

A click on the SetLength String button activates the optimized version of the string appending code. As we know that the resulting string will be CNumChars long, the code can call SetLength(s, CNumChars) to preallocate all the memory in one step. After that, we should not append characters to the string as that would add new characters at the end of the preallocated string. Rather, we have to store characters directly into the string by writing  to s[i]:

procedure TfrmReallocation.btnSetLengthClick(Sender: TObject);
var
  s: String;
  i: Integer;
begin
  SetLength(s, CNumChars);
  for i := 1 to CNumChars do
    s[i] := '*';
end;

Comparing the speed shows that the second approach is significantly faster. It runs in 33 ms instead of the original 142 ms.

A similar situation happens when you are extending a dynamic array. The code triggered by the Append array button shows how an array may be extended by one element at a time in a loop. Admittedly, the code looks very weird as nobody in their right mind would write a loop like this. In reality, however, similar code would be split into multiple longer functions and may be hard to spot:

procedure TfrmReallocation.btnAppendArrayClick(Sender: TObject);
var
  arr: TArray;
  i: Integer;
begin
  SetLength(arr, 0);
  for i := 1 to CNumChars do begin
    SetLength(arr, Length(arr) + 1);
    arr[High(arr)] := '*';
  end;
end;

The solution is similar to the string case. We can preallocate the whole array by calling the SetLength function and then write the data into the array elements. We just have to keep in mind that the first array element always has index 0:

procedure TfrmReallocation.btnSetLengthArrayClick(Sender: TObject);
var
  arr: TArray;
  i: Integer;
begin
  SetLength(arr, CNumChars);
  for i := 1 to CNumChars do
    arr[i-1] := '*';
end;

Improvements in speed are similar to the string demo. The original code needs 230 ms to append ten million elements, while the improved code executes in 26 ms.

The third case when you may want to preallocate storage space is when you are appending to a list. As an example, I’ll look into a TList class. Internally, it stores the data in a TArray, so it again suffers from constant memory reallocation when you are adding data to the list.

The short demo code appends 10 million elements to a list. As opposed to the previous array demo, this is a completely normal looking code, found many times in many applications:

procedure TfrmReallocation.btnAppendTListClick(Sender: TObject);
var
  list: TList;
  i: Integer;
begin
  list := TList.Create;
  try
    for i := 1 to CNumChars do
      list.Add('*');
  finally
    FreeAndNil(list);
  end;
end;

To preallocate memory inside a list, you can set the Capacity property to an expected number of elements in the list. This doesn’t prevent the list from growing at a later time; it just creates an initial estimate. You can also use Capacity to reduce memory space used for the list after deleting lots of elements from it.

The difference between a list and a string or an array is that, after setting Capacity, you still cannot access list[i] elements directly. Firstly you have to Add them, just as if Capacity was not assigned:

procedure TfrmReallocation.btnSetCapacityTListClick(Sender: TObject);
var
  list: TList;
  i: Integer;
begin
  list := TList.Create;
  try
    list.Capacity := CNumChars;
    for i := 1 to CNumChars do
      list.Add('*');
  finally
    FreeAndNil(list);
  end;
end;

Comparing the execution speed shows only a small improvement. The original code executed in 167 ms, while the new version needed 145 ms. The reason for that relatively small change is that TList already manages its storage array. When it runs out of space, it will always at least double the previous size. Internal storage therefore grows from 1 to 2, 4, 8, 16, 32, 64, … elements.

This can, however, waste a lot of memory. In our example, the final size of the internal array is 16,777,216 elements, which is about 60% elements too many. By setting the capacity to the exact required size, we have therefore saved 6,777,216 * SizeOf(Char) bytes or almost 13 megabytes.

Other data structures also support the Capacity property. We can find it in TList, TObjectList, TInterfaceList, TStrings, TStringList, TDictionary, TObjectDictionary and others.

Memory management functions

Besides the various internal functions that the Delphi runtime library (RTL) uses to manage strings, arrays and other built-in data types, RTL also implements various functions that you can use in your program to allocate and release memory blocks. In the next few paragraphs, I’ll tell you a little bit about them.

Memory management functions can be best described if we split them into a few groups, each including functions that were designed to work together.

The first group includes GetMem, AllocMem, ReallocMem, and FreeMem.

The procedure GetMem(var P: Pointer; Size: Integer) allocates a memory block of size Size and stores an address of this block in a pointer variable P. This pointer variable is not limited to pointer type, but can be of any pointer type (for example PByte).

The new memory block is not initialized and will contain whatever is stored in the memory at that time. Alternatively, you can allocate a memory block with a call to the function AllocMem(Size: Integer): Pointer which allocates a memory block, fills it with zeroes, and then returns its address.

To change the size of a memory block, call the procedure ReallocMem(var P: Pointer; Size: Integer). Variable P must contain a pointer to a memory block and Size can be either smaller or larger than the original block size. FastMM will try to resize the block in place. If that fails, it will allocate a new memory block, copy the original data into the new block and return an address of the new block in the P. Just as with the GetMem, newly allocated bytes will not be initialized.

To release memory allocated in this way, you should call the FreeMem(var P: Pointer) procedure.

The second group includes GetMemory, ReallocMemory, and FreeMemory. These three work just the same as functions from the first group, except that they can be used from C++ Builder.

The third group contains just two functions, New and Dispose.

These two functions can be used to dynamically create and destroy variables of any type. To allocate such a variable, call New(var X: Pointer) where P is again of any pointer type. The compiler will automatically provide the correct size for the memory block and it will also initialize all managed fields to zero. Unmanaged fields will not be initialized.

To release such variables, don’t use FreeMem but Dispose(var X: Pointer). In the next section, I’ll give a short example of using New and Dispose to dynamically create and destroy variables of a record type.

You must never use Dispose to release memory allocated with GetMem or AllocateMem. You must also never use FreeMem to release memory allocated with New.

The fourth and last group also contains just two functions, Initialize and Finalize. Strictly speaking, they are not memory management functions.

If you create a variable containing managed fields (for example, a record) with a function other than New or AllocMem, it will not be correctly initialized. Managed fields will contain random data and that will completely break the execution of the program. To fix that, you should call Initialize(var V) passing in the variable (and not the pointer to this variable!). An example in the next section will clarify that.

Before you return such a variable to the memory manager, you should clean up all references to managed fields by calling Finalize(var V). It is better to use Dispose, which will do that automatically, but sometimes that is not an option and you have to do it manually.

Both functions also exist in a form that accepts a number of variables to initialize. This form can be used to initialize or finalize an array of data:

procedure Initialize(var V; Count: NativeUInt);
procedure Finalize(var V; Count: NativeUInt);

In the next section, I’ll dig deeper into the dynamic allocation of record variables. I’ll also show how most of the memory allocation functions are used in practice.

Dynamic record allocation

While it is very simple to dynamically create new objects—you just call the Create constructor—dynamic allocation of records and other data types (arrays, strings …) is a bit more complicated.

In the previous section, we saw that the preferred way of allocating such variables is with the New method. The InitializeFinalize demo shows how this is done in practice.

The code will dynamically allocate a variable of type TRecord. To do that, we need a pointer variable, pointing to TRecord. The cleanest way to do that is to declare a new type PRecord = ^TRecord:

type
  TRecord = record
    s1, s2, s3, s4: string;
  end;
  PRecord = ^TRecord;

Now, we can just declare a variable of type PRecord and call New on that variable. After that, we can use the rec variable as if it was a normal record and not a pointer. Technically, we would have to always write rec^.s1, rec^.s4 and so on, but the Delphi compiler is friendly enough and allows us to drop the ^ character:

procedure TfrmInitFin.btnNewDispClick(Sender: TObject);
var
  rec: PRecord;
begin
  New(rec);
  try
    rec.s1 := '4';
    rec.s2 := '2';
    rec.s4 := rec.s1 + rec.s2 + rec.s4;
    ListBox1.Items.Add('New: ' + rec.s4);
  finally
    Dispose(rec);
  end;
end;

Technically, you could just use rec: ^TRecord instead of rec: PRecord, but it is customary to use explicitly declared pointer types, such as PRecord.

Another option is to use GetMem instead of New, and FreeMem instead of Dispose. In this case, however, we have to manually prepare allocated memory for use with a call to Initialize. We must also prepare it to be released with a call to Finalize before we call FreeMem.

If we use GetMem for initialization, we must manually provide the correct size of the allocated block. In this case, we can simply use SizeOf(TRecord).

We must also be careful with parameters passed to GetMem and Initialize. You pass a pointer (rec) to GetMem and FreeMem and the actual record data (rec^) to Initialize and Finalize:

procedure TfrmInitFin.btnInitFinClick(Sender: TObject);
var
  rec: PRecord;
begin
  GetMem(rec, SizeOf(TRecord));
  try
    Initialize(rec^);
    rec.s1 := '4';
    rec.s2 := '2';
    rec.s4 := rec.s1 + rec.s2 + rec.s4;
    ListBox1.Items.Add('GetMem+Initialize: ' + rec.s4);
  finally
    Finalize(rec^);
    FreeMem (rec);
  end;
end;

This demo also shows how the code doesn’t work correctly if you allocate a record with GetMem, but then don’t call Initialize. To test this, click the third button (GetMem). While in actual code the program may sometimes work and sometimes not, I have taken some care so that GetMem will always return a memory block which will not be initialized to zero and the program will certainly fail:

It is certainly possible to create records dynamically and use them instead of classes, but one question still remains—why? Why would we want to use records instead of objects when working with objects is simpler? The answer, in one word, is speed.

The demo program, Allocate, shows the difference in execution speed. A click on the Allocate objects button will create ten million objects of type TNodeObj, which is a typical object that you would find in an implementation of a binary tree. Of course, the code then cleans up after itself by destroying all those objects:

type
  TNodeObj = class
    Left, Right: TNodeObj;
    Data: NativeUInt;
  end;
procedure TfrmAllocate.btnAllocClassClick(Sender: TObject);
var
i: Integer;
nodes: TArray;
begin
SetLength(nodes, CNumNodes);
for i := 0 to CNumNodes-1 do
nodes[i] := TNodeObj.Create;
for i := 0 to CNumNodes-1 do
nodes[i].Free;
end;

A similar code, activated by the Allocate records button creates ten million records of type TNodeRec, which contains the same fields as TNodeObj:

type
  PNodeRec = ^TNodeRec;
  TNodeRec = record
    Left, Right: PNodeRec;
    Data: NativeUInt;
  end;
procedure TfrmAllocate.btnAllocRecordClick(Sender: TObject);
var
i: Integer;
nodes: TArray;
begin
SetLength(nodes, CNumNodes);
for i := 0 to CNumNodes-1 do
New(nodes[i]);
for i := 0 to CNumNodes-1 do
Dispose(nodes[i]);
end;

Running both methods shows a big difference. While the class-based approach needs 366 ms to initialize objects and 76 ms to free them, the record-based approach needs only 76 ms to initialize records and 56 to free them. Where does that big difference come from?

When you create an object of a class, lots of things happen. Firstly, TObject.NewInstance is called to allocate an object. That method calls TObject.InstanceSize to get the size of the object, then GetMem to allocate the memory and in the end, InitInstance which fills the allocated memory with zeros. Secondly, a chain of constructors is called. After all that, a chain of AfterConstruction methods is called (if such methods exist). All in all, that is quite a process which takes some time.

Much less is going on when you create a record. If it contains only unmanaged fields, as in our example, a GetMem is called and that’s all. If the record contains managed fields, this GetMem is followed by a call to the _Initialize method in the System unit which initializes managed fields.

The problem with records is that we cannot declare generic pointers. When we are building trees, for example, we would like to store some data of type T in each node. The initial attempt at that, however, fails. The following code does not compile with the current Delphi compiler:

type
  PNodeRec = ^TNodeRec;
  TNodeRec = record
    Left, Right: PNodeRec;
    Data: T;
  end;

We can circumvent this by moving the TNodeRec declaration inside the generic class that implements a tree. The following code from the Allocate demo shows how we could declare such internal type as a generic object and as a generic record:

type
  TTree = class
  strict private type
    TNodeObj = class
      Left, Right: TNodeObj;
      Data: T1;
    end;
PNodeRec = ^TNodeRec;
TNodeRec = record
Left, Right: PNodeRec;
Data: T1;
end;
TNodeRec = TNodeRec;
end;

If you click the Allocate node button, the code will create a TTree object and then create 10 million class-based nodes and the same amount of record-based nodes. This time, New must initialize the managed field Data: string but the difference in speed is still big. The code needs 669 ms to create and destroy class-based nodes and 133 ms to create and destroy record-based nodes.

Another big difference between classes and records is that each object contains two hidden pointer-sized fields. Because of that, each object is 8 bytes larger than you would expect (16 bytes in 64-bit mode). That amounts to 8 * 10,000,000 bytes or a bit over 76 megabytes. Records are therefore not only faster but also save space!

FastMM internals

To get a full speed out of anything, you have to understand how it works and memory managers are no exception to this rule. To write very fast Delphi applications, you should, therefore, understand how Delphi’s default memory manager works.

FastMM is not just a memory manager—it is three memory managers in one! It contains three significantly different subsystems—small block allocator, medium block allocator, and large block allocator.

The first one, the allocator for small blocks, handles all memory blocks smaller than 2,5 KB. This boundary was determined by observing existing applications. As it turned out, in most Delphi applications, this covers 99% of all memory allocations. This is not surprising, as in most Delphi applications most memory is allocated when an application creates and destroys objects and works with arrays and strings, and those are rarely larger than a few hundred characters.

Next comes the allocator for medium blocks, which are memory blocks with a size between 2,5 KB and 160 KB. The last one, allocator for large blocks, handles all other requests.

The difference between allocators lies not just in the size of memory that they serve, but in the strategy they use to manage memory.

The large block allocator implements the simplest strategy. Whenever it needs some memory, it gets it directly from Windows by calling VirtualAlloc. This function allocates memory in 4 KB blocks so this allocator could waste up to 4,095 bytes per request. As it is used only for blocks larger than 160 KB, this wasted memory doesn’t significantly affect the program, though.

The medium block allocator gets its memory from the large block allocator. It then carves this larger block into smaller blocks, as they are requested by the application. It also keeps all unused parts of the memory in a linked list so that it can quickly find a memory block that is still free.

The small block allocator is where the real smarts of FastMM lies. There are actually 56 small memory allocators, each serving only one size of the memory block. The first one serves 8-byte blocks, the next one 16-byte blocks, followed by the allocator for 24, 32, 40, … 256, 272, 288, … 960, 1056, … 2384, and 2608-byte blocks. They all get memory from the medium block allocator.

If you want to see block sizes for all 56 allocators, open FastMM4.pas and search for SmallBlockTypes.

What that actually means is that each memory allocation request will waste some memory. If you allocate 28 bytes, they’ll be allocated from the 32-byte allocator, so 4 bytes will be wasted. If you allocate 250 bytes, they’ll come from the 256-byte allocator and so on. The sizes of memory allocators were carefully chosen so that the amount of wasted memory is typically below 10%, so this doesn’t represent a big problem in most applications.

Each allocator is basically just an array of equally sized elements (memory blocks). When you allocate a small amount of memory, you’ll get back one element of an array. All unused elements are connected into a linked list so that the memory manager can quickly find a free element of an array when it needs one.

The following image shows a very simplified representation of FastMM allocators. Only two small block allocators are shown. Boxes with thick borders represent allocated memory. Boxes with thin borders represent unused (free) memory. Free memory blocks are connected into linked lists. Block sizes in different allocators are not to scale:

Simplified representation of FastMM memory manager

FastMM implements a neat trick which helps a lot when you resize strings or arrays by a small amount.

Well, the truth be told, I had to append lots and lots of characters—ten million of them—for this difference to show. If I were appending only a few characters, both versions would run at nearly the same speed. If you can, on the other hand, get your hands on a pre-2006 Delphi and run the demo program there, you’ll see that the one-by-one approach runs terribly slow. The difference in speed will be of a few more orders of magnitude larger than in my example.

The trick I’m talking about assumes that if you had resized memory once, you’ll probably want to do it again, soon. If you are enlarging the memory, it will limit the smallest size of the new memory block to be at least twice the size of the original block plus 32 bytes. Next time you’ll want to resize, FastMM will (hopefully) just update the internal information about the allocated memory and return the same block, knowing that there’s enough space at the end.

All that trickery is hard to understand without an example, so here’s one. Let’s say we have a string of 5 characters which neatly fits into a 24-byte block. Sorry, what am I hearing? “What? Why!? 5 unicode characters need only 10 bytes!” Oh, yes, strings are more complicated than I told you before.

In reality, each Delphi UnicodeString and AnsiString contains some additional data besides the actual characters that make up the string. Parts of the string are also: 4-byte length of string, 4-byte reference count, 2-byte field storing the size of each string character (either 1 for AnsiString or 2 for UnicodeString), and 2-byte field storing the character code page. In addition to that, each string includes a terminating Chr(0) character. For a 5-character string this gives us 4 (length) + 4 (reference count) + 2 (character size) + 2 (codepage) + 5 (characters) * 2 (size of a character) + 2 (terminating Chr(0)) = 24 bytes.

When you add one character to this string, the code will ask the memory manager to enlarge a 24-byte block to 26 bytes. Instead of returning a 26-byte block, FastMM will round that up to 2 * 24 + 32 = 80 bytes. Then it will look for an appropriate allocator, find one that serves 80-byte blocks (great, no memory loss!) and return a block from that allocator. It will, of course, also have to copy data from the original block to the new block.

This formula, 2 * size + 32, is used only in small block allocators. A medium block allocator only overallocates by 25%, and a large block allocator doesn’t implement this behavior at all.

Next time you add one character to this string, FastMM will just look at the memory block, determine that there’s still enough space inside this 80-byte memory block and return the same memory. This will continue for quite some time while the block grows to 80 bytes in two-byte increments. After that, the block will be resized to 2 * 80 + 32 = 192 bytes (yes, there is an allocator for this size), data will be copied and the game will continue.

This behavior indeed wastes some memory but, under most circumstances, significantly boosts the speed of code that was not written with speed in mind.

Memory allocation in a parallel world

We’ve seen how FastMM boosts the reallocation speed. The life of a memory manager is simple when there is only one thread of execution inside a program. When the memory manager is dealing out the memory, it can be perfectly safe in the knowledge that nothing can interrupt it in this work.

When we deal with parallel processing, however, multiple paths of execution simultaneously execute the same program and work on the same data. Because of that, life from the memory manager’s perspective suddenly becomes very dangerous.

For example, let’s assume that one thread wants some memory. The memory manager finds a free memory block on a free list and prepares to return it. At that moment, however, another thread also needs some memory from the same allocator. This second execution thread (running in parallel with the first one) would also find a free memory block on the free list. If the first thread didn’t yet update the free list, that may even be the same memory block! That can only result in one thing—complete confusion and crashing programs.

It is extremely hard to write a code that manipulates some data structures (such as a free list) in a manner that functions correctly in a multithreaded world. So hard that FastMM doesn’t even try it. Instead of that, it regulates access to each allocator with a lock. Each of the 56 small block allocators get their own lock, as do medium and large block allocators.

When a program needs some memory from, say, a 16-byte allocator, FastMM will lock this allocator until the memory is returned to the program. If during this time, another thread requests a memory from the same 16-byte allocator, it will have to wait until the first thread finishes.

This indeed fixes all problems but introduces a bottleneck—a part of the code where threads must wait to be processed in a serial fashion. If threads do lots of memory allocation, this serialization will completely negate the speed-up that we expected to get from the parallel approach. Such a memory manager would be useless in a parallel world.

To fix that, FastMM introduces memory allocation optimization which only affects small blocks.

When accessing a small block allocator, FastMM will try to lock it. If that fails, it will not wait for the allocator to become unlocked but will try to lock the allocator for the next block size. If that succeeds, it will return memory from the second allocator. That will indeed waste more memory but will help with the execution speed. If the second allocator also cannot be locked, FastMM will try to lock the allocator for yet the next block size. If the third allocator can be locked, you’ll get back memory from it. Otherwise, FastMM will repeat the process from the beginning.

This process can be somehow described with the following pseudo-code:

allocIdx := find best allocator for the memory block
repeat
  if can lock allocIdx then
    break;
  Inc(allocIdx); 
  if can lock allocIdx then
    break;
  Inc(allocIdx); 
  if can lock allocIdx then
    break;
  Dec(allocIdx, 2)
until false
allocate memory from allocIdx allocator

unlock allocIdx

A careful reader would notice that this code fails when the first line finds the last allocator in the table or the one before that. Instead of adding some conditional code to work around the problem, FastMM rather repeats the last allocator in the list three times. The table of small allocators actually ends with the following sizes: 1,984; 2,176; 2,384; 2,608; 2,608; 2,608. When requesting a block size above 2,384 the first line in the pseudo-code above will always find the first 2,608 allocator, so there will always be two more after it.

This approach works great when memory is allocated but hides another problem. And how can I better explain a problem than with a demonstration …?

An example of this problem can be found in the program, ParallelAllocations. If you run it and click the Run button, the code will compare the serial version of some algorithm with a parallel one. I’m aware that I did not explain parallel programming at all, but the code is so simple that even somebody without any understanding of the topic will guess what it does.

The core of a test runs a loop with the Execute method on all objects in a list. If a parallelTest flag is set, the loop is executed in parallel, otherwise, it is executed serially. The only mystery part in the code, TParallel.For does exactly what it says—executes a for loop in parallel.

if parallelTest then
  TParallel.For(0, fList.Count - 1,
    procedure(i: integer)
    begin
      fList[i].Execute;
    end)
else
  for i := 0 to fList.Count - 1 do
    fList[i].Execute;

If you’ll be running the program, make sure that you execute it without the debugger (Ctrl + Shift + F9 will do that). Running with the debugger slows down parallel execution and can skew the measurements.

On my test machine I got the following results:

In essence, parallelizing the program made it almost 4 times faster. Great result!

Well, no. Not a great result. You see, the machine I was testing on has 12 cores. If all would be running in parallel, I would expect an almost 12x speed-up, not a mere 4-times improvement!

If you take a look at the code, you’ll see that each Execute allocates a ton of objects. It is obvious that a problem lies in the memory manager. The question remains though, where exactly lies this problem and how can we find it?

I ran into exactly the same problem a few years ago. A highly parallel application which processes gigabytes and gigabytes of data was not running fast enough. There were no obvious problematic points and I suspected that the culprit was FastMM. I tried swapping the memory manager for a more multithreading-friendly one and, indeed, the problem was somehow reduced but I still wanted to know where the original sin lied in my code. I also wanted to continue using FastMM as it offers great debugging tools.

In the end, I found no other solution than to dig in the FastMM internals, find out how it works, and add some logging there. More specifically, I wanted to know when a thread is waiting for a memory manager to become unlocked. I also wanted to know at which locations in my program this happens the most.

To cut a (very) long story short, I extended FastMM with support for this kind of logging. This extension was later integrated into the main FastMM branch. As these changes are not included in Delphi, you have to take some steps to use this code.

Firstly, you have to download FastMM from the official repository at https://github.com/pleriche/FastMM4. Then you have to unpack it somewhere on the disk and add FastMM4 as a first unit in the project file (.dpr). For example, the ParallelAllocation program starts like this:

program ParallelAllocation;
uses
  FastMM4 in 'FastMM\FastMM4.pas',
  Vcl.Forms,
  ParallelAllocationMain in 'ParallelAllocationMain.pas' {frmParallelAllocation};

When you have done that, you should firstly rebuild your program and test if everything is still working. (It should but you never know …)

To enable the memory manager logging, you have to define a conditional symbol LogLockContention, rebuild (as FastMM4 has to be recompiled) and, of course, run the program without the debugger.

If you do that, you’ll see that the program runs quite a bit slower than before. On my test machine, the parallel version was only 1.6x faster than the serial one. The logging takes its toll, but that is not important. The important part will appear when you close the program.

At that point, the logger will collect all results and sort them by frequency. The 10 most frequent sources of locking in the program will be saved to a file called _MemoryManager_EventLog.txt. You will find it in the folder with the .exe. The three most frequent sources of locking will also be displayed on the screen.

The following screenshot shows a cropped version of this log. Some important parts are marked out:

For starters, we can see that at this location the program waited 19,020 times for a memory manager to become unlocked. Next, we can see that the memory function that caused the problem was FreeMem. Furthermore, we can see that somebody tried to delete from a list (InternalDoDelete) and that this deletion was called from TSpeedTest.Execute, line 130. FreeMem was called because the list in question is actually a TObjectList and deleting elements from the list caused it to be destroyed.

The most important part here is the memory function causing the problem—FreeMem. Of course! Allocations are optimized. If an allocator is locked, the next one will be used and so on. Releasing memory, however, is not optimized! When we release a memory block, it must be returned to the same allocator that it came from. If two threads want to release memory to the same allocator at the same time, one will have to wait.

I had an idea on how to improve this situation by adding a small stack (called release stack) to each allocator. When FreeMem is called and it cannot lock the allocator, the address of the memory block that is to be released will be stored on that stack. FreeMem will then quickly exit.

When a FreeMem successfully locks an allocator, it firstly releases its own memory block. Then it checks if anything is waiting on the release stack and releases these memory blocks too (if there are any).

This change is also included in the main FastMM branch, but it is not activated by default as it increases the overall memory consumption of the program. However, in some situations, it can do miracles and if you are developing multithreaded programs you certainly should test it out.

To enable release stacks, open the project settings for the program, remove the conditional define LogLockContention (as that slows the program down) and add the conditional define UseReleaseStack. Rebuild, as FastMM4.pas has to be recompiled.

On my test machine, I got much better results with this option enabled. Instead of a 3,9x speed-up, the parallel version was 6,3x faster than the serial one. The factor is not even close to 12x, as the threads do too much fighting for the memory, but the improvement is still significant:

parallel location

That is as far as FastMM will take us. For a faster execution, we need a more multithreading-friendly memory manager.

To summarize, this article covered memory management techniques offered by Delphi. We looked into optimization, allocation, and internal of storage for efficient parallel programming.

If you found this post useful, do check out the book Delphi High Performance to learn more about the intricacies of how to perform High-performance programming with Delphi.

Read More:


Subscribe to the weekly Packt Hub newsletter. We'll send you this year's Skill Up Developer Skills Report.

* indicates required

LEAVE A REPLY

Please enter your comment!
Please enter your name here