A brief look at C++0x move semantics performance

November 29, 2008 by Tor Brede Vekterli

In my earlier post about the Visual Studio 2010 Community Tech Preview and its inclusion of a few C++0x features, I briefly touched upon rvalues and move semantics and how they can be used to dramatically speed up the passing of temporary objects. In an attempt to demonstrate, I've whipped up a tiny—admittedly highly unscientific—benchmark just to show the difference between good old copy construction and fancy, shiny new move construction.

Method

Since the VS 2010 CTP comes in a virtual machine image, that is also where the code has been tested. As such, the performance and timings are not native. This should however not matter, as we're not interested in raw numbers but rather the difference between timings of copy-construction vs. move-construction. Please also note that only a debug build without named return-value optimization (NRVO) has been tested, as that is more sure to be free of clever compiler tricks that might skew the results.

A very simple, naïve fixed vector class was used for the testing, with a conditional compile flag for whether or not move semantics should be used. Each regular construction of the vector incurs a heap-allocation, and each regular copy-construction incurs a heap-allocation and a element-wise copy.

The relevant move semantics code is as follows (move assignment included is not actually tested here, since the exact same performance principles apply)

fixed_vector(fixed_vector&& other)
  : buf_(other.buf_), size_(other.size_)
{
  other.buf_ = 0;
  other.size_ = 0;
}

fixed_vector& operator=(fixed_vector&& other)
{
  using std::swap;
  swap(buf_, other.buf_);
  swap(size_, other.size_);
  return *this;
}

For those who haven't seen this kind of reference syntax before, I refer you to this article by Bjarne Stroustrup et al. or this in-depth, awesome rvalue reference article on MSDN In a nutshell, the rvalue reference allows us to modify temporary objects. For the move constructor, we know that the current (i.e. the to-be-constructed) vector does not have a buffer allocated yet, so we can safely overwrite our own pointer with the temporary object's pointer. However (and this is where rvalue references differ from lvalue references), we will in addition set the buffer and size properties of the temporary object to null, effectively invalidating its contents. This means that when the temporary object is destroyed, no heap memory has to be deallocated at all.

Similarly for the move assignment operator, we swap the values of the vectors' buffer pointers (not the buffers' actual elements) and size member variables. This transparently handles taking over the temporary object's buffer as well as safely disposing of any current buffer data we might have, since that pointer is now the property of the temporary object and will subsequently be deallocated when that object is destroyed.

To be able to vary the depth of return levels, the following simple recursion template was used:

template <size_t RecurCalls>
struct recur_val
{
  static fixed_vector<int> invoke(size_t elems)
  {
    fixed_vector<int> mt(recur_val<RecurCalls-1>::invoke(elems));
    return mt; // Avoids RVO
  }
};

template <>
struct recur_val<0>
{
  static fixed_vector<int> invoke(size_t elems)
  {
    fixed_vector<int> mt(elems);
    return mt;
  }
};

Here calling recur_val<3>::invoke(10) would recurse down to a depth of 4, return by value a fixed_vector of 10 elements and propagate it up the callchain. Since the biggest cost of copy-construction unsurprisingly is the actual copying, the number of elements is the key factor for showing the importance of move semantics.

Another very similar template was used to test how well returning an auto_ptr fared against copy/move construction (sort of as a baseline check).

Results

The timing was done by running 100,000 vector creations in recursive calls of depth 6 and then recording the total time spent. Each vector allocation was given randomly in the range of [4096, 8192] ints. The results will naturally deviate somewhat for each run, but they show the general pattern.

copy constructor: 66.3s, auto_ptr return: 5.6s, move constructor: 4.5s

Total time (lower is better)

Again, since the run was done with a debug build on a virtual machine, the actual numbers themselves aren't important, just their relative magnitudes. As one might imagine, regular copy construction took quite some time to complete; over a minute more than auto_ptr and move construction! Move construction consistently took slightly less time than auto_ptr, despite being more complex than the auto_ptr's copy-constructor. This is of course due to the latter's need for a dynamic allocation of the vector object. In a more realistic usage scenario, the cost of constant small allocations would most likely be quite a bit higher, and consequently so would the benefits of move construction be. If the number of bytes allocated per vector test were to increase, the copy construction time would keep on growing while the rest would stay constant.

The very fact that such a linear increase in time and temporary memory usage can be easily avoided without additional heap allocations or code complexity should be reason enough to welcome move semantics with open arms.

The source code for the test is available here (requires the VC++2010 CTP).

As always, comments and/or corrections are very welcome!

Posted in:

Comments

BHW on April 14, 2017

This blog was... how do I say it? Relevant!!
Finally I have found something which helped me. Thank you!

Post a comment