c++: inlining code away

Thu Jul 1 21:57:32 CDT 2004

Good day,

Sometimes you want the same line of source code to do something in
one case and completely disappear in the other. assert() is a good
example. The common way of achieving this in a C/C++ environment is
to use the preprocessor. With this approach, however, you are stuck
with the function call notation. For assert() that's exactly what we
need but what if we want code like

  a + b

or even

  cerr << "allocation " << size << "bytes" << endl;

to go away? We could write

  if_debug (cerr << "allocation " << size << "bytes" << endl;)

or something along those lines but it doesn't look very appealing.

To restate the problem: can we achieve conditional compilation for
operator-based language constructs like cerr << "hello"? The answer
could be a do-nothing inline function and a C++ compiler optimization.
But will this actually work? That's what we are going to test today.

Our example will be a piece of tracing facility I've been playing with
lately. One of the requirements were to be able to turn tracing
completely off with zero overhead in resulting code. Also I didn't
want to pose any notational burden on the user so I decided to provide
an interface similar to the one in the iostream library:

  tout << "operator new (" << size << "): " << p;

Even though the code snippet above looks very innocent, there is quite
a lot of things going on under the hood. While the inter-workings of
the tracing facility is not the topic of this essay, the number of
actions performed under the hood is quite relevant to our discussion.
Therefore, I am going to provide a quick overview of how everything
works. There are two main object involved: a record and a stream.
The records are traced into the stream:

  class record
  {
  public:

    // ...

    template <typename x>
    record&
    operator<< (x const& arg);
  };

  class stream
  {
  public:

    // ...

    stream&
    operator<< (record const& r);
  };

Having these definitions we can write something like this:

stream tout;

record r;
r << "operator new (" << size << "): " << p;
tout << r;

Or even this:

tout << (record () << "operator new (" << size << "): " << p);

It is not exactly what we want, however. We would like the temporary
record to be automatically created for us:

  class stream
  {

    // ...

  private:
    class mediator
    {
    public:
      mediator (stream& s)
        : s_ (s)
      {
      }

      ~mediator ()
      {
        s_ << r_;
      }

      stream& s_;
      record  r_;
    };

    friend record&
    operator<< (mediator const& mc, char const* s)
    {
      mediator& m (const_cast<mediator&> (mc));
      return m.r_ << s;
    }

    template <typename x>
    friend record&
    operator<< (mediator const& m, x const& arg)
    {
      mediator& m (const_cast<mediator&> (mc));
      return m.r_ << arg;
    }
  };

Do you see how this works? Let's start from a simple example and walk
through it step-by-step:

  tout << "hello";

When the compiler sees this line, it must decide which operator<< to
call. Let's see what choices it has:

 stream& stream::
 operator << (record const& r);

This one doesn't work since "hello" is of type char const [6], not
record, and there is no conversion from char const[6] (or decayed
char const*) to the type record.

 record&
 operator<< (mediator const& mc, char const* s);

The second argument matches after decaying to char const*. The first
formal argument is of type mediator, the stream can be implicitly
converted to the type mediator (see mediator::mediator (stream&)).
We've got the match. In order to make the call the compiler creates
a temporary of type mediator and passes a const reference to it as
the first actual argument. The generated code will be something
equivalent to this:

  {
    mediator m (tout);
    operator<< (m, "hello");
  }

And our original example

tout << "operator new (" << size << "): " << p;

will be turned into this:

  {
    mediator m (tout);
    operator<< (m, "operator new (").operator<< (size).
      operator<< ("): ").operator<< (p);
  }

The innocent looking piece of code turned out to do quite a lot:
the compiler has to create the temporary (with all the constructors)
and then call a number of functions each of which depends on the return
value of the previous.

Now let's go back to our zero-overhead problem: if we provide
a do-nothing inlined implementation, will the compiler be able to
optimize the whole thing away?

Here is our zero-overhead implementation:

  class record
  {
  public:

    // ...

    template <typename x>
    record&
    operator<< (x const& arg)
    {
      return *this;
    }
  };

  class stream
  {

    // ...

  private:
    class mediator
    {
    public:
      mediator (stream& s)
        : s_ (s)
      {
      }

      ~mediator ()
      {
      }

      stream& s_;
      record  r_;
    };

    friend record&
    operator<< (mediator const& mc, char const* s)
    {
      return const_cast<mediator&> (mc);
    }

    template <typename x>
    friend record&
    operator<< (mediator const& m, x const& arg)
    {
      return const_cast<mediator&> (mc);
    }
  };

Even though it's a do-nothing implementation, we are still performing
some initializations and return some values. Therefore, it's not quite
obvious that the compiler will be able to figure out that all those
actions don't produce anything.

Our test case will be a simple function, assembler code of which we
are going to inspect:

  stream tout;

  int
  bar (size_t size, void* p)
  {
    tout << "operator new (" << size << "): " << p;
    return 0;
  }

Here is the assembler code for this function when compiled by
g++ 3.4.0 with -O2:

.globl _Z3barmPv
	.type	_Z3barmPv, @function
_Z3barmPv:
.LFB1528:
.L11:
	xorl	%eax, %eax
	ret

For comparison here is the same function but compiled with -g:

.globl _Z3barmPv
	.type	_Z3barmPv, @function
_Z3barmPv:
.LFB1496:
	.loc 2 26 0
	pushq	%rbp
.LCFI6:
	movq	%rsp, %rbp
.LCFI7:
	pushq	%rbx
.LCFI8:
	subq	$72, %rsp
.LCFI9:
	movq	%rdi, -24(%rbp)
	movq	%rsi, -32(%rbp)
.LBB6:
	.loc 2 27 0
	leaq	-64(%rbp), %rdi
	movl	$tout, %esi
	call	_ZN4cult5trace6stream8mediatorC1ERS1_
	leaq	-64(%rbp), %rdi
	movl	$.LC1, %esi
	call	_ZN4cult5tracelsERKNS0_6stream8mediatorEPKc
	movq	%rax, %rdi
	leaq	-24(%rbp), %rsi
.LEHB0:
	call	_ZN4cult5trace6recordlsImEERS1_RKT_
	movq	%rax, %rdi
	movl	$.LC2, %esi
	call	_ZN4cult5trace6recordlsIA4_cEERS1_RKT_
	movq	%rax, %rdi
	leaq	-32(%rbp), %rsi
	call	_ZN4cult5trace6recordlsIPvEERS1_RKT_
.LEHE0:
	jmp	.L13
.L16:
	movq	%rax, -72(%rbp)
.L12:
	movq	-72(%rbp), %rbx
	leaq	-64(%rbp), %rdi
	call	_ZN4cult5trace6stream8mediatorD1Ev
	movq	%rbx, -72(%rbp)
.L14:
	movq	-72(%rbp), %rdi
.LEHB1:
	call	_Unwind_Resume
.LEHE1:
.L13:
	leaq	-64(%rbp), %rdi
	call	_ZN4cult5trace6stream8mediatorD1Ev
	.loc 2 28 0
	movl	$0, %eax
.LBE6:
	.loc 2 29 0
	addq	$72, %rsp
	popq	%rbx
	leave
	ret

I also ran this test on Intel C++ with the same results. This shows
that contemporary compilers are smart enough to make the technique
of inlining code away practical. Keep in mind, however, that in order
for this technique to work, the compiler should be able too see
through function calls until elementary operations. In particular, if
you have a call to a non-inline function as part of your expression
there is nothing the compiler can do about it except making the call.
To illustrate, consider this code fragment:

  stream tout;

  char const*
  foo ();

  int
  bar (size_t size, void* p)
  {
    tout << foo () << size << p;
    return 0;
  }

When compiled by gcc 3.4.0 with -O2:

.globl _Z3barmPv
	.type	_Z3barmPv, @function
_Z3barmPv:
.LFB1527:
	subq	$40, %rsp
.LCFI0:
	movq	tout(%rip), %rax
	movq	$tout, (%rsp)
	movq	%rax, 8(%rsp)
	movl	tout+8(%rip), %eax
	movl	%eax, 16(%rsp)
.LEHB0:
	call	_Z3foov
.LEHE0:
	xorl	%eax, %eax
	addq	$40, %rsp
	ret

This is because a C/C++ compiler cannot make any assumptions about
arbitrary functions. Using GCC's function attributes we can specify
that our function is "pure" and consequently can be called fewer
times than the program says:

char const*
foo () __attribute__ ((pure));

With this hint GCC eliminates the call:

.globl _Z3barmPv
	.type	_Z3barmPv, @function
_Z3barmPv:
.LFB1528:
.L11:
	xorl	%eax, %eax
	ret

If you have made it this far, thank you for your time. Permission is
granted to copy, distribute and/or modify this document under the terms
of the GNU Free Documentation License, Version 1.2; with no Invariant
Sections, no Front-Cover Texts, and no Back-Cover Texts.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 652 bytes
Desc: Digital signature
Url : http://www.kolpackov.net/pipermail/notes/attachments/20040701/0884b6ad/attachment-0001.bin