Introduction

It is known that non-recursive makefile architectures have a higher degree of parallelism than recursive ones. It is not clear just how much higher. This document presents results of benchmarking recursive and non-recursive build systems based on GNU make. Both build systems were used to build the same project under the same conditions.

Description of the benchmark, the build systems and hardware are in the Benchmark section, results of the benchmark are in the Results section, and finally some thoughts and interpretation of the results are presented in the Comments sections.

Benchmark

I believe the benchmark to be accurate; the conditions and tasks performed by both build systems are the same except where recursive vs non-recursive design comes to play. For example, both build systems use so-called advanced auto-dependency generation technique. If you find any inaccuracy or unfairness towards one build system or the other, please let me know and I will try to fix it.

The benchmark is based on the source code of CIDL compiler. According to sloccount, it contains 277 source code files of which 102 are translation units with a total of 27,589 SLOC (source lines of code). The source code is split over 15 directories with the translation-units-per-directory ratio being about 7.

The build systems being tested are a traditional recursive build system distributed as part of Utility library and a non-recursive build system called build

If you want to reproduce the benchmark you will need to install a few third-party libraries (described in CCF/Documentation/Build.html in archive above) as well as the build systems.

The tests were run on four different machines which are described below.

P1: IBM ThinkPad T41p laptop with single Pentium M 1.6GHz, 1MB L2 cache, 512MB RAM. P1 is running Debian GNU/Linux with 2.4.21 kernel. Build partition is ext2 mounted with noatime option.
X2: Dell Precision 650 workstation with dual Pentium Xeon 2.4GHz HT, 512KB L2 cache, 1GB RAM. X2 is running Debian GNU/Linux with 2.6.4 kernel. Build partition is reiserfs on IDE.
X4: Dell BigIron with quad Pentium Xeon MP 1.9GHz HT, 512KB L2 cache, 2GB RAM. X4 is running Debian GNU/Linux with 2.4.26 kernel. Build partition is xfs on SCSI.
D6: X2 and X4 combined in a distcc build over 100MB ethernet. Localhost is X2 and DISTCC_HOSTS="X2/2 X4/10". Both 3/rest and 1/rest splits were also tested with the results worse that the 2/rest split.

Results

Up-to-date check

This test consists of running make on an up-to-date build and measures the time it takes make to discover the build is up-to-date. You can view this result as the amount of time between when you start make and when the first compilation command starts execution.

build machine	recursive	non-recursive
P1	1.79	1.57
X2	5.51 (-j 3)	0.70
X4	1.04 (-j 3)	0.80

Build from ground up

This test measures the time it takes to build everything from scratch.

machine	recursive	non-recursive
P1	597.34	598.46

machine	-j	recursive	non-recursive
X2	-j 2	178.91	165.80
	-j 3	174.16	152.81
	-j 4	178.92	152.77
	-j 5	179.90	152.06
	-j 6	207.48	192.39

machine	-j	recursive	non-recursive
X4	-j 4	177.30	123.75
	-j 5	176.49	113.34
	-j 6	162.59	103.93
	-j 7	169.03	101.87
	-j 8	168.71	104.87
	-j 9	173.49	109.18
	-j 10	177.26	113.28

machine	-j	recursive	non-recursive
D6	-j 6	108.39	80.10
	-j 7	111.12	79.71
	-j 8	112.87	75.52
	-j 9	120.62	71.29
	-j 10	130.90	67.85
	-j 11	114.18	73.01
	-j 12	126.97	71.56

Comments

In the up-to-date check both build systems perform pretty close except in case of X2. My only explanation to such a poor performance of the recursive build system is 2.6 kernel and/or reiserfs. I would also expect a bigger difference in up-to-date check times between recursive and non-recursive build systems in larger projects and in projects with lower translation-units-per-directory ratio.

The results of the build from scratch benchmark on P1 indirectly proves that both build systems were tested in equivalent conditions and perform the same tasks.

Not surprisingly, the non-recursive build outperforms the recursive one on every parallel configuration. On X2 the non-recursive build takes about 87% of the recursive time. On X4 that time drops to 63% and on D6 to 61%.

What's surprising is the speed-up of the recursive build when moved from X4 to D6 being 32% (33% for the non-recursive build) in comparison to the speed-up when moved from X2 to X4 being just 7% (33% for the non-recursive build). I tend to think it is not the recursive build that scales well, but rather the non-recursive one that does not. The potential distcc bottlenecks (e.g., the fact that X4 depends on X2 for preprocessed material as well as on saving the output) may result in flow control and consequently in the reduction of parallelism. This reduction in some sense evens the chances of the recursive and non-recursive builds. Supporting evidence to this explanation is the load of the two hosts. In the non-recursive build the load fluctuates significantly the first 10 seconds or so and then both boxes become uniformly loaded. The recursive build never loads either of the two boxes to 100%, however the fluctuations are not as big as in the first 10 seconds of the non-recursive build. It would be interesting to see the results with reversed roles, i.e., X4 being the localhost.

It is also interesting to note that the highest performance of the non-recursive build is achieved about two -j points higher than of the recursive build in equivalent conditions.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2; with the whole document being an Invariant Section, no Front-Cover Texts and no Back-Cover Texts.

Last updated on Aug 21 2004