OCaml Planet

July 04, 2015

Amir Chaudhry

Unikernels at PolyConf!

<script async="" class="speakerdeck-embed" data-id="1076a457408d42d7bb9da27dd88b68c8" data-ratio="1.77777777777778" src="http://speakerdeck.com/assets/embed.js"></script>

Above are my slides from a talk at PolyConf this year. I was originally going to talk about the MISO tool stack and personal clouds (i.e. how we’ll build towards Nymote) but after some informal conversations with other speakers and attendees, I thought it would be way more useful to focus the talk on unikernels themselves — specifically, the ‘M’ in MISO. As a result, I ended up completely rewriting all my slides! Since I pushed this post just before my talk, I hope that I’m able to stick to the 30min time slot (I’ll find out very soon).

In the slides I mention a number of things we’ve done with MirageOS so I thought it would be useful to list them here. If you’re reading this at the conference now, please do give me feedback at the end of my talk!

To get involved in the development work, please do join the MirageOS devel list and try out some of the examples for yourselves!

by Amir Chaudhry at July 04, 2015 01:00 PM

July 02, 2015

OCamlCore Forge Projects

POSIX message queues

The mq library provides an OCaml interface to the POSIX message queue API. The bindings try to keep the semantics close to the POSIX API of mqueue.h.

July 02, 2015 06:19 PM

Github OCaml jobs

Full Time: Software Developer (Functional Programming) at Jane Street in New York, NY; London, UK; Hong Kong

Software Developer

Jane Street is a proprietary quantitative trading firm, focusing primarily on trading equities and equity derivatives. We use innovative technology, a scientific approach, and a deep understanding of markets to stay successful in our highly competitive field. We operate around the clock and around the globe, employing over 400 people in offices in New York, London and Hong Kong.

The markets in which we trade change rapidly, but our intellectual approach changes faster still. Every day, we have new problems to solve and new theories to test. Our entrepreneurial culture is driven by our talented team of traders and programmers. At Jane Street, we don't come to work wanting to leave. We come to work excited to test new theories, have thought-provoking discussions, and maybe sneak in a game of ping-pong or two. Keeping our culture casual and our employees happy is of paramount importance to us.

We are looking to hire great software developers with an interest in functional programming. OCaml, a statically typed functional programming language with similarities to Haskell, Scheme, Erlang, F# and SML, is our language of choice. We've got the largest team of OCaml developers in any industrial setting, and probably the world's largest OCaml codebase. We use OCaml for running our entire business, supporting everything from research to systems administration to trading systems. If you're interested in seeing how functional programming plays out in the real world, there's no better place.

The atmosphere is informal and intellectual. There is a focus on education, and people learn about software and trading, both through formal classes and on the job. The work is challenging, and you get to see the practical impact of your efforts in quick and dramatic terms. Jane Street is also small enough that people have the freedom to get involved in many different areas of the business. Compensation is highly competitive, and there's a lot of room for growth.

You can learn more about Jane Street and our technology from our main site, janestreet.com. You can also look at a a talk given at CMU about why Jane Street uses functional programming (http://ocaml.janestreet.com/?q=node/61), and our programming blog (http://ocaml.janestreet.com).

We also have extensive benefits, including:

  • 90% book reimbursement for work-related books
  • 90% tuition reimbursement for continuing education
  • Excellent, zero-premium medical and dental insurance
  • Free lunch delivered daily from a selection of restaurants
  • Catered breakfasts and fresh brewed Peet's coffee
  • An on-site, private gym in New York with towel service
  • Kitchens fully stocked with a variety of snack choices
  • Full company 401(k) match up to 6% of salary, vests immediately
  • Three weeks of paid vacation for new hires in the US
  • 16 weeks fully paid maternity/paternity leave for primary caregivers, plus additional unpaid leave

More information at http://janestreet.com/culture/benefits/

July 02, 2015 01:35 PM

June 28, 2015

Shayne Fletcher

Cumulative moving average

We have $nA_{N} = s_{1} + \cdots + s_{n}$ and $(n + 1)A_{n + 1} = s_{1} + \cdots + s_{n + 1}$ so $s_{n + 1} = (n + 1)A_{n + 1} - nA_{n}$ from which we observe $A_{n + 1} = \frac{s_{n + 1} + nA_{n}}{n + 1} = A_{n} + \frac{s_{n + 1} + nA_{n}}{n + 1} - A_{n} = A_{n} + \frac{s_{n + 1} - A_{n}}{n + 1}$.


let cumulative_moving_average (l : float list) : float =
let f (s, i) x =
let k = i + 1 in
let s = (s +. (x -. s)/. (float_of_int k)) in
(s, k) in
fst @@ List.fold_left f (0., 0) l


fun cumulative_moving_average (l : list[double]) : double = {
var f = fun (s : double, var i : int) (x : double) : double * int = {
return (s + ((x - s) / (double)i), i);

return (fold_left f (0.0, 0) l) .0 ;


#include <numeric>
#include <utility>

namespace detail {
typedef std::pair<double, int> acc_t;
inline acc_t cumulative_moving_average (acc_t const& p, double x) {
double s = p.first;
int k = p.second + 1;
return std::make_pair (s + (x - s) / k, k);
}//namespace detail

template <class ItT>
double cumulative_moving_average (ItT begin, ItT end) {
return (
std::accumulate (
begin, end, std::make_pair (0.0, 0),


template <class T>
auto cumulative_moving_average(T c) {
return accumulate (
std::begin (c), std::end (c), std::make_pair (0.0, 0),
[](auto p, auto x) {
++std::get<1> (p);
return std::make_pair (
std::get<0> (p) + (x - std::get<0> (p)) /
std::get<1> (p), std::get<1> (p));
(credit : Juan Alday)

by Shayne Fletcher (noreply@blogger.com) at June 28, 2015 04:35 PM

June 23, 2015


Using Coq's evaluation mechanisms in anger

Coq offers several internal evaluation mechanisms that have many uses, from glorious proofs by reflection to mundane testing of functions. Read on for an account of my quest to get large parts of the CompCert verified C compiler to execute from within Coq. It was a mighty battle between the forces of transparency and opacity, with the innocent-looking "decide equality" tactic as the surprise villain...

Recently, I spent significant time trying to execute parts of the CompCert verified C compiler directly within Coq. The normal execution path for CompCert is to extract (using Coq's extraction mechanism) OCaml code from the Coq function definitions of CompCert, then compile this OCaml code and link it with hand-written OCaml code. Executing Coq definitions from within Coq itself bypasses extraction and provides a more lightweight way to run them on sample inputs.

That's the theory; in practice, I ran into many problems, some specific to CompCert, others of a more general nature. In this post, I describe some of the "gotcha's" I ran into. Most of this material is old news for expert Coq users, but writing it down could help others avoiding the pitfalls of Coq's execution mechanisms.

Coq's evaluation mechanisms

Readers unfamiliar with Coq may wonder why a proof assistant should be able to evaluate functional programs at all. Owing to the conversion rule (types/propositions are identified up to reductions), evaluation is an integral part of Coq's logic. The Coq implementation provides several evaluation mechanisms:

  • compute, an interpreter that supports several evaluation strategies (lazy, call-by-value, etc).
  • vm_compute, which relies on compilation to the bytecode of a virtual machine, implements call-by-value evaluation, and delivers performance comparable to that of the OCaml bytecode compiler.
  • native_compute, introduced in the upcoming 8.5 release of Coq, relies on the OCaml native-code compiler to produce even higher performance.

Here is an example of evaluation:

Require Import Zarith.
Open Scope z_scope.

Fixpoint fib (n: nat) : Z :=
  match n with 0%nat => 1 | 1%nat => 1 | S (S n2 as n1) => fib n1 + fib n2 end.

Compute (fib 30%nat).

The Coq toplevel prints = 1346269 : Z, taking about 0.2s for the evaluation. The Compute command is shorthand for Eval vm_compute in, and therefore uses the virtual machine evaluator. We can also use the interpreter instead, obtaining the same results, only slower:

Eval cbv  in (fib 30%nat).   (* takes 2.3s *)
Eval lazy in (fib 30%nat).   (* takes 16s *)

On the "ML subset" of Coq, like the example above, Coq's evaluators behave exactly as one would expect. This "ML subset" comprises non-dependent data types and plain function definitions. Other features of Coq are more problematic evaluation-wise, as we now illustrate.

Beware opacity!

The most common source of incomplete evaluations is opaque names. An obvious example is names that are declared using Parameter or Axiom but not given a definition. Consider:

Parameter oracle: bool.
Compute (if oracle then 2+2 else 3+3).
    (* = if oracle then 4 else 6 : Z *)

Coq is doing its best here, evaluating the then and else branches to 4 and 6, respectively. Since it does not know the value of the oracle Boolean, it is of course unable to reduce the if entirely.

A less obvious source of opacity is definitions conducted in interactive proof mode and terminated by Qed:

Definition nat_eq (x y: nat) : {x=y} + {x<>y}.
Proof. decide equality. Qed.

Compute (nat_eq 2 2).
    (* = nat_eq 2 2 : {2%nat = 2%nat} + {2%nat <> 2%nat} *)

Qed always creates opaque definitions. To obtain a transparent definition that evaluates properly, the proof script must be terminated by Defined instead:

Definition nat_eq (x y: nat) : {x=y} + {x<>y}.
Proof. decide equality. Defined.

Compute (nat_eq 2 2).
    (* = left eq_refl : {2%nat = 2%nat} + {2%nat <> 2%nat} *)

The Print Assumptions command can be used to check for opaque names. However, as we see later, an opaque name does not always prevent full evaluation, and opaque definitions are sometimes preferable to transparent ones.

Another source of opacity, of the "after the fact" kind, is the Opaque command, which makes opaque a previous transparent definition. Its effect can be undone with the Transparent command. The virtual machine-based evaluator ignores opacity coming from Opaque, but the interpreter-based evaluator honors it:

Definition x := 2.
Opaque x.
Compute (x + x).      (* = 4 : Z *)
Eval cbv in (x + x).  (* = huge useless term *)

Dependently-typed data structures

A very interesting feature of Coq, from a functional programming standpoint, is its support for dependently-typed data structures, containing both data and proofs of properties about them. Such data structures can be a challenge for evaluation: intuitively, we want to evaluate the data parts fully, but may not want to evaluate the proof parts fully (because proof terms can be huge and their evaluation takes too much time), or may not be able to evaluate the proof parts fully (because they use theorems that were defined opaquely).

A paradigmatic example of dependently-typed data structure is the subset type { x : A | P x }, which is shorthand for sig A P and defined in the Coq standard library as:

Inductive sig (A:Type) (P:A -> Prop) : Type :=
    exist : forall x:A, P x -> sig P.

Intuitively, terms of type { x : A | P x } are pairs of a value of type A and of a proof that this value satisfies the predicate P.

Let us use a subset type to work with integers greater than 1:

Definition t := { n: Z | n > 1 }.

Program Definition two : t := 2.
Next Obligation. omega. Qed.

Program Definition succ (n: t) : t := n + 1.
Next Obligation. destruct n; simpl; omega. Qed.

The Program facility that we used above makes it easy to work with subset types: to build terms of such types, the programmer specifies the data part (e.g. 2 or n + 1 above), and the proof parts are left as proof obligations, which can be solved using proof scripts. There are other ways, for example by writing proof terms by hand:

Definition two : t := exist _ 2 (refl_equal Gt).

or by using interactive proof mode for the whole term:

Definition two : t.
Proof. exists 2. omega. Defined.

But how well does this compute? Let's compute succ two:

Compute (succ two).
   (*  = exist (fun n : Z => n > 1) 3
            (succ_obligation_1 (exist (fun n : Z => n > 1) 2 two_obligation_1)) : t *)

This is not too bad: the value part of the result was completely evaluated to 3, while the proof part got stuck on the opaque lemmas introduced by Program. The reason why this is not too bad is that, often, subset types are used locally to transport invariants on data structures, but the final result we are interested in is just the data part of the subset type, as obtained with the proj1_sig projection.

Compute (proj1_sig (succ two)).
   (*  = 3 : Z *)

In other words, the proof parts of values of type t are carried around during computation, but do not contribute to the final result obtained with proj1_sig. So, it's not a problem to have opaque names in the proof parts. Indeed, making these names transparent (e.g. by using Defined instead of Qed to terminate Next Obligation) just creates bigger proof terms that will be discarded eventually anyway.

Another classic example of dependent data type is the type {P} + {Q} of informative Booleans. Values of this type are either the left constructor carrying a proof of P, or the right constructor carrying a proof of Q. A typical use is for decidable equality functions that return not just a Boolean "equal/not equal", but also a proof of equality or disequality. For example, here is a decidable equality for the subset type t above:

Require Import Eqdep_dec.

Program Definition t_eq (x y: t) : {x=y} + {x<>y} :=
  if Z.eq_dec (proj1_sig x) (proj1_sig y) then left _ else right _.
Next Obligation.
  destruct x as [x Px], y as [y Py]. simpl in H; subst y.
  f_equal. apply UIP_dec. decide equality.
Next Obligation.
  red; intros; elim H; congruence.

Again, such definitions compute relatively well:

Compute (t_eq two two).
    (* = left
         (t_eq_obligation_1 (exist (fun n : Z => n > 1) 2 eq_refl)
            (exist (fun n : Z => n > 1) 2 eq_refl) eq_refl)
       : {two = two} + {two <> two} *)

The proof part blocks, again, on an opaque name, but, more importantly, evaluation went far enough to determine the head constructor left, meaning that the two arguments are equal. Typically, we use decidable equality functions like t_eq in the context of an if expression that just looks at the head constructor and discards the proof parts:

Compute (if t_eq two two then 1 else 2).
    (* = 1 : Z *)

Bottom line: dependently-typed data structures such as subset types or rich Booleans compute quite well indeed, even if their proof parts are defined opaquely. This is due to a phase distinction that functions over those types naturally obey: the data part of the result depends only on the data part of the argument, while the proof part of the argument is used only to produce the proof part of the result. Consider again the succ function above, with type

   succ: { x : Z | x > 1 } -> { y : Z | y > 1 }

The y data part of the result depends only on the x data part of the argument, via y = x + 1. The x > 1 proof part of the argument contributes only to proving the y > 1 part of the result.

The odd "decide equality" tactic

At first sight, the phase distinction outlined above is a natural consequence of Coq's sorting rules, which, to a first approximation, prevent a term of sort Type to depend on a term of sort Prop. But there are exceptions to this sorting rule, which result in completely mysterious failures of evaluation. As I learned through painful debugging sessions, the decide equality tactic violates the phase distinction in a mysterious way.

Consider a decidable equality over the type list t, autogenerated by decide equality:

Definition t_list_eq: forall (x y: list t), {x=y} + {x<>y}.
Proof. decide equality. apply t_eq. Defined.

Compute (if t_list_eq (two::nil) (two::nil) then 1 else 2).
(*   = if match
                (exist (fun n : Z => n > 1) 2 two_obligation_1)
                (exist (fun n : Z => n > 1) 2 two_obligation_1) eq_refl
              in (_ = x)
          | eq_refl => left eq_refl
       then 1
       else 2 : Z *)

Whazzat? The normal form is 40 lines long! Clearly, t_list_eq (two::nil) (two::nil) failed to reduce to left of some equality proof. Apparently, it got stuck on the opaque proof t_eq_obligation_1 before reaching the point where it can decide between left (equal) and right (not equal). But that violates the phase distinction! The left/right data part of the result should not depend on the proof term t_eq_obligation_1!

Something fishy is going on. But maybe we can circumvent it by using Defined instead of Qed in the proof obligations of t_eq? Doing so only delays failure: computation goes further but produces a 200-line term that is blocked on the opaque lemma UIP_dec from Coq's standard library... I played this "whack-a-mole" game for hours, copying parts of the Coq standard library to make lemmas more transparent, in the hope that functions produced by decide equality will compute eventually.

Then I realized that the problem lies with decide equality. The term it produces is roughly the same one would get with the following proof script:

Definition bad_t_list_eq: forall (x y: list t), {x=y} + {x<>y}.
  induction x as [ | xh xt]; destruct y as [ | yh yt].
- left; auto.
- right; congruence.
- right; congruence.
- destruct (t_eq xh yh).
+ subst yh. (* HERE IS THE PROBLEM *)
  destruct (IHxt yt).
  * left; congruence.
  * right; congruence.
+ right; congruence.

Notice the subst in the first + bullet? In the case where x and y are not empty and their heads are equal, it eliminates the proof of equality between the heads before recursing over the tails and finally deciding whether to produce a left or a right. This makes the left/right data part of the final result dependent on a proof term, which in general does not reduce!

In this particular example of lists, and in all cases involving ML-like data types, this early elimination of an equality proof is useless: if we just remove the subst xh, we get a perfectly good decidable equality that respects the phase distinction and computes just fine:

Definition good_t_list_eq: forall (x y: list t), {x=y} + {x<>y}.
  induction x as [ | xh xt]; destruct y as [ | yh yt].
- left; auto.
- right; congruence.
- right; congruence.
- destruct (t_eq xh yh).
+ destruct (IHxt yt).
  * left; congruence.
  * right; congruence.
+ right; congruence.

Compute (if (good_list_eq (two::nil) (two::nil)) then 1 else 2).
    (* = 1 : Z *)

The only case where the current behavior of decide equality would be warranted is for dependently-typed data types like the following:

Inductive bitvect : Type := Bitvect (n: nat) (v: Vector.t bool n).

When comparing two values of bitvect type, after checking that their n components are equal, we must substitute one n by the other so that the two v components have the same type and can be compared in turn. However, decide equality just does not work on the bitvect type above, producing an ill-typed term... So much for my sympathetic explanation of the odd behavior of decide equality!


After reimplementing decide equality in 20 lines of Ltac that generate phase-distinction-correct functions (thank you very much), and performing a zillion other changes in CompCert's Coq sources, I was finally able to execute whole CompCert passes from within Coq. If you are wondering about performance, Coq's VM runs CompCert at 2/3 of the speed of CompCert extracted to OCaml then compiled to bytecode, and 15% of the speed of the "real" CompCert, extracted to OCaml then compiled to native code. Happy end for a terrible hack!

by Xavier Leroy at June 23, 2015 08:00 AM

Gerd Stolpmann

OMake On Steroids (Part 3)

Faster builds with omake, part 3: Caches
In this (last) part of the series we have a closer look at how OMake uses caches, and what could be improved in this field. Remember that we saw in total double speed for large OMake projects, and that we also could reduce the time for incremental builds. In particular for the latter, the effect of caching is important. <cc-field name="maintext">
This text is part 3/3 of a series about the OMake improvements sponsored by LexiFi: The original publishing is on camlcity.org.

Caching more is better, right? Unfortunately, this attitude of many application programmers does not hold if you look closer at how caches work. Basically, you trade memory for time, but there are also unwanted effects. As we learned in the last part, bigger process images may also cost time. What we examined there at the example of the fork() system call is also true for any memory that is managed in a fine-grained way. Look at the garbage collector of the OCaml runtime: If more memory blocks are allocated, the collector also needs to cycle through more blocks in order to mark and reclaim memory. Although the runtime includes some clever logic to alleviate this effect (namely by allowing more waste for bigger heaps and by adjusting the collection speed to the allocation speed), the slowdown is still measurable.

Another problem for large setups is that if processes consume more memory the caches maintained by the OS have less memory to work with. The main competitor on the OS level is the page cache that stores recently used file blocks. After all, memory is limited, and it is the question for what we use it. Often enough, the caches on the OS level are the most effective ones, and user-maintained caches need to be justified.

In the case of OMake there are mainly two important caches:

  • The target cache answers the question whether a file can be built in a given directory. The cache covers both types of build rules: explicit and implicit rules. For the latter it is very important to have this cache because the applicable implicit rules need to be searched. As OMake normally uses the "-modules" switch of ocamldep, it has to find out on its own in which directory an OCaml module is built.
  • The file cache answers the question whether a file is still up to date, or whether it needs to be rebuilt. This is based on three data blobs: first, the Unix.stat() properties of the file (and whether the file exists at all). Second, the MD5 digest of the file. Third, the digest of the command that created the file. If any of these blobs change the file is out of date. The details are somewhat complicated, though, in particular the computation of the digest costs some time and should only be done if it helps avoiding other expensive actions. Parts of the file cache survive OMake invocations as these are stored in the ".omakedb" file.

All in all, I was looking for ways of reducing the size of the caches, and for a cleverer organization that makes the cache operations cheaper.

The target cache

The target cache is used for searching the directory where a file can be built, and also the applicable file extensions (e.g. if a file m.ml is generated from m.mly there will be entries for both m.ml and m.mly). As I found it, it was very simple, just a mapping
filepath ↦ buildable_flag
and if a file f could potentially exist in many directories d there was a separate entry d/f for every d. For a given OCaml module m, there were entries for every potential suffix (i.e. for .cmi, .cmo, .cmx etc.), and also for the casing of m (remember that a module M can be stored in both m.ml and M.ml). In total, the cache had 2 * D * S * M entries (when D = number of build directories and S = number of file suffixes). It's a high number of entries.

The problem is not only the size, but also the speed: For every test we need to walk the mapping data structure.

The new layout of the cache compresses the data in the following way:

filename ↦ (directories_buildable, directories_non_buildable)
On the left side, only simple filenames without paths are used. So we need only 1/D entries than before now. On the right side, we have two sets: the directories where the file can be built, and the directories where the file cannot be built (and if a directory appears in neither set, we don't know yet). As the number of directories is very limited, these sets can be represented as bitsets.

Note that if we were to program a lame build system, we could even simplify this to

filename ↦ directory_buildable option
but we want to take into account that files can potentially be built in several directories, and that it depends on the include paths currently in scope which directory is finally picked.

It's not only that the same information is now stored in a compressed way. Also, the main user of the target cache picks a single file and searches the directory where it can be built. Because the data structure is now aligned with this style of accessing it, only one walk over the mapping is needed per file (instead of one walk per combination of directory and file). Inside the loop over the directories we only need to look into the bitsets, which is very cheap.

The file cache

Compared to the target cache, the file cache is really complicated. For every file we have three meta data blobs (stat, file digest, command digest). Also, there are two versions of the cache: the persistent version, as stored in the .omakedb file, and the live version.

Many simpler build systems (like "make") only use the file stats for deciding whether a file is out of date. This is somewhat imprecise, in particular when the filesystem stores the timestamps of the files with only low granularity (e.g. in units of seconds). Another problem occurs when the timestamps are not synchronous with the system clock, as it happens with remote filesystems.

There is a now a pre-release omake-0.10.0-test1 that can be bootstrapped! It contains all of the described improvements, plus a number of bugfixes.

OMake is programmed so that it only uses the timestamps between invocations. This means that if OMake is started another time, and the timestamp of a file changed compared with the previous invocation of OMake, it is assumed that the file has changed. OMake does not use timestamps during its runs. Instead it relies on the file cache as the instance that decides which files need to be created again. For doing so, it only uses digests (i.e. a rule fires when the digests of the input files change, or when the digest of the command changes).

The role of the .omakedb file is now that a subset of the file cache is made persistent beween invocations. This file stores the timestamps of the files and the digests. OMake simply assumes that the saved digest is still the current one if the timestamp of the file remains the same. Otherwise it recomputes the digest. This is the only purpose of the timestamps. Inaccuracies do not play a big role when we can assume that users typically do not start omake instances so quickly after each other that clock deviations would matter.

The complexity of the file cache is better understood if you look at key operations:

  • Load the .omakedb file and interpret it in the right way
  • Decide whether the cached file digest can be trusted or not (and in the latter case the digest is recomputed from the existing file)
  • Decide whether a rule is out of date or not. This check needs to take the cache contents for the inputs and the outputs of the rule into account.
  • Sometimes, we want to avoid expensive checks, and e.g. only know whether a digest might be out of date from the available information without having to recompute the digest.

After finding a couple of imprecise checks in the existing code, I actually went through the whole Omake_cache module, and went through all data cases. After that I'm now sure that it is perfect in the sense that only those digests are recomputed that are really needed for deciding whether a rule is out of date.

There are also some compressions:

  • The cache no longer stores the complete Unix.stat records, but only the subset of the fields that are really meaningful (timestamps, inode), and represent these fields as a single string.
  • There is a separate data structure for the question whether a file exists. This is one of the cases where OS level caches already do a good job. Now, only for the n most recently accessed files this information is available (where n=100). On Linux with its fast system calls this cache is probably unnecessary, but on Windows I actually saw some speedup.

All taken together, this gives another little boost. This is mostly observable on Windows as this OS does not profit from the improvements described in the previous article of the series.

Gerd Stolpmann works as OCaml consultant.

June 23, 2015 12:00 AM

June 19, 2015

Richard Jones

New in nbdkit 1.1.10: OCaml plugins

You can now write OCaml plugins for nbdkit – the liberally licensed NBD server. You will, however, need OCaml ≥ 4.02.2+rc1 because of this fix.

by rich at June 19, 2015 01:29 PM

Gerd Stolpmann

OMake On Steroids (Part 2)

Faster builds with omake, part 2: Linux
The Linux version of OMake suffered from specific problems, and it is worth looking at these in detail.
This text is part 2/3 of a series about the OMake improvements sponsored by LexiFi:
  • Part 1: Overview
  • Part 2: Linux (this page)
  • Part 3: Caches (will be released on Tuesday, 6/23)
The original publishing is on camlcity.org.

While analyzing the performance characteristics of OMake, I found that the features of the OS were used in a non-optimal way. In particular, the fork() system call can be very expensive, and by avoiding it the speed of OMake could be dramatically improved. This is the biggest contribution to the performance optimizations allowing OMake to run roughly twice as fast on Linux (see part 1 for numbers).

The fork/exec problem

The traditional way of starting commands is to use the fork/exec combination: The fork() system call creates an almost identical copy of the process, and in this copy the exec() call starts the command. This has a number of logical advantages, namely that you can run code between fork() and exec() that modifies the environment for the new command. Often, the file descriptors 0, 1, and 2 are assigned as it is required for creating pipelines. You can also do other things, e.g. change the working directory.

The whole problem with this is that it is slow. Even for a modern OS like Linux, fork() includes a number of expensive operations. Although it can be avoided to actually copy memory, the new address space must be set up by duplicating the page table. This is the more expensive the bigger the address space is. Also, memory must be set aside even if it is not immediately used. The entries for all file mappings must be duplicated (and every linked-in shared library needs such mappings). The point is now that all these actions are not really needed because at exec() time the whole process image is replaced by a different one.

In my performance tests I could measure that forking a 450 MB process image needs around 10 ms. In the n=8 test for compiling each of the 4096 modules two commands are needed (ocamldep.opt and ocamlopt.opt). The time for this fork alone sums up to 80 seconds. Even worse, this dramatically limits the benefit of parallelizing the build, because this time is always spent in the main process.

The POSIX standard includes an alternate way of starting commands, the posix_spawn() call. It was originally developed for small systems without virtual memory where it is difficult to implement fork() efficiently. However, because of the mentioned problems of the fork/exec combinations it was quickly picked up by all current POSIX systems. The posix_spawn() call takes a potentially long list of parameters that describes all the actions needed to be done between fork() and exec(). This gives the implementer all freedom to exploit low-level features of the OS for speeding the call up. Some OS, e.g. Mac OS X, even implement posix_spawn directly as system call.

On Linux, posix_spawn is a library function of glibc. By default, however, it is no real help because it uses fork/exec (being very conservative). If you pass the flag POSIX_SPAWN_USEVFORK, though, it switches to a fast alternate implementation. I was pointed (by Török Edwin) to a few emails showing that the quality in glibc is not yet optimal. In particular, there are weaknesses in signal handling and in thread cancellation. Fortunately, these weaknesses do not matter for this application (signals are not actively used, and on Linux OMake is single-threaded).

Note that I developed the wrapper for posix_spawn already years ago for OCamlnet where it is still used. So, if you want to test the speed advantage out on yourself, just use OCamlnet's Shell library for starting commands.

Pipelines and fork()

It turned that there is another application of fork() in OMake. When creating pipelines, it is sometimes required that the OMake process forks itself, namely when one of commands of the pipeline is implemented in the OMake language. This is somewhat expected, as the parts of a pipeline need to run concurrently. However, this feature turned out to be a little bit in the way because the default build rules used it. In particular, there is the pipeline

$(OCAMLFIND) $(OCAMLDEP) ... -modules $(src_file) | ocamldep-postproc
which is started for scanning OCaml modules. While the first command, $(OCAMLFIND), is a normal external command, the second command, ocamldep-postprocess, is written in the OMake language.

Forking for creating pipelines is even more expensive than the fork/exec combination discussed above, because memory needs really to be copied. I could finally avoid this fork() by some trickery in the command starter. When used for scanning, and the command is the last one in the pipeline (as in the above pipeline), a workaround is activated that writes the data to a temporary file, as if the pipeline would read

$(OCAMLFIND) $(OCAMLDEP) ... -modules $(src_file) >$(tmpfile);
ocamldep-postproc <$(tmpfile)

(NB. You actually can also program this in the OMake language. However, this does not solve the problem, because for sequences of commands $(cmd1);$(cmd2) it is also required to fork the process. Hence, I had to find a solution deeper in the OMake internals.)

There is a now a pre-release omake-0.10.0-test1 that can be bootstrapped! It contains all of the described improvements, plus a number of bugfixes.

There is one drawback of this, though: The latency of the pipeline is increased when the commands are run sequentially rather than in parallel. The effect is that OMake takes longer for a j=1 build even if less CPU resources are consumed. A number of further improvements compensate for this:

  • Most importantly, ocamldep-postprocess can now use a builtin function, speeding this part up by switching the implementation language (now OCaml, previously the OMake language).
  • Because ocamldep-postprocess mainly accesses the target cache, speeding up this cache also helped (see the next part of this article series).
  • Finally, there is now a way how functions like ocamldep-postprocess can propagate updates of the target cache to the main environment. The background is here that functions implementing commands run in a sub environment simulating some isolation from the parent environment. This isolation prevented that updates of the target cache found by one invocation of ocamldep-postprocess could be used by the next invocation. This also speeds up this function.

Windows is not affected

The Windows port of OMake is not affected by the fork problems. For starting commands, an optimized technique similar to posix_spawn() is used anyway. For pipelines and other internal uses of fork() the Windows port uses threads. (Note beside: You may ask why we don't use threads on Linux. There are a couple of reasons: First, the emulation of the process environment with threads is probably not quite as stable as the original using real processes. Second, there are difficult interoperability problems between threads and signals (something that does not exist in Windows). Finally, this would not save us maintaining the code branch using real processes and fork() because OCaml does not support multi-threading for all POSIX systems. Of course, this does not mean we cannot implement it as optional feature, and probably this will be done at some point in the future.)

The trick of using temporary files for speeding up pipelines is not enabled on Windows. Here, it is more important to get the benefits of parallelization that the real pipeline allows.

The next part will be published on Tuesday, 6/23.
Gerd Stolpmann works as OCaml consultant.

June 19, 2015 12:00 PM