OCaml Planet

April 22, 2015

Coq

Coq 8.5 beta 2 is out!

The second beta release of Coq 8.5 is available for testing. The 8.5 version brings several major features to Coq:
  • asynchronous edition of documents under CoqIDE to keep working on a proof while Coq checks the other proofs in the background (by Enrico Tassi);
  • universe polymorphism making it possible to reuse the same definitions at various universe levels (by Matthieu Sozeau);
  • primitive projections improving space and time efficiency of records, and adding eta-conversion for records (by Matthieu Sozeau);
  • a new tactic engine allowing dependent subgoals, fully backtracking tactics, as well as tactics which can consider multiple goals together (by Arnaud Spiwack);
  • a new reduction procedure called native_compute to evaluate terms using the OCaml native compiler, for proofs with large computational steps (by Maxime Dénès).
More information about the changes from 8.4 to 8.5 and since the first beta release can be found in the CHANGES file. Feedback and bug reports are extremely welcome. Enjoy!

by Matthieu Sozeau at April 22, 2015 07:00 PM

Caml Spotting

Recover the good old C-x C-b (list-buffers) behaviour around Emacs 24.4 and later

Something has been changed around Emacs 24.4:  my favorite C-x C-b (list-buffers) no longer works as before.  It displays the buffer menu in a random window.  Which windows is chosen is almost undeterministic and no way to predict.  Sometimes the current window is selected. :-(

This change is awful to me who built a special neuro circuit for Emacs.  I often did:
  • C-x C-b to display the buffer list in another window than the current
  • then immediately C-x o to move the cursor from the current window to the buffer list.
I can type these strokes in 0.5secs and this worked fine since what happened with C-x C-b was almost predictable.  24.x ruined it.

After long investigation, finally I found a way to recover the good old behaviour of C-x C-b (or at least something pretty similar):
(defun good-old-list-buffers () 
  (interactive) 
  (display-buffer (list-buffers-noselect)))  
(global-change-key (kbd "C-x C-b") 'good-old-list-buffers) 
This is it.  I hope this helps some other Emacs users.

by Jun Furuse (noreply@blogger.com) at April 22, 2015 06:27 PM

April 16, 2015

Functional Jobs

Mid/Senior Software Development Engineer at Lookingglass Cyber Solutions (Full-time)

Lookingglass is the world leader in cyber threat intelligence management. We collect and process all source intelligence, connecting organizations to valuable information through its cyber threat intelligence monitoring and management platform.

Our solutions allow customers to continuously monitor threats far and near, such as the presence of botnets, hosts associated with cybercriminal networks, unexpected route changes and the loss of network resiliency.

We are seeking qualified Software Development Engineers to join our team!

Required Skills & Experience:

  • US Citizen or Green Card Holder
  • Location: MD/VA based
  • Bachelor’s or Masters degree in: computer science, engineering, information systems or mathematics
  • 3-5 yrs experienced with full development life-cycle with shipping products
  • 2+ yrs experienced with Functional and OO languages – have worked in functional paradigms with immutable data models
  • 3+ yrs building distributed system products including messaging, NoSQL, RPC / RMI mechanisms – Including building Service Orientated Architectures
  • Proficiency with data structure and algorithm analysis
  • Experienced working in a TDD Environment

Nice to Have:

  • Product development experience in network security, content security or cyber threat intelligence
  • Experience with concurrency models
  • Experience with key-value distributed databases
  • Experience deploying production software in Haskell, OCAML, Clojure, or Lisp
  • Comfortable writing one or more of the following Javascript, GoLang, Ruby

At Lookingglass, we believe our employees are our greatest assets. We offer competitive salaries, a full benefits package, with available medical, dental, vision, and disability insurance, a 401k retirement package, and stock options. We offer generous PTO, a well supplied kitchen, and regular team activities. Most importantly, we offer the opportunity to build a world-class product with a team of talented engineers.

Get information on how to apply for this position.

April 16, 2015 09:46 PM

Senior/Principal Software Development Engineer at Lookingglass Cyber Solutions (Full-time)

Are you an experienced software engineer in security, networking, cloud and big data? Are you interested in cyber security or improving the security of the Internet? Do you push yourself to be creative and innovative and expect the same of others?

At Lookingglass, we are driven and passionate about what we do. We believe that teams deliver great products not individuals. We inspire each other and our customers every day with technology that improves the security of the Internet and of our customer’s. Behind our success is a team that thrives on collaboration and creativity, delivering meaningful impact to our customers.

We are currently seeking qualified Senior/Principal Software Development Engineers to join our team!

Required Skills & Experience:

  • US Citizen or Green Card Holder
  • Location: MD/VA or CA based
  • Bachelor’s or Masters degree in:computer science, engineering, information systems or mathematics
  • Strong technical leader with proven experience leading project technologies and mentoring junior development team members
  • 7-10 yrsexperienced with full development life-cycle with shipping products
  • 4-8yrs experienced with Functional and OO languages – have worked in functional paradigms with immutable data models
  • 5+ yrs building distributed system products including messaging, NoSQL, RPC / RMI mechanisms, and Service Orientated Architectures
  • Proficiency with data structure, algorithm analysis, and concurrency programming
  • Experienced working in a TDD Environment
  • Comfortable with aggressive refactoring
  • Architectural and design skills to map a solution across hardware, software, and business layers in a distributed architecture

Nice to Have:

  • Product development experience in network security, content security or cyber threat intelligence
  • Experience with CSP concurrency models
  • Experience with key-value distributed databases
  • Experience deploying production software in Haskell, OCAML, Clojure, or Lisp
  • Polyglot programmer with production experience in imperative, declarative, OO, functional, strongly/weakly typed, static/dynamic, interpreted/compiled languages

Lookingglass believes our employees are our greatest assets. We offer competitive salaries, a full benefits package, with available medical, dental, vision, and disability insurance, a 401k retirement package, and stock options. We offer generous PTO, a well supplied kitchen, and regular team activities. Most importantly, we offer the opportunity to build a world-class product with a team of talented engineers.

Get information on how to apply for this position.

April 16, 2015 09:46 PM

April 13, 2015

OCamlPro

Yes, ocp-memprof (s)can(f) !

A few months ago, a memory leak in the Scanf.fscanf function of OCaml's standard library has been reported on the OCaml mailing list. The following "minimal" example reproduces this misbehavior:

(* in file scanf_leak.ml *)
for i = 0 to 100_000 do
   let ic = open_in "some_file.txt" in
   Scanf.fscanf ic "%s" (fun _s -> ());
   close_in ic
done;;
        
read_line ();;

Let us see how to identify the origin of the leak and fix it with our OCaml memory profiler.

Installing the OCaml Memory Profiler

We first install our modified OCaml compiler and the memory profiling tool thanks to the following opam commands:

#### Add memprof repository ####
$ opam remote add memprof http://memprof.typerex.org/opam
$ opam update
    
#### Install the patched compiler and ocp-memprof ####
$ opam switch 4.01.0+ocp1-20150202
$ opam install ocp-memprof
$ eval `opam config env`

That's all ! Installation is done after only five (opam) commands.

Compiling and Executing the Example

The second step consists in compiling the example above and profiling it. This is simply achieved with the commands:

#### Compile your example ####
$ ocamlopt scanf_leak.ml -o scanf.x
    
#### Execute your program with ocp-memprof ####
$ ocp-memprof --exec scanf.x

You may notice that no instrumentation of the source is needed to enable profiling.

Visualizing the Results

In the last command above, scanf.x dumps a lot of information (related to memory occupation) during its execution. Our "OCaml Memory Profiler" then analyzes these dumps, and generates a "human readable" graph that shows the evolution of memory consumption after each OCaml garbage collection. Concretely, this yields the graph below (the interactive graph generated by ocp-memprof is available here). As you can see, memory consumption is growing abnormally and exceed 240Mb ! Note that we stopped the scanf.x after 90 seconds.

Ocaml-fscanf-function-with-leak

Playing With (Some of) ocp-memprof Capabilities

ocp-memprof allows to group and show data contained in the graph w.r.t. several criteria. For instance, data are grouped by "Modules" in the capture below. This allows us to deduce that most allocations are performed in the Scanf and Buffer modules.

Ocaml-fscanf-function-with-leak

In addition to aggregation capabilities, the interactive graph generated by ocp-memprof also allows to "zoom" on particular data. For instance, by looking at Scanf, we obtain the graph below that shows the different functions that are allocating in this module. We remark that the most allocating function is Scanf.Scanning.from_ic. Let us have a look to this function.

Ocaml-fscanf-function-with-leak

From Profiling Graphs to Source Code

The code of the function from_ic, that is responsible for most of the allocation in Scanf, is the following:

let memo_from_ic =
        let memo = ref [] in
        (fun scan_close_ic ic ->
        try List.assq ic !memo with
        | Not_found ->
                let ib = from_ic scan_close_ic (From_channel ic) ic in
                memo := (ic, ib) :: !memo;
                ib)
;;

It looks like that the leak is caused by the memo list that associates a lookahead buffer, resulting from the call to from_ic, with each input channel.

Patching the Code

Benoit Vaugon quickly sent a patch based on weak-pointers that seems to solve the problem. He modified the code as follows:

  • he put the key in a weak set in order to test if it is gone;

  • he created a pair that stores the key and the associated value (PairMemo);

  • he put this pair in a weak set (IcMemo), where it will be reclaimed at the next GC because;

  • he added a finalizer on the pair that adds again the pair in the weak set at each GC

let memo_from_ic =
        let module IcMemo = Weak.Make (struct
                type t = Pervasives.in_channel
                let equal ic1 ic2 = ic1 = ic2
                let hash ic = Hashtbl.hash ic
        end) in
        let module PairMemo = Weak.Make (struct
                type t = Pervasives.in_channel * in_channel
                let equal (ic1, _) (ic2, _) = ic1 = ic2
                let hash (ic, _) = Hashtbl.hash ic
        end) in
        let ic_memo = IcMemo.create 16 in
        let pair_memo = PairMemo.create 16 in
        let rec finaliser ((ic, _) as pair) =
                if IcMemo.mem ic_memo ic then (
                        Gc.finalise finaliser pair;
                        PairMemo.add pair_memo pair) in
        (fun scan_close_ic ic ->
        try snd (PairMemo.find pair_memo (ic, stdin)) with
        | Not_found ->
                let ib = from_ic scan_close_ic (From_channel ic) ic in
                let pair = (ic, ib) in
                IcMemo.add ic_memo ic;
                Gc.finalise finaliser pair;
                PairMemo.add pair_memo pair;
        ib)
;;

Checking the Fixed Version

Curious to see the memory behavior after applying this patch ? The graph below shows the memory consumption of the patched version of Scanf module. Again, the interactive version is available here. After each iteration of the for-loop, the memory is released as expected and memory consumption does not exceed 2.1Mb during each for-loop iteration.

Ocaml-fscanf-function-without-leak

Conclusion

This example is online in our gallery of examples if you want to see and explore the graphs (with the leak and without the leak).

Do not hesitate to use ocp-memprof on your applications. Of course, all feedback and suggestions on using ocp-memprof are welcome, just send us an email !

More information:

by Çagdas Bozman (cagdas.bozman@ocamlpro.com) at April 13, 2015 12:00 AM

April 10, 2015

Jane Street

Building a lower-latency GC

We've been doing a bunch of work recently on improving the responsiveness of OCaml's garbage collector. I thought it would be worth discussing these developments publicly to see if there was any useful feedback to be had on the ideas that we're investigating.

The basic problem is a well-known one: GCs can introduce unpredictable pauses into your application, and depending on how your GC is configured, these pauses can be quite long. Unpredictable latencies are a problem in a wide variety of applications, from trading systems to web stacks.

One approach people often take is to avoid using the allocator altogether: pool all your objects, and never allocate anything else. You can even keep many of your pooled objects outside of the heap.

This works, but makes for a less pleasant coding experience (and code that is trickier and harder to reason about.) So while pooling is a valuable technique, we'd like to have a GC that lets you run with low latencies without sacrificing the ability to allocate.

What are the problems?

OCaml's garbage collector is already pretty good from a latency perspective. Collection of the major heap in OCaml is incremental, which means that collection of the major heap can be done in small slices spread out over time, so no single transaction need experience the full latency of walking the major heap. Also, collection of the minor heap is pretty fast, and OCaml programs tend to do pretty well with a relatively small minor heap --- typical advice in Java-land is to have a young generation in the 5-10 GiB range, whereas our minor heaps are measured in megabytes.

Still, there are problems with OCaml's collector.

No profiling

There's no good way in the stock runtime to see how long the different parts of collection take, and that makes it hard to optimize.

False promotions

OCaml's generational collector is very simple: objects are typically allocated first on the minor heap, where the work is effectively three inlined instructions to bump a pointer and check whether you've hit the end. When you do hit the end, you do a minor collection, walking the minor heap to figure out what's still live, and promoting that set of objects to the major heap.

In a typical functional workload, most of your allocations are short-lived, and so most of the minor heap is dead by the time you do the minor collection, so the walk of the minor heap can be quite cheap. But there's always a small number of false promotions, objects that would have become unreachable shortly, but were promoted because the minor collection came at an inconvenient time.

Clocking

One fundamental issues with the stock runtime is that the collector is clocked in terms of minor allocations --- ignoring, critically, the amount of time that has gone by.

This clocking makes sense for many applications, but if you're building a server that needs to respond to bursty traffic with low and predictable latencies, this is the opposite of what you want. Really, what you'd prefer to do is to defer GC work when you're busy, instead scheduling it at times when the application would otherwise be idle.

One solution here is to allow the application to drive the scheduling of the GC, but the runtime in its current form doesn't really support doing this. In particular, while you can choose to explicitly run a major slice, the collector accounting doesn't take note of the work that has been done that way, so the major collector works just as hard as it did previously.

Furthermore, the major slice always forces a minor collection. But running minor collections all the time is problematic in its own right, since if you run them when the minor heap is too small, then you'll end up accidentally promoting a rather large fraction of your minor allocations.

Insufficient incrementality

While the major collector is mostly incremental, not everything about it runs incrementally. In particular, when the major collector hits an array, it walks the array all at once. This is problematic if you start using large arrays, which does happen when one is using pooling techniques. Similarly, the collector is not incremental when it comes to scanning GC roots.

Immediate accounting

The stock runtime decides how big of a major slice to do based on how much was promoted to the major heap in the last minor collection. This is part of a heuristic that is meant to make sure that the collector keeps up with the rate of allocation, without running needlessly when the application isn't promoting much.

But the accounting is in some sense too immediate: if you do a lot of promotion in a given cycle, you're forced to do the collection immediately. While all that work needs to be done, it's not clear that it needs to be done immediately. Again, for responsive systems, it's often better to push off work until after the busy times.

Making it better

Happily, Damien Doligez, the author of OCaml's GC, has been visiting with us for the last few months, and has been doing a lot of good work to improve the runtime, and in particular to address the concerns raised above. Here's the summary of the changes made thus far.

Better profiling

A set of probes was added to the GC, allowing us to record in a quite detailed way every phase of the collection process. This is quite detailed, telling you the phase (marking vs sweeping) and the sub-phase, as well as keeping track of a collection of useful counters. This is available in the instrument branch.

Aging

Damien has also implemented aging in the minor heap. Aging is a technique whereby objects stay in the minor heap for several minor collections before being promoted to the major heap. The goal of aging is to reduce the amount of false promotion.

Better incrementalization

Several of the stages of the collector have been made interruptible, including scanning of arrays and of the roots. The effect here is to reduce the worst-case delays imposed by the collector. This is in the low-latency branch.

Separating major slices from minor collections

In the stock runtime, major slices and minor collections are always done together. In the low-latency branch, you can run one without the other, and you can basically run them at any time. This has a couple of advantages --- one is that it's essentially another form of incrementalization, allowing you to do less work per GC pause.

The other is that it gives you more freedom to schedule collections when you want to. One way we're looking at using this is to have an application-level job that wakes up periodically, and does a heuristic check to see if the system appears busy. If it doesn't, then it schedules some GC work, and it may choose to do either a minor collection or a major slice. A minor collection would only be chosen in the case that the minor heap is bigger than some configured level, to avoid too much false promotion; but a major collection can be done at any time.

Smoothed work-accounting

Instead of just keeping track of the amount of work that needs to be done in the next major slice, the GC in the low-latency branch tracks work that must be done over the next n major slices, by keeping these numbers in a circular buffer.

The runtime also uses these buckets for keeping track of extra work that has been done by application-forced major slices. A forced major slice takes work away from the front-most bucket, potentially bringing the bucket to negative territory.

When the runtime checks if it needs to do a major slice, it looks at the first bucket. If it's got a positive amount of work in it, then that work is done in that slice, if possible. Whatever is left over (which may be positive or negative) is spread out uniformly over the next n buckets.

Segmented free lists

A big part of the cost of minor collections is the cost of finding free blocks. One observation is that in many OCaml applications, block sizes are quite small. One way of taking advantage of this is to have a set of size-segregated free-lists, for a configurable set of sizes. e.g., one could have a different free list for blocks with 1, 2, 3 and 4 slots.

This is still ongoing (read: not working yet), but it will show up in the multi-free-list branch eventually.

How is it going?

This is all very much a work in progress, but the results so far have been quite promising. By using a version of the compiler with most of these changes and with an application-driven job that forces major slices in quiet times, we were able to reduce tail latencies by a factor of 3 in a real production application. That's pretty good considering that we've done essentially no parameter tuning at this point.

That said, some of the results are less promising. We were somewhat disappointed to see that when doing more traditional batch jobs, aging didn't provide a significant improvement in overall compute time. It seems like in many applications, aging saves some on promotion, but the minor collection itself gets a little more expensive, and these seem to nearly cancel out.

This seems especially surprising given that aging is present in most GCs, including those for Java's HotSpot, the .NET CLR, and GHC. Given that everyone seems to use aging, I would have expected aging to have a quite noticeable benefit for lots of workloads, not just carefully tuned packet processors.

A call for help

The progress we've made so far is quite promising, but a lot of things are still up in the air. The reason that I wanted to post about it now is that I was hoping to hear feedback from others who have experience dealing with similar issues in other languages.

So, if you have thoughts about the techniques we've tried for making OCaml's runtime more responsive, or suggestions for other techniques we should consider, please comment!

by Yaron Minsky at April 10, 2015 08:37 PM

April 09, 2015

Coq

Coq 8.4pl6 is out

Version 8.4pl6 of Coq fixes several bugs of version 8.4pl5. More information to be found in the CHANGES file.

by coq-www at April 09, 2015 04:25 PM

Jane Street

Faster OCaml to C calls

The official OCaml documentation "Interfacing C with OCaml" doesn't document some interesting performance features.

C functions with no OCaml allocation

A C call can allocate OCaml data and pass it back to OCaml, for example using caml_copy_string(s). Between the C call allocating OCaml data and passing it back, it has to make sure that OCaml's Garbage Collector doesn't collect it, as the Garbage Collector can be triggered during the C call. There's an intricate mechanism which assures that, part of which are CAMLparam, CAMLlocal and CAMLreturn.

This mechanism can be bypassed if the C call is not going to allocate any OCaml data. This can yield performance benefits especially in shorter functions. To bypass it, CAMLparam, CAMLlocal and CAMLreturn should not be used and the primitive should be declared with "noalloc".

For example, OCaml's compare is not smart to avoid branch mispredictions for floats. Moving comparison to C speeds it up a little bit. "noalloc" speeds it up a lot.

float compare            8.93 ns
float_compare_c          7.88 ns
float_compare_c_noalloc  5.32 ns

external float_compare_noalloc : float -> float -> int =
  "float_compare_noalloc_stub" "noalloc"

CAMLprim value float_compare_noalloc_stub(value vf, value vg)
{
  double f = Double_val(vf);
  double g = Double_val(vg);
  return Val_int((f > g) - (f < g) + (f == f) - (g == g));
}

C functions with float arguments and float return

Since C code must use boxed OCaml floats, any unboxed float must be boxed prior to the C call. This is not cheap, especially for fast functions. This boxing can be avoided if the C call accepts and returns only floats.

For example, float_min can be replaced with a single CPU instruction. Unfortunately, the C implementation is much slower because of boxing floats:

float_min                  6.09 ns
float_min_c               15.92 ns

let float_min (x : float) y = if x < y then x else y

CAMLprim value float_min_stub(value x, value y)
{
  CAMLparam2(x, y);
  CAMLlocal1(v);
  double z = Double_val(y);

  __asm__ ("minsd %1, %0;" : "+&x"(z) : "x"(Double_val(x)));
  v = caml_copy_double(z);
  CAMLreturn(v);
}

external float_min_c : float -> float -> float = "float_min_stub"

To avoid boxing, C function's arguments and return should be "double", CAMLparam, CAMLlocal and CAMLreturn should be avoided and the primitive should include "float" and both interpreted and compiled implementation:

float_min_c_float          1.95 ns

external float_min_inan_c_float : float -> float -> float
  = "float_min_inan_float_bytecode" "float_min_inan_float_stub" "float"

CAMLprim double float_min_inan_float_stub(double x, double y)
{
  double z = y;
  __asm__ ("minsd %1, %0;" : "+&x"(z) : "x"(x));
  return z;
}

C functions with float arguments and non-float return

We might be able to further speed up float_compare_noalloc if we avoided boxing. Alas, that function returns integer so it's impossible to use "float". Is it still possible to avoid boxing? The answer is yes, by simply converting float to int.

float_compare_c_float    3.73 ns

CAMLprim double float_compare_float_stub(double f, double g)
{
  return (f > g) - (f < g) + (f == f) - (g == g);
}

external float_compare_float : float -> float -> float
  = "float_compare_float_bytecode" "float_compare_float_stub" "float"
let float_compare_float x y = int_of_float (float_compare_float x y)

by Vladimir Brankov at April 09, 2015 04:20 PM